Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
makes some file organization and prepares for the first release
- Loading branch information
Showing
18 changed files
with
170,749 additions
and
67,787 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2014 Mustafa Atik | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
include *.txt | ||
include *.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
Genderize | ||
====================== | ||
|
||
Genderize is a language independent module which tries to detect gender by looking given first names and/or analyzing sample texts. | ||
|
||
##Installation | ||
You can install this package using the following ```pip``` command: | ||
|
||
```sh | ||
$ sudo pip install genderize | ||
``` | ||
|
||
|
||
##Examples | ||
|
||
```python | ||
1 | ||
print Genderize.detect(firstName = 'John') | ||
# >>> male | ||
|
||
print Genderize.detect(firstName = 'Marry') | ||
# >>> female | ||
|
||
print Genderize.detect(firstName = 'mustafa') | ||
# >>> male | ||
|
||
print Genderize.detect(firstName = 'fatma') | ||
# >>> female | ||
|
||
print Genderize.detect(firstName = 'fikret') | ||
# >>> None | ||
|
||
print Genderize.detect(firstName = 'fikret', text='galatasary maçı kaçmaz') | ||
# >>> male | ||
|
||
print Genderize.detect(firstName = 'fikret', text='annemlerle yemek yedik') | ||
# >>> female | ||
|
||
print Genderize.detect(text='askerlik yoklamasını kaçırdım mk') | ||
# >>> male | ||
|
||
print Genderize.detect(text='bana çiçek alan erkek için canım feda') | ||
# >>> female | ||
|
||
print Genderize.detect(firstName='fatma', text='askerlik yoklamasını kaçırdım mk') | ||
# >>> female | ||
|
||
print Genderize.detect(text='futbol sevgi') | ||
# >>> None | ||
|
||
print Genderize.detect(text='lan bi siktir git') | ||
# >>> male | ||
|
||
``` | ||
***Note***: You may claim that women can say ```lan bi siktir git```, of course. But the probability of being female is less than the probability of being male according the trained data of the classifier. | ||
|
||
By the way, in Turkish saying 'lan bi siktir git' makes you quite rude. | ||
|
||
|
||
## How It Works | ||
Genderize is a module which tries to detect gender by looking given first names and/or analyzing sample text of a person. | ||
|
||
If a first name is definitely used for only one gender, the system will accept this gender and will not make any further analysis. For example, while 'Mustafa', 'Osman', 'Hasan' is used in Turkish only for male; 'Fatma', 'Ceyda', 'Elif' only for female. | ||
|
||
When looking at first names does not infer any gender for sure, the system will make text analysis if it is given. For example; 'Ekim', 'Meric', or 'Tümay' is used for both male and female. | ||
|
||
The text analysis is the classification of sample texts. It simply try to compute the probability of being male or female mining the sample text. In this system Naive Bayesian Classification is adopted and [naiveBayesClassifier][1] is used. | ||
|
||
## How To Improve It | ||
TODO: write a step by step guide | ||
|
||
##Preparing Language Dependent Training Sets | ||
TODO: give a few examples | ||
|
||
## Customization and Optimization | ||
TODO: Tell what options are there and when to choose | ||
```python | ||
MongoNamesCollection.mongodbURL = 'mongodb://192.168.1.170' | ||
Genderize.init( | ||
lang='tr', | ||
namesCollection=MongoNamesCollection | ||
) | ||
|
||
MemcachedNamesCollection.memcacheHost = '127.0.0.1:11211' | ||
Genderize.init( | ||
lang='tr', | ||
namesCollection=MemcachedNamesCollection | ||
) | ||
|
||
Genderize.init( | ||
lang='tr', | ||
namesCollection=NamesCollection, | ||
classifier=Classifier(CachedModel.get('tr'), tokenizer) | ||
) | ||
|
||
print Genderize.detect(firstName = 'fikret', text='annem') | ||
|
||
``` | ||
|
||
## TODO | ||
* inline docs | ||
* unit-tests | ||
|
||
## AUTHORS | ||
* Mustafa Atik [@muatik][2] | ||
* Nejdet Yucesoy @nejdetckenobi | ||
|
||
|
||
[1]: https://github.com/muatik/naive-bayes-classifier | ||
[2]: https://twitter.com/muatik2 |
Oops, something went wrong.