<h1><b>Gender identification</b></h1>

Model logic : Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male.

a. Feature Extraction -  grab last letter of word (return dict)

The returned dictionary, known as a feature set, maps from feature names (last_letter) to their values(word[-1]).

In [18]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [19]:
gender_features('check')

{'last_letter': 'k'}

   b. Get data

In [20]:
from nltk.corpus import names

labeled_names = ([(name, 'male')for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])


#Shuffle em up
import random

random.shuffle(labeled_names)
labeled_names

[('Spud', 'male'),
 ('Hamilton', 'male'),
 ('Rutter', 'male'),
 ('Bernd', 'male'),
 ('Ragnar', 'male'),
 ('Mic', 'male'),
 ('Simmonds', 'male'),
 ('Rafaelita', 'female'),
 ('Josephus', 'male'),
 ('Georgeta', 'female'),
 ('Mignonne', 'female'),
 ('Maribelle', 'female'),
 ('Hashim', 'male'),
 ('Muire', 'female'),
 ('Berchtold', 'male'),
 ('Thorn', 'male'),
 ('Nanci', 'female'),
 ('Cathlene', 'female'),
 ('Lamb', 'female'),
 ('Bryn', 'female'),
 ('Benson', 'male'),
 ('Allyce', 'female'),
 ('Marabel', 'female'),
 ('Thomasin', 'female'),
 ('Toinette', 'female'),
 ('Josefa', 'female'),
 ('Linn', 'female'),
 ('Nerissa', 'female'),
 ('Anastasie', 'female'),
 ('Johny', 'male'),
 ('Adrian', 'female'),
 ('Gunilla', 'female'),
 ('Viole', 'female'),
 ('Sandi', 'female'),
 ('Lilith', 'female'),
 ('Julianne', 'female'),
 ('Caspar', 'male'),
 ('Elli', 'female'),
 ('Dorita', 'female'),
 ('Melly', 'female'),
 ('Cat', 'female'),
 ('Brigid', 'female'),
 ('Gae', 'female'),
 ('Venus', 'female'),
 ('Horst', 

c. Make feature sets from data(labeled_names)

d. Divide train and test

e. Run classifier

In [21]:
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy
from nltk.classify import apply_features

featuresets = [ (gender_features(name), gender) for (name, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]


classifier = NaiveBayesClassifier.train(train_set)

In [22]:
classifier.classify(gender_features('Siddhi'))

'female'

In [23]:
print(accuracy(classifier, test_set))

0.76


In [24]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.3 : 1.0
             last_letter = 'k'              male : female =     28.5 : 1.0
             last_letter = 'f'              male : female =     15.4 : 1.0
             last_letter = 'p'              male : female =     11.3 : 1.0
             last_letter = 'm'              male : female =     10.7 : 1.0


LARGE corpus? Return feature set an object, but does not store all feature sets in memory

How? Use the LazyMap class to construct a lazy list-like object that is analogous to map(feature_func, toks).

In [25]:
train_set1 = apply_features(gender_features, labeled_names[500:])
test_set1 = apply_features(gender_features, labeled_names[:500])

In [26]:
len(train_set), len(train_set1) #Needs more memory to lazy load

(7444, 7444)

Summary

1. Feature Sets - to make features out of data

2. ** apply_features - map the fs in one command+ lazy load (Win win)**

3. classify.accuracy

4. ** clf.show_most_informative_features(no of features)**

5. clf.classify (predict in sklearn)


<h1>Choosing the right features</h1>

1. First letter
2. Last letter
3. Count of each letter in a-z
4. Contains letter from a-z (has)

ALSO an example of over-fitting features :)


if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets.

In [27]:
def gender_feature2(name):
    name = name.lower()
    features = {}
    features["first_letter"] = name[0]
    features["last_letter"] = name[-1]
    
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count({})'.format(letter)] = name.count(letter)
        features['has({})'.format(letter)] = (letter in name)
        
    return features

In [28]:
gender_feature2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'last_letter': 'n'}

In [29]:
featuresets2 = [(gender_feature2(name), gender) for (name, gender) in labeled_names]
train_set2, test_set2 = featuresets2[500:], featuresets2[:500]
classifier2 = NaiveBayesClassifier.train(train_set2)
accuracy_score = accuracy(classifier2, test_set2)
accuracy_score

0.782

<h3>0.728 for 4 features vs 0.758 in considering last letter</h3>

<h4> After choosing features -> **Error analysis** </h4>

Divide Train into -> Train | Dev-Test

In [30]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

len(labeled_names), len(train_names), len(devtest_names), len(test_names)

(7944, 6444, 1000, 500)

In [31]:
train_set = [(gender_features(name), gender) for (name, gender) in train_names]
devtest_set = [(gender_features(name), gender) for (name, gender) in devtest_names]
test_set = [(gender_features(name), gender) for (name, gender) in test_names]

clf = NaiveBayesClassifier.train(train_set)

print(accuracy(clf, devtest_set))

clf.show_most_informative_features(30)

0.769
Most Informative Features
             last_letter = 'a'            female : male   =     33.7 : 1.0
             last_letter = 'k'              male : female =     22.4 : 1.0
             last_letter = 'f'              male : female =     13.2 : 1.0
             last_letter = 'd'              male : female =     10.6 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'o'              male : female =      9.6 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'v'              male : female =      9.1 : 1.0
             last_letter = 'r'              male : female =      7.1 : 1.0
             last_letter = 'w'              male : female =      6.6 : 1.0
             last_letter = 'g'              male : female =      5.3 : 1.0
             last_letter = 'u'              male : female =      5.1 : 1.0
             last_letter = 's'              male : female =      4.2

^ Test it on devtest set instead of test directly

<h3>get to bottom of why clf is making errors</h3> :) 

In [32]:
errors = []

for(name, gender) in devtest_names:
    pred_gender = clf.classify(gender_features(name))
    if pred_gender != gender:
        errors.append((gender, pred_gender, name))
        
#lets see the errors

for (gender, pred_gender, name) in sorted(errors):
    print('correct = {:<8s} guess = {:<8s} name = {:<30}'.format(gender, pred_gender, name))

correct = female   guess = male     name = Adel                          
correct = female   guess = male     name = Aigneis                       
correct = female   guess = male     name = Alexis                        
correct = female   guess = male     name = Alis                          
correct = female   guess = male     name = Allsun                        
correct = female   guess = male     name = Amabel                        
correct = female   guess = male     name = Annabel                       
correct = female   guess = male     name = Ardys                         
correct = female   guess = male     name = Brear                         
correct = female   guess = male     name = Brynn                         
correct = female   guess = male     name = Calypso                       
correct = female   guess = male     name = Cameo                         
correct = female   guess = male     name = Carleen                       
correct = female   guess = male     na

Names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. (*wow*) <-someday

Result : 2 features
1. last letter
2. last 2 letters

In [33]:
def gender_features3(word):
    return { 'suffix1' : word[-1:],
             'suffix2' : word[-2:] }

gender_features3('Jojjs')

{'suffix1': 's', 'suffix2': 'js'}

In [34]:
train_set = [(gender_features3(name), gender) for (name, gender) in train_names]
devtest_set = [(gender_features3(name), gender) for (name, gender) in devtest_names]
test_set = [(gender_features3(name), gender) for (name, gender) in test_names]

clf = NaiveBayesClassifier.train(train_set)
print(accuracy(clf, devtest_set))

0.789


<h3>0.769 -> 0.789 Improvement after taking 2 features</h3>

Summary II

1. Feature selection - Do not over fit the model 
2. Use Test DevTest  | Test split
3. **Test on Dev test, check errors, analyze errors and change features based on it**
4. Re test on **another** Dev test partition
Repeat 3,4