<h2>DATA 620 - Project 3</h2>
<h3>Name Gender Classifier</h3>

<h3>Team : Mohamed Thasleem, Kalikul Zaman and Jeyaraman Ramalingam</h3>

<h3>Assignment</h3>

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

<h3>Libraries</h3>

In [46]:
#import libraries
import nltk
import pandas as pd
import random
from nltk.corpus import names
from nltk.classify import apply_features

<h3>Data Preperation</h3>

Import names data from NLTK library files males and females, adding them to make an single dataset and apply random shuffling to names

In [19]:
#set seed for recreation
random.seed(620)

#get data from NLTK library files
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

# and to make sure we are sampling across all the names
# we shuffle them so they aren't alphabetical
random.shuffle(names)

<h4>Sample Data</h4>

In [20]:
#list names
names[0:10]

[('Christie', 'female'),
 ('Tibold', 'male'),
 ('Chet', 'male'),
 ('Alyss', 'female'),
 ('Eunice', 'female'),
 ('Mehetabel', 'female'),
 ('Marj', 'female'),
 ('Adam', 'male'),
 ('Natka', 'female'),
 ('Sarene', 'female')]

<h4>Data Stats</h4>

In [34]:
print("Total count: " , len(names))

Total count:  7944


Splitting the dataset based on the below condition

500 words for the test set, 500 words for the dev-test set, and the remaining 6900 + words for the training set

In [33]:
test_set = names[:500]
print("test set: " , len(test_set))
devtest_set = names[500:1000] 
print("dev test set: " , len(devtest_set))
train_set = names[1000:]
print("train set: " , len(train_set))

test set:  500
dev test set:  500
train set:  6944


<h3>Feature Set - Gender</h3>

Setting up the feature set to predict the outpur variable, it is an individual measurable property or characteristic of a phenomenon being observed, In our case the deature will be different pattern/characteristics of the name

In [57]:
def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["suffix2"]= name[-2:].lower()
    features["preffix2"]= name[:2].lower()
    for letter in 'aeiou':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [36]:
featuresets = [(gender_features(n), g) for (n,g) in names]

In [37]:
train_set_fe = featuresets[1000:]
test_set_fe =featuresets[:500]
devtest_set_fe =featuresets[500:1000]

<h3>Naive Baiyer Classifier</h3>

Applying the Naive Baiyer Classifier to find out the accuracy of the prediction

In [47]:
classifier = nltk.NaiveBayesClassifier.train(train_set_fe)

<h4>Accuracy</h4>

In [48]:
# Show Accuracy
print("train_set: ", nltk.classify.accuracy(classifier, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier, devtest_set_fe))

train_set:  0.8109158986175116
test_set:  0.802
devtest_set:  0.778


<h4>Naive Baiyer features list</h4>

In [49]:
# Show important features
classifier.show_most_informative_features(20)

Most Informative Features
                 suffix2 = 'na'           female : male   =     93.8 : 1.0
                 suffix2 = 'la'           female : male   =     71.8 : 1.0
                 suffix2 = 'ia'           female : male   =     52.5 : 1.0
              lastletter = 'a'            female : male   =     34.6 : 1.0
                 suffix2 = 'sa'           female : male   =     32.6 : 1.0
                 suffix2 = 'rd'             male : female =     29.4 : 1.0
              lastletter = 'f'              male : female =     28.5 : 1.0
              lastletter = 'k'              male : female =     28.0 : 1.0
                 suffix2 = 'us'             male : female =     27.5 : 1.0
                 suffix2 = 'ra'           female : male   =     24.2 : 1.0
                 suffix2 = 'ta'           female : male   =     24.1 : 1.0
                 suffix2 = 'io'             male : female =     23.6 : 1.0
                 suffix2 = 'ld'             male : female =     23.4 : 1.0

<h4>Compare with dev test data for error rate - NB</h4>

In [50]:
# Check errors
errors = []
for (name, tag) in devtest_set:
    guess = classifier.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [51]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Aurore                        
correct=female   guess=male     name=Austin                        
correct=female   guess=male     name=Barbe                         
correct=female   guess=male     name=Barby                         
correct=female   guess=male     name=Bebe                          
correct=female   guess=male     name=Bird                          
correct=female   guess=male     name=Birgit                        
correct=female   guess=male     name=Bunnie                        
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Caron                         
correct=female   guess=male     name=Clemmy                        
correct=female   guess=male     name=Cloris                        
correct=female   guess=male     name=Coleen                        
correct=female   guess=male     name=Colleen                       
correct=female   guess=male     name=Corliss    

<h4>Result of Unmatched Error Count - NB</h4>

In [52]:
print("Error count: ", len(errors))

Error count:  111


<h3>Decision Tree Classifier</h3>

Applying the decision tree classifier to find the accuracy

In [53]:
classifier_tree = nltk.DecisionTreeClassifier.train(train_set_fe)

train_set:  0.9344758064516129
test_set:  0.74
devtest_set:  0.736


<h4>Accuracy</h4>

In [58]:
print("train_set: ", nltk.classify.accuracy(classifier_tree, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier_tree, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier_tree, devtest_set_fe))

train_set:  0.9344758064516129
test_set:  0.74
devtest_set:  0.736


<h4>Compare with dev test data for error rate - DT</h4>

In [54]:
errors2 = []
for (name, tag) in devtest_set:
    guess = classifier_tree.classify(gender_features2(name))
    if guess != tag:
        errors2.append( (tag, guess, name) )

In [55]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Aurore                        
correct=female   guess=male     name=Austin                        
correct=female   guess=male     name=Barbe                         
correct=female   guess=male     name=Barby                         
correct=female   guess=male     name=Bebe                          
correct=female   guess=male     name=Bird                          
correct=female   guess=male     name=Birgit                        
correct=female   guess=male     name=Bunnie                        
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Caron                         
correct=female   guess=male     name=Clemmy                        
correct=female   guess=male     name=Cloris                        
correct=female   guess=male     name=Coleen                        
correct=female   guess=male     name=Colleen                       
correct=female   guess=male     name=Corliss    

<h4>Result of Unmatched Error Count - DT</h4>

In [56]:
print("Error count")
len(errors2)

Error count


132

<h3>Classifier Comparison</h3>

In [66]:
compare = {
        'Data Set': ['train_set','test_set','devtest_set','error count'],
        'Naive Bayer': [0.81,0.8,0.77,111],
        'Decision Tree': [0.93,0.74,0.73,132]
        }

df = pd.DataFrame(compare, columns = ['Data Set', 'Naive Bayer','Decision Tree'])

print (df)

      Data Set  Naive Bayer  Decision Tree
0    train_set         0.81           0.93
1     test_set         0.80           0.74
2  devtest_set         0.77           0.73
3  error count       111.00         132.00


<h3>Conclusion</h3>

<h4>How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?</h4>




The training accuracy of Decision Tree model better than Naive Bayer, however it resulted much unmatched error count count than Naive Bayer model, on average considering the accuracy of the three different data sets and matched prediction, Naive Baiyer outperfomed Decision Tree model