# Project 3

**Goal**

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import names 
import random
from IPython.display import display, HTML

**Load the Corpus**

The name gender classifier is loaded then shuffled.

The length of the data is displayed below.

In [2]:
random.seed(123)
n = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(n)
len(n)

7944

**Display the head records**

In [5]:
n[:5]

[('Cordelie', 'female'),
 ('Peggie', 'female'),
 ('Solange', 'female'),
 ('Rana', 'female'),
 ('Jessy', 'female')]

**Splitting the corpus**

500 words for the test set, 500 words for the dev-test set, and the remaining words for the training set.

The length of each set is displayed below.

In [6]:
test,devtest,train = n[:500],n[500:1000],n[1000:]
print("Length of Test set:",len(test),",","Length of Dev Test set:", len(devtest),",","Length of Train set", len(train))

Length of Test set: 500 , Length of Dev Test set: 500 , Length of Train set 6944


**Selecting Features**

Starting by selecting the features. I have chosen 5 features to explore. The features are as follows:

1. First Letter: *The first letter of each word is analysed*
2. Last letter: *The last letter of each word is analysed*
3. First letter, last letter: *The first and the last letters of each word is analysed*
4. First letter, last letter and second last: *The first and the 2 last letters of each word is analysed*
5. first letter, last letter and second and third last: *The first and the 3 last letters of each word is analysed*

*infer* dataframe contains the devtest and test performance data of all the 5 features.

In [7]:
infer = pd.DataFrame(index=['Devtest Accuracy','Test Performamce'])

Error Function

In [8]:
def error_fn(gf):
    errors = []
    for (name, tag) in devtest:
        guess = classifier.classify(gf(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
    return errors

**Feature 1**

First Letter: *The first letter of each word is analysed*

The below function returns the first letter of each word

In [9]:
def gender_features4(word):
    return {'first_letter': word[0]}

**Naive Bayes used for training on feature 1**

NaiveBayesClassifier is used on the train set and the accuracy is tested on the devtest set. I am testing the performance with test set as well to analyse the results later on.

In [10]:
train_set = [(gender_features4(n), g) for (n,g) in train]
devtest_set = [(gender_features4(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
x=nltk.classify.accuracy(classifier, devtest_set)
print("Devtest accuracy",x)
test_set=[(gender_features4(n), g) for (n,g) in test]
y=nltk.classify.accuracy(classifier, test_set)
print("Performance on the test set for feature 1:",y)
errors=error_fn(gender_features4)
error_df=pd.DataFrame(errors,columns=['Observed','Predicted','Name'])
display(error_df)
len(error_df)
infer[0]=[x,y]

Devtest accuracy 0.624
Performance on the test set for feature 1: 0.646


Unnamed: 0,Observed,Predicted,Name
0,male,female,Dugan
1,male,female,Park
2,male,female,Ferinand
3,female,male,Ulrike
4,male,female,Lawrence
...,...,...,...
183,male,female,Fabio
184,male,female,Rickie
185,male,female,Pooh
186,male,female,Jehu


**Feature 2**

Last letter: *The last letter of each word is analysed*

The below function returns the last letter of each word

In [11]:
def gender_features(word):
    return {'last_letter': word[-1]}

**Naive Bayes used for training on feature 2**

NaiveBayesClassifier is used on the train set and the accuracy is tested on the devtest set. I am testing the performance with test set as well to analyse the results later on.

In [12]:
train_set = [(gender_features(n), g) for (n,g) in train]
devtest_set = [(gender_features(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
x=nltk.classify.accuracy(classifier, devtest_set)
print("Devtest accuracy",x)
test_set=[(gender_features(n), g) for (n,g) in test]
y=nltk.classify.accuracy(classifier, test_set)
print("Performance on the test set for feature 1:",y)
errors=error_fn(gender_features)
error_df=pd.DataFrame(errors,columns=['Observed','Predicted','Name'])
display(error_df)
len(error_df)
infer[1]=[x,y]

Devtest accuracy 0.78
Performance on the test set for feature 1: 0.778


Unnamed: 0,Observed,Predicted,Name
0,female,male,Annabal
1,male,female,Lawrence
2,female,male,Sam
3,female,male,Margo
4,male,female,Saxe
...,...,...,...
105,female,male,Ardys
106,male,female,Pooh
107,male,female,Guthrey
108,female,male,Vivian


**Feature 3**

First letter, last letter: *The first and the last letters of each word is analysed*

The below function returns the first and last letters of the word as a dictionary *features*

In [13]:
def gender_features1(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

**Naive Bayes used for training on feature 3**

NaiveBayesClassifier is used on the train set and the accuracy is tested on the devtest set. I am testing the performance with test set as well to analyse the results later on.

In [14]:
train_set = [(gender_features1(n), g) for (n,g) in train]
devtest_set = [(gender_features1(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
x=nltk.classify.accuracy(classifier, devtest_set)
print("Devtest accuracy",x)
test_set=[(gender_features1(n), g) for (n,g) in test]
y=nltk.classify.accuracy(classifier, test_set)
print("Performance on the test set for feature 2:",y)
errors=error_fn(gender_features1)
error_df=pd.DataFrame(errors,columns=['Observed','Predicted','Name'])
display(error_df)
len(error_df)
infer[2]=[x,y]

Devtest accuracy 0.782
Performance on the test set for feature 2: 0.796


Unnamed: 0,Observed,Predicted,Name
0,female,male,Ulrike
1,male,female,Lawrence
2,female,male,Daffy
3,female,male,Sam
4,female,male,Margo
...,...,...,...
104,female,male,Delores
105,male,female,Nickolas
106,female,male,Penny
107,male,female,Rickie


**Feature 4**

First letter, last letter and second last: *The first and the 2 last letters of each word is analysed*

The below function returns the first secondlast and last letters of the word as a dictionary *features*

In [15]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["secondlastletter"] = name[-2:].lower()
    return features

**Naive Bayes used for training on feature 4**

NaiveBayesClassifier is used on the train set and the accuracy is tested on the devtest set. I am testing the performance with test set as well to analyse the results later on.

In [16]:
train_set = [(gender_features2(n), g) for (n,g) in train]
devtest_set = [(gender_features2(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
x=nltk.classify.accuracy(classifier, devtest_set)
print("Devtest accuracy",x)
test_set=[(gender_features2(n), g) for (n,g) in test]
y=nltk.classify.accuracy(classifier, test_set)
print("Performance on the test set for feature 3:",y)
errors=error_fn(gender_features2)
error_df=pd.DataFrame(errors,columns=['Observed','Predicted','Name'])
display(error_df)
len(error_df)
infer[3]=[x,y]

Devtest accuracy 0.788
Performance on the test set for feature 3: 0.82


Unnamed: 0,Observed,Predicted,Name
0,female,male,Ulrike
1,male,female,Lawrence
2,female,male,Daffy
3,female,male,Sam
4,female,male,Margo
...,...,...,...
101,male,female,Maurie
102,female,male,Delores
103,female,male,Penny
104,male,female,Rickie


**Feature 5**

first letter, last letter and second and third last: *The first and the 3 last letters of each word is analysed*

The below function returns the first thirdlast secondlast and last letters of the word as a dictionary *features*

In [17]:
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["secondlastletter"] = name[-2:].lower()
    features["thirdlastletter"] = name[-3:].lower()
    return features

**Naive Bayes used for training on feature 5**

NaiveBayesClassifier is used on the train set and the accuracy is tested on the devtest set. I am testing the performance with test set as well to analyse the results later on.

In [18]:
train_set = [(gender_features3(n), g) for (n,g) in train]
devtest_set = [(gender_features3(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
x=nltk.classify.accuracy(classifier, devtest_set)
print("Devtest accuracy",x)
test_set=[(gender_features3(n), g) for (n,g) in test]
y=nltk.classify.accuracy(classifier, test_set)
print("Performance on the test set for feature 4:",y)
errors=error_fn(gender_features3)
error_df=pd.DataFrame(errors,columns=['Observed','Predicted','Name'])
display(error_df)
len(error_df)
infer[4]=[x,y]

Devtest accuracy 0.81
Performance on the test set for feature 4: 0.814


Unnamed: 0,Observed,Predicted,Name
0,female,male,Ulrike
1,female,male,Daffy
2,female,male,Sam
3,female,male,Margo
4,female,male,Vicky
...,...,...,...
90,male,female,Maurie
91,female,male,Delores
92,female,male,Penny
93,male,female,Rickie


The infer dataframe is displayed below

In [20]:
infer.columns=['First','Last','First Last','First 2 Lasts','First 3 Lasts']

display(infer)

Unnamed: 0,First,Last,First Last,First 2 Lasts,First 3 Lasts
Devtest Accuracy,0.624,0.78,0.782,0.788,0.81
Test Performamce,0.646,0.778,0.796,0.82,0.814


**Conclusion**

From above it can be seen that feature 4 (first and last 2) was the best out of the 5 features I chose.

There were slight changes between the devtest and test sets accuracy but the difference is insignificant.

**Video**

https://youtu.be/CrNuNJrHK58 