## Project 3
### Amanda Arce, Monu Chacko, Abdelmalek Hajjam, Nick Schettini

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import itertools
from string import ascii_lowercase


#nltk.download('names')

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
#shuffle the names
random.shuffle(names)

#### Let divide the data into test, dev and training datasets with 500, 500, x data split

In [3]:
#print(len(names))
#unpacking the names to 3 sets
test, dev_test, training = names[:500], names[500:1000], names[1000:]

## Accuracy

#### The gender feature 1 extractor uses first letter, last letter and suffix as its feature

In [4]:
def gender_features1(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    return features

#### Train data using Naive Bayes 

In [5]:
train_set = [(gender_features1(n), g) for (n,g) in training]
dev_test_set = [(gender_features1(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_1 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 1 is: " + str(acc_dev_test_1))

The accuracy for the dev using Feature 1 is: 0.734


In [6]:
# Performance test - Feature 1
test_set = [(gender_features1(n), g) for (n,g) in test]
test_set_1 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 1 is: " + str(test_set_1))

The accuracy for the test using Feature 1 is: 0.812


#### The gender feature 2 extractor uses first letter, last letter and two suffixes as its feature

In [7]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    return features

#### Train feature 2 using Naive Bayes Classifier

In [8]:
train_set = [(gender_features2(n), g) for (n,g) in training]
dev_test_set = [(gender_features2(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_2 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 2 is: " + str(acc_dev_test_2))

The accuracy for the dev using Feature 2 is: 0.744


In [9]:
# Performance test - Feature 2
test_set = [(gender_features2(n), g) for (n,g) in test]
test_set_2 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 2 is: " + str(test_set_2))

The accuracy for the test using Feature 2 is: 0.824


#### The gender feature 3 extractor uses first letter, last letter and three suffixes as its feature

In [10]:
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    return features

#### Train feature 3 data using Naive Bayes

In [11]:
train_set = [(gender_features3(n), g) for (n,g) in training]
dev_test_set = [(gender_features3(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_3 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 3 is: " + str(acc_dev_test_3))

The accuracy for the dev using Feature 3 is: 0.766


In [12]:
# Performance test - Feature 3
test_set = [(gender_features3(n), g) for (n,g) in test]
test_set_3 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 3 is: " + str(test_set_3))

The accuracy for the test using Feature 3 is: 0.852


In [13]:
def gender_features4(name):
    
    features = {}
    keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 2)]
    
    #look at first, first2, last, last2 letters of name
    #apply .lower() method to convert all text to lowercase
    features["first_letter"] = name[0].lower()
    features["first_2letter"] = name[0:1].lower()
    features["last_letter"] = name[-1].lower()
    features["last_2letter"] = name[-2:-1].lower()
    
    for letter in ascii_lowercase:
        features["has({})".format(letter)] = (letter in name.lower())

        for keyword in keywords:
            features["combo2({})".format(keyword)] = (keyword in name.lower())
            
        return features

In [14]:
train_set = [(gender_features4(n), g) for (n,g) in training]
dev_test_set = [(gender_features4(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_4 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 3 is: " + str(acc_dev_test_4))

The accuracy for the dev using Feature 3 is: 0.772


In [15]:
# Performance test - Feature 4
test_set = [(gender_features4(n), g) for (n,g) in test]
test_set_4 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 4 is: " + str(test_set_4))

The accuracy for the test using Feature 4 is: 0.82


## Errors

In [16]:
def error_analysis(gender_features):
    errors = []
    for (name, tag) in dev_test:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append((tag, guess, name))
    print('no. of errors: ', len(errors))        
        
    #for (tag, guess, name) in sorted(errors): 
    #    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))
    return errors

In [17]:
lst1 = error_analysis(gender_features1)
lst1[0: 10]

no. of errors:  166


[('male', 'female', 'Kaspar'),
 ('male', 'female', 'Upton'),
 ('male', 'female', 'Wilber'),
 ('male', 'female', 'Erny'),
 ('male', 'female', 'Erl'),
 ('male', 'female', 'Say'),
 ('male', 'female', 'Serge'),
 ('male', 'female', 'Horace'),
 ('male', 'female', 'Efram'),
 ('male', 'female', 'Wolf')]

In [18]:
lst2 = error_analysis(gender_features2)
lst2[0:10]

no. of errors:  166


[('male', 'female', 'Kaspar'),
 ('male', 'female', 'Upton'),
 ('male', 'female', 'Wilber'),
 ('male', 'female', 'Erny'),
 ('male', 'female', 'Erl'),
 ('male', 'female', 'Say'),
 ('male', 'female', 'Serge'),
 ('male', 'female', 'Horace'),
 ('male', 'female', 'Efram'),
 ('male', 'female', 'Wolf')]

In [19]:
lst3 = error_analysis(gender_features3)
lst3[0:10] 

no. of errors:  166


[('male', 'female', 'Kaspar'),
 ('male', 'female', 'Upton'),
 ('male', 'female', 'Wilber'),
 ('male', 'female', 'Erny'),
 ('male', 'female', 'Erl'),
 ('male', 'female', 'Say'),
 ('male', 'female', 'Serge'),
 ('male', 'female', 'Horace'),
 ('male', 'female', 'Efram'),
 ('male', 'female', 'Wolf')]

In [20]:
lst4 = error_analysis(gender_features4)
lst4[0:10] 

no. of errors:  114


[('female', 'male', 'Row'),
 ('female', 'male', 'Patrice'),
 ('male', 'female', 'Erny'),
 ('female', 'male', 'Roz'),
 ('female', 'male', 'Sheree'),
 ('male', 'female', 'Kimball'),
 ('male', 'female', 'Cornellis'),
 ('male', 'female', 'Germaine'),
 ('female', 'male', 'Wynn'),
 ('male', 'female', 'Charlie')]

## Accuracy Comparition

In [21]:
print("Accuracy Dev Feature 1: " + str(acc_dev_test_1))
print("Accuracy Test Feature 1: " + str(test_set_1))

Accuracy Dev Feature 1: 0.734
Accuracy Test Feature 1: 0.812


In [22]:
print("Accuracy Dev Feature 2: " + str(acc_dev_test_2))
print("Accuracy Test Feature 2: " + str(test_set_2))

Accuracy Dev Feature 2: 0.744
Accuracy Test Feature 2: 0.824


In [23]:
print("Accuracy Dev Feature 3: " + str(acc_dev_test_3))
print("Accuracy Test Feature 3: " + str(test_set_3))

Accuracy Dev Feature 3: 0.766
Accuracy Test Feature 3: 0.852


In [24]:
print("Accuracy Dev Feature 4: " + str(acc_dev_test_4))
print("Accuracy Test Feature 4: " + str(test_set_4))

Accuracy Dev Feature 4: 0.772
Accuracy Test Feature 4: 0.82


## Simulation

In [25]:
def AccuracySimulation(numIterations, callBackFunction):
    acc_df = {
        "classifier": [],
        "test_set_accuracy": [],
        "dev_test_set_accuracy": [],
        "train_set_accuracy": [],
        "dev_test_errors": []
    }
    for i in range(numIterations):
        random.shuffle(names)
        acc_train_names = names[1000:]
        acc_dev_test_names = names[500:1000]
        acc_test_names = names[:500]
        acc_train_set = [(callBackFunction(n), g) for (n,g) in acc_train_names]
        acc_dev_test_set = [(callBackFunction(n), g) for (n,g) in acc_dev_test_names]
        acc_test_set = [(callBackFunction(n), g) for (n,g) in acc_test_names]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier) 
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["dev_test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_dev_test_set))
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
       
        acc_errors = []
        for (name, tag) in acc_dev_test_names:
            acc_guess = acc_classifier.classify(callBackFunction(name))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, name) )
        acc_df["dev_test_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

In [26]:
df_1 = AccuracySimulation(10, gender_features1)
df_1.describe()

Unnamed: 0,test_set_accuracy,dev_test_set_accuracy,train_set_accuracy
count,10.0,10.0,10.0
mean,0.7918,0.7926,0.800446
std,0.023934,0.013794,0.001462
min,0.762,0.772,0.797667
25%,0.774,0.78,0.799863
50%,0.789,0.799,0.800835
75%,0.8135,0.803,0.801591
max,0.824,0.808,0.801987


In [27]:
df_2 = AccuracySimulation(10, gender_features2)
df_2.describe()

Unnamed: 0,test_set_accuracy,dev_test_set_accuracy,train_set_accuracy
count,10.0,10.0,10.0
mean,0.8118,0.8004,0.829349
std,0.023598,0.015771,0.002241
min,0.776,0.768,0.826757
25%,0.7955,0.793,0.827837
50%,0.816,0.804,0.828917
75%,0.826,0.8125,0.830321
max,0.844,0.818,0.833957


In [28]:
df_3 = AccuracySimulation(10, gender_features3)
df_3.describe()

Unnamed: 0,test_set_accuracy,dev_test_set_accuracy,train_set_accuracy
count,10.0,10.0,10.0
mean,0.8294,0.8232,0.863292
std,0.013533,0.017338,0.001368
min,0.812,0.8,0.861751
25%,0.8185,0.8105,0.862111
50%,0.828,0.821,0.863191
75%,0.8415,0.829,0.863947
max,0.85,0.852,0.866071


In [29]:
df_4 = AccuracySimulation(10, gender_features4)
df_4.describe()

Unnamed: 0,test_set_accuracy,dev_test_set_accuracy,train_set_accuracy
count,10.0,10.0,10.0
mean,0.8042,0.803,0.815294
std,0.018961,0.021066,0.003028
min,0.768,0.754,0.811492
25%,0.7965,0.797,0.81268
50%,0.802,0.807,0.815668
75%,0.816,0.8175,0.817216
max,0.836,0.824,0.820853


## Conclusion

#### - We found that feature 3 performed better than all the other features.
#### - When comparing dev and test sets we found difference but were not significant. This was as expected.
