## Project 3
### Amanda Arce, Monu Chacko, Abdelmalek Hajjam, Nick Schettini

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

In [1]:
import nltk
from nltk.corpus import names
import random
import itertools
from string import ascii_lowercase

#nltk.download('names')

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
#shuffle the names
random.shuffle(names)

#### Let divide the data into test, dev and training datasets with 500, 500, x data split

In [3]:
#print(len(names))
#unpacking the names to 3 sets
test, dev_test, training = names[:500], names[500:1000], names[1000:]

## Accuracy

#### The gender feature 1 extractor uses first letter, last letter and suffix as its feature

In [4]:
def gender_features1(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    return features

#### Train data using Naive Bayes 

In [5]:
train_set = [(gender_features1(n), g) for (n,g) in training]
dev_test_set = [(gender_features1(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_1 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 1 is: " + str(acc_dev_test_1))

The accuracy for the dev using Feature 1 is: 0.788


In [6]:
# Performance test - Feature 1
test_set = [(gender_features1(n), g) for (n,g) in test]
test_set_1 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 1 is: " + str(test_set_1))

The accuracy for the test using Feature 1 is: 0.778


#### The gender feature 2 extractor uses first letter, last letter and two suffixes as its feature

In [7]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    return features

#### Train feature 2 using Naive Bayes Classifier

In [8]:
train_set = [(gender_features2(n), g) for (n,g) in training]
dev_test_set = [(gender_features2(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_2 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 2 is: " + str(acc_dev_test_2))

The accuracy for the dev using Feature 2 is: 0.798


In [9]:
# Performance test - Feature 2
test_set = [(gender_features2(n), g) for (n,g) in test]
test_set_2 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 2 is: " + str(test_set_2))

The accuracy for the test using Feature 2 is: 0.792


#### The gender feature 3 extractor uses first letter, last letter and three suffixes as its feature

In [10]:
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    return features

#### Train feature 3 data using Naive Bayes

In [11]:
train_set = [(gender_features3(n), g) for (n,g) in training]
dev_test_set = [(gender_features3(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_3 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 3 is: " + str(acc_dev_test_3))

The accuracy for the dev using Feature 3 is: 0.824


In [12]:
# Performance test - Feature 3
test_set = [(gender_features3(n), g) for (n,g) in test]
test_set_3 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 3 is: " + str(test_set_3))

The accuracy for the test using Feature 3 is: 0.806


In [13]:
def gender_features4(name):
    
    features = {}
    keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 2)]
    
    #look at first, first2, last, last2 letters of name
    #apply .lower() method to convert all text to lowercase
    features["first_letter"] = name[0].lower()
    features["first_2letter"] = name[0:1].lower()
    features["last_letter"] = name[-1].lower()
    features["last_2letter"] = name[-2:-1].lower()
    
    for letter in ascii_lowercase:
        features["has({})".format(letter)] = (letter in name.lower())

        for keyword in keywords:
            features["combo2({})".format(keyword)] = (keyword in name.lower())
            
        return features

In [14]:
train_set = [(gender_features4(n), g) for (n,g) in training]
dev_test_set = [(gender_features4(n), g) for (n,g) in dev_test]
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_test_4 = nltk.classify.accuracy(classifier, dev_test_set)
print("The accuracy for the dev using Feature 3 is: " + str(acc_dev_test_4))

The accuracy for the dev using Feature 3 is: 0.816


In [15]:
# Performance test - Feature 4
test_set = [(gender_features4(n), g) for (n,g) in test]
test_set_4 = nltk.classify.accuracy(classifier, test_set)
print("The accuracy for the test using Feature 4 is: " + str(test_set_4))

The accuracy for the test using Feature 4 is: 0.79


## Errors

In [16]:
def error_analysis(gender_features):
    errors = []
    for (name, tag) in dev_test:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append((tag, guess, name))
    print('no. of errors: ', len(errors))        
        
    #for (tag, guess, name) in sorted(errors): 
    #    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))
    return errors

In [17]:
lst1 = error_analysis(gender_features1)
lst1[0: 10]

no. of errors:  199


[('male', 'female', 'Bartolomeo'),
 ('male', 'female', 'Derk'),
 ('male', 'female', 'Archibald'),
 ('male', 'female', 'Benson'),
 ('male', 'female', 'Patin'),
 ('male', 'female', 'Wake'),
 ('male', 'female', 'Tristan'),
 ('male', 'female', 'Smith'),
 ('male', 'female', 'Tyson'),
 ('male', 'female', 'Ferinand')]

In [18]:
lst2 = error_analysis(gender_features2)
lst2[0:10]

no. of errors:  199


[('male', 'female', 'Bartolomeo'),
 ('male', 'female', 'Derk'),
 ('male', 'female', 'Archibald'),
 ('male', 'female', 'Benson'),
 ('male', 'female', 'Patin'),
 ('male', 'female', 'Wake'),
 ('male', 'female', 'Tristan'),
 ('male', 'female', 'Smith'),
 ('male', 'female', 'Tyson'),
 ('male', 'female', 'Ferinand')]

In [19]:
lst3 = error_analysis(gender_features3)
lst3[0:10] 

no. of errors:  199


[('male', 'female', 'Bartolomeo'),
 ('male', 'female', 'Derk'),
 ('male', 'female', 'Archibald'),
 ('male', 'female', 'Benson'),
 ('male', 'female', 'Patin'),
 ('male', 'female', 'Wake'),
 ('male', 'female', 'Tristan'),
 ('male', 'female', 'Smith'),
 ('male', 'female', 'Tyson'),
 ('male', 'female', 'Ferinand')]

In [20]:
lst4 = error_analysis(gender_features4)
lst4[0:10] 

no. of errors:  92


[('female', 'male', 'Trudy'),
 ('male', 'female', 'Patin'),
 ('male', 'female', 'Smith'),
 ('male', 'female', 'Ferinand'),
 ('female', 'male', 'Haley'),
 ('male', 'female', 'Grady'),
 ('female', 'male', 'Rosamond'),
 ('female', 'male', 'Goldie'),
 ('male', 'female', 'Dane'),
 ('female', 'male', 'Violet')]

## Accuracy Comparition

In [21]:
print("Accuracy Dev Feature 1: " + str(acc_dev_test_1))
print("Accuracy Test Feature 1: " + str(test_set_1))

Accuracy Dev Feature 1: 0.788
Accuracy Test Feature 1: 0.778


In [22]:
print("Accuracy Dev Feature 2: " + str(acc_dev_test_2))
print("Accuracy Test Feature 2: " + str(test_set_2))

Accuracy Dev Feature 2: 0.798
Accuracy Test Feature 2: 0.792


In [23]:
print("Accuracy Dev Feature 3: " + str(acc_dev_test_3))
print("Accuracy Test Feature 3: " + str(test_set_3))

Accuracy Dev Feature 3: 0.824
Accuracy Test Feature 3: 0.806


In [24]:
print("Accuracy Dev Feature 4: " + str(acc_dev_test_4))
print("Accuracy Test Feature 4: " + str(test_set_4))

Accuracy Dev Feature 4: 0.816
Accuracy Test Feature 4: 0.79


## Conclusion

#### - We found that feature 3 performed better than all the other features.
#### - When comparing dev and test sets we found difference but were not significant. This was as expected.
