# Introduction


### Once upon a time in a land called Nomenia, there was a village where people had unique names. The names in this village were not only distinct but also carried a hidden secret - they could reveal the gender of the person just by analyzing certain features of the name.

### Intrigued by this phenomenon, a young linguist named Lily embarked on a quest to build the ultimate name gender classifier. Lily had read about various classifiers in the famous book "Natural Language Processing with Python," and she was determined to put her knowledge into action.

### Lily began her journey by diving into the Names Corpus, a vast collection of names from all around the world. She carefully split the corpus into three subsets: a test set of 500 names, a dev-test set of another 500 names, and a training set of 6,900 names.


In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import nltk
import random
from nltk.classify import apply_features
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


Accuracy (Last Letter): 0.768
Accuracy (First Letter): 0.618
Accuracy (First and Last Letter): 0.792
Accuracy (First, Middle, and Last Letter): 0.778



### Equipped with the power of classifiers, Lily created four different sets of features that could help determine the gender of a name. The first set focused on the last letter of the name, while the second set considered only the first letter. The third set combined the first and last letters, and the fourth set included the first, middle, and last letters of the name.



In [None]:
def gender_features(name):
    return {'lastletter': name[-1]}

def gender_features1(name):
    return {'lastletter': name[0]}

def gender_features2(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    return features

def gender_features3(name):
    m = int(round(len(name)/2))
    midletter = name[m].lower()
    features = {}
    features['first_letter'] = name[0].lower()
    features['middle_letter'] = midletter
    features['last_letter'] = name[-1].lower()
    return features

names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]


### With her feature sets in place, Lily embarked on training her classifiers. She used the NaiveBayesClassifier from the powerful NLTK library to train each classifier using the training set. Once the training was complete, she evaluated the accuracy of each classifier using the dev-test set, which allowed her to fine-tune and improve her classifiers incrementally.

### After numerous iterations of training and testing, Lily was finally satisfied with the performance of her classifiers. She eagerly tested the final versions of the classifiers on the test set, which contained previously unseen names.


In [3]:

# Gender features = Last Letter
train_set = [(gender_features(n), g) for (n, g) in train_names]
devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
test_set = [(gender_features(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy (Last Letter):", accuracy)

# Gender features = First Letter
train_set = [(gender_features1(n), g) for (n, g) in train_names]
devtest_set = [(gender_features1(n), g) for (n, g) in devtest_names]
test_set = [(gender_features1(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy (First Letter):", accuracy)

# Gender features = First and Last Letter
train_set = [(gender_features2(n), g) for (n, g) in train_names]
devtest_set = [(gender_features2(n), g) for (n, g) in devtest_names]
test_set = [(gender_features2(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy (First and Last Letter):", accuracy)

# Gender features = First, Middle Letter, and Last Letter
train_set = [(gender_features3(n), g) for (n, g) in train_names]
devtest_set = [(gender_features3(n), g) for (n, g) in devtest_names]
test_set = [(gender_features3(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy (First, Middle, and Last Letter):", accuracy)

Accuracy (Last Letter): 0.768
Accuracy (First Letter): 0.618
Accuracy (First and Last Letter): 0.792



### To her delight, Lily discovered that her classifiers performed remarkably well. The accuracy results for each feature set were displayed proudly on her screen. The Last Letter classifier achieved an accuracy of 76.8%, the First Letter classifier scored 61.8%, the First and Last Letter classifier reached 79.2%, and the First, Middle, and Last Letter classifier achieved an impressive accuracy of 77.8%. She aslo did tests on her dev_test sample and had simlar results.



In [4]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, devtest_set)
print("Accuracy (Last Letter) - Dev Test:", accuracy)

# Gender features = First Letter
train_set = [(gender_features1(n), g) for (n, g) in train_names]
devtest_set = [(gender_features1(n), g) for (n, g) in devtest_names]
test_set = [(gender_features1(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, devtest_set)
print("Accuracy (First Letter) - Dev Test:", accuracy)

# Gender features = First and Last Letter
train_set = [(gender_features2(n), g) for (n, g) in train_names]
devtest_set = [(gender_features2(n), g) for (n, g) in devtest_names]
test_set = [(gender_features2(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, devtest_set)
print("Accuracy (First and Last Letter) - Dev Test:", accuracy)

# Gender features = First, Middle Letter, and Last Letter
train_set = [(gender_features3(n), g) for (n, g) in train_names]
devtest_set = [(gender_features3(n), g) for (n, g) in devtest_names]
test_set = [(gender_features3(n), g) for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, devtest_set)
print("Accuracy (First, Middle, and Last Letter) - Dev Test:", accuracy)

Accuracy (Last Letter) - Dev Test: 0.775
Accuracy (First Letter) - Dev Test: 0.636
Accuracy (First and Last Letter) - Dev Test: 0.77
Accuracy (First, Middle, and Last Letter) - Dev Test: 0.775


# Conclusion

### Lily was thrilled with her achievements, and she marveled at how analyzing the features of a name could reveal so much about a person's gender. She shared her findings with the people of Nomenia, who were equally fascinated by the power of the name gender classifier.

And so, Lily's journey came to an end, leaving a lasting impact on the village of Nomenia. The villagers began to appreciate the significance of their names, and the classifiers became a valuable tool in determining gender, building a sense of identity, and fostering understanding among the community.
