In this project we will use a heuristic  to construct a feature vector and use it to train a classifier. The heuristic that will be used here is the last N letters of a given name. For example, if the name ends with ia, it is most likely a female name such as amelia or Genelia. On the other hand, if the name ends iwth rk, it is likely a male name such as Mark, clark. Because we will not know exact number of letters to use, we will play around with this parameter and find out what the best answer is.

In [1]:
import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

In [2]:
# Define a function to extract the last N letters from the input word
def extract_features(word, N=2):
  last_n_letters = word[-N:]
  return {'feature':last_n_letters.lower()}

In [5]:
import nltk
nltk.download('names')
# Create the training data using labeled names available in NTLK 
male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


In [6]:
# Seed the random number generator
random.seed(5)
# Shuffle the data
random.shuffle(data)

In [7]:
# Create some sample names that will be used for testing
input_names = ['Alexander', 'Daniellle', 'David','Cheryl']
# Define the number of samples used for train and test
num_train = int(0.8*len(data))


The last N characters will be used as the feature vector to predict the gender. This parameter will be cahnged to see how the performanfe varies. In this case we will go from 1 to 6.

In [9]:
for i in range(1, 6):
    print('\nNumber of end letters:', i)
    features = [(extract_features(n, i), gender) for (n, gender) in data]
    train_data, test_data = features[:num_train], features[num_train:]
    classifier = NaiveBayesClassifier.train(train_data)
    # We need to compute the accuracy of the classifier using the inbuilt accuracy method that is available in NLTK
        # Compute the accuracy of the classifier 
    accuracy = round(100 * nltk_accuracy(classifier, test_data), 2)
    print('Accuracy = ' + str(accuracy) + '%')
    # Predict the output for each name in the input test list
    for name in input_names:
        print(name, '==>', classifier.classify(extract_features(name, i)))


Number of end letters: 1
Accuracy = 74.7%
Alexander ==> male
Daniellle ==> female
David ==> male
Cheryl ==> male

Number of end letters: 2
Accuracy = 78.79%
Alexander ==> male
Daniellle ==> female
David ==> male
Cheryl ==> female

Number of end letters: 3
Accuracy = 77.22%
Alexander ==> male
Daniellle ==> female
David ==> male
Cheryl ==> female

Number of end letters: 4
Accuracy = 69.98%
Alexander ==> male
Daniellle ==> female
David ==> male
Cheryl ==> female

Number of end letters: 5
Accuracy = 64.63%
Alexander ==> male
Daniellle ==> female
David ==> male
Cheryl ==> female


As can be seen from the above result, the accuracy peaked at 2 letters and then started decreasing after that.