<a href="https://colab.research.google.com/github/nandir2512/NLP/blob/main/Name_Gender_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Name Gender Prediction**

Let’s assume that we have collected a list of personal names and we have their corresponding gender labels, i.e., whether the name is a male or female one.

The goal of this example is to create a classifier that would automatically classify a given name into either male or female.

# Prepare Data
* We use the data provided in NLTK. Please download the corpus data if necessary.

* We load the corpus, nltk.corpus.names and randomize it before we proceed.

In [1]:
import numpy as np
import nltk
nltk.download('names')
from nltk.corpus import names
import random


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


In [2]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)

In [3]:
labeled_names[:10]

[('Heinz', 'male'),
 ('Lilli', 'female'),
 ('Zared', 'male'),
 ('Goose', 'male'),
 ('Nonah', 'female'),
 ('Baird', 'male'),
 ('Myrtle', 'female'),
 ('Aila', 'female'),
 ('Bear', 'male'),
 ('Elna', 'female')]

In [4]:
len(labeled_names)

7944

# **1. Simple Model**

# Feature Engineering
* Now our unit for classification is a name.
*  In feature engineering, our goal is to transform the texts (i.e., names) into vectorized representations.
*  To start with, let’s represent each text (name) by using its last character as the features.

In [5]:
def text_vectorizer(word):
  return {'last_letter': word[-1]}

text_vectorizer('shrek')

{'last_letter': 'k'}

# Train-Test Split
We then apply the feature engineering method to every text in the data and split the data into training and testing sets.

In [6]:
featuresets = [(text_vectorizer(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

In [7]:
len(train_set), len(test_set)

(7444, 500)

# **Train the Model - simple Naive Bayes**
A good start is to try the simple Naive Bayes Classifier.

In [8]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Model Prediction

In [9]:
print(classifier.classify(text_vectorizer('Neo')))
print(classifier.classify(text_vectorizer('Trinity')))
print(classifier.classify(text_vectorizer('Alvin')))

male
female
male


In [10]:
print(nltk.classify.accuracy(classifier, test_set))

0.774


# Post-hoc Analysis
One of the most important steps after model training is to examine which features contribute the most to the classifier prediction of the class.

In [11]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     36.8 : 1.0
             last_letter = 'k'              male : female =     30.8 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'd'              male : female =      9.9 : 1.0


* Please note that in NLTK, we can use the apply_features to create training and testing datasets.

* When you have a very large feature set, this can be more effective in terms of memory management.

* This is our earlier method of creating training and testing sets:

In [12]:
# featuresets = [(text_vectorizer(n), gender) for (n, gender) in labeled_names]
# train_set, test_set = featuresets[500:], featuresets[:500]

from nltk.classify import apply_features
train_set = apply_features(text_vectorizer, labeled_names[500:])
test_set = apply_features(text_vectorizer, labeled_names[:500])

# 2. How can we improve the model/classifier?
In the following, we will talk about methods that we may consider to further improve the model training.

* Feature Engineering
* Error Analysis
* Cross Validation
* Try Different Machine-Learning Algorithms
* (Ensemble Methods)'

More Sophisticated Feature Engineering
* We can extract more features from the names.
* Use the following features for vectorized representations of names:
  * The first/last letter
  * Frequencies of all 26 alphabets in the names

In [13]:
def text_vectorizer2(name):
  features = {}
  features["first_letter"] = name[0].lower()
  features["last_letter"] = name[-1].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
    features["count({})".format(letter)] = name.lower()
    features["has({})".format(letter)] = (letter in name.lower())
  return features

text_vectorizer2('Alvin')

{'first_letter': 'a',
 'last_letter': 'n',
 'count(a)': 'alvin',
 'has(a)': True,
 'count(b)': 'alvin',
 'has(b)': False,
 'count(c)': 'alvin',
 'has(c)': False,
 'count(d)': 'alvin',
 'has(d)': False,
 'count(e)': 'alvin',
 'has(e)': False,
 'count(f)': 'alvin',
 'has(f)': False,
 'count(g)': 'alvin',
 'has(g)': False,
 'count(h)': 'alvin',
 'has(h)': False,
 'count(i)': 'alvin',
 'has(i)': True,
 'count(j)': 'alvin',
 'has(j)': False,
 'count(k)': 'alvin',
 'has(k)': False,
 'count(l)': 'alvin',
 'has(l)': True,
 'count(m)': 'alvin',
 'has(m)': False,
 'count(n)': 'alvin',
 'has(n)': True,
 'count(o)': 'alvin',
 'has(o)': False,
 'count(p)': 'alvin',
 'has(p)': False,
 'count(q)': 'alvin',
 'has(q)': False,
 'count(r)': 'alvin',
 'has(r)': False,
 'count(s)': 'alvin',
 'has(s)': False,
 'count(t)': 'alvin',
 'has(t)': False,
 'count(u)': 'alvin',
 'has(u)': False,
 'count(v)': 'alvin',
 'has(v)': True,
 'count(w)': 'alvin',
 'has(w)': False,
 'count(x)': 'alvin',
 'has(x)': False,
 '

In [14]:
text_vectorizer2('John')

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 'john',
 'has(a)': False,
 'count(b)': 'john',
 'has(b)': False,
 'count(c)': 'john',
 'has(c)': False,
 'count(d)': 'john',
 'has(d)': False,
 'count(e)': 'john',
 'has(e)': False,
 'count(f)': 'john',
 'has(f)': False,
 'count(g)': 'john',
 'has(g)': False,
 'count(h)': 'john',
 'has(h)': True,
 'count(i)': 'john',
 'has(i)': False,
 'count(j)': 'john',
 'has(j)': True,
 'count(k)': 'john',
 'has(k)': False,
 'count(l)': 'john',
 'has(l)': False,
 'count(m)': 'john',
 'has(m)': False,
 'count(n)': 'john',
 'has(n)': True,
 'count(o)': 'john',
 'has(o)': True,
 'count(p)': 'john',
 'has(p)': False,
 'count(q)': 'john',
 'has(q)': False,
 'count(r)': 'john',
 'has(r)': False,
 'count(s)': 'john',
 'has(s)': False,
 'count(t)': 'john',
 'has(t)': False,
 'count(u)': 'john',
 'has(u)': False,
 'count(v)': 'john',
 'has(v)': False,
 'count(w)': 'john',
 'has(w)': False,
 'count(x)': 'john',
 'has(x)': False,
 'count(y)': 'john',
 'ha

In [15]:
train_set = apply_features(text_vectorizer2, labeled_names[500:])
test_set = apply_features(text_vectorizer2, labeled_names[:500])

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.332


In [16]:
classifier.show_most_informative_features(n=20)

Most Informative Features
             last_letter = 'a'            female : male   =     36.8 : 1.0
             last_letter = 'k'              male : female =     30.8 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'd'              male : female =      9.9 : 1.0
             last_letter = 'm'              male : female =      9.3 : 1.0
             last_letter = 'v'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      7.8 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
             last_letter = 'g'              male : female =      5.5 : 1.0
             last_letter = 'w'              male : female =      5.1 : 1.0
            first_letter = 'w'              male : female =      5.0 : 1.0
                  has(w) = True             male : female =      4.4 : 1.0

# Train-Development-Test Data Splits for Error Analysis
* Normally we have training-testing splits of data
* Sometimes we can use development (dev) set for error analysis and feature engineering.
* This dev set should be independent of training and testing sets.
* Now let’s train the model on the training set and first check the classifier’s performance on the dev set.
* We then identify the errors the classifier made in the dev set.
* We perform error analysis for further improvement.
* We only test our final model on the testing set. (Note: Testing set can only be used once.)

In [17]:
# Using text_vectorizer

train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

train_set = [(text_vectorizer(n), gender) for (n, gender) in train_names]
devtest_set = [(text_vectorizer(n), gender) for (n, gender) in devtest_names]
test_set = [(text_vectorizer(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.774


In [18]:
errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(text_vectorizer(name))
  if guess !=tag:
    errors.append((tag, guess, name))

In [19]:
import pandas as pd
pd.DataFrame(errors, columns= ['tag', 'guess', 'name'])

Unnamed: 0,tag,guess,name
0,female,male,Jaquelyn
1,female,male,Mariel
2,female,male,Madelin
3,male,female,Frederich
4,female,male,Noelyn
...,...,...,...
221,female,male,Isabeau
222,male,female,George
223,male,female,Maury
224,male,female,Kirby


# **Confusion Matrix**

In [20]:
print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set)))

Accuracy: 0.77


In [21]:
def createCM(classifier, test_set):
  t_f = [feature for (feature, label) in test_set]
  t_l = [label for (feature, label) in test_set]
  t_l_pr = [classifier.classify(f) for f in t_f]
  cm = nltk.ConfusionMatrix(t_l, t_l_pr)
  print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

In [22]:
createCM(classifier, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <52.8%>  9.8% |
  male |  12.8% <24.6%>|
-------+---------------+
(row = reference; col = test)



# **Cross Validation**

In [23]:
import sklearn.model_selection

kf = sklearn.model_selection.KFold(n_splits=5)
acc_kf = []

for train_index, test_index in kf.split(train_set):
  classifier = nltk.NaiveBayesClassifier.train(
      train_set[train_index[0]:train_index[len(train_index)- 1]])
  cur_fold_acc = nltk.classify.util.accuracy(
      classifier, train_set[test_index[0]:test_index[len(test_index)-1]])
  acc_kf.append(cur_fold_acc)
  print('accuracy:', np.round(cur_fold_acc,2))

accuracy: 0.75
accuracy: 0.77
accuracy: 0.76
accuracy: 0.75
accuracy: 0.77


In [24]:
np.mean(acc_kf)

0.760057454622672

# **3. Naive Bayes in sklearn**

In [25]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

sk_classifier = SklearnClassifier(MultinomialNB())
sk_classifier.train(train_set)

<SklearnClassifier(MultinomialNB())>

In [26]:
nltk.classify.accuracy(sk_classifier, test_set)

0.774

# **4. Decision Tree**

* Parameters:
  * ***binary***: whether the features are binary
  * ***entropy_cutoff:*** a value used during tree refinement process
    * entropy = 1 -> high-level uncertainty
    * entropy = 0 -> perfect model prediction
  * ***depth_cutoff:*** to control the depth of the tree
  * ***support_cutoff:*** the minimum number of instances that are required to make a decision about a feature.

In [27]:
from nltk.classify import DecisionTreeClassifier

In [28]:
%%time
classifier_dt = DecisionTreeClassifier.train(train_set,
                                             binary=True,
                                             entropy_cutoff=0.7,
                                             depth_cutoff=5,
                                             support_cutoff=5)

CPU times: user 1.56 s, sys: 11.2 ms, total: 1.57 s
Wall time: 1.59 s


In [29]:
nltk.classify.accuracy(classifier_dt, test_set)

0.72

In [30]:
createCM(classifier_dt, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <59.4%>  3.2% |
  male |  24.8% <12.6%>|
-------+---------------+
(row = reference; col = test)



In [31]:
%%time

for train_index, test_index in kf.split(train_set):
  classifier = DecisionTreeClassifier.train(
      train_set[train_index[0]:train_index[len(train_index)-1 ]],
      binary = True,
      entropy_cutoff=0.7,
      depth_cutoff=5,
      support_cutoff=5
  )
  print('accuracy:', nltk.classify.util.accuracy(classifier,
                train_set[test_index[0]:test_index[len(test_index)-1]]))

accuracy: 0.6995341614906833
accuracy: 0.7228260869565217
accuracy: 0.7228260869565217
accuracy: 0.717391304347826
accuracy: 0.728049728049728
CPU times: user 11.5 s, sys: 30.6 ms, total: 11.5 s
Wall time: 17 s


# **5. Logistic Regression**

In [32]:
from sklearn.linear_model import LogisticRegression

sk_classifier = SklearnClassifier(LogisticRegression(max_iter=500))
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.776

# **6. Support Vector Machine**

In [33]:
from sklearn.svm import SVC
sk_classifier = SklearnClassifier(SVC())
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.776

In [34]:
from sklearn.svm import NuSVC
sk_classifier = SklearnClassifier(NuSVC())
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.776

In [35]:
from sklearn.svm import LinearSVC
sk_classifier = SklearnClassifier(LinearSVC(max_iter=2000))
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.776

#### *Need to required Further Tune the Model* ####