<a href="https://colab.research.google.com/github/nitin-barthwal/TextAnalytics/blob/master/NLP_Male_Female.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification [Natural Language Processing (NLP)

**Identifying wether a name is of a Male or Female **

we will use the NLTK’s names corpus as our labeled training data. The names corpus contains a total of around 8K male and female names. It’s compiled by Kantrowitz, Ross.

So, we have two categories for classification. They are male and female. Our training data (the “names” corpus) has names that are already labeled as male and names that are already labeled as female.

In [1]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


True

In [2]:
from nltk.corpus import names 
 
#print (names.fileids()) # Output: [female.txt', male.txt']
 
male_names = names.words('male.txt')
female_names = names.words('female.txt')
 
print (len(male_names)) # Output: 2943
print (len(female_names)) # Output: 5001
 
# print 10 male names
print (male_names[10:20])

 
# print 10 female names
print (female_names[10:20]) 


2943
5001
['Abdullah', 'Abe', 'Abel', 'Abelard', 'Abner', 'Abraham', 'Abram', 'Ace', 'Adair', 'Adam']
['Abra', 'Acacia', 'Ada', 'Adah', 'Adaline', 'Adara', 'Addie', 'Addis', 'Adel', 'Adela']


**Feature Extraction**

To classify the text into any category, we need to define some criteria. On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category. This kind of criteria is known as feature. We can define one or more feature to train our classifier.

In this example, we will use the last letter of the names as the feature.

We will define a function that extracts the last letter of any provided word. The function will return a dictionary containing the last letter information of the given word.

In [3]:
def gender_features(word):
    return {'last_letter' : word[-1]}
 
print (gender_features('Nitin')) # Output: {'last_letter': 'n'}

{'last_letter': 'n'}


The dictionary returned by the above function is called a feature set. This feature set is used to train the classifier.

We will now create a feature set using all the male and female names.

For this, we first combine the male and female names and shuffle the combined array.

In [4]:
from nltk.corpus import names 
import random 
 
male_names = names.words('male.txt')
female_names = names.words('female.txt')
 
labeled_male_names = [(str(name), 'male') for name in male_names]
 
# printing  10 labeled male names
print (labeled_male_names[10:20])
print('No of Male names :',len(labeled_male_names))
 
labeled_female_names = [(str(name), 'female') for name in female_names]
 
# printing  10 labeled female names
print (labeled_female_names[10:20])
print('No of Female names :',len(labeled_female_names))
 
# combine labeled male and labeled female names
labeled_all_names = labeled_male_names + labeled_female_names
print('Total Names : ',len(labeled_all_names))

# shuffle the labeled names array
random.shuffle(labeled_all_names)
 
# printing  10 labeled all/combined names
print (labeled_all_names[10:20])

[('Abdullah', 'male'), ('Abe', 'male'), ('Abel', 'male'), ('Abelard', 'male'), ('Abner', 'male'), ('Abraham', 'male'), ('Abram', 'male'), ('Ace', 'male'), ('Adair', 'male'), ('Adam', 'male')]
No of Male names : 2943
[('Abra', 'female'), ('Acacia', 'female'), ('Ada', 'female'), ('Adah', 'female'), ('Adaline', 'female'), ('Adara', 'female'), ('Addie', 'female'), ('Addis', 'female'), ('Adel', 'female'), ('Adela', 'female')]
No of Female names : 5001
Total Names :  7944
[('Jereme', 'male'), ('Alister', 'male'), ('Hetti', 'female'), ('Rustie', 'male'), ('Denice', 'female'), ('Lidia', 'female'), ('Denna', 'female'), ('Donovan', 'male'), ('Avi', 'male'), ('Devan', 'female')]


**Extracting Feature & Creating Feature Set**

We use the gender_features function that we defined above to extract the feature from the labeled names data. As mentioned above, the feature for this example will be the last letter of the names. So, we extract the last letter of all the labeled names and create a new array with the last letter of each name and the associated label for that particular name. This new array is called the feature set.

In [5]:

feature_set = [(gender_features(name), gender) for (name, gender) in labeled_all_names]
 
print (labeled_all_names[:10])

 
print (feature_set[:10])

[('Kane', 'male'), ('Silas', 'male'), ('Liliane', 'female'), ('Fallon', 'female'), ('Kimmy', 'female'), ('Nathanial', 'male'), ('Glynis', 'female'), ('Shaw', 'male'), ('Janie', 'female'), ('Deeann', 'female')]
[({'last_letter': 'e'}, 'male'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'n'}, 'female'), ({'last_letter': 'y'}, 'female'), ({'last_letter': 'l'}, 'male'), ({'last_letter': 's'}, 'female'), ({'last_letter': 'w'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'n'}, 'female')]


**Training Classifier**

From the feature set we created above, we now create a separate training set and a separate testing/validation set. The train set is used to train the classifier and the test set is used to test the classifier to check how accurately it classifies the given text.

**Creating Train and Test Dataset**


In this example, we use the first 1500 elements of the feature set array as the test set and the rest of the data as the train set. Generally, 80/20 percent is a fair split between training and testing set, i.e. 80 percent training set and 20 percent testing set.

In [6]:

test_set = feature_set[:1500]
train_set = feature_set[1500:]
 
print ('Train set Length',len(train_set)) # Output: 6944
print ('Test set Length',len(test_set)) # Output: 1500

Train set Length 6444
Test set Length 1500


**Training a Classifier**

Now, we train a classifier using the training dataset. There are different kind of classifiers namely Naive Bayes Classifier, Maximum Entropy Classifier, Decision Tree Classifier, Support Vector Machine Classifier, etc.

In this example, we use the Naive Bayes Classifier. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

In [0]:

from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)

**Testing the trained Classifier**

Let’s see the output of the classifier by providing some names to it.

In [8]:

print (classifier.classify(gender_features('Robert'))) # Output: male
 
print (classifier.classify(gender_features('Katrina'))) # Output: female

male
female


Let’s see the accuracy percentage of the trained classifier. The accuracy value changes each time you run the program because of the names array being shuffled above.



In [9]:

from nltk import classify 
 
accuracy = classify.accuracy(classifier, test_set)
 
print (accuracy) # Output: 0.76

0.7513333333333333


Let’s see the most informative features among the entire features in the feature set.

The result shows that the names ending with letter “k” are male 36.9 times more often than they are female but the names ending with the letter “a” are female 34.1 times more often than they are male. Similarly, for other letters. These ratios are also called likelihood ratios.

Therefore, if you provide a name ending with letter “k” to the above trained classifier then it will predict it as “male” and if you provide a name ending with the letter “a” to the classifier then it will predict it as “female”.


In [10]:

# show 5 most informative features
print (classifier.show_most_informative_features(5))

print (classifier.classify(gender_features('Nitin'))) # Output: male

print (classifier.classify(gender_features('Maggie'))) # Output: female


Most Informative Features
             last_letter = 'a'            female : male   =     37.7 : 1.0
             last_letter = 'k'              male : female =     27.3 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'd'              male : female =     11.3 : 1.0
             last_letter = 'o'              male : female =      9.1 : 1.0
None
male
female


**Note**:
We can modify the *gender_features* function to generate the feature set which can improve the accuracy of the trained classifier. For example, we can use both first and last letter of the names as the feature.
Feature extractors are built through a process of trial-and-error & guided by intuitions.