<a href="https://colab.research.google.com/github/jainnipun/MachineLearning/blob/master/TextAnalytics/NLP_Text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Processing (NLP)

## Text Classification

**Identifying wether a name is of a Male or Female **

I am using the NLTK’s names corpus as the labeled training data. The names corpus contains a total of around 8K male and female names. 
It’s compiled by Kantrowitz, Ross.

So, we have two categories for classification. They are male and female. Our training data (the “names” corpus) has names that are already labeled as male and names that are already labeled as female.

In [9]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [24]:
from nltk.corpus import names 
 
#Viewing fields present in names
print (names.fileids()) # Output: [female.txt', male.txt']names_masculine

['female.txt', 'male.txt']


In [0]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
 
print (len(male_names)) # Output: 2943
print (len(female_names)) # Output: 5001
 
# print 15 female names
print (female_names[1200:1215]) 
 
# print 15 male names
print (male_names[1200:1215])

**Feature Extraction**

To classify the text into any category, we need to define some criteria. On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category. This kind of criteria is known as feature. We can define one or more feature to train our classifier.

In this example, we will use the last letter of the names as the feature.

We will define a function that extracts the last letter of any provided word. The function will return a dictionary containing the last letter information of the given word.

In [11]:
def gender_features(word):
    return {'last_letter' : word[-1]}
 
print (gender_features('Nipun')) # Output: {'last_letter': 'n'}

{'last_letter': 'n'}


The dictionary returned by the above function is called a feature set. This feature set is used to train the classifier.

We will now create a feature set using all the male and female names.

For this, we first combine the male and female names and shuffle the combined array.

In [21]:
from nltk.corpus import names 
import random 
 
names_masculine = names.words('male.txt')
names_feminine = names.words('female.txt')

labeled_names_feminine = [(str(name), 'female') for name in names_feminine]
 
# printing  15 labeled female names
print (labeled_names_feminine[1200:1215])
print('No of Female names :',len(labeled_names_feminine))

labeled_names_masculine = [(str(name), 'male') for name in names_masculine]
 
# printing  15 labeled male names
print (labeled_names_masculine[1200:1215])
print('No of Male names :',len(labeled_names_masculine))
 
# combine labeled male and labeled female names
labeled_all_names = labeled_names_masculine + labeled_names_feminine
print('Total Names : ',len(labeled_all_names))

# shuffle the labeled names array
random.shuffle(labeled_all_names)
 
# printing  10 labeled all/combined names
print (labeled_all_names[1200:1215])

[('Danita', 'female'), ('Danna', 'female'), ('Danni', 'female'), ('Dannie', 'female'), ('Danny', 'female'), ('Dannye', 'female'), ('Danya', 'female'), ('Danyelle', 'female'), ('Danyette', 'female'), ('Daphene', 'female'), ('Daphna', 'female'), ('Daphne', 'female'), ('Dara', 'female'), ('Darb', 'female'), ('Darbie', 'female')]
No of Female names : 5001
[('Howard', 'male'), ('Howie', 'male'), ('Hoyt', 'male'), ('Hubert', 'male'), ('Hudson', 'male'), ('Huey', 'male'), ('Hugh', 'male'), ('Hugo', 'male'), ('Humbert', 'male'), ('Humphrey', 'male'), ('Hunt', 'male'), ('Hunter', 'male'), ('Huntington', 'male'), ('Huntlee', 'male'), ('Huntley', 'male')]
No of Male names : 2943
Total Names :  7944
[('Eleanora', 'female'), ('Tersina', 'female'), ('Magdaia', 'female'), ('Nina', 'female'), ('Shepherd', 'male'), ('Danelle', 'female'), ('Valerye', 'female'), ('Bernetta', 'female'), ('Kiri', 'female'), ('Giralda', 'female'), ('Kelcy', 'female'), ('Sharla', 'female'), ('Annmarie', 'female'), ('Bella', 

**Extracting Feature & Creating Feature Set**

We use the gender_features function that we defined above to extract the feature from the labeled names data. As mentioned above, the feature for this example will be the last letter of the names. So, we extract the last letter of all the labeled names and create a new array with the last letter of each name and the associated label for that particular name. This new array is called the feature set.

In [28]:

feature_set = [(gender_features(name), gender) for (name, gender) in labeled_all_names]
 
print (labeled_all_names[:15])

 
print (feature_set[:15])

[('Franklyn', 'male'), ('Donica', 'female'), ('Stanly', 'male'), ('Cookie', 'female'), ('Ferinand', 'male'), ('Sandor', 'male'), ('Melodee', 'female'), ('Wallas', 'male'), ('Marwin', 'male'), ('Liuka', 'female'), ('Loree', 'female'), ('Rachelle', 'female'), ('Eloisa', 'female'), ('Yehudi', 'male'), ('Viva', 'female')]
[({'last_letter': 'n'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'y'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'd'}, 'male'), ({'last_letter': 'r'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'n'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'i'}, 'male'), ({'last_letter': 'a'}, 'female')]


**Training Classifier**

From the feature set we created above, we now create a separate training set and a separate testing/validation set. The train set is used to train the classifier and the test set is used to test the classifier to check how accurately it classifies the given text.

**Creating Train and Test Dataset**

Now we will be splitting the dataset using scikit learn test-train split. 
We split data in 80/20 percentage split between training and testing set, i.e. 80 percent training set and 20 percent testing set.

test_size : = 0.25 represents the percent of test samples, rest is training set

random_state : 73 The seed used by the random number generator

In [26]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(feature_set, test_size=0.20, random_state=73)

print ('Train set Length',len(train_set)) # Output: 6355
print ('Test set Length',len(test_set)) # Output: 1589

Train set Length 6355
Test set Length 1589


**Training a Classifier**

Now, we train a classifier using the training dataset. There are different kind of classifiers namely Naive Bayes Classifier, Maximum Entropy Classifier, Decision Tree Classifier, Support Vector Machine Classifier, etc.

In this example, we use the Naive Bayes Classifier. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

In [0]:

from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)

**Testing the trained Classifier**

Let’s see the output of the classifier by providing some names to it.

In [22]:

print (classifier.classify(gender_features('Nipun'))) # Output: male
 
print (classifier.classify(gender_features('Roxie'))) # Output: female

male
female


Let’s see the accuracy percentage of the trained classifier. The accuracy value changes each time you run the program because of the names array being shuffled above.



In [17]:

from nltk import classify 
 
accuracy = classify.accuracy(classifier, test_set)
 
print (accuracy) # Output: 0.77

0.7746666666666666


Let’s see the most informative features among the entire features in the feature set.

The result shows that the names ending with letter “a” are females 36.1 times more often than they are female but the names ending with the letter “k” are males 285 times more often than they are male. Similarly, for other letters. These ratios are also called likelihood ratios.

Therefore, if you provide a name ending with letter “k” to the above trained classifier then it will predict it as “male” and if you provide a name ending with the letter “a” to the classifier then it will predict it as “female”.


In [35]:

# show 5 most informative features
print (classifier.show_most_informative_features(5))

print ('Rock : ',classifier.classify(gender_features('Rock'))) # Output: male

print ('Sara : ',classifier.classify(gender_features('Sara'))) # Output: female

print ('Nipun : ',classifier.classify(gender_features('Nipun'))) # Output: male


Most Informative Features
             last_letter = 'a'            female : male   =     36.1 : 1.0
             last_letter = 'k'              male : female =     28.5 : 1.0
             last_letter = 'v'              male : female =     15.3 : 1.0
             last_letter = 'f'              male : female =     14.7 : 1.0
             last_letter = 'd'              male : female =     10.0 : 1.0
None
Rock :  male
Sara :  female
Nipun :  male


**Note**:
We can modify the *gender_features* function to generate the feature set which can improve the accuracy of the trained classifier. For example, we can use both first and last letter of the names as the feature.
Feature extractors are built through a process of trial-and-error & guided by intuitions.