<a href="https://www.kaggle.com/code/nittsgh/nlp-text-classification-using-naive-byes?scriptVersionId=200000117" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Installation and import of necessary packages

In [1]:
import nltk
import string
import random
import pandas as pd

# Download necessary corpus and models from nltk

**Use the "names" corpus from nltk to build a simple model for gender classification of names.**

In [2]:
nltk.download("names")
nltk.download('product_reviews_1')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package names to /usr/share/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package product_reviews_1 is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
print(nltk.corpus.names.fileids())

['female.txt', 'male.txt']


# Data acquisition

* The names corpus contains two text files.
* male.txt contains list of names which are most frequently used for males.
* female.txt contains list of names most commonly used for females.

**Start by extracting names as female and male names list.**

In [4]:
female_name=nltk.corpus.names.words('female.txt')
male_name=nltk.corpus.names.words('male.txt')

**Create a labelled data list with names from female.txt labeled as females and names from male.txt as males as a tuple.**

In [5]:
labeled_data= ([(name,'male') for name in male_name] + [(name,'female') for name in female_name])

In [6]:
labeled_data[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

# Feature Extraction

* Text data is unstructured and features need to be extracted in order to use it in ML models.
* Here features are identified manually as length, first letter, last letter, count of each letter and count of vowels in the name.
* The function below extracts these features and returns a dictionary of features.

In [7]:
def get_features(name):
    name=name.lower()
    feature_dict={}
    
    # Getting the features like last letter, first letter,length
    feature_dict['length']=len(name)
    feature_dict['first_letter']=name[0]
    feature_dict['last letter']=name[-1]
    
    # Getting count of consonants and vowels in the name
    vowels=set('aeiou')
    feature_dict['vowel_count']=sum(1 for char in name if char in vowels)
    feature_dict['consonant_count']=len(name)- feature_dict['vowel_count']
    
     # Common suffix (last two letters)
    feature_dict['suffix'] = name[-2:] if len(name) > 1 else name[-1]

    # Frequency of each letter in the name
    for char in string.ascii_lowercase:
        feature_dict[f'count_{char}'] = name.count(char)
    
    return feature_dict

**Transform names in the labeled data to these features using the above function.**

In [8]:
new_lab_data= []
for name, label in labeled_data:
    features = get_features(name)
    new_lab_data.append((features, label))

new_lab_data[:2]

[({'length': 5,
   'first_letter': 'a',
   'last letter': 'r',
   'vowel_count': 3,
   'consonant_count': 2,
   'suffix': 'ir',
   'count_a': 2,
   'count_b': 0,
   'count_c': 0,
   'count_d': 0,
   'count_e': 0,
   'count_f': 0,
   'count_g': 0,
   'count_h': 0,
   'count_i': 1,
   'count_j': 0,
   'count_k': 0,
   'count_l': 0,
   'count_m': 1,
   'count_n': 0,
   'count_o': 0,
   'count_p': 0,
   'count_q': 0,
   'count_r': 1,
   'count_s': 0,
   'count_t': 0,
   'count_u': 0,
   'count_v': 0,
   'count_w': 0,
   'count_x': 0,
   'count_y': 0,
   'count_z': 0},
  'male'),
 ({'length': 5,
   'first_letter': 'a',
   'last letter': 'n',
   'vowel_count': 3,
   'consonant_count': 2,
   'suffix': 'on',
   'count_a': 2,
   'count_b': 0,
   'count_c': 0,
   'count_d': 0,
   'count_e': 0,
   'count_f': 0,
   'count_g': 0,
   'count_h': 0,
   'count_i': 0,
   'count_j': 0,
   'count_k': 0,
   'count_l': 0,
   'count_m': 0,
   'count_n': 1,
   'count_o': 1,
   'count_p': 0,
   'count_q': 0,
 

# Model development

**Shuffle the data in random order before splitting into train and test in order to obtain optimized sample for training.**

In [9]:
random.shuffle(new_lab_data)

**Select first 1000 records of the shuffled data as test and remaining as training set.**

In [10]:
test_data = new_lab_data[:1000]
train_data = new_lab_data[1000:]

**Define the classifier object for training of the model.**

In [11]:
classifier= nltk.naivebayes.NaiveBayesClassifier.train(train_data)

**Once the training is complete, the classifier object may be used to classify for a single name input.**

In [12]:
classifier.classify(get_features('Rohan'))

'male'


**Note :**
  - For classification input text needs to be converted into features similar to the training data
  - We can use the same feature extraction function here for transformation
    


**This classifier object can also be used to classify multiple text inputs at the same time.**

* In order to do so, pass a unlabeled data to the classifier associated function classify_many.
* The below snippet separates the labels from the preprocessed (feature extracted) list and prepares the data input for the classification function.

In [13]:
test_features = []
test_labels = []
for feature_set, label in test_data:
    test_features.append(feature_set)
    test_labels.append(label)

**Obtain the classes for the test input.**

In [14]:
test_labels_pred = classifier.classify_many(test_features)

# Evaluation

**Cobfusion Matrix**

In [15]:
for_matrix = pd.DataFrame({'pred' : test_labels_pred, 'act' : test_labels})

In [16]:
confusion_mat = pd.crosstab(for_matrix.pred, for_matrix.act)
confusion_mat

act,female,male
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
female,507,81
male,119,293


In [17]:
# Get the values of true positives, true negatives, false positives, false negatives for computation of accuracy and other measures
TP = confusion_mat.iloc[0,0]
TN = confusion_mat.iloc[1,1]
FP = confusion_mat.iloc[0,1]
FN = confusion_mat.iloc[1,0]

In [18]:
Accuracy = (TP + TN) / sum([TP, TN, FP, FN]) * 100
print(f"Accuracy : {Accuracy:0.2f} %")

Accuracy : 80.00 %


**NLTK also provides functions to obtain accuracy for the model.**

In [19]:
## Accuracy on test data :
nltk.classify.accuracy(classifier, test_data)

0.8

**The nltk `naive bayes model` also provides the `top n` important features contributing in classification.**

In [20]:
classifier.show_most_informative_features(n = 15)

Most Informative Features
                  suffix = 'na'           female : male   =     96.3 : 1.0
                  suffix = 'la'           female : male   =     70.9 : 1.0
                  suffix = 'rt'             male : female =     52.8 : 1.0
                  suffix = 'ia'           female : male   =     35.0 : 1.0
                  suffix = 'sa'           female : male   =     32.6 : 1.0
             last letter = 'a'            female : male   =     31.9 : 1.0
             last letter = 'k'              male : female =     29.9 : 1.0
                  suffix = 'rd'             male : female =     27.9 : 1.0
                  suffix = 'us'             male : female =     26.5 : 1.0
                  suffix = 'ra'           female : male   =     24.9 : 1.0
                  suffix = 'ta'           female : male   =     23.9 : 1.0
                  suffix = 'do'             male : female =     21.7 : 1.0
                  suffix = 'ld'             male : female =     21.7 : 1.0