#** NBC Full name model**

The naïve Bayesian approach developed from classical 
mathematical theory and has a sound mathematical base and consistent classification 
usefulness. Naive Bayes is based on Bayes probability theorem. It is a simple classifier that 
works on the probability of events. Encoding this probability is extremely helpful as, in later 
cases, it adds on to give us the final probability of a name. However, the main disadvantage of 
using Naïve Bayes probability classifier it assigns “0” probability for the words not in the list. 

## Data processing

Here We import the ethnicity data from the file and select the relevant columns. We converted labels into numeric and clean the names by removing suffixes and special characters. Cleaning data is essential as it can create wrong models.

vectorising text : It is important part to create the token and vocabulary from the words.We have use countvectorizer. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. Basically we converted each name into a single vector.

In [None]:
# NBC FULL NAME MODEL

#import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
import re
from sklearn.model_selection import KFold
import joblib
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import MultinomialNB
import seaborn as sns
import matplotlib.pyplot as plt

# read data , and choose columns and get info about data
df = pd.read_csv("ethnicity collection data.csv")
df.shape
df.head()
df.groupby('ethinicity')['full name'].size()
df =df[['full name','ethinicity']]

# convert the columns into lower case and remove na
# converting classes into numeric 
df['full name'] = df['full name'].str.lower()
df['ethinicity']=df['ethinicity'].str.lower()
df['ethinicity']=df['ethinicity'].map({'asian-indian':1,'black non hispanic':2,'hispanic':3 , 'white non hispanic':4,'asian-east':5})
df = df.dropna()

# clean text, remove special charaters, numbers,suffix(Full name),  
def cleaning(text):
    only_words = re.sub('([^A-Za-z ]+)|(^dr\.)|(^dr )|(^mr\.)|(^mr )|(^prof\.)|(^adv\. )',' ',text )
    return only_words

df['full_name_cleaned']=df['full name'].apply(cleaning)

#save data in new file 
df.to_csv('data.csv', encoding='utf-8', index=False)
#load new data
cleaned_data = pd.read_csv("data.csv", encoding = "ISO-8859-1")
pd.set_option('display.max_colwidth', None)
#convert lables 
data = cleaned_data.full_name_cleaned
target = cleaned_data.ethinicity

# vectorise text - ngram 1,2 using count vectoriser
# here we are using full name model
countvect = CountVectorizer(ngram_range=(1,2))
name = countvect.fit_transform(data)
a =countvect.get_feature_names()
len(a)



  interactivity=interactivity, compiler=compiler, result=result)


## Model Creation

Here to create a model we have done following steps :

1.   We have used cross- validation for comparing the model and select the 
best-trained model. It presents the general idea of how the model will perform on unseen data. The result of cross-validation is generally the output of a model that is less biased or less optimistic.We specifically used k-fold technique with k = 10.
2.   We have used SMOTE for oversampling the data. Our data was not balance which can create a problem of bias. To avoide that we used SMOTE.
3. We split the data into train-test and ran the model.



In [None]:
#implementing cross validation, oversampling and fitting model 
def Cross_validation(data, tagret, countvect, clf_cv, model_name): #Performs cross-validation 

    kf = KFold(n_splits=10, shuffle=True, random_state=1) # 10-fold cross-validation
    scores=[]
    data_train_list = []
    targets_train_list = []
    data_test_list = []
    targets_test_list = []
    iteration = 0
    print("Performing cross-validation for {}...".format(model_name))
    for train_index, test_index in kf.split(data):
        iteration += 1
        print("Iteration ", iteration)
        #spliting train text data
        data_train_cv, targets_train_cv = data[train_index], target[train_index]
        data_test_cv, targets_test_cv = data[test_index], target[test_index]
        data_train_list.append(data_train_cv) 
        data_test_list.append(data_test_cv) 
        targets_train_list.append(targets_train_cv) 
        targets_test_list.append(targets_test_cv)
	# using countvectoriser to convert text into computer understanding language
        countvect.fit(data_train_cv.values.astype('U')) # learning vocabulary of training set
        data_train_countvect_cv = countvect.transform(data_train_cv.values.astype('U'))
        print(data_train_countvect_cv.shape)
        print(targets_train_cv.shape)

        #balancing Trainign dataset for each itteration using SMOTE
        print("Number of observations in each class before oversampling (training data): \n", pd.Series(targets_train_cv).value_counts())
        smote = SMOTE(random_state = 101)
        data_train_countvect_cv,targets_train_cv = smote.fit_sample(data_train_countvect_cv,targets_train_cv)
        print("Number of observations in each class after oversampling (training data): \n", pd.Series(targets_train_cv).value_counts())
        
	#print shape of train and test data
        print("Shape of training data: ", data_train_countvect_cv.shape)
        data_test_countvect_cv = countvect.transform(data_test_cv.values.astype('U'))
        print("Shape of test data: ", data_test_countvect_cv.shape)
        clf_cv.fit(data_train_countvect_cv, targets_train_cv) # Fitting model
        score = clf_cv.score(data_test_countvect_cv, targets_test_cv) # Calculating accuracy
        scores.append(score) # appending cross-validation accuracy for each iteration
    print("List of cross-validation accuracies for {}: ".format(model_name), scores)
    mean_accuracy = np.mean(scores)
    print("Mean cross-validation accuracy for {}: ".format(model_name), mean_accuracy)
    print("Best cross-validation accuracy for {}: ".format(model_name), max(scores))

    #finding best cross-validation for best set
    max_acc_index = scores.index(max(scores)) #
    max_acc_data_train = data_train_list[max_acc_index]
    max_acc_data_test = data_test_list[max_acc_index]
    max_acc_targets_train = targets_train_list[max_acc_index] 
    max_acc_targets_test = targets_test_list[max_acc_index] 

    return mean_accuracy, max_acc_data_train, max_acc_data_test, max_acc_targets_train, max_acc_targets_test

def c_matrix(max_acc_data_train, max_acc_data_test, max_acc_targets_train, max_acc_targets_test, countvect, target, clf, model_name): #### Creates Confusion matrix for NBC
    countvect.fit(max_acc_data_train.values.astype('U'))
    max_acc_data_train_countvect = countvect.transform(max_acc_data_train.values.astype('U'))
    max_acc_data_test_countvect = countvect.transform(max_acc_data_test.values.astype('U'))
    clf.fit(max_acc_data_train_countvect, max_acc_targets_train) # Fitting NBC
    targets_pred = clf.predict(max_acc_data_test_countvect) # Prediction on test data
    conf_mat = classification_report(max_acc_targets_test, targets_pred)
    print(conf_mat)

# firring model 
NBC_clf = MultinomialNB() 
NBC_mean_accuracy, max_acc_data_train, max_acc_data_test, max_acc_targets_train, max_acc_targets_test = Cross_validation(data, target, countvect, NBC_clf, "NBC") # NBC cross-validation
c_matrix(max_acc_data_train, max_acc_data_test, max_acc_targets_train, max_acc_targets_test, countvect, target, NBC_clf, "NBC") # NBC confusion matrix

# Saving model
def NBC_Save(data, target, countvect):
    countvect.fit(data.values.astype('U')) # learn vocabulary of entire data
    data_countvect = countvect.transform(data.values.astype('U'))
    pd.DataFrame.from_dict(data=dict([word, i] for i, word in enumerate(countvect.get_feature_names())), orient='index').to_csv('vocabulary_NBC.csv', header=False)
    print("Shape of countvect matrix for saved NBC Model: ", data_countvect.shape)
    clf = MultinomialNB().fit(data_countvect, target)
    joblib.dump(clf, 'nbc.sav')

NBC_Save(data, target, countvect)


Performing cross-validation for NBC...
Iteration  1
(560904, 790226)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374214
1.0     85482
3.0     49811
2.0     37902
5.0     13495
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374214
5.0    374214
4.0    374214
3.0    374214
2.0    374214
dtype: int64
Shape of training data:  (1871070, 790226)
Shape of test data:  (62323, 790226)
Iteration  2
(560904, 789900)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374124
1.0     85736
3.0     49659
2.0     37962
5.0     13423
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374124
5.0    374124
4.0    374124
3.0    374124
2.0    374124
dtype: int64
Shape of training data:  (1870620, 789900)
Shape of test data:  (62323, 789900)
Iteration  3
(560904, 789837)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374394
1.0     85395
3.0     49696
2.0     37972
5.0     13447
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374394
5.0    374394
4.0    374394
3.0    374394
2.0    374394
dtype: int64
Shape of training data:  (1871970, 789837)
Shape of test data:  (62323, 789837)
Iteration  4
(560904, 790116)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374340
1.0     85493
3.0     49660
2.0     37975
5.0     13436
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374340
5.0    374340
4.0    374340
3.0    374340
2.0    374340
dtype: int64
Shape of training data:  (1871700, 790116)
Shape of test data:  (62323, 790116)
Iteration  5
(560904, 789934)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374330
1.0     85423
3.0     49762
2.0     38017
5.0     13372
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374330
5.0    374330
4.0    374330
3.0    374330
2.0    374330
dtype: int64
Shape of training data:  (1871650, 789934)
Shape of test data:  (62323, 789934)
Iteration  6
(560904, 789856)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374281
1.0     85461
3.0     49745
2.0     37939
5.0     13478
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374281
5.0    374281
4.0    374281
3.0    374281
2.0    374281
dtype: int64
Shape of training data:  (1871405, 789856)
Shape of test data:  (62323, 789856)
Iteration  7
(560904, 790034)
(560904,)
Number of observations in each class before oversampling (training data): 
 4.0    374345
1.0     85521
3.0     49681
2.0     37927
5.0     13430
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374345
5.0    374345
4.0    374345
3.0    374345
2.0    374345
dtype: int64
Shape of training data:  (1871725, 790034)
Shape of test data:  (62323, 790034)
Iteration  8
(560905, 790009)
(560905,)
Number of observations in each class before oversampling (training data): 
 4.0    374298
1.0     85489
3.0     49693
2.0     37965
5.0     13460
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374298
5.0    374298
4.0    374298
3.0    374298
2.0    374298
dtype: int64
Shape of training data:  (1871490, 790009)
Shape of test data:  (62322, 790009)
Iteration  9
(560905, 789874)
(560905,)
Number of observations in each class before oversampling (training data): 
 4.0    374346
1.0     85460
3.0     49720
2.0     37950
5.0     13429
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374346
5.0    374346
4.0    374346
3.0    374346
2.0    374346
dtype: int64
Shape of training data:  (1871730, 789874)
Shape of test data:  (62322, 789874)
Iteration  10
(560905, 790302)
(560905,)
Number of observations in each class before oversampling (training data): 
 4.0    374248
1.0     85486
3.0     49715
2.0     37975
5.0     13481
Name: ethinicity, dtype: int64




Number of observations in each class after oversampling (training data): 
 1.0    374248
5.0    374248
4.0    374248
3.0    374248
2.0    374248
dtype: int64
Shape of training data:  (1871240, 790302)
Shape of test data:  (62322, 790302)
List of cross-validation accuracies for NBC:  [0.8784236959068081, 0.8736902909038397, 0.8794506041108419, 0.8775893329910306, 0.8769314699228214, 0.8793703769074017, 0.8792099225005214, 0.8768492667115946, 0.8774429575430827, 0.8793042585282885]
Mean cross-validation accuracy for NBC:  0.877826217602623
Best cross-validation accuracy for NBC:  0.8794506041108419
              precision    recall  f1-score   support

         1.0       0.98      0.92      0.95      9599
         2.0       0.94      0.11      0.20      4204
         3.0       0.89      0.63      0.74      5542
         4.0       0.85      0.99      0.92     41486
         5.0       0.99      0.53      0.69      1492

    accuracy                           0.88     62323
   macro avg    

Our model was created using vocabulary of 790302 words., it performs 
well on first name with the accuracy of 86% . The model with name as feature, works best on
groups (Asian-Indian Subcontinent, Hispanic, White non-Hispanic). It was expected as the 
vocabulary for these classes was more. Black non – Hispanic and Asian – East Asians received 
a recall score of 0.11 and 0.53. For our main group i.e., Indian – Subcontinent origin, we got 
precision and recall scores of 0.98 and 0.92, respectively. This observation tells us that NBC 
might perform well for our class. However, using such a conditional model results in more 
false positive. Which will result in increasing manual work. The advantage of using Naïve 
Bayes classification is that its training time is significantly less compared to others. Also, it 
does not need high computer memory and CPU consumption.

## Deploying model


To deploy this model on unseen data of the company, we import the data and the saved model and vocabulary. We cleaned the data and removed unwanted data that is not classified as human names (e.g., LLC, trust, ltd .etc). We also removed suffixes and special characters to get a better result. We ran the model on unseen data and checked the results.



In [None]:
#NBC MODEL DEPLOYMENT 

#importing libraries
import re, nltk
import numpy as np
import pandas as pd
import csv
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer

#reading compnay data , Data info, selecting relevant column
df = pd.read_csv("main1.csv")
df.shape
df.head()
df =df[['prop_owner_name', 'IsSelectedforExport']]

# convert the columns into lower case and remove na
df['prop_owner_name'] = df['prop_owner_name'].str.lower()
df = df.dropna()
 
# removing unwanted data
df = df[~(df.prop_owner_name.str.contains('llc'))]
df = df[~(df.prop_owner_name.str.contains('ltd'))]
df = df[~(df.prop_owner_name.str.contains('estate'))]
df = df[~(df.prop_owner_name.str.contains('trust'))]
df = df[~(df.prop_owner_name.str.contains('inc'))]
df = df[~(df.prop_owner_name.str.contains('trustee'))]

# spliting two names and saving them in new column
new=df['prop_owner_name'].str.split("&",n = 1, expand = True)
df['name1']= new[0]

# clean text, remove special charaters, numbers,suffix(Full name),  
def cleaning(text):
    only_words = re.sub('([^A-Za-z ]+)(^dr.)(^dr )(^mr.)(^mr )(^prof.)(^adv. )',' ',text )
    return only_words
df['name1']=df['name1'].apply(cleaning)

#loading model and vocabulary
model = joblib.load("nbc.sav")
vocabulary_model = pd.read_csv('ocabulary_NBC.csv')

#converting vocabulary in dictonary
vocabulary_model_dict = {}
for i, word in enumerate(vocabulary_model['aa']):
         vocabulary_model_dict[word] = i

# countvectoriser
countvect = CountVectorizer(ngram_range=(1,2),vocabulary = vocabulary_model_dict) 
name = countvect.fit_transform(data)
a =countvect.get_feature_names()
new_name_countvect= countvect.fit_transform(df['name1'])
#prediting from model
targets_pred = model.predict(new_name_countvect)
df['predicited_ethnicity'] = targets_pred
#saving prediciton
df.to_csv('predicted_name_NBC.csv', encoding='utf-8', index=False)
