## Introduction to Prediction using Surnames Analysis

---
### Goal
---

Predict whether a name is of Russian origin or not.

In this iteration we are going to:
* build a unigram model (bag of characters)
* learn the weights for the Russian-language predictor
* implement multi-linear regression
* test predictions using test data
* compute accuracy, recall, and precision for Russian names.

------


In [79]:
import pandas as pd
from pandas import DataFrame
import numpy as np

In [67]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values
from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

---
### Let's perform some EDA

---

In [68]:
# read the csv file into data frame.
surname_csv = "data_set/surnames_dev.csv"
surname_test_csv = "data_set/surnames_test.csv"

surname_df = pd.read_csv(surname_csv, index_col = None)
surname_test = pd.read_csv(surname_test_csv, index_col = None)

In [69]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
surname_test.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [70]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [71]:
surname_test.head()

Unnamed: 0,surname,nationality
0,Moghadam,Arabic
1,Najjar,Arabic
2,Said,Arabic
3,Cham,Arabic
4,Tuma,Arabic


#### Features Exploration

In [72]:
features = surname_df["surname"] # features (x) needed to predict nationatlity
target = surname_df["nationality"] # what we are predicting (y)
test_target = surname_test["nationality"]

---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [107]:
# vectorize features - unigrams only
# cv = CountVectorizer()
# X = cv.fit_transform(features)
# min_df=0.0, max_df=1.0

cv = CountVectorizer(analyzer='char', ngram_range=(1,1), min_df=0.0, max_df=1.0)
X = cv.fit_transform(surname_df["nationality"], surname_test["nationality"])

In [108]:
cv

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [109]:
# alphabet
cv.get_feature_names()

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u',
 'v',
 'z']

In [110]:
print(X)

  (0, 0)	2
  (0, 16)	1
  (0, 1)	1
  (0, 8)	1
  (0, 2)	1
  (1, 0)	2
  (1, 16)	1
  (1, 1)	1
  (1, 8)	1
  (1, 2)	1
  (2, 0)	2
  (2, 16)	1
  (2, 1)	1
  (2, 8)	1
  (2, 2)	1
  (3, 0)	2
  (3, 16)	1
  (3, 1)	1
  (3, 8)	1
  (3, 2)	1
  (4, 0)	2
  (4, 16)	1
  (4, 1)	1
  (4, 8)	1
  (4, 2)	1
  :	:
  (2999, 20)	1
  (3000, 0)	1
  (3000, 8)	1
  (3000, 13)	1
  (3000, 4)	3
  (3000, 17)	1
  (3000, 18)	1
  (3000, 12)	1
  (3000, 20)	1
  (3001, 0)	1
  (3001, 8)	1
  (3001, 13)	1
  (3001, 4)	3
  (3001, 17)	1
  (3001, 18)	1
  (3001, 12)	1
  (3001, 20)	1
  (3002, 0)	1
  (3002, 8)	1
  (3002, 13)	1
  (3002, 4)	3
  (3002, 17)	1
  (3002, 18)	1
  (3002, 12)	1
  (3002, 20)	1


In [111]:
X.toarray()

array([[2, 1, 1, ..., 0, 0, 0],
       [2, 1, 1, ..., 0, 0, 0],
       [2, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 1, 0]])

In [112]:
print(DataFrame(X.A, columns=cv.get_feature_names()).to_string())

      a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  r  s  t  u  v  z
0     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
1     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
2     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
3     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
4     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
5     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
6     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
7     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
8     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
9     2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
10    2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
11    2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
12    2  1  1  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
13    

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [None]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

---
### Model Developement - Entire Model

---
Multinomial NB - entire model

In [None]:
# fit the model
surname_model = MultinomialNB()
surname_model.fit(x_train, y_train)

In [None]:
y_pred = surname_model.predict(x_test)

#### -Accuracy-
------
Note: Accuracy will be computed using sklearn.metrics's ```f1_score``` library.


------

In [None]:
# Accuracy using test data
surname_accuracy = f1_score(y_test, y_pred, average='micro')
print(f"f1 score: {surname_accuracy}")

__Observation 1__: Accuracy  is at 50%.
Possible Reasons: The amount of data used to train the model is too low or it's because there is only a small amount of data.

In [None]:
# Which nationalities is the model more accurate with
surname_null_accuracy = y_test.value_counts().head(5) / len(y_test)
print(surname_null_accuracy)

__Observation 2__: This model is more accurate with Russian names. Japanese surnames accuracy is at 5% compared to Russia's 48%. This could be due to Russia having the most names in this dataset!

#### -Precision and Recall-
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [None]:
surname_precision = 0.0
surname_recall = 0.0

In [None]:
surname_precision = precision_score(y_test, y_pred, average="micro")
surname_recall = recall_score(y_test, y_pred, average="micro")

print("Overall")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

#### -Predictions-
Now we will run some test predictions.

In [None]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = cv.transform(pred_name1)
surname_model.predict(reshape_feature)

In [None]:
# Arabic surname name
pred_name2 = ["Bata"]
reshape_feature = cv.transform(pred_name2)
surname_model.predict(reshape_feature)

In [None]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = cv.transform(pred_name3)
surname_model.predict(reshape_feature)

In [None]:
# multiple names (German, English, Japanese, Arabic)
pred_names = ["Samuel", "Drew", "Shunji", "Mustafa"]
reshape_feature = cv.transform(pred_names)
surname_model.predict(reshape_feature)

__Observation 3__: It thinks every name is Russian. This makes sense given Russian surname accuracy is almost 50% while the odds of predicting other nationalities are less than 16%.

---
### -Conclusion- Fix

Although the second model has the best accuracy, precision, and recall it is not practical. By training the dataset only on one value, it now assumes that all names are Japanese. While the test data is indeed all Japanese surnames, the model has been overfitted and so there are lots of false positives for names that have not been classified as Japanese. 

We will go with the first model because it is more realistic and practical.

---

In [None]:
print(f"Accuracy: {surname_accuracy}")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

In [None]:
print(f"Nationality Accuracy: \n {surname_null_accuracy}")