## Introduction to Prediction using Surnames Analysis

---
### Hypothesis
---

To relate a name to nationality, check for the most common letter combinations/pronounciations within a specific language that makes it unique.

In this iteration we are going to:
* use naive bayes and logistic regression
* compare accuracy, recall, and precision for classifying Japanese names.

------


In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames-dev.csv"

surname_df = pd.read_csv(surname_csv, index_col = None)

surname_df.shape

(3003, 2)

In [4]:
# check columns
surname_df.columns

Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')

In [5]:
# a glimpse at the data
surname_df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [6]:
# rename the columns
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [7]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [8]:
# types of nationalities
surname_df.nationality.unique()

array(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French',
       'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean',
       'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish',
       'Vietnamese'], dtype=object)

In [9]:
# there are 18 unique nationalities in the dataset
len(surname_df.nationality.unique())

18

In [10]:
# Number of names per nationality
surname_df.groupby("nationality")["surname"].size()

nationality
Arabic         300
Chinese         40
Czech           77
Dutch           44
English        550
French          41
German         108
Greek           30
Irish           34
Italian        106
Japanese       148
Korean          14
Polish          20
Portuguese      11
Russian       1411
Scottish        15
Spanish         44
Vietnamese      10
Name: surname, dtype: int64

---
Note that Japanese has 148 surnames in this dataset. The Japanese names are what we will be focusing on.

---

#### Features Exploration

In [11]:
features = surname_df["surname"] # features (x) needed to predict nationatlity
target = surname_df["nationality"] # what we are predicting (y)

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (80%) and test (20%) datasets.

In [12]:
# vectorize features
vector_features = CountVectorizer()
X = vector_features.fit_transform(features)

In [13]:
# split the data to train the model
# random_state = 1 for reproducibility
x_train, x_test, y_train, y_test = train_test_split(X, target, train_size =0.8, test_size=0.2, random_state = 32)

---
### Model Developement

---
As stated earlier, we're going to be using naive bayes and linear regression for improved predictions.

In [14]:
# fit the model
surname_model = MultinomialNB()
surname_model.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Accuracy

In [15]:
# Accuracy using test data
accuracy = surname_model.score(x_test, y_test)
print(f"Complete model accuracy: {accuracy}")

Complete model accuracy: 0.5241264559068219


__Observation__: This accuracy is awful.
Possible Reasons: The amount of data used to train the model is too low or it's because there is only a small amount of data.

#### Predictions
Now we will run some test predictions.

In [16]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = vector_features.transform(pred_name1)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [17]:
# Chinese surname name
pred_name2 = ["Xiao"]
reshape_feature = vector_features.transform(pred_name2)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [18]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = vector_features.transform(pred_name3)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [19]:
# multiple names (Chinese, Japanese, Japanese, Arabic)
pred_names = ["Li", "Onohara", "Fuwa", "Sarraf"]
reshape_feature = vector_features.transform(pred_names)
surname_model.predict(reshape_feature)

array(['Russian', 'Russian', 'Russian', 'Russian'], dtype='<U10')

__Observation__: As you can see, the model thinks that every name is Russian. But at least it got its own nationality right! This could be due to the fact that Russian has the most surnames (1,411) in the dataset.

This model is awful.

----
## Results

---
### Accuracy, Precision, and Recall


In [None]:
import numpy as np

In [None]:
print(pred_surnames.columns) # predicted surnames
print(surname_df.columns) # ground truth surnames

----
Let's focus on the classification of Japanese surnames for now.

----

In [None]:
# Japanese surnames only
pred_japanese = pred_surnames.loc[pred_surnames["nationality"] == "Japanese"]
gt_japanese = surname_df.loc[surname_df["nationality"] == "Japanese"] # ground truth

In [None]:
pred_japanese.shape

In [None]:
gt_japanese.shape

#### Accuracy
------
Note: Accuracy will be computed using sklearn.metrics's ```accuracy_score``` library. We will find an intersection between the ground truth (original surnames) and predicted surnames, and then get the average accuracy for the two columns.


------

In [None]:
# find an intersection between the two datasets and compute the accuracy score.
ground_truth = surname_df #gt_japanese
predictions = pred_surnames #pred_japanese

col_overlap = list(set(ground_truth.columns).intersection(predictions.columns))

In [None]:
# compute accuracy
avg_accuracy = 0.0
for col in col_overlap:
    accuracy = accuracy_score(ground_truth[col], predictions[col])
    print(f"Column: {col} \t Accuracy: {accuracy}")
    avg_accuracy += accuracy / len(col_overlap)
    
print(f"B1 Surname Predictions Accuracy: {avg_accuracy} \n")

In [None]:
print("Japanese surnames accuracy: ")

#### Precision and Recall
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [None]:
precision = 0.0
recall = 0.0

y_gt = ground_truth["nationality"] # ground truth for nationality
y_pred = surname_df["nationality"] # predictions for nationality

In [None]:
precision = precision_score(y_gt, y_pred, average="macro")
recall = recall_score(y_gt, y_pred, average="macro")

print("Overall")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

In [None]:
print("Japanese Surnames ")
print(f"Precision: {precision}")
print(f"Recall: {recall}")