## Introduction to Prediction using Surnames Analysis

---
### Hypothesis
---

To relate a name to nationality, check for the most common letter combinations/pronounciations within a specific language that makes it unique.

In this iteration we are going to:
* use naive bayes and logistic regression
* compare accuracy, recall, and precision for classifying Japanese names.

------


In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values
from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames-dev.csv"

surname_df = pd.read_csv(surname_csv, index_col = None)

surname_df.shape

(3003, 2)

In [4]:
# check columns
surname_df.columns

Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')

In [5]:
# a glimpse at the data
surname_df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [6]:
# rename the columns
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [7]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [8]:
# types of nationalities
surname_df.nationality.unique()

array(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French',
       'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean',
       'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish',
       'Vietnamese'], dtype=object)

In [9]:
# there are 18 unique nationalities in the dataset
len(surname_df.nationality.unique())

18

In [10]:
# Number of names per nationality
surname_df.groupby("nationality")["surname"].size()

nationality
Arabic         300
Chinese         40
Czech           77
Dutch           44
English        550
French          41
German         108
Greek           30
Irish           34
Italian        106
Japanese       148
Korean          14
Polish          20
Portuguese      11
Russian       1411
Scottish        15
Spanish         44
Vietnamese      10
Name: surname, dtype: int64

#### Features Exploration

In [11]:
features = surname_df["surname"] # features (x) needed to predict nationatlity
target = surname_df["nationality"] # what we are predicting (y)

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (80%) and test (20%) datasets.

In [12]:
# vectorize features
cv = CountVectorizer()
X = cv.fit_transform(features)

In [13]:
# get feature names
cv.get_feature_names()

['ababko',
 'abadi',
 'abakumtsev',
 'abalakin',
 'abalikhin',
 'abalmasoff',
 'abana',
 'abandonato',
 'abascal',
 'abasheev',
 'abashin',
 'abbatangelo',
 'abboud',
 'abdrakhmanoff',
 'abdrazakov',
 'abdulbasirov',
 'abdulgapuroff',
 'abdulladjanov',
 'abdulladzhanoff',
 'abdulrahmanov',
 'abdurahmanoff',
 'abdurakhmanov',
 'abel',
 'abelman',
 'abelmazov',
 'abeltsev',
 'aberquero',
 'abertasov',
 'abih',
 'abikh',
 'ablesimoff',
 'abletsov',
 'ableuhoff',
 'ableukhoff',
 'ablov',
 'ablyakimoff',
 'aboev',
 'aboimov',
 'abolin',
 'abragamson',
 'abrahams',
 'abrameitsev',
 'abramov',
 'abramovitch',
 'abramovsky',
 'abraroff',
 'abreu',
 'abrikosov',
 'absalyamov',
 'abuhoff',
 'abulhanoff',
 'abyshev',
 'abyzgiddin',
 'abzyaparoff',
 'acconci',
 'acconcio',
 'achterberg',
 'ackermann',
 'ackroyd',
 'acquati',
 'acton',
 'adabashian',
 'adamenko',
 'adamidis',
 'adamoff',
 'adamson',
 'adarchenko',
 'addams',
 'addis',
 'adelfinski',
 'adelkhanoff',
 'adelung',
 'ader',
 'adilov',
 

In [14]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

---
### Model Developement - Entire Model

---
Multinomial NB - entire model

In [15]:
# fit the model
surname_model = MultinomialNB()
surname_model.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
y_pred = surname_model.predict(x_test)

#### -Accuracy-
------
Note: Accuracy will be computed using sklearn.metrics's ```accuracy_score``` library.


------

In [17]:
# Accuracy using test data
accuracy = accuracy_score(y_test, y_pred)
print(f"Complete model accuracy: {accuracy}")

Complete model accuracy: 0.5095057034220533


__Observation 1__: This accuracy is awful.
Possible Reasons: The amount of data used to train the model is too low or it's because there is only a small amount of data.

In [18]:
# nationalities in test data
print(y_test.value_counts())

Russian       507
English       183
Arabic         96
Japanese       56
German         42
Italian        37
Czech          27
Dutch          20
French         14
Polish         12
Spanish        11
Greek          10
Chinese        10
Irish          10
Scottish        5
Portuguese      5
Vietnamese      5
Korean          2
Name: nationality, dtype: int64


In [19]:
# Which nationalities is the model more accurate with
null_accuracy = y_test.value_counts().head(5) / len(y_test)
print(null_accuracy)

Russian     0.481939
English     0.173954
Arabic      0.091255
Japanese    0.053232
German      0.039924
Name: nationality, dtype: float64


__Observation 2__: This model is more accurate with Russian names. This could also leave room for misidentification.

#### -Precision and Recall-
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [20]:
precision = precision_score(y_test, y_pred, average="macro")
recall = recall_score(y_test, y_pred, average="macro")

print("Overall")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

Overall
Precision: 0.08308895405669599
Recall: 0.07233796296296297


  'precision', 'predicted', average, warn_for)


#### -Predictions-
Now we will run some test predictions.

In [21]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = cv.transform(pred_name1)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [22]:
# Arabic surname name
pred_name2 = ["Bata"]
reshape_feature = cv.transform(pred_name2)
surname_model.predict(reshape_feature)

array(['Arabic'], dtype='<U10')

In [23]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = cv.transform(pred_name3)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [24]:
# multiple names (German, English, Japanese, Arabic)
pred_names = ["Samuel", "Drew", "Shunji", "Mustafa"]
reshape_feature = cv.transform(pred_names)
surname_model.predict(reshape_feature)

array(['Russian', 'Russian', 'Russian', 'Arabic'], dtype='<U10')

__Observation 3__: It thinks every name is Russian. Russian accuracy is almost 50% while the odds of getting other nationalities are less than 16%.

---
### Model Developement - Japanese Surnames Only

What if we train the model on only Japanese surnames?

---

#### -Features Exploration-

In [43]:
# get japanese names
japanese_surnames = surname_df.loc[surname_df["nationality"] == "Japanese"]
japanese_surnames.head()

Unnamed: 0,surname,nationality
1330,Takashi,Japanese
1331,Ishida,Japanese
1332,Soga,Japanese
1333,Mitsuharu,Japanese
1334,Seo,Japanese


In [26]:
features = japanese_surnames["surname"] # features (x) needed to predict nationatlity
target = japanese_surnames["nationality"] # what we are predicting (y)

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (80%) and test (20%) datasets.

In [27]:
# vectorize features
cv = CountVectorizer()
X = cv.fit_transform(features)

In [28]:
# get feature names
cv.get_feature_names()

['akimoto',
 'aoki',
 'araki',
 'arita',
 'asai',
 'asanuma',
 'asuhara',
 'ayabito',
 'daishi',
 'dan',
 'eda',
 'endo',
 'erizawa',
 'fukuoka',
 'fuwa',
 'hamada',
 'hatayama',
 'hattori',
 'higo',
 'higoshi',
 'higuchi',
 'hirase',
 'hirota',
 'honami',
 'ichisada',
 'iesada',
 'igarashi',
 'ijiri',
 'imoo',
 'ishida',
 'iwakura',
 'joshuya',
 'kahaya',
 'kajitani',
 'kamata',
 'kanesaka',
 'kawamura',
 'kawasie',
 'kazuyoshi',
 'kido',
 'kikugawa',
 'kimiyama',
 'kinashita',
 'kishi',
 'kiski',
 'komon',
 'konoe',
 'koshin',
 'kotoku',
 'kubota',
 'kuno',
 'kurkawa',
 'kurohiko',
 'kurosawa',
 'kurusu',
 'kwakami',
 'maeno',
 'makioka',
 'maruyama',
 'masuno',
 'matsuzawa',
 'mazuka',
 'mifune',
 'minabuchi',
 'mitsuharu',
 'mitsuya',
 'miyahara',
 'miyake',
 'miyoshi',
 'morioka',
 'murasaki',
 'mutsu',
 'nagata',
 'nakadai',
 'nakanishi',
 'nakanoi',
 'nakasato',
 'nakasawa',
 'nakayama',
 'nakazawa',
 'namiki',
 'nishihara',
 'nozara',
 'numajiri',
 'odaka',
 'ohmae',
 'ohmiya',

In [29]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state = 32)

In [30]:
# fit the model
jap_surname_model = MultinomialNB()
jap_surname_model.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
y_pred = jap_surname_model.predict(x_test)

#### -Accuracy-

In [32]:
# Accuracy using test data
accuracy = accuracy_score(y_test, y_pred)
print(f"Complete model accuracy: {accuracy}")

Complete model accuracy: 1.0


__Observation 1__: Woah. That accuracy. This is solely because I only tested the model using Japanese names.

In [33]:
# Number of Japanese names in test data
print(y_test.value_counts())

Japanese    45
Name: nationality, dtype: int64


In [34]:
# number of Japanese names in training data
print(y_train.value_counts())

Japanese    103
Name: nationality, dtype: int64


In [35]:
# Which nationalities is the model more accurate with
null_accuracy = y_test.value_counts().head(5) / len(y_test)
print(null_accuracy)

Japanese    1.0
Name: nationality, dtype: float64


In [36]:
precision = precision_score(y_test, y_pred, average="macro")
recall = recall_score(y_test, y_pred, average="macro")

print("Japanese Surnames ")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

Japanese Surnames 
Precision: 1.0
Recall: 1.0


#### -Predictions-
Now we will run some test predictions.

In [37]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = cv.transform(pred_name1)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [38]:
# Arabic surname name
pred_name2 = ["Bata"]
reshape_feature = cv.transform(pred_name2)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [39]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = cv.transform(pred_name3)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [40]:
# multiple names (German, English, Japanese, Arabic)
pred_names = ["Samuel", "Drew", "Shunji", "Mustafa"]
reshape_feature = cv.transform(pred_names)
jap_surname_model.predict(reshape_feature)

array(['Japanese', 'Japanese', 'Japanese', 'Japanese'], dtype='<U8')

__Observation 2__: The model assumes that all names are Japanese. This is because it has been trained on no other type of name. It is all it knows.