## Language Modeling using Surnames Analysis - Japanese Surnames

---
### Hypothesis
---

To relate a name to nationality, check for the most common letter combinations/pronounciations within a specific language that makes it unique.

In this iteration we are going to:
* use naive bayes
* compare accuracy, recall, and precision for classifying Japanese names.

------


In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values
from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames-dev.csv"

surname_df = pd.read_csv(surname_csv, index_col = None)

surname_df.shape

(3003, 2)

In [4]:
# check columns
surname_df.columns

Index(['Unnamed: 0', 'Unnamed: 1'], dtype='object')

In [5]:
# a glimpse at the data
surname_df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [6]:
# rename the columns
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [7]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [8]:
# types of nationalities
surname_df.nationality.unique()

array(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French',
       'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean',
       'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish',
       'Vietnamese'], dtype=object)

In [9]:
# there are 18 unique nationalities in the dataset
len(surname_df.nationality.unique())

18

In [10]:
# Number of names per nationality
surname_df.groupby("nationality")["surname"].size()

nationality
Arabic         300
Chinese         40
Czech           77
Dutch           44
English        550
French          41
German         108
Greek           30
Irish           34
Italian        106
Japanese       148
Korean          14
Polish          20
Portuguese      11
Russian       1411
Scottish        15
Spanish         44
Vietnamese      10
Name: surname, dtype: int64

#### Features Exploration

In [11]:
features = surname_df["surname"] # features (x) needed to predict nationatlity
target = surname_df["nationality"] # what we are predicting (y)

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [12]:
# vectorize features
cv = CountVectorizer()
X = cv.fit_transform(features)

In [13]:
# list names in the feature
# cv.get_feature_names()

In [14]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

---
### Model Developement - Entire Model

---
Multinomial NB - entire model

In [15]:
# fit the model
surname_model = MultinomialNB()
surname_model.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
y_pred = surname_model.predict(x_test)

#### -Accuracy-
------
Note: Accuracy will be computed using sklearn.metrics's ```f1_score``` library.


------

In [17]:
# Accuracy using test data
surname_accuracy = f1_score(y_test, y_pred, average='micro')
print(f"f1 score: {surname_accuracy}")

f1 score: 0.5095057034220533


__Observation 1__: Accuracy  is at 50%.
Possible Reasons: The amount of data used to train the model is too low or it's because there is only a small amount of data.

In [18]:
# Which nationalities is the model more accurate with
surname_null_accuracy = y_test.value_counts().head(5) / len(y_test)
print(surname_null_accuracy)

Russian     0.481939
English     0.173954
Arabic      0.091255
Japanese    0.053232
German      0.039924
Name: nationality, dtype: float64


__Observation 2__: This model is more accurate with Russian names. Japanese surnames accuracy is at 5% compared to Russia's 48%. This could be due to Russia having the most names in this dataset!

#### -Precision and Recall-
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [19]:
surname_precision = 0.0
surname_recall = 0.0

In [20]:
surname_precision = precision_score(y_test, y_pred, average="micro")
surname_recall = recall_score(y_test, y_pred, average="micro")

print("Overall")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

Overall
Precision: 0.5095057034220533
Recall: 0.5095057034220533


#### -Predictions-
Now we will run some test predictions.

In [21]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = cv.transform(pred_name1)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [22]:
# Arabic surname name
pred_name2 = ["Bata"]
reshape_feature = cv.transform(pred_name2)
surname_model.predict(reshape_feature)

array(['Arabic'], dtype='<U10')

In [23]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = cv.transform(pred_name3)
surname_model.predict(reshape_feature)

array(['Russian'], dtype='<U10')

In [24]:
# multiple names (German, English, Japanese, Arabic)
pred_names = ["Samuel", "Drew", "Shunji", "Mustafa"]
reshape_feature = cv.transform(pred_names)
surname_model.predict(reshape_feature)

array(['Russian', 'Russian', 'Russian', 'Arabic'], dtype='<U10')

__Observation 3__: It thinks every name is Russian. This makes sense given Russian surname accuracy is almost 50% while the odds of predicting other nationalities are less than 16%.

---
### Model Developement - Japanese Surnames Only

What if we train the model on only Japanese surnames?

---

#### -Features Exploration-

In [25]:
# get japanese names
japanese_surnames = surname_df.loc[surname_df["nationality"] == "Japanese"]
japanese_surnames.head()

Unnamed: 0,surname,nationality
1330,Takashi,Japanese
1331,Ishida,Japanese
1332,Soga,Japanese
1333,Mitsuharu,Japanese
1334,Seo,Japanese


In [26]:
features = japanese_surnames["surname"] # features (x) needed to predict nationatlity
target = japanese_surnames["nationality"] # what we are predicting (y)

---
### Train/Test Data

---
To make the data a little more accurate in it's predictions, we are going to split the surnames into train (80%) and test (20%) datasets.

In [27]:
# vectorize features
cv = CountVectorizer()
X = cv.fit_transform(features)

In [28]:
# get feature names
# cv.get_feature_names()

In [29]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

In [30]:
# fit the model
jap_surname_model = MultinomialNB()
jap_surname_model.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
y_pred = jap_surname_model.predict(x_test)

#### -Accuracy-

In [32]:
# Accuracy using test data
jap_accuracy = f1_score(y_test, y_pred, average='macro')
print(f"f1 score: {jap_accuracy}")

f1 score: 1.0


__Observation 1__: Woah. That accuracy. This is solely because I only tested the model using Japanese names.

In [33]:
# Number of Japanese names in test data
print(y_test.value_counts())

Japanese    52
Name: nationality, dtype: int64


In [34]:
# number of Japanese names in training data
print(y_train.value_counts())

Japanese    96
Name: nationality, dtype: int64


In [35]:
# Which nationalities is the model more accurate with
null_accuracy = y_test.value_counts().head(5) / len(y_test)
print(null_accuracy)

Japanese    1.0
Name: nationality, dtype: float64


In [36]:
jap_precision = 0.0
jap_recall = 0.0

In [37]:
jap_precision = precision_score(y_test, y_pred, average="micro")
jap_recall = recall_score(y_test, y_pred, average="micro")

print("Japanese Surnames ")
print(f"Precision: {jap_precision}")
print(f"Recall: {jap_recall}")

Japanese Surnames 
Precision: 1.0
Recall: 1.0


#### -Predictions-
Now we will run some test predictions.

In [38]:
# let's try predicting a Japanese name
pred_name1 = ["Showa"]
reshape_feature = cv.transform(pred_name1)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [39]:
# Arabic surname name
pred_name2 = ["Bata"]
reshape_feature = cv.transform(pred_name2)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [40]:
# Russian surname name
pred_name3 = ["Jugai"]
reshape_feature = cv.transform(pred_name3)
jap_surname_model.predict(reshape_feature)

array(['Japanese'], dtype='<U8')

In [41]:
# multiple names (German, English, Japanese, Arabic)
pred_names = ["Samuel", "Drew", "Shunji", "Mustafa"]
reshape_feature = cv.transform(pred_names)
jap_surname_model.predict(reshape_feature)

array(['Japanese', 'Japanese', 'Japanese', 'Japanese'], dtype='<U8')

__Observation 2__: The model assumes that all names are Japanese. This is because it has been trained on no other type of name. It is all it knows. This model is biased too biased and has been overfitted.

---
### -Conclusion-

Although the second model has the best accuracy, precision, and recall it is not practical. By training the dataset only on one value, it now assumes that all names are Japanese. While the test data is indeed all Japanese surnames, the model has been overfitted and so there are lots of false positives for names that have not been classified as Japanese. 

We will go with the first model because it is more realistic and practical.

---

In [42]:
print(f"Accuracy: {surname_accuracy}")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

Accuracy: 0.5095057034220533
Precision: 0.5095057034220533
Recall: 0.5095057034220533


In [43]:
print(f"Nationality Accuracy: \n {surname_null_accuracy}")

Nationality Accuracy: 
 Russian     0.481939
English     0.173954
Arabic      0.091255
Japanese    0.053232
German      0.039924
Name: nationality, dtype: float64
