## Introduction to Prediction using Surnames Analysis

### Hypothesis
To relate a name to nationality, check for the most common letter combinations/pronounciations within a specific language that makes it unique.

Note: This is a naïve approach. We will continue to improve!

In this iteration we are going to:
* calculate the accuracy, precision, and recall of the first iteration of Japanese names.
* improve precision
* compare accuracy, recall, and precision for classifying Japanese names.

------


In [1]:
import pandas as pd

In [2]:
def is_arabic(word):
    if "afa" in word:
        return True
    elif "ba" in word:
        return True
    elif "zay" in word:
        return True
    elif "shin" in word:
        return True
    elif "ra" in word:
        return True
    elif "sud" in word:
        return True
    elif "alif" in word:
        return True
    elif "aa" in word:
        return True
    elif "aha" in word:
        return True
    elif "dh" in word:
        return True
    elif "tha" in word:
        return True
    elif "slah" in word:
        return True
    elif "att" in word:
        return True
    elif "kh" in word:
        return True
    elif "zog" in word:
        return True
    elif "tar" in word:
        return True
    elif "eeb" in word:
        return True
    elif "alb" in word:
        return True
    elif "haj" in word:
        return True
    else:
        return False

In [3]:
def is_chinese(word):
    if len(word) < 4:
        return True
    elif "ang" in word:
        return True
    elif "iang" in word:
        return True
    elif "ing" in word:
        return True
    elif "eng" in word:
        return True
    elif "sen" in word:
        return True
    elif "ong" in word:
        return True
    elif "hao" in word:
        return True
    elif "aw" in word:
        return True
    elif "eah" in word:
        return True
    else:
        return False

In [4]:
def is_czech(word):
    if "mak" in word:
        return True
    elif "tak" in word:
        return True
    elif "erv" in word:
        return True
    elif "fk" in word:
        return True
    elif "som" in word:
        return True
    elif "iova" in word:
        return True
    elif "perer" in word:
        return True
    elif "nek" in word:
        return True
    elif "esi" in word:
        return True
    elif "ovic" in word:
        return True
    elif "nec" in word:
        return True
    elif "zka" in word:
        return True
    elif "abb" in word:
        return True
    elif "enk" in word:
        return True
    elif "chek" in word:
        return True
    elif "ss" in word:
        return True
    elif "erka" in word:
        return True
    elif "kovsky" in word:
        return True
    elif "gl" in word:
        return True
    elif "ovy" in word:
        return True
    elif "pp" in word:
        return True
    elif "jj" in word:
        return True
    elif "lik" in word:
        return True
    elif "sk" in word:
        return True

In [5]:
def is_dutch(word):
    if "eter" in word:
        return True
    elif "aijer" in word:
        return True
    elif "oorn" in word:
        return True
    elif "hren" in word:
        return True
    elif "rik" in word:
        return True
    elif "ogt" in word:
        return True
    elif "eeu" in word:
        return True
    elif "man" in word:
        return True
    elif "aar" in word:
        return True
    elif "ijk" in word:
        return True
    elif "eghe" in word:
        return True
    elif "del" in word:
        return True
    elif "kk" in word:
        return True
    elif "seg" in word:
        return True
    elif "ning" in word:
        return True
    elif "sch" in word:
        return True
    elif "ken" in word:
        return True
    return False

In [6]:
def is_german(word):
    word = word.lower()
    keys = "äöü"
    for letter in word:
        if letter in keys:
            return True
        
    if "isch" in word:
        return True
    if "tsch" in word:
        return True
    if "aus" in word:
        return True
    return False

In [7]:
def is_italian(word):
    """Naive Italian Surname Identification"""
    return word.count("i") >= 3    

In [8]:
def is_japanese(word):
    """Naive Japanese Surname Identification"""
    if "naka" in word:
        return True
    elif "tsu" in word:
        return True
    elif "kawa" in word:
        return True
    else:
        return False

In [9]:
def is_spanish(word):
    """Naive Spanish Surname Identification"""
    word = word.lower()
    keys = "áéíóúüñ"
    for letter in word:
        if letter in keys:
            return True
    return False

In [10]:
"""
    Naive Nationality Identification

    Returns "Unknown" for nationalities that are detected as 
    other than Spanish, Italian or Japanese
"""

def check_nationality(surname):
    
    if is_arabic(surname):
        return "Arabic"
    
    if is_chinese(surname):
        return "Chinese"
    
    if is_czech(surname):
        return "Czech"
    
    if is_dutch(surname):
        return "Dutch"
    
    if is_german(surname):
        return "German"
    
    if is_italian(surname):
        return "Italian"
    
    if is_japanese(surname):
        return "Japanese"
    
    if is_spanish(surname):
        return "Spanish"
    
    return "Unknown"

In [11]:
# read the csv file into data frame.
def read_file(file_name):
    df = pd.read_csv(file_name, index_col = None)
    df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
    print("Preview of original file:")
    print(df.head())
    
    return df

In [12]:
# read first column only
def modify_file(df):
    surname_list = pd.DataFrame(df[df.columns[0]])
    surname_list.rename(columns = {'Unnamed: 0':'surname'}, inplace = True)
    
    return surname_list

In [13]:
# retrieve nationality for each name.
def insert_nationality_pred(surname_list):
    nationality = []
    col = "surname"
    for name, row in surname_list.iterrows():
        nationality.append(check_nationality(row[col]))

    surname_list.insert(1, "nationality", nationality, True)
    
    return surname_list

In [14]:
# files to be predicted
surname_csv = "data_set/surnames-dev.csv"

surname_df = read_file(surname_csv)
surname_list = modify_file(surname_df)
pred_surnames = insert_nationality_pred(surname_list)

Preview of original file:
    surname nationality
0  Fakhoury      Arabic
1      Toma      Arabic
2     Koury      Arabic
3      Bata      Arabic
4    Samaha      Arabic


In [15]:
musician_csv = "data_set/musician_surnames.csv"
musician_df = read_file(musician_csv)
surname_list = modify_file(musician_df)
pred_mus_surnames = insert_nationality_pred(surname_list)

Preview of original file:
      surname nationality
0       Belén     Spanish
1  Cunningham    Scottish
2          Ji      Korean
3  Grönemeyer      German
4        Lévy      French


In [16]:
# ouput results
pred_surnames.to_csv("results/surname_results.csv", index=False)
pred_mus_surnames.to_csv("results/musician_results.csv", index=False)

## Results
### Accuracy, Precision, and Recall of Iteration I


In [32]:
from sklearn.metrics import accuracy_score
import numpy as np

In [18]:
print(pred_surnames.columns) # predicted surnames
print(surname_df.columns) # ground truth surnames

Index(['surname', 'nationality'], dtype='object')
Index(['surname', 'nationality'], dtype='object')


#### Accuracy
------
Note: Accuracy will be computed using sklearn.metrics's ```accuracy_score``` library. We will find an intersection between the ground truth (original surnames) and predicted surnames, and then get the average accuracy for the two columns.


------

In [36]:
# find an intersection between the two datasets and compute the accuracy score.
ground_truth = surname_df
col_overlap = list(set(ground_truth.columns).intersection(pred_surnames.columns))

# compute accuracy
avg_accuracy = 0.0
for col in col_overlap:
    accuracy = accuracy_score(ground_truth[col], pred_surnames[col])
    print(f"Column: {col} \t Accuracy: {accuracy}")
    avg_accuracy += accuracy / len(col_overlap)
    
print(f"B1 Surname Predictions Accuracy: {avg_accuracy} \n")

Column: surname 	 Accuracy: 1.0
Column: nationality 	 Accuracy: 0.05427905427905428
B1 Surname Predictions Accuracy: 0.5271395271395272 



#### Precision and Recall
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [37]:
from sklearn.metrics import precision_score, recall_score

In [38]:
precision = 0.0
recall = 0.0

y_gt = ground_truth["nationality"] # ground truth for nationality
y_pred = pred_surnames["nationality"] # predictions for nationality

In [39]:
precision = precision_score(y_gt, y_pred, average="macro")
recall = recall_score(y_gt, y_pred, average="macro")

print(f"Precision: {precision}")
print(f"Recall: {recall}")

Precision: 0.07993759550310502
Recall: 0.10168449378975694
