## Introduction to Sequence Modeling - Russian vs English Surnames

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Good Turing Discounting Model Smoothing
* Test performance of model using English data.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
import nltk
from nltk import ngrams
import collections
from collections import defaultdict, Counter

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import brown # corpus of english words

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [4]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [5]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [6]:
surname_df = surname_df.dropna()

In [7]:
# Retrieve English names only
english_df = surname_df.loc[surname_df["nationality"] == "English"]
english_df = english_df[["surname"]]
english_names = english_df.apply(lambda x: x.astype(str).str.lower())
english_names.head()

Unnamed: 0,surname
940,fairhurst
941,wateridge
942,nemeth
943,moroney
944,goodall


In [8]:
# save all english names in txt file
english_names.to_csv("data_set/corpus/english_names.txt", sep='\t', index=False, header=False)

In [9]:
# Russian names only
russian_df = surname_df.loc[surname_df["nationality"] == "Russian"]
russian_df = russian_df[["surname"]]
russian_names = russian_df.apply(lambda x: x.astype(str).str.lower())
russian_names.head()

Unnamed: 0,surname
0,mokrousov
1,nurov
2,judovich
3,mikhailjants
4,jandarbiev


In [10]:
russian_names.to_csv("data_set/corpus/russian_names.txt", sep='\t', index=False, header=False)

-------
### Calculate Frequencies and Probabilities

-----

In [11]:
# generate bigrams and frequencies
def generate_bigrams(names):
    n_gram = collections.Counter()
    n_gram_freq = 3
    for c in names:
        n_gram.update(Counter(c[idx : idx + n_gram_freq] for idx in range(len(c) - 1)))
        
    return n_gram

In [12]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [13]:
# retrieve english bigrams
eng_gram = generate_bigrams(english_names)

In [14]:
# sort freqencies
# freq_sorted(eng_gram)

In [15]:
rus_gram = generate_bigrams(russian_names)

In [16]:
n_grams = generate_bigrams(surname_df['surname'])

In [17]:
# freq_sorted(rus_gram)

-------

### Feature Selection

In [18]:
features = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x)) # features (x) needed to predict nationatlity
target = surname_df['nationality'] # what we are predicting (y)

------
## Naiive Bayes

------

In [19]:
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,3), strip_accents="ascii", min_df=0.09, max_df=1.0)
X = cv.fit_transform(features)

In [20]:
target.shape

(1305,)

In [21]:
X.shape

(1305, 28)

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (80%) and test (20%) datasets.

In [22]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [23]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state = 32)

## Model

In [24]:
from sklearn.naive_bayes import GaussianNB

In [25]:
# fit the model
surname_model = GaussianNB()
surname_model.fit(x_train.toarray(), y_train)

GaussianNB()

In [26]:
y_pred = surname_model.predict(x_test.toarray())

### Summary

In [27]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     English       0.62      0.98      0.76        86
     Russian       0.98      0.70      0.82       175

    accuracy                           0.79       261
   macro avg       0.80      0.84      0.79       261
weighted avg       0.86      0.79      0.80       261



In [29]:
print(confusion_matrix(y_test, y_pred))

[[ 84   2]
 [ 52 123]]


In [30]:
print(f"f1_score: {f1_score(y_test, y_pred, average='micro')} ")

f1_score: 0.7931034482758621 


__observe__: Accuracy is almost at 80%.

In [31]:
def good_turing_smoothing(n_gram):
    dict(n_gram)
    smoothing = {}
    
    result = None
    
    
    for k in n_gram:
        result = (n_gram[k] + 1 / n_gram[k])
        smoothing[k] = result
        
    return smoothing

__Note:__ The smoothing method we will use is the Good-Turing Discounting Formula. It is perfect for accounting for bigrams that have yet to occur.

Equation: C^* = (c + 1) Nc+1/Nc

In [32]:
# english metaparameters
eng_meta = good_turing_smoothing(eng_gram)

In [33]:
# russian metaparameters
rus_meta = good_turing_smoothing(rus_gram)

### Using English Data to Test Performance
The data we will be using is from the United States Naval Academy. It is 7 KB and has a good amount of English first names.

In [34]:
# List of First Names
names = open("data_set/corpus/name_catalog.txt", "r")
test_names = [x.rstrip() for x in names.readlines()]
test_names = [x.lower() for x in test_names]

In [35]:
for name in test_names:
    name = [name]
    reshape_feature = cv.transform(name)
    y_pred = surname_model.predict(reshape_feature.toarray().reshape(-1,1))
    
    f = open("results/test_pred.txt","a")
    f.write(f"{name}: {y_pred} \n\n") 
    f.close()

## Results
All of the output can be found in results/test_pred.txt. However, her is a snapshot:

__michael__: 

['Russian' 'English' 'English' 'Russian' 'Russian' 'English' 'Russian'
 'English' 'English' 'English' 'Russian' 'Russian' 'English' 'English'
 'English' 'Russian' 'Russian' 'English' 'English' 'English' 'English'
 'English' 'English' 'English' 'English' 'English' 'English' 'English']
 
__christopher__: 

['English' 'English' 'English' 'Russian' 'Russian' 'English' 'Russian'
 'Russian' 'English' 'English' 'Russian' 'Russian' 'English' 'English'
 'English' 'English' 'English' 'English' 'Russian' 'English' 'Russian'
 'Russian' 'Russian' 'Russian' 'English' 'English' 'English' 'English'] 
 
__jessica__: 

['Russian' 'English' 'English' 'Russian' 'English' 'English' 'Russian'
 'English' 'English' 'English' 'English' 'Russian' 'English' 'English'
 'English' 'English' 'English' 'English' 'English' 'English' 'English'
 'English' 'Russian' 'English' 'English' 'English' 'English' 'English'] 


__Observe__: Although all names in the test data are English, it still identifies some trigrams as Russian. There are some trigams that happen more frequently in Russian than English hence the reason for some misidentification of Russian. This could be fixed by further training the model on more English data and gather more n-grams.