## Introduction to Sequence Modeling - Russian and English names

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Good Turing Discounting Model Smoothing
* Test performance of model using English data.

------


In [352]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

import collections
from collections import defaultdict, Counter

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [353]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [354]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [355]:
surname_df = surname_df.dropna()

---
### Generate Bigrams
Calculate n_grams and frequencies of names

---

In [356]:
# generate ngrams and frequencies
def generate_ngrams(names):
    n_gram = collections.Counter()
    n_gram_freq = 3
    for c in names:
        n_gram.update(Counter(c[idx : idx + n_gram_freq] for idx in range(len(c) - 1)))
        
    return n_gram

In [357]:
# retrieve names for computing ngrams
names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]

names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]

In [358]:
eng_gram = generate_ngrams(english_names)
rus_gram = generate_ngrams(russian_names)

### Good Turing Smoothing
__Note:__ The smoothing method we will use is the Good-Turing Discounting Formula. It is perfect for accounting for bigrams that have yet to occur.

Equation: C^* = (c + 1) Nc+1/Nc

In [359]:
def good_turing_smoothing(n_gram):
    dict(n_gram)
    smoothing = {}
    
    result = None
    
    
    for k in n_gram:
        result = (n_gram[k] + 1 / n_gram[k])
        smoothing[k] = result
        
    return smoothing

In [360]:
# english metaparameters
eng_meta = good_turing_smoothing(eng_gram)

In [361]:
# russian metaparameters
rus_meta = good_turing_smoothing(rus_gram)

### Feature Selection

In [362]:
# Creating another column for when surname is English or not.
surname_df['label_eng'] = [1 if x =='English' else 0 for x in surname_df['nationality']]
label_eng = surname_df["label_eng"]

In [363]:
# Creating another column for when surname is Russian or not.
surname_df['label_rus'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
label_rus = surname_df["label_rus"]

In [364]:
surname_df.head()

Unnamed: 0,surname,nationality,label_eng,label_rus
0,Mokrousov,Russian,0,1
1,Nurov,Russian,0,1
2,Judovich,Russian,0,1
3,Mikhailjants,Russian,0,1
4,Jandarbiev,Russian,0,1


Create a bag of ngrams

In [365]:
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [366]:
# vectorize features - unigrams, bigrams, and trigrams
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,2), strip_accents="ascii", min_df=0.09, max_df=1.0)
X_freq = cv.fit_transform(surname_list)

# tf_transformer for normalization
tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
X = tf_transformer.transform(X_freq)

------
## Multiple Linear Regression

------

We will train two (2) models: One for English and the other for Russian surnames!

In [370]:
def metaparameters(X, meta):
    for key in meta:
        return meta[key] * X

In [371]:
X_eng = metaparameters(X, eng_meta)
X_rus = metaparameters(X, rus_meta)

#### English Surname Model

In [372]:
# split the data to train the model
x_train_eng, x_test_eng, y_train_eng, y_test_eng = train_test_split(X_eng, label_eng, test_size=0.20)

In [373]:
english_model = LinearRegression()
english_model.fit(x_train_eng, y_train_eng)

LinearRegression()

#### Russian Surname Model

In [374]:
x_train_rus, x_test_rus, y_train_rus, y_test_rus = train_test_split(X_rus, label_rus, test_size=0.20)

In [375]:
russian_model = LinearRegression()
russian_model.fit(x_train_rus, y_train_rus)

LinearRegression()

### Test Data and Predictions

#### English

In [376]:
englishness_test = english_model.predict(x_test_eng)

In [377]:
# summary of results
from statsmodels.api import OLS
OLS(y_test_eng, englishness_test).fit().summary()

0,1,2,3
Dep. Variable:,label_eng,R-squared (uncentered):,0.632
Model:,OLS,Adj. R-squared (uncentered):,0.631
Method:,Least Squares,F-statistic:,446.7
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,2.2e-58
Time:,23:12:56,Log-Likelihood:,-73.597
No. Observations:,261,AIC:,149.2
Df Residuals:,260,BIC:,152.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.0301,0.049,21.134,0.000,0.934,1.126

0,1,2,3
Omnibus:,0.685,Durbin-Watson:,2.103
Prob(Omnibus):,0.71,Jarque-Bera (JB):,0.761
Skew:,0.119,Prob(JB):,0.683
Kurtosis:,2.882,Cond. No.,1.0


#### Russian

In [378]:
russianess_test = russian_model.predict(x_test_rus)

In [379]:
# summary of results
OLS(y_test_rus, russianess_test).fit().summary()

0,1,2,3
Dep. Variable:,label_rus,R-squared (uncentered):,0.857
Model:,OLS,Adj. R-squared (uncentered):,0.856
Method:,Least Squares,F-statistic:,1557.0
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,9.17e-112
Time:,23:12:57,Log-Likelihood:,-69.581
No. Observations:,261,AIC:,141.2
Df Residuals:,260,BIC:,144.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.9971,0.025,39.457,0.000,0.947,1.047

0,1,2,3
Omnibus:,2.651,Durbin-Watson:,2.066
Prob(Omnibus):,0.266,Jarque-Bera (JB):,2.512
Skew:,-0.173,Prob(JB):,0.285
Kurtosis:,2.666,Cond. No.,1.0


----
### Observations

----

#### (1) Checking if the following names are Russian or English.

In [380]:
# predicting the following names
names = ["Fergus", "Angus", "Boston", "Austin", "Dankworth", "Denkworth", "Birtwistle", "Birdwhistle"]

reshape_feature = cv.transform(names)
english_res = english_model.predict(reshape_feature)
russian_res = russian_model.predict(reshape_feature)

print(f"English Model Results: \n {english_res} \n")
print(f"Russian Model Results: \n {russian_res}")

English Model Results: 
 [ 0.43321572  0.25252268  0.38993089 -0.16778808  0.17241999  0.47250489
  0.100231    0.07384694] 

Russian Model Results: 
 [0.5078675  0.59411788 0.52490733 0.92214076 0.62885159 0.40034486
 0.69007533 0.70230745]


Note: The english model does not see any of the above names as being of English origin. What's interesting is that it sees most of them more as Russian names.

---

#### 2) Predicting the most likely possible name

In [390]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

from nltk import bigrams, trigrams

In [391]:
# opening list of english names pulled from the United States Navy Academy
names = open("data_set/corpus/name_catalog.txt", "r")
names_df = [x.rstrip() for x in names.readlines()]
names_df = [x.lower() for x in names_df]

In [392]:
# convert list of names to ngrams
name_gram = generate_ngrams(names_df)

In [393]:
# write list of ngrams to a file. This will be our corpus.
path = "data_set/corpus/name_corpus/name_gram.txt" 
dict(name_gram)
f = open(path, 'w') 
for n in name_gram:
    f.write(n + "\n")

f.close()

In [394]:
# read from the corpus we just created
corpus_root = 'data_set/corpus/name_corpus/'
names_txt = PlaintextCorpusReader(corpus_root, '.*')
names_txt.words('name_gram.txt')

['mic', 'ich', 'cha', 'hae', 'ael', 'el', 'chr', 'hri', ...]

In [395]:
# creating new english model for predicting the name
eng_name_model = defaultdict(lambda: defaultdict(lambda: 0))

In [396]:
for sentence in names_txt.sents():
    for c1, c2 in bigrams(sentence, pad_right=True, pad_left=True):
        eng_name_model[c1][c2] += 1

In [397]:
# count probabilities
for c1 in eng_name_model:
    total_count = float(sum(eng_name_model[c1].values()))
    for c2 in n_gram_model[c1]:
        eng_name_model[c1][c2] /= total_count

In [409]:
test_names = ["Lou", "Ber", "Cul", "Ede", "Zjo"]

for name in test_names:
    name = name.lower()
    print(f"{name} -> {name.join(eng_name_model[name])}")

lou -> oui
ber -> ert
cul -> ull
ede -> ryl
zjo -> 


----
### Improvements

----

For observation (1), the model needs to be tested on more english data. It appears to be more partial to Russian names.

For observation (2), it's the same as above. Although it was able to predict majority of the names given, it could not give a possible name to Zjo. This could be due to the fact that the model has not seen this trigram before and therefore defaulted to unknown or it could be that no English name contains this combination. 