## Introduction to Sequence Modeling - Russian and English names

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Good Turing Discounting Model Smoothing
* Test performance of model using English data.

------


In [224]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

import nltk
from nltk import ngrams
import collections
from collections import defaultdict, Counter

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [225]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [226]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [227]:
surname_df = surname_df.dropna()

#### Features Exploration

We will focus on the English names and Russian names only.

In [228]:
# read english names
# english_df = pd.read_csv("data_set/corpus/english_names.txt")

In [229]:
# english_df.head()

In [230]:
# russian_df = pd.read_csv("data_set/corpus/russian_names.txt")

In [231]:
# russian_df.head()

### Generate Bigrams
Calculate n_grams and frequencies

In [232]:
# generate ngrams and frequencies
def generate_ngrams(names):
    n_gram = collections.Counter()
    n_gram_freq = 3
    for c in names:
        n_gram.update(Counter(c[idx : idx + n_gram_freq] for idx in range(len(c) - 1)))
        
    return n_gram

In [233]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [234]:
names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]

names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]

In [235]:
# retrieve english and russian ngrams
eng_gram = generate_ngrams(english_names)
rus_gram = generate_ngrams(russian_names)

In [236]:
# eng_gram

In [237]:
# rus_gram

### Good Turing Smoothing

In [238]:
def good_turing_smoothing(n_gram):
    dict(n_gram)
    smoothing = {}
    
    result = None
    
    
    for k in n_gram:
        result = (n_gram[k] + 1 / n_gram[k])
        smoothing[k] = result
        
    return smoothing

__Note:__ The smoothing method we will use is the Good-Turing Discounting Formula. It is perfect for accounting for bigrams that have yet to occur.

Equation: C^* = (c + 1) Nc+1/Nc

In [239]:
# english metaparameters
eng_meta = good_turing_smoothing(eng_gram)

In [240]:
# russian metaparameters
rus_meta = good_turing_smoothing(rus_gram)

In [241]:
# Creating another column for when surname is English or not.
surname_df['label_eng'] = [1 if x =='English' else 0 for x in surname_df['nationality']]
label_eng = surname_df["label_eng"]

In [242]:
# Creating another column for when surname is Russian or not.
surname_df['label_rus'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
label_rus = surname_df["label_rus"]

In [243]:
surname_df.head()

Unnamed: 0,surname,nationality,label_eng,label_rus
0,Mokrousov,Russian,0,1
1,Nurov,Russian,0,1
2,Judovich,Russian,0,1
3,Mikhailjants,Russian,0,1
4,Jandarbiev,Russian,0,1


Create a bag of ngrams

In [244]:
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [245]:
# vectorize features - unigrams, bigrams, and trigrams
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,3), strip_accents="ascii", min_df=0.09, max_df=1.0)
X_freq = cv.fit_transform(surname_list)

# tf_transformer for normalization
tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
X = tf_transformer.transform(X_freq)

In [246]:
print(X.toarray())

[[0.         0.         0.         ... 0.25       0.         0.        ]
 [0.         0.         0.         ... 0.40824829 0.         0.        ]
 [0.         0.         0.         ... 0.33333333 0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.5        ... 0.         0.         0.        ]]


In [247]:
X.shape

(1305, 28)

In [248]:
print(cv.get_feature_names())

['a', 'an', 'b', 'c', 'ch', 'd', 'e', 'er', 'ev', 'g', 'h', 'i', 'in', 'k', 'ko', 'l', 'm', 'n', 'o', 'ov', 'p', 'r', 's', 't', 'u', 'v', 'y', 'z']


------
## Multiple Linear Regression

------

We will train two (2) models: One for English and the other for Russian surnames!

In [249]:
def metaparameters(X, meta):
    for key in meta:
        return meta[key] * X

In [250]:
X_eng = metaparameters(X, eng_meta)
X_rus = metaparameters(X, rus_meta)

#### English Surname Model

In [251]:
# split the data to train the model
x_train_eng, x_test_eng, y_train_eng, y_test_eng = train_test_split(X_eng, label_eng, test_size=0.20, random_state = 32)

In [252]:
english_model = LinearRegression()
english_model.fit(x_train_eng, y_train_eng)

LinearRegression()

#### Russian Surname Model

In [253]:
x_train_rus, x_test_rus, y_train_rus, y_test_rus = train_test_split(X_rus, label_rus, test_size=0.20, random_state = 32)

In [254]:
russian_model = LinearRegression()
russian_model.fit(x_train_rus, y_train_rus)

LinearRegression()

### Test Data and Predictions

#### English

In [255]:
englishness_test = english_model.predict(x_test_eng)

In [256]:
# summary of results
from statsmodels.api import OLS
OLS(y_test_eng, englishness_test).fit().summary()

0,1,2,3
Dep. Variable:,label_eng,R-squared (uncentered):,0.651
Model:,OLS,Adj. R-squared (uncentered):,0.65
Method:,Least Squares,F-statistic:,485.2
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,2.1799999999999998e-61
Time:,23:03:26,Log-Likelihood:,-88.056
No. Observations:,261,AIC:,178.1
Df Residuals:,260,BIC:,181.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.0113,0.046,22.027,0.000,0.921,1.102

0,1,2,3
Omnibus:,9.104,Durbin-Watson:,1.98
Prob(Omnibus):,0.011,Jarque-Bera (JB):,4.92
Skew:,0.113,Prob(JB):,0.0854
Kurtosis:,2.366,Cond. No.,1.0


#### Russian

In [257]:
russianess_test = russian_model.predict(x_test_rus)

In [258]:
# summary of results
OLS(y_test_rus, russianess_test).fit().summary()

0,1,2,3
Dep. Variable:,label_rus,R-squared (uncentered):,0.829
Model:,OLS,Adj. R-squared (uncentered):,0.828
Method:,Least Squares,F-statistic:,1256.0
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,1.53e-101
Time:,23:03:26,Log-Likelihood:,-88.075
No. Observations:,261,AIC:,178.1
Df Residuals:,260,BIC:,181.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.0044,0.028,35.442,0.000,0.949,1.060

0,1,2,3
Omnibus:,9.518,Durbin-Watson:,1.982
Prob(Omnibus):,0.009,Jarque-Bera (JB):,5.106
Skew:,-0.121,Prob(JB):,0.0779
Kurtosis:,2.359,Cond. No.,1.0


----
### Observations

----

1) Checking if the following names are Russian or English.

In [259]:
# predicting the following names
names = ["Fergus", "Angus", "Boston", "Austin", "Dankworth", "Denkworth", "Birtwistle", "Birdwhistle"]

reshape_feature = cv.transform(names)
english_res = english_model.predict(reshape_feature)
russian_res = russian_model.predict(reshape_feature)

print(f"English Model Results: \n {english_res} \n")
print(f"Russian Model Results: \n {russian_res}")

English Model Results: 
 [ 0.43234363  0.28832746  0.40257493 -0.11332377  0.12043297  0.43754444
  0.0987624   0.03762174] 

Russian Model Results: 
 [0.45579119 0.57100412 0.47960615 0.89232511 0.70531972 0.45163054
 0.72265617 0.7715687 ]


Note: The english model does not see any of the above names as being of English origin. What's interesting is that it sees most of them more as Russian names.

---

2) Predicting the most likely possible name

In [260]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

from nltk import bigrams, trigrams
from collections import defaultdict, Counter

In [261]:
# opening list of english names pulled from the United States Navy Academy
names = open("data_set/corpus/name_catalog.txt", "r")
names_df = [x.rstrip() for x in names.readlines()]
names_df = [x.lower() for x in names_df]

In [262]:
# convert list of names to ngrams
name_gram = generate_ngrams(names_df)

In [263]:
# write list of ngrams to a file. This will be our corpus.
path = "data_set/corpus/name_corpus/name_gram.txt" 
dict(name_gram)
f = open(path, 'w') 
for n in name_gram:
    f.write(n + "\n")

f.close()

In [264]:
# read from the corpus we just created
corpus_root = 'data_set/corpus/name_corpus/'
names_txt = PlaintextCorpusReader(corpus_root, '.*')
names_txt.words('name_gram.txt')

['mic', 'ich', 'cha', 'hae', 'ael', 'el', 'chr', 'hri', ...]

In [265]:
# creating new english model for predicting the name
eng_name_model = defaultdict(lambda: defaultdict(lambda: 0))

In [266]:
for sentence in names_txt.sents():
    for c1, c2 in bigrams(sentence, pad_right=True, pad_left=True):
        eng_name_model[c1][c2] += 1

In [267]:
# count probabilities
for c1 in eng_name_model:
    total_count = float(sum(eng_name_model[c1].values()))
    for c2 in n_gram_model[c1]:
        eng_name_model[c1][c2] /= total_count

In [268]:
test_names = ["Lou", "Ber", "Cul", "Ede", "Zjo"]

for name in test_names:
    name = name.lower()
    print(f"{name} -> {dict(eng_name_model[name])}")

lou -> {'oui': 2}
ber -> {'ert': 2}
cul -> {'ull': 2}
ede -> {'ryl': 2}
zjo -> {}


----
### Improvements

----

The model needs to be tested on more english data. It appears to be more partial to Russian names.