## Introduction to Sequence Modeling - Russian and English Names

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Good Turing Discounting Model Smoothing
* Test performance of model using English data.

------


In [38]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

import nltk
from nltk import ngrams
import collections
from collections import defaultdict, Counter

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [2]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [3]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [4]:
surname_df = surname_df.dropna()

#### Features Exploration

We will focus on the English names and Russian names only.

In [5]:
# read english names
english_df = pd.read_csv("data_set/corpus/english_names.txt")

In [6]:
english_df.head()

Unnamed: 0,fairhurst
0,wateridge
1,nemeth
2,moroney
3,goodall
4,agar


In [7]:
russian_df = pd.read_csv("data_set/corpus/russian_names.txt")

In [8]:
russian_df.head()

Unnamed: 0,mokrousov
0,nurov
1,judovich
2,mikhailjants
3,jandarbiev
4,govyadin


### Generate Bigrams
Calculate n_grams and frequencies

In [9]:
# generate bigrams and frequencies
def generate_bigrams(names):
    n_gram = collections.Counter()
    n_gram_freq = 2
    for c in names:
        n_gram.update(Counter(c[idx : idx + n_gram_freq] for idx in range(len(c) - 1)))
        
    return n_gram

In [10]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [11]:
# retrieve english and russian ngrams
eng_gram = generate_bigrams(english_df)
rus_gram = generate_bigrams(russian_df)

In [12]:
eng_gram

Counter({'fa': 1,
         'ai': 1,
         'ir': 1,
         'rh': 1,
         'hu': 1,
         'ur': 1,
         'rs': 1,
         'st': 1})

In [13]:
rus_gram

Counter({'mo': 1,
         'ok': 1,
         'kr': 1,
         'ro': 1,
         'ou': 1,
         'us': 1,
         'so': 1,
         'ov': 1})

#### Smoothing

In [14]:
def good_turing_smoothing(n_gram):
    dict(n_gram)
    smoothing = {}
    
    result = None
    
    
    for k in n_gram:
        result = (n_gram[k] + 1 / n_gram[k])
        smoothing[k] = result
        
    return smoothing

__Note:__ The smoothing method we will use is the Good-Turing Discounting Formula. It is perfect for accounting for bigrams that have yet to occur.

Equation: C^* = (c + 1) Nc+1/Nc

In [15]:
# english metaparameters
eng_meta = good_turing_smoothing(eng_gram)

In [16]:
# russian metaparameters
rus_meta = good_turing_smoothing(rus_gram)

In [17]:
# Creating another column for when surname is English or not.
surname_df['label_eng'] = [1 if x =='English' else 0 for x in surname_df['nationality']]
label_eng = surname_df["label_eng"]

In [18]:
# Creating another column for when surname is Russian or not.
surname_df['label_rus'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
label_rus = surname_df["label_rus"]

In [19]:
surname_df.head()

Unnamed: 0,surname,nationality,label_eng,label_rus
0,Mokrousov,Russian,0,1
1,Nurov,Russian,0,1
2,Judovich,Russian,0,1
3,Mikhailjants,Russian,0,1
4,Jandarbiev,Russian,0,1


---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [20]:
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [21]:
# vectorize features - unigrams, bigrams, and trigrams
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,3), strip_accents="ascii", min_df=0.09, max_df=1.0)
X_freq = cv.fit_transform(surname_list)

# tf_transformer for normalization
tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
X = tf_transformer.transform(X_freq)

In [22]:
print(X.toarray())

[[0.         0.         0.         ... 0.25       0.         0.        ]
 [0.         0.         0.         ... 0.40824829 0.         0.        ]
 [0.         0.         0.         ... 0.33333333 0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.5        ... 0.         0.         0.        ]]


In [23]:
X.shape

(1305, 28)

In [24]:
print(cv.get_feature_names())

['a', 'an', 'b', 'c', 'ch', 'd', 'e', 'er', 'ev', 'g', 'h', 'i', 'in', 'k', 'ko', 'l', 'm', 'n', 'o', 'ov', 'p', 'r', 's', 't', 'u', 'v', 'y', 'z']


------
## Multiple Linear Regression

------

We will train two (2) models: One for English and the other for Russian surnames!

In [25]:
def metaparameters(X, meta):
    for key in meta:
        return meta[key] * X

In [26]:
X_eng = metaparameters(X, eng_meta)
X_rus = metaparameters(X, rus_meta)

#### English Surname Model

In [27]:
# split the data to train the model
x_train_eng, x_test_eng, y_train_eng, y_test_eng = train_test_split(X_eng, label_eng, test_size=0.20, random_state = 32)

In [28]:
english_model = LinearRegression()
english_model.fit(x_train_eng, y_train_eng)

LinearRegression()

#### Russian Surname Model

In [29]:
x_train_rus, x_test_rus, y_train_rus, y_test_rus = train_test_split(X_rus, label_rus, test_size=0.20, random_state = 32)

In [30]:
russian_model = LinearRegression()
russian_model.fit(x_train_rus, y_train_rus)

LinearRegression()

### Test Data and Predictions

#### English

In [31]:
englishness_test = english_model.predict(x_test_eng)

In [32]:
# summary of results
from statsmodels.api import OLS
OLS(y_test_eng, englishness_test).fit().summary()

0,1,2,3
Dep. Variable:,label_eng,R-squared (uncentered):,0.651
Model:,OLS,Adj. R-squared (uncentered):,0.65
Method:,Least Squares,F-statistic:,485.2
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,2.1799999999999998e-61
Time:,21:51:43,Log-Likelihood:,-88.056
No. Observations:,261,AIC:,178.1
Df Residuals:,260,BIC:,181.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.0113,0.046,22.027,0.000,0.921,1.102

0,1,2,3
Omnibus:,9.104,Durbin-Watson:,1.98
Prob(Omnibus):,0.011,Jarque-Bera (JB):,4.92
Skew:,0.113,Prob(JB):,0.0854
Kurtosis:,2.366,Cond. No.,1.0


#### Russian

In [33]:
russianess_test = russian_model.predict(x_test_rus)

In [34]:
# summary of results
OLS(y_test_rus, russianess_test).fit().summary()

0,1,2,3
Dep. Variable:,label_rus,R-squared (uncentered):,0.829
Model:,OLS,Adj. R-squared (uncentered):,0.828
Method:,Least Squares,F-statistic:,1256.0
Date:,"Mon, 21 Sep 2020",Prob (F-statistic):,1.53e-101
Time:,21:51:43,Log-Likelihood:,-88.075
No. Observations:,261,AIC:,178.1
Df Residuals:,260,BIC:,181.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.0044,0.028,35.442,0.000,0.949,1.060

0,1,2,3
Omnibus:,9.518,Durbin-Watson:,1.982
Prob(Omnibus):,0.009,Jarque-Bera (JB):,5.106
Skew:,-0.121,Prob(JB):,0.0779
Kurtosis:,2.359,Cond. No.,1.0


#### -Observations-

Note: __Wasem__ is an Arabic name. Model seems to think it is Russian due to similarity is spelling. Misclassified.

In [35]:
pred_name1 = ["See"]
reshape_feature = cv.transform(pred_name1)
russian_model.predict(reshape_feature)

array([0.13105165])

Note: __See__ is a dutch name so the probability of this being of Russian origin is 25%.

In [36]:
pred_name2 = ["Los"]
reshape_feature = cv.transform(pred_name2)
russian_model.predict(reshape_feature)

array([0.26052117])

Note: __Los__ is a Russian name and has been (barely) classified as such.