## Introduction to Sequence Modeling - Russian vs English Surnames

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Good Turing Discounting Model Smoothing
* Test performance of model using English data.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
import nltk
# from nltk import bigrams, trigrams, word_tokenize
import collections
from collections import defaultdict, Counter

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import brown # corpus of english words

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [4]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [5]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [6]:
# removing non-alphabetic characters 
surname_df.shape

(1306, 2)

In [7]:
surname_df = surname_df.dropna()

In [8]:
surname_df.shape

(1305, 2)

In [9]:
surname_df.to_csv("data_set/corpus/surnames_names.txt", sep='\t', index=False, header=False)

In [10]:
# Retrieve English names only
english_df = surname_df.loc[surname_df["nationality"] == "English"]
english_df = english_df[["surname"]]
english_df.head()

Unnamed: 0,surname
940,Fairhurst
941,Wateridge
942,Nemeth
943,Moroney
944,Goodall


In [11]:
# save all english names in txt file
english_df.to_csv("data_set/corpus/english_names.txt", sep='\t', index=False, header=False)

In [12]:
# Russian names only
russian_df = surname_df.loc[surname_df["nationality"] == "Russian"]
russian_df = russian_df[["surname"]]
russian_df.head()

Unnamed: 0,surname
0,Mokrousov
1,Nurov
2,Judovich
3,Mikhailjants
4,Jandarbiev


In [13]:
russian_df.to_csv("data_set/corpus/russian_names.txt", sep='\t', index=False, header=False)

---
### Create New Corpus

---
Create a new corpus of English and Russian names to be used for the n_gram model.

In [14]:
# English
names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]
# english_names

In [15]:
# Russian
names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]
# russian_names

-------
### Calculate Frequencies and Probabilities

-----

In [16]:
# generate bigrams and frequencies
def generate_bigrams(names):
    n_gram = collections.Counter()
    n_gram_freq = 2
    for c in names:
        n_gram.update(Counter(c[idx : idx + n_gram_freq] for idx in range(len(c) - 1)))
        
    return n_gram

In [17]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [18]:
# retrieve english bigrams
eng_gram = generate_bigrams(english_names)

In [38]:
# sort freqencies
# freq_sorted(eng_gram)

er 49
on 47
ar 40
in 37
le 37
an 36
ey 33
ll 33
ne 28
en 26
el 25
ra 25
th 23
or 23
re 23
to 23
ck 21
ma 20
st 19
ri 19
de 19
ur 18
es 18
ha 17
so 17
la 17
rt 16
ke 16
ro 15
ho 15
ou 15
is 15
al 14
am 14
ng 14
ns 14
pe 14
nd 14
se 14
od 13
ld 13
il 13
nn 13
ol 13
mo 12
as 12
sh 12
tt 12
ic 12
wa 11
te 11
be 11
rd 11
co 11
rs 10
at 10
ge 10
ga 10
he 10
do 10
ea 10
hi 10
rr 10
lo 10
oo 9
wo 9
ul 9
li 9
ki 9
ow 9
ir 8
et 8
rn 8
gh 8
pa 8
ve 8
wi 8
we 8
no 7
ak 7
tr 7
ed 7
rl 7
dd 7
vi 7
ch 7
oc 7
un 7
ei 7
ss 7
ad 7
di 7
ie 7
hu 6
em 6
da 6
ag 6
na 6
ut 6
gr 6
ee 6
dr 6
bi 6
ta 6
nt 6
ds 6
fo 6
um 6
fi 6
ug 5
gg 5
it 5
ks 5
ay 5
au 5
bb 5
ru 5
ni 5
ig 5
aw 5
dl 5
ja 5
ac 5
op 5
ls 5
ti 5
ry 5
ai 4
rh 4
id 4
dg 4
me 4
go 4
du 4
ot 4
ka 4
lm 4
va 4
kn 4
gi 4
gu 4
tu 4
av 4
sm 4
br 4
bu 4
us 4
om 4
lk 4
cu 4
cr 4
pl 4
wn 4
ms 4
fa 3
rb 3
mc 3
ty 3
mi 3
gt 3
ud 3
oa 3
kl 3
ox 3
tl 3
ib 3
wh 3
ff 3
pp 3
rg 3
wr 3
fr 3
nc 3
sa 3
sw 3
po 3
ob 3
oy 3
yl 3
dm 3
ya 3
mp 3
ye 3
mm 3
bo 3
os 3
si 3
d

In [20]:
rus_gram = generate_bigrams(russian_names)

In [21]:
n_grams = generate_bigrams(surname_df['surname'])

In [22]:
# freq_sorted(rus_gram)

__Question__: What bigram is most informative for distinguishing between English and Russian names?

__Obervation__: English top 5 bigrams:

er: 49

on: 47

ar: 40

in: 37

le: 37

__Observation__: Russian top 5 bigrams:

ov : 342

in : 198

ko : 159

ev : 133

ch : 117

### Feature Selection

In [23]:
features = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x)) # features (x) needed to predict nationatlity
target = surname_df['nationality'] # what we are predicting (y)

------
## Naiive Bayes

------

### English Surnames

__Observation__: There are significantly more russian names in the dataset than english. The data is not quite balanced.

In [24]:
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,4), strip_accents="ascii", min_df=0.09, max_df=1.0)
X = cv.fit_transform(features)

In [25]:
target.shape

(1305,)

In [26]:
X.shape

(1305, 28)

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [27]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [28]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

## Model

In [29]:
from sklearn.naive_bayes import GaussianNB

In [30]:
# fit the model
surname_model = GaussianNB()
surname_model.fit(x_train.toarray(), y_train)

GaussianNB()

In [31]:
y_pred = surname_model.predict(x_test.toarray())

### Summary

In [32]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [33]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     English       0.58      0.98      0.73       135
     Russian       0.99      0.70      0.82       322

    accuracy                           0.78       457
   macro avg       0.78      0.84      0.77       457
weighted avg       0.87      0.78      0.79       457



In [34]:
print(confusion_matrix(y_test, y_pred))

[[132   3]
 [ 96 226]]


In [35]:
print(f"f1_score: {f1_score(y_test, y_pred, average='micro')} ")

f1_score: 0.7833698030634574 


In [36]:
def good_turing_smoothing(n_gram):
    dict(n_gram)
    smoothing = {}
    
    result = None
    
    
    for k in n_gram:
        result = (n_gram[k] + 1 / n_gram[k])
        smoothing[k] = result
        
    return smoothing

__Note:__ The smoothing method we will use is the Good-Turing Discounting Formula. It is perfect for accounting for bigrams that have yet to occur.

Equation: C^* = (c + 1) Nc+1/Nc

In [37]:
# english metaparameters
eng_meta = good_turing_smoothing(eng_gram)

In [39]:
# russian metaparameters
rus_meta = good_turing_smoothing(rus_gram)