## Introduction to Sequence Modeling - Russian vs English Surnames

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Katz's Back-Off Model Smoothing
* Test performance of model using English data.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [86]:
import nltk
# from nltk import bigrams, trigrams, word_tokenize
import collections
from collections import defaultdict, Counter

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import brown # corpus of english words

In [3]:
# from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [4]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [5]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [6]:
# removing non-alphabetic characters 
# surname_df = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))
surname_df

Unnamed: 0,surname,nationality
0,Mokrousov,Russian
1,Nurov,Russian
2,Judovich,Russian
3,Mikhailjants,Russian
4,Jandarbiev,Russian
...,...,...
1301,Foxall,English
1302,Cowan,English
1303,Wrightson,English
1304,Loft,English


In [7]:
# Retrieve English names only
english_df = surname_df.loc[surname_df["nationality"] == "English"]
english_df = english_df[["surname"]]
english_df.head()

Unnamed: 0,surname
940,Fairhurst
941,Wateridge
942,Nemeth
943,Moroney
944,Goodall


In [8]:
# save all english names in txt file
english_df.to_csv("data_set/english-corpus/english_names.txt", sep='\t', index=False, header=False)

In [95]:
# Russian names only
russian_df = surname_df.loc[surname_df["nationality"] == "Russian"]
russian_df = russian_df[["surname"]]
russian_df.head()

Unnamed: 0,surname
0,Mokrousov
1,Nurov
2,Judovich
3,Mikhailjants
4,Jandarbiev


In [10]:
russian_df.to_csv("data_set/russian-corpus/russian_names.txt", sep='\t', index=False, header=False)

---
### Create New Corpus

---
Create a new corpus of English and Russian names to be used for the n_gram model.

In [11]:
# eng_gram = defaultdict(lambda: defaultdict(lambda: 0))
# rus_gram = defaultdict(lambda: defaultdict(lambda: 0))

In [100]:
# corpus_root = 'data_set/english-corpus/'
# english_names = PlaintextCorpusReader(corpus_root, '.*')
# english_names.words('english_names.txt')

names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]
english_names

['fairhurst',
 'wateridge',
 'nemeth',
 'moroney',
 'goodall',
 'agar',
 'thonon',
 'duggan',
 'nash',
 'herbert',
 'mcarthur',
 'moriarty',
 'douthwaite',
 'dell',
 'wakeham',
 'mottram',
 'beamish',
 'karne',
 'greenwood',
 'ullman',
 'aldred',
 'darlington',
 'judd',
 'hicks',
 'kay',
 'dervish',
 'oakley',
 'morrison',
 'bethell',
 'vaughn',
 'knox',
 'iles',
 'trattles',
 'gibbins',
 'whelan',
 'mctaggart',
 'charnock',
 'thorley',
 'thorpe',
 'garland',
 'gunter',
 'turland',
 'turney',
 'reisser',
 'ruff',
 'newall',
 'sheppard',
 'knigge',
 'davey',
 'rodrigues',
 'smullen',
 'alam',
 'bradshaw',
 'kingston',
 'pelling',
 'auberton',
 'kennett',
 'newham',
 'ware',
 'millar',
 'wallis',
 'sugden',
 'butler',
 'lofthouse',
 'prendergast',
 'wragg',
 'francis',
 'eddleston',
 'sykes',
 'thurston',
 'ullmann',
 'reynolds',
 'eggison',
 'jackson',
 'savage',
 'ransom',
 'holdsworth',
 'hiscocks',
 'dick',
 'warden',
 'powis',
 'dunford',
 'vickars',
 'johns',
 'mcconnell',
 'leigh'

In [101]:
# corpus_root = 'data_set/russian-corpus/'
# russian_names = PlaintextCorpusReader(corpus_root, '.*')
# russian_names.words('russian_names.txt')

names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]
russian_names

['mokrousov',
 'nurov',
 'judovich',
 'mikhailjants',
 'jandarbiev',
 'govyadin',
 'tubylov',
 'tunkin',
 'turetsky',
 'remyannikov',
 'adam',
 'ablesimov',
 'bakastov',
 'munin',
 'tsenkovsky',
 'polikarpov',
 'dogel',
 'janek',
 'obolonsky',
 'marhasin',
 'abdrashitov',
 'mochalin',
 'rifkind',
 'nasonov',
 'abramchuk',
 'pohlebaev',
 'murov',
 'timaev',
 'jminko',
 'pavlenkov',
 'gaur',
 'bekhoev',
 'vainson',
 'mikhailidi',
 'kartunov',
 'batchaev',
 'jukhman',
 'talkov',
 'bagmevsky',
 'jakimchik',
 'vaidanovich',
 'vavkin',
 'privalihin',
 'gujavin',
 'jijilev',
 'guk',
 'drozdetsky',
 'ukhov',
 'muijel',
 'avdulov',
 'zhavoronkov',
 'tolbuhin',
 'ryjkin',
 'rahalsky',
 'minchenkov',
 'yuhma',
 'glavinsky',
 'zinovin',
 'zhitnikov',
 'musalnikov',
 'yanpolsky',
 'richter',
 'hamukov',
 'ageitchik',
 'bibler',
 'hismatulov',
 'bakihanov',
 'virenius',
 'avtokratov',
 'egin',
 'dubrowski',
 'jitny',
 'mojar',
 'lihtentul',
 'gulenko',
 'awtokratoff',
 'mogila',
 'gaspirovich',
 'ra

-------
### Calculate Frequencies and Probabilities

-----

In [80]:
# def generate_bigrams(word):
#     lower = word[:-1]
#     upper = word[1:]
#     bigram_gen = map(lambda l,u: l+u, lower, upper)
#     for bigram in bigram_gen:
#         yield bigram

In [None]:
# def compute_frequencies(data, language):
#     """data is an iterable.    Frequency counts for both Bigrams and Unigrams,  on a specific language
#     """
#     bigram_freqs = {a+b:0 for (a,b) in itertools.product(
#             string.ascii_lowercase, string.ascii_lowercase)}
#     letter_freqs = {a:0 for a in string.ascii_lowercase}

#     filtered = filter(lambda x: x[1] == language, data)

#     for (name, _) in filtered:
#         for letter in name:
#             letter_freqs[letter] += 1
#         for bigram in generate_bigrams(name):
#             bigram_freqs[bigram] += 1

#     return bigram_freqs, letter_freqs

# bigram_freqs = {a+b:0 for (a,b) in itertools.product(
#             string.ascii_lowercase, string.ascii_lowercase)}


In [103]:
# english bigrams
eng_gram = collections.Counter()
for name in english_names:
    eng_gram.update(Counter(name[idx : idx + 2] for idx in range(len(name) - 1)))

{'fa': 3,
 'ai': 4,
 'ir': 8,
 'rh': 4,
 'hu': 6,
 'ur': 18,
 'rs': 10,
 'st': 19,
 'wa': 11,
 'at': 10,
 'te': 11,
 'er': 49,
 'ri': 19,
 'id': 4,
 'dg': 4,
 'ge': 10,
 'ne': 28,
 'em': 6,
 'me': 4,
 'et': 8,
 'th': 23,
 'mo': 12,
 'or': 23,
 'ro': 15,
 'on': 47,
 'ey': 33,
 'go': 4,
 'oo': 9,
 'od': 13,
 'da': 6,
 'al': 14,
 'll': 33,
 'ag': 6,
 'ga': 10,
 'ar': 40,
 'ho': 15,
 'no': 7,
 'du': 4,
 'ug': 5,
 'gg': 5,
 'an': 36,
 'na': 6,
 'as': 12,
 'sh': 12,
 'he': 10,
 'rb': 3,
 'be': 11,
 'rt': 16,
 'mc': 3,
 'ca': 1,
 'ia': 1,
 'ty': 3,
 'do': 10,
 'ou': 15,
 'ut': 6,
 'hw': 1,
 'it': 5,
 'de': 19,
 'el': 25,
 'ak': 7,
 'ke': 16,
 'eh': 1,
 'ha': 17,
 'am': 14,
 'ot': 4,
 'tt': 12,
 'tr': 7,
 'ra': 25,
 'ea': 10,
 'mi': 3,
 'is': 15,
 'ka': 4,
 'rn': 8,
 'gr': 6,
 're': 23,
 'ee': 6,
 'en': 26,
 'nw': 2,
 'wo': 9,
 'ul': 9,
 'lm': 4,
 'ma': 20,
 'ld': 13,
 'dr': 6,
 'ed': 7,
 'rl': 7,
 'li': 9,
 'in': 37,
 'ng': 14,
 'gt': 3,
 'to': 23,
 'ju': 2,
 'ud': 3,
 'dd': 7,
 'hi': 10,
 'i

In [None]:
dict(eng_gram)

In [111]:
[print(key, value) for (key, value) in sorted(eng_gram.items(), key=lambda x: x[1], reverse=True)]

er 49
on 47
ar 40
in 37
le 37
an 36
ey 33
ll 33
ne 28
en 26
el 25
ra 25
th 23
or 23
re 23
to 23
ck 21
ma 20
st 19
ri 19
de 19
ur 18
es 18
ha 17
so 17
la 17
rt 16
ke 16
ro 15
ho 15
ou 15
is 15
al 14
am 14
ng 14
ns 14
pe 14
nd 14
se 14
od 13
ld 13
il 13
nn 13
ol 13
mo 12
as 12
sh 12
tt 12
ic 12
wa 11
te 11
be 11
rd 11
co 11
rs 10
at 10
ge 10
ga 10
he 10
do 10
ea 10
hi 10
rr 10
lo 10
oo 9
wo 9
ul 9
li 9
ki 9
ow 9
ir 8
et 8
rn 8
gh 8
pa 8
ve 8
wi 8
we 8
no 7
ak 7
tr 7
ed 7
rl 7
dd 7
vi 7
ch 7
oc 7
un 7
ei 7
ss 7
ad 7
di 7
ie 7
hu 6
em 6
da 6
ag 6
na 6
ut 6
gr 6
ee 6
dr 6
bi 6
ta 6
nt 6
ds 6
fo 6
um 6
fi 6
ug 5
gg 5
it 5
ks 5
ay 5
au 5
bb 5
ru 5
ni 5
ig 5
aw 5
dl 5
ja 5
ac 5
op 5
ls 5
ti 5
ry 5
ai 4
rh 4
id 4
dg 4
me 4
go 4
du 4
ot 4
ka 4
lm 4
va 4
kn 4
gi 4
gu 4
tu 4
av 4
sm 4
br 4
bu 4
us 4
om 4
lk 4
cu 4
cr 4
pl 4
wn 4
ms 4
fa 3
rb 3
mc 3
ty 3
mi 3
gt 3
ud 3
oa 3
kl 3
ox 3
tl 3
ib 3
wh 3
ff 3
pp 3
rg 3
wr 3
fr 3
nc 3
sa 3
sw 3
po 3
ob 3
oy 3
yl 3
dm 3
ya 3
mp 3
ye 3
mm 3
bo 3
os 3
si 3
d

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [28]:
# russian bigrams
rus_gram = collections.Counter()
for name in english_names:
    eng_gram.update(Counter(name[idx : idx + 2] for idx in range(len(name) - 1)))

In [40]:
dict(english_bigrams)

{None: defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Fairhurst': 1.0}),
 'Fairhurst': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Wateridge': 1.0}),
 'Wateridge': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Nemeth': 1.0}),
 'Nemeth': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Moroney': 1.0}),
 'Moroney': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Goodall': 1.0}),
 'Goodall': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Agar': 1.0}),
 'Agar': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Thonon': 1.0}),
 'Thonon': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Duggan': 1.0}),
 'Duggan': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
             {'Nash': 1.0}),
 'Nash': defaultdict(<function __main__.<lambda>.<locals>.

------
## Multiple Linear Regression

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [None]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.35, random_state = 32)

In [None]:
y_train.shape

In [None]:
x_train.shape

### Linear Regression


In [None]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

In [None]:
intercept = russian_model.intercept_
intercept

In [None]:
weight = russian_model.coef_
weight

### Test Data and Predictions

In [None]:
surname_test['label'] = [1 if x =='Russian' else 0 for x in surname_test['nationality']]
labels = surname_test["label"]

In [None]:
# test data
cv_feature = cv.fit_transform(surname_test_list)
tf_transformer = TfidfTransformer(use_idf=False).fit(cv_feature)
reshape_feature = tf_transformer.transform(cv_feature)

In [None]:
russianess = russian_model.predict(reshape_feature)
russianess

#### -Model Summary-

In [None]:
# convert to same type as russianess (y_pred)
reshape_feature = reshape_feature.toarray()

In [None]:
from statsmodels.api import OLS
OLS(labels,russianess).fit().summary()

#### -Observations-

In [None]:
pred_name1 = ["Wasem"]
reshape_feature = cv.transform(pred_name1)
russian_model.predict(reshape_feature)

Note: __Wasem__ is an Arabic name. Model seems to think it is Russian due to similarity is spelling. Misclassified.

In [None]:
pred_name2 = ["See"]
reshape_feature = cv.transform(pred_name2)
russian_model.predict(reshape_feature)

Note: __See__ is a dutch name. This is correct.

In [None]:
pred_name3 = ["Los"]
reshape_feature = cv.transform(pred_name3)
russian_model.predict(reshape_feature)

Note: __Los__ is a Russian name. This has been misclassified as not Russian most likely due to Spanish having a similar name.