## Introduction to Sequence Modeling - Russian vs English Surnames

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Katz's Back-Off Model Smoothing
* Test performance of model using English data.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
import nltk
# from nltk import bigrams, trigrams, word_tokenize
import collections
from collections import defaultdict, Counter

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import brown # corpus of english words

In [3]:
# from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [4]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [5]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [6]:
# removing non-alphabetic characters 
surname_df.shape

(1306, 2)

In [7]:
surname_df = surname_df.dropna()

In [8]:
surname_df.shape

(1305, 2)

In [9]:
surname_df.to_csv("data_set/corpus/surnames_names.txt", sep='\t', index=False, header=False)

In [10]:
# Retrieve English names only
english_df = surname_df.loc[surname_df["nationality"] == "English"]
english_df = english_df[["surname"]]
english_df.head()

Unnamed: 0,surname
940,Fairhurst
941,Wateridge
942,Nemeth
943,Moroney
944,Goodall


In [11]:
# save all english names in txt file
english_df.to_csv("data_set/corpus/english_names.txt", sep='\t', index=False, header=False)

In [12]:
# Russian names only
russian_df = surname_df.loc[surname_df["nationality"] == "Russian"]
russian_df = russian_df[["surname"]]
russian_df.head()

Unnamed: 0,surname
0,Mokrousov
1,Nurov
2,Judovich
3,Mikhailjants
4,Jandarbiev


In [13]:
russian_df.to_csv("data_set/corpus/russian_names.txt", sep='\t', index=False, header=False)

---
### Create New Corpus

---
Create a new corpus of English and Russian names to be used for the n_gram model.

In [14]:
# English
names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]
# english_names

In [15]:
# Russian
names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]
# russian_names

-------
### Calculate Frequencies and Probabilities

-----

In [16]:
# generate bigrams and frequencies
def generate_bigrams(names):
    n_gram = collections.Counter()
    for c in names:
        n_gram.update(Counter(c[idx : idx + 2] for idx in range(len(c) - 1)))
        
    return n_gram

In [17]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [18]:
# retrieve english bigrams
eng_gram = generate_bigrams(english_names)

In [19]:
# sort freqencies
# freq_sorted(eng_gram)

In [20]:
rus_gram = generate_bigrams(russian_names)

In [136]:
n_grams = generate_bigrams(surname_df['surname'])

In [21]:
# freq_sorted(rus_gram)

__Question__: What bigram is most informative for distinguishing between English and Russian names?

__Obervation__: English top 5 bigrams:

er : 49

on : 47

ar : 40

in : 37

le : 37

__Observation__: Russian top 5 bigrams:

ov : 342

in : 198

ko : 159

ev : 133

ch : 117

### Feature Selection

In [144]:
features = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x)) # features (x) needed to predict nationatlity
target = surname_df['nationality'] # what we are predicting (y)

------
## Naiive Bayes

------

### English Surnames

__Observation__: There are significantly more russian names in the dataset than english. The data is not quite balanced.

In [242]:
# vectorize features - unigrams to trigrams
# cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,2), strip_accents="ascii", min_df=0.0, max_df=1.0)
# X_freq = cv.fit_transform(features.ravel())

# # tf_transformer for normalization
# tf_transformer = TfidfTransformer(use_idf=True).fit(X_freq)
# X = tf_transformer.transform(X_freq)


cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,4), strip_accents="ascii", min_df=0.09, max_df=1.0)
X = cv.fit_transform(features)

In [243]:
target.shape

(1305,)

In [244]:
X.shape

(1305, 28)

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [245]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [246]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

## Model

In [247]:
from sklearn.naive_bayes import GaussianNB

In [248]:
# fit the model
surname_model = GaussianNB()
surname_model.fit(x_train.toarray(), y_train)

GaussianNB()

In [249]:
y_pred = surname_model.predict(x_test.toarray())

### Summary

In [250]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [256]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     English       0.58      0.98      0.73       135
     Russian       0.99      0.70      0.82       322

    accuracy                           0.78       457
   macro avg       0.78      0.84      0.77       457
weighted avg       0.87      0.78      0.79       457



In [252]:
print(confusion_matrix(y_test, y_pred))

[[132   3]
 [ 96 226]]


In [255]:
print(f"f1_score: {f1_score(y_test, y_pred, average='micro')} ")

f1_score: 0.7833698030634574 
