## Introduction to Prediction using Surnames Analysis

---
### Goal
---

Predict whether a name is of Russian origin or not.

In this iteration we are going to:
* build a unigram model (bag of characters)
* learn the weights for the Russian-language predictor
* implement multi-linear regression
* test predictions using test data
* compute accuracy, recall, and precision for Russian names.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames_dev.csv"
surname_test_csv = "data_set/surnames_test.csv"

surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")
surname_test = pd.read_csv(surname_test_csv, index_col = None, encoding="UTF-8")

In [4]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
surname_test.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [5]:
# making sure it works - delete later
surname_df.tail(100)

Unnamed: 0,surname,nationality
2903,Timin,Russian
2904,Porshnev,Russian
2905,Gassan,Russian
2906,Drozdetsky,Russian
2907,Bawvykin,Russian
...,...,...
2998,Banh,Vietnamese
2999,Thach,Vietnamese
3000,Hoang,Vietnamese
3001,Do,Vietnamese


In [6]:
surname_df.info

<bound method DataFrame.info of        surname nationality
0     Fakhoury      Arabic
1         Toma      Arabic
2        Koury      Arabic
3         Bata      Arabic
4       Samaha      Arabic
...        ...         ...
2998      Banh  Vietnamese
2999     Thach  Vietnamese
3000     Hoang  Vietnamese
3001        Do  Vietnamese
3002      Than  Vietnamese

[3003 rows x 2 columns]>

In [7]:
# surname_list = surname_df["surname"] # all nationalities

# removing non-alphabetic characters 
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))
surname_test_list = surname_test['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

# russian_rows = surname_df.loc[surname_df["nationality"] == "Russian"] # russian rows only
# russian_surnames = russian_rows["surname"] # russian names only

In [8]:
surname_df

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic
...,...,...
2998,Banh,Vietnamese
2999,Thach,Vietnamese
3000,Hoang,Vietnamese
3001,Do,Vietnamese


In [9]:
# Creating another column for when surname is Russian or not.
surname_df['label'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
labels = surname_df["label"]

In [10]:
surname_list.head()

0    Fakhoury
1        Toma
2       Koury
3        Bata
4      Samaha
Name: surname, dtype: object

In [11]:
labels.head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

In [12]:
surname_df.groupby("label").count()

Unnamed: 0_level_0,surname,nationality
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1592,1592
1,1411,1411


In [13]:
surname_list.shape

(3003,)

In [14]:
labels.shape

(3003,)

---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [15]:
# vectorize features - unigrams only
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,1), strip_accents="ascii", min_df=0.0, max_df=1.0)
# X = cv.fit_transform(surname_list)
X_freq = cv.fit_transform(surname_list)

# X_freq = cv.fit_transform(target, test_target)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
X = tf_transformer.transform(X_freq) # weights

In [16]:
print(X.toarray())

[[0.35355339 0.         0.         ... 0.         0.35355339 0.        ]
 [0.5        0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.4472136  0.        ]
 ...
 [0.4472136  0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.5        0.         0.         ... 0.         0.         0.        ]]


In [17]:
X.shape

(3003, 26)

In [18]:
# alphabet
cv.get_feature_names()

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [19]:
# output to csv - delete later
# l = DataFrame(X.A, columns=cv.get_feature_names())
# l.to_csv("data_set/vector_normalization.csv", index = False)

In [20]:
# print(DataFrame(X.A, columns=cv.get_feature_names()).to_string())

In [21]:
# print(X)

------
## Multiple Linear Regression

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [22]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.35, random_state = 32)

In [23]:
# print("Training set: %d samples" % len(x_train))
# print("Test set: %d samples" % len(x_test))

print(x_train[0].shape)

(1, 26)


In [24]:
print(y_train[0].shape)

()


In [25]:
y_train.shape

(1951,)

In [26]:
x_train.shape

(1951, 26)

In [27]:
print(x_train[:10])

  (0, 0)	0.4472135954999579
  (0, 12)	0.4472135954999579
  (0, 14)	0.4472135954999579
  (0, 15)	0.4472135954999579
  (0, 17)	0.4472135954999579
  (1, 3)	0.4082482904638631
  (1, 8)	0.4082482904638631
  (1, 10)	0.4082482904638631
  (1, 12)	0.4082482904638631
  (1, 17)	0.4082482904638631
  (1, 20)	0.4082482904638631
  (2, 3)	0.6324555320336759
  (2, 13)	0.6324555320336759
  (2, 14)	0.31622776601683794
  (2, 20)	0.31622776601683794
  (3, 1)	0.3333333333333333
  (3, 3)	0.3333333333333333
  (3, 5)	0.6666666666666666
  (3, 10)	0.3333333333333333
  (3, 14)	0.3333333333333333
  (3, 20)	0.3333333333333333
  (4, 3)	0.35355339059327373
  (4, 6)	0.35355339059327373
  (4, 7)	0.35355339059327373
  (4, 8)	0.35355339059327373
  :	:
  (6, 3)	0.30151134457776363
  (6, 4)	0.30151134457776363
  (6, 14)	0.6030226891555273
  (6, 17)	0.30151134457776363
  (6, 18)	0.30151134457776363
  (6, 19)	0.30151134457776363
  (6, 21)	0.30151134457776363
  (7, 0)	0.5
  (7, 12)	0.5
  (7, 19)	0.5
  (7, 20)	0.5
  (8, 4)	0.5

In [28]:
print(y_train[:10])

437     0
2296    1
467     0
2369    1
1631    1
2128    1
2564    1
46      0
2913    1
1221    0
Name: label, dtype: int64


#### Linear Regression


In [29]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

LinearRegression()

In [30]:
intercept = russian_model.intercept_
intercept

-0.5841042407646291

In [31]:
weight = russian_model.coef_
weight

array([ 0.26901713,  0.60000776, -0.07457439,  0.42081762,  0.27859781,
        0.64886292,  0.35141401,  0.68804682,  0.54788032,  0.75338328,
        0.76185174,  0.26654431,  0.36193052,  0.50536713,  0.26336493,
        0.44137044, -0.22693039,  0.12529151,  0.2481927 ,  0.46114508,
        0.30065996,  1.63249024,  0.09537375, -0.32264184,  0.63706962,
        0.72023425])

#### Test Data and Predictions

In [32]:
surname_test["surname"]

0       Moghadam
1         Najjar
2           Said
3           Cham
4           Tuma
          ...   
3023         Thi
3024          Ta
3025         Cao
3026       Trieu
3027          Ha
Name: surname, Length: 3028, dtype: object

In [33]:
X_freq = cv.fit_transform(surname_test_list)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
y_test = tf_transformer.transform(X_freq)

In [34]:
print(y_test)

  (0, 0)	0.5773502691896258
  (0, 3)	0.2886751345948129
  (0, 6)	0.2886751345948129
  (0, 7)	0.2886751345948129
  (0, 12)	0.5773502691896258
  (0, 14)	0.2886751345948129
  (1, 0)	0.6324555320336759
  (1, 9)	0.6324555320336759
  (1, 13)	0.31622776601683794
  (1, 17)	0.31622776601683794
  (2, 0)	0.5
  (2, 3)	0.5
  (2, 8)	0.5
  (2, 18)	0.5
  (3, 0)	0.5
  (3, 2)	0.5
  (3, 7)	0.5
  (3, 12)	0.5
  (4, 0)	0.5
  (4, 12)	0.5
  (4, 19)	0.5
  (4, 20)	0.5
  (5, 0)	0.7559289460184544
  (5, 8)	0.3779644730092272
  (5, 13)	0.3779644730092272
  :	:
  (3019, 20)	0.3779644730092272
  (3019, 24)	0.3779644730092272
  (3020, 0)	0.7071067811865475
  (3020, 13)	0.7071067811865475
  (3021, 0)	0.7071067811865475
  (3021, 11)	0.7071067811865475
  (3022, 0)	0.5
  (3022, 3)	0.5
  (3022, 6)	0.5
  (3022, 13)	0.5
  (3023, 7)	0.5773502691896258
  (3023, 8)	0.5773502691896258
  (3023, 19)	0.5773502691896258
  (3024, 0)	0.7071067811865475
  (3024, 19)	0.7071067811865475
  (3025, 0)	0.5773502691896258
  (3025, 2)	0.57735

In [35]:
y_test.shape

(3028, 26)

In [36]:
russianess = russian_model.predict(y_test)

In [41]:
russianess.shape

(3028,)

In [37]:
russianess

array([ 0.27774654,  0.26195033,  0.15884965, ..., -0.31978886,
        0.18222966,  0.09264217])

#### -Accuracy- will update

In [None]:
# f1 score
surname_accuracy = f1_score(y_test, russianess, scoring='f1', average='micro')
print(f"f1 score: {surname_accuracy}")

__Observation 1__: Accuracy  is at [blank]
Comment here


#### -Precision and Recall- will update
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [None]:
surname_precision = 0.0
surname_recall = 0.0

In [47]:
surname_precision = precision_score(y_test, russianess, average="micro")
surname_recall = recall_score(y_test, russianess, average="micro")

print("Overall")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

AttributeError: 'numpy.ndarray' object has no attribute 'A'

__Observation 3__: Comment here!


---
### -Conclusion-

UPDATE 

---

In [None]:
print(f"Accuracy: {surname_accuracy}")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

In [None]:
print(f"Nationality Accuracy: \n {surname_null_accuracy}")