## Introduction to Prediction using Surnames Analysis

---
### Goal
---

Predict whether a name is of Russian origin or not.

In this iteration we are going to:
* build a unigram model (bag of characters)
* learn the weights for the Russian-language predictor
* implement multi-linear regression
* test predictions using test data
* compute accuracy, recall, and precision for Russian names.

------


In [54]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames_dev.csv"
surname_test_csv = "data_set/surnames_test.csv"

surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")
surname_test = pd.read_csv(surname_test_csv, index_col = None, encoding="UTF-8")

In [4]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
surname_test.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [5]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [6]:
surname_test.head()

Unnamed: 0,surname,nationality
0,Moghadam,Arabic
1,Najjar,Arabic
2,Said,Arabic
3,Cham,Arabic
4,Tuma,Arabic


#### Features Exploration

In [7]:
# Creating another column for when surname is Russian or not.
surname_df['label'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
labels = surname_df["label"]

In [25]:
surname_df.tail(100)

Unnamed: 0,surname,nationality,label
2903,Timin,Russian,1
2904,Porshnev,Russian,1
2905,Gassan,Russian,1
2906,Drozdetsky,Russian,1
2907,Bawvykin,Russian,1
...,...,...,...
2998,Banh,Vietnamese,0
2999,Thach,Vietnamese,0
3000,Hoang,Vietnamese,0
3001,Do,Vietnamese,0


In [50]:
surname_df.head()

Unnamed: 0,surname,nationality,label
0,Fakhoury,Arabic,0
1,Toma,Arabic,0
2,Koury,Arabic,0
3,Bata,Arabic,0
4,Samaha,Arabic,0


In [97]:
# surnames_df = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [91]:
surnames_df.head()

0    Fakhoury
1        Toma
2       Koury
3        Bata
4      Samaha
Name: surname, dtype: object

In [99]:
surname_list = surname_df["surname"] # all nationalities
# russian_rows = surname_df.loc[surname_df["nationality"] == "Russian"] # russian rows only
# russian_surnames = russian_rows["surname"] # russian names only

In [100]:
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [101]:
surname_list.head()

0    Fakhoury
1        Toma
2       Koury
3        Bata
4      Samaha
Name: surname, dtype: object

In [102]:
surname_df.groupby("label").count()

Unnamed: 0_level_0,surname,nationality
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1592,1592
1,1411,1411


---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [103]:
# vectorize features - unigrams only
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,1), strip_accents="ascii", min_df=0.0, max_df=1.0)
X = cv.fit_transform(surname_list)

# X_freq = cv.fit_transform(target, test_target)

# tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
# X = tf_transformer.transform(X_freq)

In [104]:
print(X.toarray())

[[1 0 0 ... 0 1 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]


In [105]:
# alphabet
cv.get_feature_names()

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [None]:
# output to csv - delete later
# l = DataFrame(X.A, columns=cv.get_feature_names())
# l.to_csv("data_set/vector_normalization.csv", index = False)

In [None]:
# print(DataFrame(X.A, columns=cv.get_feature_names()).to_string())

In [None]:
# print(X)

------
## Multiple Linear Regression

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [26]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.35, random_state = 32)

In [23]:
print("Training set: %d samples" % len(x_train))
print("Test set: %d samples" % len(x_test))

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [27]:
y_train.shape

(1951,)

In [28]:
x_train.shape

(1951, 33)

In [30]:
print(x_train[:10])

  (0, 7)	1
  (0, 21)	1
  (0, 24)	1
  (0, 19)	1
  (0, 22)	1
  (1, 17)	1
  (1, 27)	1
  (1, 24)	1
  (1, 19)	1
  (1, 10)	1
  (1, 15)	1
  (2, 21)	1
  (2, 27)	1
  (2, 10)	2
  (2, 20)	2
  (3, 12)	2
  (3, 17)	1
  (3, 21)	1
  (3, 27)	1
  (3, 8)	1
  (3, 10)	1
  (4, 14)	1
  (4, 21)	1
  (4, 27)	1
  (4, 25)	1
  :	:
  (6, 24)	1
  (6, 26)	1
  (6, 8)	1
  (6, 25)	1
  (6, 10)	1
  (6, 11)	1
  (6, 28)	1
  (7, 7)	1
  (7, 27)	1
  (7, 26)	1
  (7, 19)	1
  (8, 17)	1
  (8, 14)	1
  (8, 21)	1
  (8, 24)	1
  (8, 31)	1
  (8, 25)	2
  (8, 11)	2
  (8, 28)	1
  (9, 7)	1
  (9, 21)	1
  (9, 19)	1
  (9, 11)	1
  (9, 20)	1
  (9, 18)	1


In [17]:
print(y_train[:10])

437     0
2296    1
467     0
2369    1
1631    1
2128    1
2564    1
46      0
2913    1
1221    0
Name: label, dtype: int64


In [18]:
print(x_train[:10])

  (0, 7)	1
  (0, 21)	1
  (0, 24)	1
  (0, 19)	1
  (0, 22)	1
  (1, 17)	1
  (1, 27)	1
  (1, 24)	1
  (1, 19)	1
  (1, 10)	1
  (1, 15)	1
  (2, 21)	1
  (2, 27)	1
  (2, 10)	2
  (2, 20)	2
  (3, 12)	2
  (3, 17)	1
  (3, 21)	1
  (3, 27)	1
  (3, 8)	1
  (3, 10)	1
  (4, 14)	1
  (4, 21)	1
  (4, 27)	1
  (4, 25)	1
  :	:
  (6, 24)	1
  (6, 26)	1
  (6, 8)	1
  (6, 25)	1
  (6, 10)	1
  (6, 11)	1
  (6, 28)	1
  (7, 7)	1
  (7, 27)	1
  (7, 26)	1
  (7, 19)	1
  (8, 17)	1
  (8, 14)	1
  (8, 21)	1
  (8, 24)	1
  (8, 31)	1
  (8, 25)	2
  (8, 11)	2
  (8, 28)	1
  (9, 7)	1
  (9, 21)	1
  (9, 19)	1
  (9, 11)	1
  (9, 20)	1
  (9, 18)	1


#### Bag of Characters


In [20]:
cv = CountVectorizer(analyzer='char', ngram_range=(1,1), strip_accents="ascii", min_df=0.0, max_df=1.0)

boc = dict()
boc["train"] = (cv.fit_transform(x_train), y_train)
boc["test"] = (cv.transform(x_test), y_test)

AttributeError: lower not found

In [None]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

In [None]:
y_predicted = russian_model.predict(x_test)

In [None]:
# what is x = words and y = russianness????
plt.scatter(y_test, y_predicted)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.title("Russianess of a Surname")
plt.xlabel("Letter Frequency")
plt.ylabel("Is Russian - Yes or No")
plt.show

-----
#### Scoring

----

#### -Accuracy- will update

In [None]:
# f1 score
surname_accuracy = f1_score(y_test, y_pred, average='micro')
print(f"f1 score: {surname_accuracy}")

In [None]:
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_predicted))
print('R²: %.2f' % r2_score(y_test, y_predicted))

__Observation 1__: Accuracy  is at [blank]
Comment here


#### -Precision and Recall- will update
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [None]:
surname_precision = 0.0
surname_recall = 0.0

In [None]:
surname_precision = precision_score(y_test, y_pred, average="micro")
surname_recall = recall_score(y_test, y_pred, average="micro")

print("Overall")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

__Observation 3__: Comment here!


---
### -Conclusion-

UPDATE 

---

In [None]:
print(f"Accuracy: {surname_accuracy}")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

In [None]:
print(f"Nationality Accuracy: \n {surname_null_accuracy}")