## Introduction to Prediction using Surnames Analysis

---
### Goal
---

Predict whether a name is of Russian origin or not.

In this iteration we are going to:
* build a unigram model (bag of characters)
* learn the weights for the Russian-language predictor
* implement multi-linear regression
* test predictions using test data
* compute accuracy, recall, and precision for Russian names.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames_dev.csv"
surname_test_csv = "data_set/surnames_test.csv"

surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")
surname_test = pd.read_csv(surname_test_csv, index_col = None, encoding="UTF-8")

In [4]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
surname_test.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

In [5]:
surname_df.head()

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic


In [6]:
surname_test.head()

Unnamed: 0,surname,nationality
0,Moghadam,Arabic
1,Najjar,Arabic
2,Said,Arabic
3,Cham,Arabic
4,Tuma,Arabic


#### Features Exploration

In [7]:
features = surname_df["surname"] # features (x) needed to predict nationatlity


# target = surname_df["nationality"]
# russian names
russian_names = surname_df.loc[surname_df["nationality"] == "Russian"]
target = russian_names["surname"]


test_target = surname_test["surname"]

In [8]:
target

1523                 Mindra
1524                 Holuev
1525             Researcher
1526                  Vasin
1527              Beltyukov
               ...         
2929                  Bader
2930    To The FirstÂ  Page
2931               Vaksberg
2932                Martidi
2933                Munster
Name: surname, Length: 1411, dtype: object

---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [27]:
# vectorize features - unigrams only
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,1), strip_accents="ascii", min_df=0.0, max_df=1.0)
X_counts = cv.fit_transform(target, test_target)

X = TfidfTransformer(use_idf=False).fit_transform(X_counts)
# X = tf_transformer.transform(X_counts)

In [28]:
# alphabet
cv.get_feature_names()

[' ',
 "'",
 ',',
 '-',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'y',
 'z']

In [29]:
# output to csv
l = DataFrame(X.A, columns=cv.get_feature_names())
l.to_csv("data_set/vector_normalization.csv", index = False)

In [30]:
print(DataFrame(X.A, columns=cv.get_feature_names()).to_string())

                       '         ,         -         a         b         c         d         e         f         g         h         i         j         k         l         m         n         o         p         r         s         t         u         v         w         y         z
0     0.000000  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000  0.408248  0.408248  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
1     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000  0.000000  0.408248  0.408248  0.000000  0.000000  0.000000
2     0.000000  0.000000  0.000000  0.000000  0.213201  0.000000  0.213201  0.000000  0.639602  0.000000  0.000000  0.213201  0.000000  0.000000 

------
## Multiple Linear Regression

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [31]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, target, test_size=0.35, random_state = 32)

In [32]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

ValueError: could not convert string to float: 'Bagaturiya'

In [None]:
y_predicted = russian_model.predict(x_test)

In [15]:
# what is x = words and y = russianness????
plt.scatter(y_test, y_predicted)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.title("Russianess of a Surname")
plt.xlabel("Letter Frequency")
plt.ylabel("Is Russian - Yes or No")
plt.show

NameError: name 'x' is not defined

-----
#### Scoring

----

#### -Accuracy- will update

In [None]:
# f1 score
surname_accuracy = f1_score(y_test, y_pred, average='micro')
print(f"f1 score: {surname_accuracy}")

In [None]:
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_predicted))
print('R²: %.2f' % r2_score(y_test, y_predicted))

__Observation 1__: Accuracy  is at [blank]
Comment here


#### -Precision and Recall- will update
------
Note: Precision and Recall will be computed using sklearn.metrics's ```precision_score``` and ```recall_score``` library.


------

In [None]:
surname_precision = 0.0
surname_recall = 0.0

In [None]:
surname_precision = precision_score(y_test, y_pred, average="micro")
surname_recall = recall_score(y_test, y_pred, average="micro")

print("Overall")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

__Observation 3__: Comment here!


---
### -Conclusion-

UPDATE 

---

In [None]:
print(f"Accuracy: {surname_accuracy}")
print(f"Precision: {surname_precision}")
print(f"Recall: {surname_recall}")

In [None]:
print(f"Nationality Accuracy: \n {surname_null_accuracy}")