# Name Classification with Naive Bayes

## Overview

Focus of this project is to build a python module, that can determine if a person is Japanese or not, based on their name string. 
Final product is a class file containing `NameClassifier` class, capable of 
- loading & preprocessing train and test data
- train the model
- predict and evaluate the model
- save & load trained model for future use

The data of name strings with various origin from the world was obtained by using `Faker` library in python. 100000 fake names were created for each class for training and testing purposes. 

## Libraries / Dependencies

Couple python libraries were used to build this class
- scikit-learn
- pandas
- pickle

In order to use the class, these libraries and their dependencies need to be installed on your system.

## Setup and Locations

This class is only tested on Ubuntu Linux 18.04 version, and can be used by importing the class. The class file `model.py` needs to be located in the directory where you intend to use it. Data and saved model file can be located anywhere, as long as you have relative path to them from the class file. However generally it's good idea to keep everything within same or its child's directory. 

Now the basics are all out of the way, let's get started!

## Taking a look at data

This module load in the name data as csv file using pandas. You should have separate csv files, each for Japanese and non-Japanese names. 

In a file, this would look like

>Country, Address, name, other col..<br>
value1, value2, John Smith, value4

**As long as there are column named `name`, other columns won't be a problem.<br> 
There should always be a white space between first and last name though.**

For example, using dataframe, the data might look like

In [9]:
import pandas as pd
j_name = pd.read_csv('data/jp_names.csv')
f_name = pd.read_csv('data/f_names.csv')

print("Japanese Names: \n", j_name.sample(10))
print("\nnon-Japanese Names:\n", f_name.sample(10))

Japanese Names: 
         code    name
65086  jp_JP    田辺 稔
80397  jp_JP   高橋 涼平
98682  jp_JP   加納 健一
81025  jp_JP   宇野 明美
74745  jp_JP   杉山 直子
7759   jp_JP    中島 稔
68173  jp_JP  田辺 さゆり
12969  jp_JP   高橋 太一
81645  jp_JP   田辺 康弘
94883  jp_JP   廣川 里佳

non-Japanese Names:
         code                        name
31089  es_ES      Alvaro Valencia Bernat
27573  en_US                James George
15361  en_AU           Alexandra Schmidt
78193  ru_RU  Антонов Аверьян Георгиевич
75810  ru_RU     Самойлов Иван Архипович
71612  ro_RO                 Savina Niță
65742  pt_BR            Bernardo Peixoto
64837  pl_PL           Leonard Matejczyk
56023  no_NO                 Tom Eriksen
27458  en_US             Victoria Cooley


## Preprocessing

Preprocessing of data is one of the most important aspect of machine learing. It can boost or ruin the models' performance. 
Here, since we're dealing with text data, it needs to be encoded into numbers. 

### Spliting Dataset
The dataset is splitted into train and test datasets, for model training and testing.
Default ratio is set to 
> train : test = 70% : 30%

This ratio can be modified if necessary.

### Bag of Words Model
In this simple technique, each word that appear in the dataset are assigned with unique number, so that each text can be expressed as a sequence of the numbers.
The sequences are converted into vector with each position / index representing each word, and value expressing the frequency of the occurence of the word.

<br>
Specifically in this class, word count is utilized with scikit-learn's `CountVectorizer`. 

<br>
The data will be encoded into numpy sparse matrix, and is ready to be fed into the Naive Bayes model. 


## How Naive Bayes works...

Brah Brah....

Using `Sklearn.naive_bayes.MultinomialNB` class.

## Evaluation Metrics

Model evaluation was done by testset, with 3 metrics.
### accuracy
how many data points did the model correclty predicted, regardless of class

$$acc = \frac{TP + TN}{Total Data}$$

### precision
Out of all predicted Japanese names, how many were actually Japanese names?

$$precision = \frac{TP}{TP + FP}$$
### recall
Out of all actual Japanese names, how many did we predict as Japanese?
$$recall = \frac{TP}{TP + FN}$$