# Build your classifier

## 1. Data preprocessing

We demonstrate common preprocessing technique with two example.
* Normalization in cat classifier
* Categorical data convertion and missing value in Pokémon classifier

### (1 - A) Normalization

Reference to Andrew Ng's course on coursera.
The cat image dataset contains pictures of cat and other stuff. In our [cat classifier](#cat_classifier) example each X represent a single 64 * 64 * 3  RGB image, and the label of each image is decribed as follow.
- 0 : non-cat
- 1 : cat

We divide each pixel by 255, and this operation is called normalization. After dividing the image by 255, the original value of color will be rescaled to 0~1. By normalizing the data, the optimization algorithms will be able to find the optimal more efficiently. Each feature matters equally.

In this example, we already know the distribution of each pixel is ranging from 0 to 255. That's why the magic number is 255. However, in general cases you wouldn't know the distribution of each feature. Our common solution is to use [min max normalization](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

### (1 - B) Categorical data convertion

Normally, there will be a lot of features that won't be discribed in numerical data such as color and type in our Pokémon dataset.

| **Ordinal Values** | **Nominal Values** |
|--------------------|--------------------|
| Generation | Color, Type|

Instead, they probably are interpreted with categorical data. We introduce two different approaches to convert from categorical data to numical data. Take two features, **Color** and **Type_1** for example.

1. **Color**:
    * The method we demonstrate here is one-hot encoding, which transform the categorical features to combinations of 0s and 1s. Using numeric features (0:red, 1:blue ....) to represent category will confuse the classifier since the **incremental** figures do not map to their representative colors.
  
2. **Type_1**:
    * Assume the relation between types is so simple that we can view it as the itensity of the Pokémon.
    * We didn't transform Type_1 as one-hot due to the fact that Type_1 would be used as label. The classifier we presented in this example is SVM, and SVM takes 0, 1, 2, 3... as different categories. As a result we only transform the string categories to numbers.
    * Here, we merged similar types into the same categories as presented below.
    
|  Type_1  |  catogory  |
|--------|---------|
|Grass|1|
|Fire|2|
|Water+Ice|3|
|Bug|4|
|Normal|5|
|Poison + Ghost + Dark|6|
|Electric|7|
|Ground + Rock|8|
|Flying + Fairy + Dragon|9|
|Fighting + Psychic + Steel|10|

### (1 - C) missing value

Another even more often case is the value in data doesn't exist so called "**missing value**". There are many methods to deal with missing value such as dorpping the feature/sample, or giving values by zero/column or mean/interpolation. The substitute for missing value depends upon your domain knowledge or experiences.

note: 
1. The Pokémon dataset can be downloaded [here](https://www.kaggle.com/alopez247/pokemon).
2. The [Pokémon classifier](#pokemon) takes features other than "type_1" as input to predict corresponding type_1.

## 2. Use machine learning tool to build model

1. [sklearn](http://scikit-learn.org/stable/)
    One can use convenient function in sklearn instead of implement every classifier yourself including both supervised and unsupervised learning algorithms. We will use it to build classifiers in our following examples.

2. [pandas](https://pandas.pydata.org/)
    Pandas provide also a lot of API for you to process data in high level. Compared with sklearn, pandas focus on making data preprocessing simple and clean. Try to apply it in your stock model.

<a id='cat_classifier'>cat classifier</a>

In [None]:
%matplotlib inline
import numpy as np
import sklearn
import matplotlib.pyplot as plt

train_X = np.load('train_X.npy')
train_Y = np.load('train_Y.npy')
test_X = np.load('test_X.npy')
test_Y = np.load('test_Y.npy')
plt.imshow(train_X[7]) # input a RGB 64 * 64 image

# flatten the input image
num_train = train_X.shape[0]
num_test = test_X.shape[0]
flatten_train_X = train_X.reshape(num_train,-1)
flatten_test_X = test_X.reshape(num_test, -1)

In [None]:
# Normalize each pixel by dividing 255
norm_train_X = flatten_train_X / 255
norm_test_X = flatten_test_X / 255

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(norm_train_X, train_Y)

# demo the classifier really working
print("Model prediction: {}".format(clf.predict(norm_test_X[25].reshape(1,-1))))
plt.imshow(test_X[index])

<a id='pokemon'>pokemon classifier</a>

In [None]:
import pandas as pd
df = pd.read_csv('pokemon_alopez247.csv')
df.head(5) # before preprocessing

In [None]:
mapping_dictionary = {"Type_1":{ 'Grass': 1, 'Fire': 2, 'Water': 3, 'Bug': 4, 'Normal': 5, 'Poison': 6, 'Electric': 7, 'Ground': 8, 'Fairy': 9, 
 'Fighting': 10, 'Psychic': 10, 'Rock': 8, 'Ghost': 6, 'Ice': 3, 'Dragon': 9, 'Dark':6, 'Steel': 10, 'Flying': 9}}
df = df.replace(mapping_dictionary) # 透過 replace function 可以方便地把字串改為對應的數字
df["isLegendary"] = df["isLegendary"].astype(int) # Boolean to int
df["hasMegaEvolution"] = df["hasMegaEvolution"].astype(int)

dummy_df = pd.get_dummies(df['Color'])  ## one-hot encoding
df = pd.concat([df, dummy_df], axis=1)
df = df.drop('Color', axis=1)
df = df.drop(['Number','Name', 'Type_2', 'Egg_Group_1', 'Egg_Group_2', 'hasGender', 'Body_Style'],axis=1)
df.head(5) # after preprocessing

In [None]:
## Deal with missing value
print(df.isnull().sum()) # Pr_Male has 77 missing value
df.dropna(axis=1, inplace=True)

In [None]:
## you can draw histgram with pandas
df[['HP', 'Attack']].plot.hist(alpha=0.5)

In [None]:
## Split data
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

def get_arrays(df):
    X = np.array(df.iloc[:,1:])
    y = np.array(df['Type_1'])
    return X, y


In [None]:
train_X, train_Y = get_arrays(df_train)
test_X, test_Y = get_arrays(df_test)
scaler = StandardScaler()
svc = SVC(C=5, gamma=0.04)
clf = Pipeline([('scaler', scaler), ('svc', svc)])
clf.fit(train_X, train_Y)
print("Accuracy: {}".format(clf.score(train_X, train_Y)))
print("Accuracy: {}".format(clf.score(test_X, test_Y)))

# 作業


 * 本次作業為建構[UCI German Credit Data](https://onlinecourses.science.psu.edu/stat857/node/215) 分類模型  
 * 該資料集包含 1000 筆貸款資料，其中 700 筆正樣本(credit-worthy)以及 300 筆負樣本(not credit-worthy)。每個樣本有 20 個特徵，其中 17 個類別(categorical)特徵，3 個為數值(numeric)特徵，請參考[特徵的細節](https://onlinecourses.science.psu.edu/stat857/node/222)。
 * 利用準備好的工具，建構分類模型，請參考 [Classifier comparison](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) 選擇機器學習分類器。
 * 以下為助教提供的資料集介紹及分類器示範

In [None]:
# prepared 為助教預先準備的示範程式
import sys
sys.path.append('.prepared')
import classification as prepared

### 資料集

UCI German Credit Data 包含 800 筆訓練資料與 200 筆測試資料。每筆由一個 20 維的特徵向量與一個 `1` 或 `0` 的類別組成。其中 `1` 代表正樣本(credit-worthy)；`0` 代表負樣本(not credit-worthy)。

In [None]:
print(prepared.x_train.shape) # 800 筆訓練資料的特徵
print(prepared.x_train[:3])   # 印出前三筆訓練資料的特徵
print()
print(prepared.y_train.shape) # 800 筆訓練資料的類別
print(prepared.y_train[:3])   # 印出前三筆訓練資料的類別
print()
print(prepared.x_test.shape)  # 200 筆訓練資料的特徵
print(prepared.y_test.shape)  # 200 筆訓練資料的類別

### 示範分類器

使用訓練資料(`x_train` 與 `y_train`)訓練示範分類器(`demo_clf`)，在訓練資料與測試資料(`x_test` 與 `y_test`)上評估其正確率。請修改下方[動手做](#動手做)的程式碼試著超越這個示範分類器。

In [None]:
# 使用訓練資料訓練示範分類器, `demo_clf`
demo_clf = prepared.demo(prepared.x_train, prepared.y_train)

# 在訓練資料與測試資料上評估其正確率
prepared.evaluate(demo_clf, prepared.x_train, prepared.x_test, prepared.y_train, prepared.y_test)

if 'DecisionTreeClassifier' == type(clf).__name__: # if `clf` is a decision tree
    prepared.plot(clf) # plot the decision tree

# 注意 
## 本週作業除了以下建置分類器外，還要提交結果於股票系統，股票介紹請見 stock資料夾

# 動手做

修改以下程式碼，試著使用不同的機器學習演算法，建構比示範分類器更好的模型。換句話說，在測試資料上的正確率超過 `0.765`。

In [None]:
# TODO: import classifiers you want to use
from sklearn.tree import DecisionTreeClassifier

# TODO: try different classifiers
clf = DecisionTreeClassifier(max_depth=2)

clf = clf.fit(prepared.x_train, prepared.y_train) # train `clf`
prepared.evaluate(clf, prepared.x_train, prepared.x_test, prepared.y_train, prepared.y_test) # evaluate `clf`



### Reference to  Andrew Ng, Professor Lin 's course on coursera and Professor Wu in NTHU.