## What is scikit-learn?

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.


## Data in scikit-learn

Data in scikit-learn, with very few exceptions, is assumed to be stored as a two-dimensional array, of shape \[n_samples, n_features\].

- **n_samples** : The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.

- **n_features** : The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued in some cases.



## General Machine Learning Steps

1. Data collection, preprocessing (e.g., integration, cleaning, etc.), and exploration:
    - Split a dataset into the training and testing datasets
2. Model development:
    - Assume a model $\mathcal F: \{f_1, f_2, \cdots \}$ that is a collection of candidate functions  $\mathcal f$ Let's assume that each  $\mathcal f$ is parametrized by $\mathcal w$.
    - Define a cost function $\mathcal C(w)$ that measures "how good a particular $\mathcal f$ can explain the training data". The lower the cost function the better.
3. Training: employ an algorithm that finds the best (or good enough) function $\mathcal f^∗$ in the model that minimizes the cost function over the training dataset
4. Testing: evaluate the performance of the learned $\mathcal f^∗$ using the testing dataset.
5. Apply the model in the real world.

> The data is presented to the algorithm usually as a two-dimensional array (or matrix) of numbers. Each data point (also known as a sample or training instance) that we want to either learn from or make a decision on is represented as a list of numbers, a so-called feature vector, and its containing features represent the properties of this point.

> In classification, the label is discrete, such as "spam" or "no spam". In other words, it provides a clear-cut distinction between categories. Furthermore, it is important to note that class labels are nominal, not ordinal variables. Nominal and ordinal variables are both subcategories of categorical variable. Ordinal variables imply an order, for example, T-shirt sizes "XL > L > M > S". On the contrary, nominal variables don't imply an order, for example, we (usually) can't assume "orange > blue > green".

## Cat classifier - Reference to Andrew Ng's course on coursera.
The dataset contains pictures of cat and other stuff. Each X represent a single image, and the label of each image is decribed as follow.
- 0 : non-cat
- 1 : cat


In [None]:
%matplotlib inline
import numpy as np
import sklearn
import matplotlib.pyplot as plt

train_X = np.load('train_X.npy')
train_Y = np.load('train_Y.npy')
test_X = np.load('test_X.npy')
test_Y = np.load('test_Y.npy')


print("train_X shape: {}".format(train_X.shape))
print("train_Y shape: {}".format(train_Y.shape))
print("test_X shape: {}".format(test_X.shape))
print("test_Y shape: {}".format(test_Y.shape))
#print(train_X[0])

In [None]:
# Feel free to change the index
index = 15
plt.imshow(train_X[index])

In [None]:
# Reshape the training and test data sets so that images of size (64, 64, 3) are flattened into single 
# vectors of shape (64 * 64 * 3, 1).
num_train = train_X.shape[0]
num_test = test_X.shape[0]
flatten_train_X = train_X.reshape(num_train,-1)
flatten_test_X = test_X.reshape(num_test, -1)

print("Flatten train_X shape: {}".format(flatten_train_X.shape))
print("Flatten test_X shape: {}".format(flatten_test_X.shape))
#print(flatten_train_X[0,:10])

# To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the 
# pixel value is actually a vector of three numbers ranging from 0 to 255.
# One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract 
# the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole 
# numpy array. 
#
# But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the 
# dataset by 255 (the maximum value of a pixel channel).



Here we divide each image by 255, and his operation is called normalization. After dividing the image by 255, the original value of color will be rescaled to 0~1. By normalizing the data, the optimization algorithms will be able to find the optimal more efficiently.
    

In [None]:
norm_train_X = flatten_train_X / 255
norm_test_X = flatten_test_X / 255

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(norm_train_X, train_Y)
print(clf.score(norm_train_X, train_Y))
print(clf.score(norm_test_X, test_Y))

In [None]:
index = 25
print("Model prediction: {}".format(clf.predict(norm_test_X[index].reshape(1,-1))))
plt.imshow(test_X[index])

##  [Kaggle - Pokémon for Data Mining and Machine Learning](https://www.kaggle.com/alopez247/pokemon)
The Pokemon dataset uses the HP, attack, defense...etc to predict the type of Pokemon. The label is the Type_1 column and the rest columns are treated as features.  
  
The example below utilized the Python package called Pandas to parse and read CSV-liked input data. The basic usage is listed below.
- [Pandas basic usage](https://pandas.pydata.org/)


In [None]:
import pandas as pd
df = pd.read_csv('pokemon_alopez247.csv')
print("Classes of type_1: {}".format(df['Type_1'].unique()))
print("Classes of Body_Style: {}".format(df['Body_Style'].unique()))
df.head(5)

## Convert Categorical Features to Numeric Features

- **Ordinal Values**
> Generation
- **Nominal Values**
> Color, Type

Different preprocessing method deployed to **Color** and **Type_1** columns:

1. **Color**:
    * Here, we're trying to demonstrate one-hot encoding, which transform the numeric features to combinations of 0s and 1. Using numeric features to represent category will confuse the classifier since 0 and 1 are closer than 0 and 2. But Green and Red is not closer than Green and Black.  
  
2. **Type_1**:
    * We didn't transform Type_1 as one-hot due to the fact that Type_1 would be used as label. The classifier we presented in this example is SVM, and SVM takes 0, 1, 2, 3... as different categories. As a result we only transform the string categories to numbers.
    * Here, we merged similar types into the same categories as presented below.
    
|  Type_1  |  catogory  |
|--------|---------|
|Grass|1|
|Fire|2|
|Water+Ice|3|
|Bug|4|
|Normal|5|
|Poison + Ghost + Dark|6|
|Electric|7|
|Ground + Rock|8|
|Flying + Fairy + Dragon|9|
|Fighting + Psychic + Steel|10|

In [None]:
mapping_dictionary = {"Type_1":{ 'Grass': 1, 'Fire': 2, 'Water': 3, 'Bug': 4, 'Normal': 5, 'Poison': 6, 'Electric': 7, 'Ground': 8, 'Fairy': 9, 
 'Fighting': 10, 'Psychic': 10, 'Rock': 8, 'Ghost': 6, 'Ice': 3, 'Dragon': 9, 'Dark':6, 'Steel': 10, 'Flying': 9}}
df = df.replace(mapping_dictionary) # 透過 replace function 可以方便地把字串改為對應的數字
df["isLegendary"] = df["isLegendary"].astype(int) # Boolean to int
df["hasMegaEvolution"] = df["hasMegaEvolution"].astype(int)

dummy_df = pd.get_dummies(df['Color'])  ## one-hot encoding
df = pd.concat([df, dummy_df], axis=1)
df = df.drop('Color', axis=1)
df.head(5)

In [None]:
df.describe() # description of the dataset only for numeric feature

In [None]:
# Data shape
df.shape

In [None]:
## drop column because it if irrelevant to the results
   
df = df.drop(['Number','Name', 'Type_2', 'Egg_Group_1', 'Egg_Group_2', 'hasGender', 'Body_Style'],axis=1)
df.head(5)

In [None]:
# Data shape after drop column
df.shape

In [None]:
# classes count of type_1
# mind unbalanced data distribution
df['Type_1'].value_counts()

## Missing value

There are many methods to deal with missing value such as dorpping the feature/sample, or giving values by zero/column mean/interpolation. There is not always right way to do, it depends upon your domain knowledge or experiences.

In [None]:
## Missing value
print(df.isnull().sum()) # Pr_Male has 77 missing value
df.dropna(axis=1, inplace=True)
df.shape

In [None]:
# We need all features are numeric, check if there still has dtype = object
df.dtypes.value_counts()

In [None]:
# Finally check whether all features are done by the steps we described above.
df.info()

In [None]:
## Also you can draw histgram using pandas
df[['HP', 'Attack']].plot.hist(alpha=0.5)

In [None]:
## Split data
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

def get_arrays(df):
    X = np.array(df.iloc[:,1:])
    y = np.array(df['Type_1'])
    return X, y


In [None]:
train_X, train_Y = get_arrays(df_train)
test_X, test_Y = get_arrays(df_test)
scaler = StandardScaler()
svc = SVC(C=5, gamma=0.04)
clf = Pipeline([('scaler', scaler), ('svc', svc)])
clf.fit(train_X, train_Y)
print("Accuracy: {}".format(clf.score(train_X, train_Y)))
print("Accuracy: {}".format(clf.score(test_X, test_Y)))

## Exercise - Build a classifier on stock data (predict a stock will rise(1) or not(0))

- Please refer to week03_classifier/stock_system.ipynb for detailed illustraion.
- Load the data in the stock directory.
- Choose a classifier in sklearn package(SVC, decision tree, KNN, MLP, etc.)
- Make the predictions on test data and report the results.
- Raw data is in "/home/mlb/res/stock/twse/raw/"  or json data is in "/home/mlb/res/stock/twse/json/"
- Feature set consists of thirty features(six features each day, high price 高點, low price 低點, open price 開盤價, close price 收盤價, adjust_close 最高最低價, volume 成交量).

In [None]:
# note:
# We already parse raw data and save it to npy format
# for you.This exercise is only for you to build a 
# model conveniently. Please parse raw data and preprocess
# it for your own model in stock simulation.

stock_train_X = np.load('stock/train_X.npy') # train 2017-05-01 ~ 2017-05-31
stock_train_Y = np.load('stock/train_Y.npy')
stock_test_X = np.load('stock/test_X.npy') # test 2017-06-01 ~ 2017-06-30
stock_test_Y = np.load('stock/test_Y.npy')

# ... build your own classifier with module in sklearn or other lib

print(stock_train_Y[:5])

# Reference to  Andrew Ng, Professor Lin 's course on coursera and Professor Wu in NTHU.