# Regressor

In this exercise, you need to build a regressor for classification to beat a demo classifier. This document contains two parts:

will learn  and how to build a classifier based on a regressor:

1. **Data preprocessing** describes how to use [scikit-learn (skearn)](http://scikit-learn.org/stable/) pipeline for data preprocessing.
2. **Exercise** describes your homework (Chinese).

## 1. Data preprocessing

The **[Breast Cancer Wisconsin dataset](https://goo.gl/2xTMPR)** is used here.

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('./breast_cancer_data.csv') # load data
df = df.drop('id', axis=1) # drop unused columns
print(df.info())
df.head(5) # show the first 5 samples

Deal with catigorical and missing values. See [another document for data preprocessing](/notebooks/unit/classifier/classifier.ipynb#1.-Data-preprocessing).

In [None]:
# convert label B/M to 0/1
# B indicates benign (良性)
# M indicates malignant (惡性)
mapping = {'diagnosis': {
    'B': 0,
    'M': 1,
}}
df = df.replace(mapping)

# check missing value
print(df.isnull().sum()) # no missing value

df.head(5)

[Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is a tool to stream multiple pre-processing methods (e.g like standardization, dimension reduction and machine learning algorithms) into a single object for convinience. See [Python Machine Learning (ch6)](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb) for more information. The code below builds a pipeline of [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# split 20% of data for test
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# convert df to numpy
train_X, train_y = df_train.iloc[:, 1:].values.astype(np.float64), df_train.iloc[:, 0].values.astype(np.float64)
test_X, test_y = df_test.iloc[:, 1:].values.astype(np.float64), df_test.iloc[:, 0].values.astype(np.float64)

# build a pipeline 
clf = Pipeline([
    ('scl', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('clf', LogisticRegression(random_state=1))
])
print(clf)

# train parameters of all pipeline steps with `fit()`
clf.fit(train_X, train_y)

# apply all pipeline steps with `predcit()`
y_pred = clf.predict(test_X)

print("\nTest Accuracy: %.3f" % (accuracy_score(test_y, y_pred)))

## 2. 作業

一般來說，迴歸(regression)模型不像分類模型只能輸出預先定義好的類別，而是可以輸出任意數值。以股票系統為例，迴歸模型不是預測股票的漲跌，而是直接預測股價。因此，透過設定門檻值，迴歸模型可以作為分類模型使用。例如，用預測股價減去前一天的股價，就可以預測股票漲跌。

### 資料集
作業使用的資料與上述相同為 **[Breast Cancer Wisconsin dataset](https://goo.gl/2xTMPR)**，資料預處理的方式也與上述步驟相同，這裡只需要呼叫助教準備好的工具即可。


In [None]:
# 載入準備好的工具
import sys
sys.path.append('.prepared')
import regressor as prepared

train_X, train_Y, test_X, test_Y = prepared.load_data() # 讀取train, test的 X, Y資料

print(train_X.shape) # 455 筆訓練資料的特徵
print(train_X[0])    # 印出第一筆訓練資料的特徵
print()
print(train_Y.shape) # 455 筆訓練資料的類別
print(train_Y[0])    # 印出第一筆訓練資料的類別
print()
print(test_X.shape)
print(test_X[0])
print()
print(test_Y.shape)
print(test_Y[0])

### 動手做

修改以下程式區段，其中train、test data已經處理好可以直接透過`prepared.load_data()`使用。這裡提供最基本的[Linear Regression範例](http://scikit-learn.org/stable/modules/linear_model.html)  

作業要求:

1. 只能使用regression model來實作分類器，但所使用的regressor方法不指定。
2. test accuracy達到0.93

In [None]:
# TA在此資料上用regressor達到classification accuracy 0.956

TA_predict = prepared.demo() # TA_predict為對test_X所做出的預測
prepared.evaluate(test_Y, TA_predict)

In [None]:
# Write your code here