# Regressor

In this exercise, you need to build a regressor for classification to beat a demo classifier. This document contains two parts:

will learn  and how to build a classifier based on a regressor:

1. **Data preprocessing** describes how to use [scikit-learn (skearn)](http://scikit-learn.org/stable/) pipeline for data preprocessing.
2. **Exercise** describes your homework (Chinese).

## 1. Data preprocessing

The **Breast Cancer Wisconsin dataset** is used here.

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('./breast_cancer_data.csv') # load data
df = df.drop('id', axis=1) # drop unused columns
print(df.info())
df.head(5) # show the first 5 samples

Deal with catigorical and missing values. See [another document for data preprocessing](/notebooks/unit/classifier/classifier.ipynb#1.-Data-preprocessing).

In [None]:
# convert label B/M to 0/1
# B indicates benign (良性)
# M indicates malignant (惡性)
mapping = {'diagnosis': {
    'B': 0,
    'M': 1,
}}
df = df.replace(mapping)

# check missing value
print(df.isnull().sum()) # no missing value

df.head(5)

[Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is a tool to stream multiple pre-processing methods (e.g like standardization, dimension reduction and machine learning algorithms) into a single object for convinience. See [Python Machine Learning](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb) for more information. The code below builds a pipeline of [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# split 20% of data for test
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# convert df to numpy
train_X, train_y = df_train.iloc[:, 1:].values.astype(np.float64), df_train.iloc[:, 0].values.astype(np.float64)
test_X, test_y = df_test.iloc[:, 1:].values.astype(np.float64), df_test.iloc[:, 0].values.astype(np.float64)

# build a pipeline 
clf = Pipeline([('scl', StandardScaler()),
                ('pca', PCA(n_components=2)),
                ('clf', LogisticRegression(random_state=1))])
print(clf)

# train parameters of all pipeline steps with `fit()`
clf.fit(train_X, train_y)

# apply all pipeline steps with `predcit()`
y_pred = clf.predict(test_X)

print("\nTest Accuracy: %.3f" % (accuracy_score(test_y, y_pred)))

## 2. 作業

一般來說， regression model 的目的在於輸出數值，以股票系統為例，輸出可能是價錢，而不是歸類為哪個類別。但 regression model 也可以作為分類器使用，透過設定門檻值，來讓 regression model的輸出變成 classfication model。

![regression model](https://qph.fs.quoracdn.net/main-qimg-914b29e777e78b44b67246b66a4d6d71)

我們以一個 Linear Regression 模型作為示範，模型輸入為 data 的各項 feature，輸出為一個浮點數值。
在 training 階段，和 classification model 相比最大的差異在於 label 此時是 0. 或者 1.，不是單純的整數，可以預期在 prediction 階段模型輸出會是一系列的浮點數。


<img src="https://ppt.cc/ffYNvx@.png" alt="Drawing" style="width: 200px;"/>

## 動手做
將你的程式碼寫在以下的方塊中，其中train、test data已經處理好可以直接使用。這裡並不規定所使用的regression model種類，這裡提供最基本的[Linear Regression範例](http://scikit-learn.org/stable/modules/linear_model.html)  

### 作業要求
1. 只能使用regression model來實作分類器
2. test accuracy達到0.93

In [None]:
# import something

# train and test data
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
train_X, train_Y = df_train.iloc[:, 1:].values.astype(np.float64), df_train.iloc[:, 0].values.astype(np.float64)
test_X, test_Y = df_test.iloc[:, 1:].values.astype(np.float64), df_test.iloc[:, 0].values.astype(np.float64)

# train and predict

In [None]:
import sys
sys.path.append('.prepared')
import regressor as prepared
prepared.demo()