# Regressor

In this exercise, you need to build a regressor for classification to beat a demo classifier. This document contains two parts:

1. **Data preprocessing** describes how to use [scikit-learn (skearn)](http://scikit-learn.org/stable/) pipeline for data preprocessing.
2. **Exercise** describes your homework (Chinese).

## 1. Data preprocessing

The [Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic%29) is used here.

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('./breast_cancer_data.csv') # load data
df = df.drop('id', axis=1) # drop unused columns
print(df.info())
df.head(5) # show the first 5 samples

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Deal with catigorical and missing values. See [another document for data preprocessing](/notebooks/unit/classifier/classifier.ipynb#1.-Data-preprocessing).

In [2]:
# convert label B/M to 0/1
# B indicates benign (良性)
# M indicates malignant (惡性)
mapping = {'diagnosis': {
    'B': 0,
    'M': 1,
}}
df = df.replace(mapping)

# check missing value
print(df.isnull().sum()) # no missing value

df.head(5)

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


[Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is a tool to stream multiple pre-processing methods (e.g like standardization, dimension reduction and machine learning algorithms) into a single object for convinience. See [Python Machine Learning (ch6)](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb) for more information. The code below builds a pipeline of [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [3]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# split 20% of data for test
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# convert df to numpy
train_X, train_y = df_train.iloc[:, 1:].values.astype(np.float64), df_train.iloc[:, 0].values.astype(np.float64)
test_X, test_y = df_test.iloc[:, 1:].values.astype(np.float64), df_test.iloc[:, 0].values.astype(np.float64)

# build a pipeline 
clf = Pipeline([
    ('scl', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('clf', LogisticRegression(random_state=1))
])
print(clf)

# train parameters of all pipeline steps with `fit()`
clf.fit(train_X, train_y)

# apply all pipeline steps with `predcit()`
y_pred = clf.predict(test_X)

print("\nTest Accuracy: %.3f" % (accuracy_score(test_y, y_pred)))

Pipeline(steps=[('scl', StandardScaler()), ('pca', PCA(n_components=2)),
                ('clf', LogisticRegression(random_state=1))])

Test Accuracy: 0.947


## 2. 作業

一般來說，迴歸(regression)模型不像分類模型只能輸出預先定義好的類別，而是可以輸出任意數值。以股票系統為例，迴歸模型不是預測股票的漲跌，而是直接預測股價。因此，透過設定門檻值，可以將迴歸模型轉換為分類模型使用。例如，用預測股價減去前一天的股價，就可以預測股票漲跌。

### 資料集

作業使用的資料集與上面 Data preprocessing 相同，但為了避免前處理所造成的差異，請使用 `prepared.load_data()` 載入處理好的訓練、測試資料。

In [4]:
# 載入準備好的工具
import sys
sys.path.append('.prepared')
import regressor as prepared

# 載入處理好的資料集
x_train, y_train, x_test, y_test = prepared.load_data()

print(x_train.shape) # 455 筆訓練資料，每筆資料有 30 個特徵
print(x_train[0])    # 印出第一筆訓練資料的特徵
print(y_train[0])    # 印出第一筆訓練資料的類別
print()
print(x_test.shape)
print(y_test.shape)

(455, 30)
[1.799e+01 2.066e+01 1.178e+02 9.917e+02 1.036e-01 1.304e-01 1.201e-01
 8.824e-02 1.992e-01 6.069e-02 4.537e-01 8.733e-01 3.061e+00 4.981e+01
 7.231e-03 2.772e-02 2.509e-02 1.480e-02 1.414e-02 3.336e-03 2.108e+01
 2.541e+01 1.381e+02 1.349e+03 1.482e-01 3.735e-01 3.301e-01 1.974e-01
 3.060e-01 8.503e-02]
1.0

(114, 30)
(114,)


### 動手做

請使用迴歸模型來分類測試資料，可以參考 [Linear Regression Example](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)。預測結果用 `prepared.evaluate()` 評估的正確率要超過 `0.93`。

In [5]:
demo_y_pred = prepared.demo() 

print(x_test.shape)      # `x_test` has 114 samples
print(demo_y_pred.shape) # `demo_y_pred` has 114 corresponding predictions
print()
print(demo_y_pred[:5])   # show the first five predictions
print()

prepared.evaluate(y_test, demo_y_pred) # try to beat this instead of merely 0.93

(114, 30)
(114,)

[1. 1. 0. 1. 0.]

Accuracy on test samples: 0.939


In [6]:
# TODO: generate your `y_pred`
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
regr = linear_model.LinearRegression()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
# set threshold
y_pred = [ 1 if x >= 0.5 else 0  for x in y_pred ]
prepared.evaluate(y_test, y_pred) # un-comment this line once you finised `y_pred`

Accuracy on test samples: 0.947
