## SI 670 Applied Machine Learning, Week 1:  A simple classification task (Due 09/19 11:59pm)

For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below). Then, try out the first one or two questions, which use basic numpy to prepare the data, so you can get familiar with the various columns, etc. Then use k-NN classifiers to learn and make predictions.

Each question is worth 20 points, for a total of 80 points. Correct answers and code receive full credit, but partial credit will be awarded if you have the right idea even if your final answers aren't quite right.

Submit your completed notebook file to the Canvas site - IMPORTANT: please name your submitted file si670-hw1-youruniqname.ipynb.

As a reminder, the notebook code you submit must be your own work. Feel free to discuss general approaches to the homework with classmates: if you end up forming more of a team discussion on multiple questions, please include the names of the people you worked with at the top of your notebook file.

### Put your name here: Huan Zhao
### Put your uniquename here: huanzhao

In [1]:
# import required modules and load data file
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR)  # print the data set description

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

The object returned by `load_breast_cancer()` is a scikit-learn Bunch object, which is similar to a dictionary.

In [2]:
cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [3]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

### Question 1

Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as manipulating data, so let's practice creating a classifier with a pandas DataFrame. 



Convert the sklearn.dataset `cancer` to a DataFrame. 

*This function should return a `(569, 31)` DataFrame with * 

*columns = *

    ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension',
    'radius error', 'texture error', 'perimeter error', 'area error',
    'smoothness error', 'compactness error', 'concavity error',
    'concave points error', 'symmetry error', 'fractal dimension error',
    'worst radius', 'worst texture', 'worst perimeter', 'worst area',
    'worst smoothness', 'worst compactness', 'worst concavity',
    'worst concave points', 'worst symmetry', 'worst fractal dimension',
    'target']

*and index = *

    RangeIndex(start=0, stop=569, step=1)

In [4]:
def answer_one():
    
    d = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
    d['target'] = cancer['target']
    return pd.DataFrame(d)


answer_one()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


### Question 2
Using `train_test_split`, split the dataset into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0` to make sure your results match ours **

*This function should return a tuple of length 4:* `(X_train, X_test, y_train, y_test)`*, where* 
* `X_train` *has shape* `(426, 30)`
* `X_test` *has shape* `(143, 30)`
* `y_train` *has shape* `(426,)`
* `y_test` *has shape* `(143,)`

In [5]:
from sklearn.model_selection import train_test_split

def answer_two():
    df = answer_one()
    X = df.drop('target', axis=1)
    y = df.get('target')
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    return X_train, X_test, y_train, y_test

answer_two()

(     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
 293       11.850         17.46           75.54      432.7          0.08372   
 332       11.220         19.86           71.94      387.3          0.10540   
 565       20.130         28.25          131.20     1261.0          0.09780   
 278       13.590         17.84           86.24      572.3          0.07948   
 489       16.690         20.20          107.10      857.6          0.07497   
 ..           ...           ...             ...        ...              ...   
 277       18.810         19.98          120.90     1102.0          0.08923   
 9         12.460         24.04           83.97      475.9          0.11860   
 359        9.436         18.32           59.82      278.6          0.10090   
 192        9.720         18.22           60.73      288.1          0.06950   
 559       11.510         23.93           74.52      403.5          0.09261   
 
      mean compactness  mean concavity  mean conca

### Question 3
Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `n_neighbors = 5` on `X_train`, `y_train`. Then evaluate the classifier accuracy using `score` function on `X_test` and `y_test`.

*This function should return a tuple of (knn, accuracy), where*
* `knn` is a `sklearn.neighbors.classification.KNeighborsClassifier`
* `accuracy` is a `float` number returned by the `score` function

In [8]:
from sklearn.neighbors import KNeighborsClassifier

def answer_three():
    X_train, X_test, y_train, y_test = answer_two()
    knn = KNeighborsClassifier(n_neighbors = 5)
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    return (knn, accuracy)


answer_three()

(KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                      weights='uniform'), 0.9370629370629371)

### Question 4
Recall in the fruits example in lab1, we found the feature scales matter. In this question, please examine the mean and standard deviation of `X_train`, and use the `sklearn.preprocessing.StandardScaler` to normalize the feature. Then train another knn (k=5) classifier and evaluate it.

*This function should return a tuple of (standardized_X_train, knn, accuracy), where*
* `standardized_X_train` is a `pandas.DataFrame` of the standardized features
* `knn` is a `sklearn.neighbors.classification.KNeighborsClassifier`
* `accuracy` is a `float` number returned by the `score` function

In [11]:
from sklearn.preprocessing import StandardScaler

def answer_four():
    X_train, X_test, y_train, y_test = answer_two()
    
    scaler = StandardScaler()
    columns = cancer.feature_names
    
    standardized_X_train = X_train.copy()
    standardized_X_test = X_test.copy()
    
    standardized_X_train[columns] = scaler.fit_transform(X_train[columns])
    standardized_X_test[columns] = scaler.transform(X_test[columns])
    
    knn = KNeighborsClassifier(n_neighbors = 5)
    knn.fit(standardized_X_train, y_train)
    accuracy = knn.score(standardized_X_test, y_test)
    
    return (standardized_X_train, knn, accuracy)  


answer_four()    

(     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
 293    -0.650799     -0.430573       -0.680248  -0.626983        -0.913819   
 332    -0.828353      0.152265       -0.827738  -0.753094         0.652812   
 565     1.682772      2.189772        1.600098   1.673839         0.103624   
 278    -0.160411     -0.338290       -0.241878  -0.239207        -1.220208   
 489     0.713269      0.234834        0.612740   0.553289        -1.546108   
 ..           ...           ...             ...        ...              ...   
 277     1.310754      0.181407        1.178115   1.232174        -0.515658   
 9      -0.478881      1.167376       -0.334878  -0.506984         1.606665   
 359    -1.331142     -0.221723       -1.324284  -1.055037         0.327635   
 192    -1.251102     -0.246008       -1.287002  -1.028648        -1.941379   
 559    -0.746622      1.140663       -0.722037  -0.708094        -0.271413   
 
      mean compactness  mean concavity  mean conca

In [14]:
X_train, X_test, y_train, y_test = answer_two()
X_train.mean(axis = 0)    

mean radius                 14.159171
mean texture                19.233005
mean perimeter              92.143897
mean area                  658.415023
mean smoothness              0.096366
mean compactness             0.103670
mean concavity               0.088650
mean concave points          0.049144
mean symmetry                0.180473
mean fractal dimension       0.062617
radius error                 0.404795
texture error                1.212227
perimeter error              2.840979
area error                  40.695674
smoothness error             0.006987
compactness error            0.025078
concavity error              0.031699
concave points error         0.011702
symmetry error               0.020437
fractal dimension error      0.003713
worst radius                16.316817
worst texture               25.637981
worst perimeter            107.459131
worst area                 887.647887
worst smoothness             0.132503
worst compactness            0.252836
worst concav

In [15]:
X_train.std(axis = 0)

mean radius                  3.552381
mean texture                 4.122619
mean perimeter              24.437275
mean area                  360.425054
mean smoothness              0.013855
mean compactness             0.050683
mean concavity               0.078517
mean concave points          0.038819
mean symmetry                0.027692
mean fractal dimension       0.006852
radius error                 0.287167
texture error                0.546234
perimeter error              2.061693
area error                  48.515510
smoothness error             0.002827
compactness error            0.016972
concavity error              0.031449
concave points error         0.005913
symmetry error               0.008203
fractal dimension error      0.002630
worst radius                 4.894808
worst texture                6.064671
worst perimeter             33.965066
worst area                 586.352988
worst smoothness             0.022930
worst compactness            0.151899
worst concav