# DAT210x - Programming with Python for DS

## Module5- Lab7

In [123]:
import random, math
import pandas as pd
import numpy as np
import scipy.io
import matplotlib

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

matplotlib.style.use('ggplot') # Look Pretty
%matplotlib notebook


# Leave this alone until indicated:
Test_PCA = False

### A Convenience Function

This method is for your visualization convenience only. You aren't expected to know how to put this together yourself, although you should be able to follow the code by now:

In [124]:
def plotDecisionBoundary(model, X, y):
    print("Plotting...")

    fig = plt.figure()
    ax = fig.add_subplot(111)

    padding = 0.1
    resolution = 0.1

    #(2 for benign, 4 for malignant)
    colors = {2:'royalblue', 4:'lightsalmon'} 


    # Calculate the boundaris
    x_min, x_max = X[:, 0].min(), X[:, 0].max()
    y_min, y_max = X[:, 1].min(), X[:, 1].max()
    x_range = x_max - x_min
    y_range = y_max - y_min
    x_min -= x_range * padding
    y_min -= y_range * padding
    x_max += x_range * padding
    y_max += y_range * padding

    # Create a 2D Grid Matrix. The values stored in the matrix
    # are the predictions of the class at at said location
    xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
                         np.arange(y_min, y_max, resolution))

    # What class does the classifier say?
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contour map
    plt.contourf(xx, yy, Z, cmap=plt.cm.seismic)
    plt.axis('tight')

    # Plot your testing points as well...
    for label in np.unique(y):
        indices = np.where(y == label)
        plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)

    p = model.get_params()
    plt.title('K = ' + str(p['n_neighbors']))
    plt.show()

### The Assignment

Load in the dataset, identify nans, and set proper headers. Be sure to verify the rows line up by looking at the file in a text editor.

In [125]:
# .. your code here ..
X = pd.read_csv('Datasets/breast-cancer-wisconsin.data', sep=',', names = ['sample', 'thickness', 'size', 'shape', 'adhesion', 'epithelial', 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'status'])
X.head(5)

Unnamed: 0,sample,thickness,size,shape,adhesion,epithelial,nuclei,chromatin,nucleoli,mitoses,status
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Copy out the status column into a slice, then drop it from the main dataframe. Always verify you properly executed the drop by double checking (printing out the resulting operating)! Many people forget to set the right axis here.

If you goofed up on loading the dataset and notice you have a `sample` column, this would be a good place to drop that too if you haven't already.

In [126]:
# .. your code here ..
#X.dtypes # nuclei is object, want numeric
X.nuclei = pd.to_numeric(X.nuclei, errors = 'coerce')
X.isnull().sum() #16 nan values in nuclei
#X.nuclei.unique() # values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, so mean would not be appropriate
X = X.fillna(method = 'backfill', axis = 1)
X.head(10)

Unnamed: 0,sample,thickness,size,shape,adhesion,epithelial,nuclei,chromatin,nucleoli,mitoses,status
0,1000025.0,5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
1,1002945.0,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0,2.0
2,1015425.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0
3,1016277.0,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0,2.0
4,1017023.0,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0,2.0
5,1017122.0,8.0,10.0,10.0,8.0,7.0,10.0,9.0,7.0,1.0,4.0
6,1018099.0,1.0,1.0,1.0,1.0,2.0,10.0,3.0,1.0,1.0,2.0
7,1018561.0,2.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
8,1033078.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,5.0,2.0
9,1033078.0,4.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0


In [127]:
#X.describe()

With the labels safely extracted from the dataset, replace any nan values with the mean feature / column value:

In [128]:
# .. your code here ..
y = X.status
X.drop(labels = ['sample', 'status'], axis = 1, inplace = True)
y.head(5)

0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
Name: status, dtype: float64

Do train_test_split. Use the same variable names as on the EdX platform in the reading material, but set the random_state=7 for reproducibility, and keep the test_size at 0.5 (50%).

In [129]:
# .. your code here ..
#No, do this after normalizing the data.

#from sklearn.cross_validation import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .5, random_state = 7)

#

Experiment with the basic SKLearn preprocessing scalers. We know that the features consist of different units mixed in together, so it might be reasonable to assume feature scaling is necessary. Print out a description of the dataset, post transformation. Recall: when you do pre-processing, which portion of the dataset is your model trained upon? Also which portion(s) of your dataset actually get transformed?

In [130]:
# .. your code here ..
from sklearn import preprocessing
#X = preprocessing.StandardScaler().fit_transform(X) # accuracy = 0.957142857143
X = preprocessing.MinMaxScaler().fit_transform(X) # accuracy = 0.962857142857
#X = preprocessing.MaxAbsScaler().fit_transform(X) # accuracy = 0.96
#X = preprocessing.RobustScaler().fit_transform(X) # accuracy = 0.945714285714
#X = preprocessing.Normalizer().fit_transform(X) # accuracy = 0.857142857143
#X = X # No Change, accuracy = 0.957142857143


### Dimensionality Reduction

PCA and Isomap are your new best friends

In [131]:
#split the normalized data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .5, random_state = 7)


from sklearn.decomposition import PCA
from sklearn import manifold
model = None

if Test_PCA:
    print('Computing 2D Principle Components...')
    # TODO: Implement PCA here. Save your model into the variable 'model'.
    # You should reduce down to two dimensions.
    
    # .. your code here ..
    model = PCA(n_components = 2)
    X_train = model.fit_transform(data_train)
    X_test = model.transform(data_test)
    print('Done.')

else:
    print('Computing 2D Isomap Manifold...')
    # TODO: Implement Isomap here. Save your model into the variable 'model'
    # Experiment with K values from 5-10.
    # You should reduce down to two dimensions.

    # .. your code here ..
    model = manifold.Isomap(n_neighbors = 5, n_components = 2)
    X_train = model.fit_transform(X_train)
    X_test = model.transform(X_test)
    print('\n', 'Done.')

Computing 2D Isomap Manifold...

 Done.


Train your model against data_train, then transform both `data_train` and `data_test` using your model. You can save the results right back into the variables themselves.

In [132]:
# .. your code here ..
# Done above 

Implement and train `KNeighborsClassifier` on your projected 2D training data here. You can name your variable `knmodel`. You can use any `K` value from 1 - 15, so play around with it and see what results you can come up. Your goal is to find a good balance where you aren't too specific (low-K), nor are you too general (high-K). You should also experiment with how changing the weights parameter affects the results.

In [133]:
# .. your code here ..
from sklearn.neighbors import KNeighborsClassifier
knmodel = KNeighborsClassifier(n_neighbors = 5, weights = 'distance') # 'distance' generally performs better than 'uniform'
knmodel.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance')

Be sure to always keep the domain of the problem in mind! It's WAY more important to errantly classify a benign tumor as malignant, and have it removed, than to incorrectly leave a malignant tumor, believing it to be benign, and then having the patient progress in cancer. Since the UDF weights don't give you any class information, the only way to introduce this data into SKLearn's KNN Classifier is by "baking" it into your data. For example, randomly reducing the ratio of benign samples compared to malignant samples from the training set.

Calculate and display the accuracy of the testing set:

In [134]:
# .. your code changes above ..
print(knmodel.score(X_test, y_test))

0.962857142857


In [135]:
plotDecisionBoundary(knmodel, X_test, y_test)

Plotting...


<IPython.core.display.Javascript object>