# DAT210x - Programming with Python for DS

## Module5- Lab7

In [186]:
import random, math
import pandas as pd
import numpy as np
import scipy.io

from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib

from sklearn.decomposition import PCA
from sklearn import manifold



matplotlib.style.use('ggplot') # Look Pretty


# Leave this alone until indicated:
Test_PCA = False

### A Convenience Function

This method is for your visualization convenience only. You aren't expected to know how to put this together yourself, although you should be able to follow the code by now:

In [187]:
def plotDecisionBoundary(model, X, y):
    print("Plotting...")

    fig = plt.figure()
    ax = fig.add_subplot(111)

    padding = 0.1
    resolution = 0.1

    #(2 for benign, 4 for malignant)
    colors = {2:'royalblue', 4:'lightsalmon'} 


    # Calculate the boundaris
    x_min, x_max = X[:, 0].min(), X[:, 0].max()
    y_min, y_max = X[:, 1].min(), X[:, 1].max()
    x_range = x_max - x_min
    y_range = y_max - y_min
    x_min -= x_range * padding
    y_min -= y_range * padding
    x_max += x_range * padding
    y_max += y_range * padding

    # Create a 2D Grid Matrix. The values stored in the matrix
    # are the predictions of the class at at said location
    xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
                         np.arange(y_min, y_max, resolution))

    # What class does the classifier say?
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contour map
    plt.contourf(xx, yy, Z, cmap=plt.cm.seismic)
    plt.axis('tight')

    # Plot your testing points as well...
    for label in np.unique(y):
        indices = np.where(y == label)
        plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)

    p = model.get_params()
    plt.title('K = ' + str(p['n_neighbors']))
    plt.show()

### The Assignment

Load in the dataset, identify nans, and set proper headers. Be sure to verify the rows line up by looking at the file in a text editor.

In [188]:
df = pd.read_csv('Datasets/breast-cancer-wisconsin.data',  names = ['sample', 'thickness', 'size', 'shape', 'adhesion',
                                                                    'epithelial', 'nuclei', 'chromatin', 'nucleoli', 'mitoses',
                                                                    'status'])
df.nuclei = pd.to_numeric(df.nuclei, errors = 'coerce')
print(df.head(10))
print('------------------------------------------------')
print(df.dtypes)
print('------------------------------------------------')
print(df.isnull().sum())
print('------------------------------------------------')

    sample  thickness  size  shape  adhesion  epithelial  nuclei  chromatin  \
0  1000025          5     1      1         1           2     1.0          3   
1  1002945          5     4      4         5           7    10.0          3   
2  1015425          3     1      1         1           2     2.0          3   
3  1016277          6     8      8         1           3     4.0          3   
4  1017023          4     1      1         3           2     1.0          3   
5  1017122          8    10     10         8           7    10.0          9   
6  1018099          1     1      1         1           2    10.0          3   
7  1018561          2     1      2         1           2     1.0          3   
8  1033078          2     1      1         1           2     1.0          1   
9  1033078          4     2      1         1           2     1.0          2   

   nucleoli  mitoses  status  
0         1        1       2  
1         2        1       2  
2         1        1       2  
3     

Copy out the status column into a slice, then drop it from the main dataframe. Always verify you properly executed the drop by double checking (printing out the resulting operating)! Many people forget to set the right axis here.

If you goofed up on loading the dataset and notice you have a `sample` column, this would be a good place to drop that too if you haven't already.

In [189]:
y = df.status
df.drop(labels=['sample', 'status'], axis=1, inplace = True)

With the labels safely extracted from the dataset, replace any nan values with the mean feature / column value:

In [190]:
df.nuclei.fillna(df.nuclei.mean(), inplace=True)
print(df.isnull().sum())

thickness     0
size          0
shape         0
adhesion      0
epithelial    0
nuclei        0
chromatin     0
nucleoli      0
mitoses       0
dtype: int64


Do train_test_split. Use the same variable names as on the EdX platform in the reading material, but set the random_state=7 for reproducibility, and keep the test_size at 0.5 (50%).

In [191]:
from sklearn.cross_validation import train_test_split
data_train, data_test, label_train, label_test = train_test_split(df, y, test_size = 0.5, random_state = 7)

Experiment with the basic SKLearn preprocessing scalers. We know that the features consist of different units mixed in together, so it might be reasonable to assume feature scaling is necessary. Print out a description of the dataset, post transformation. Recall: when you do pre-processing, which portion of the dataset is your model trained upon? Also which portion(s) of your dataset actually get transformed?

In [192]:
from sklearn import preprocessing
T = preprocessing.StandardScaler()
T.fit_transform(data_train, label_train)
T.transform(data_test, label_test)

array([[ 1.19102499,  0.24973863,  2.17854679, ...,  1.46915484,
         2.3126771 , -0.3431157 ],
       [-1.18810343, -0.38414343, -0.10663561, ..., -0.61175035,
        -0.61758461, -0.3431157 ],
       [-0.16847696, -0.70108446, -0.75954487, ..., -1.02793139,
        -0.61758461, -0.3431157 ],
       ..., 
       [ 0.17139853, -0.70108446, -0.75954487, ..., -1.02793139,
        -0.61758461, -0.3431157 ],
       [-0.84822794, -0.70108446, -0.75954487, ..., -0.61175035,
        -0.61758461, -0.3431157 ],
       [ 0.51127402,  0.88362069,  0.87272828, ...,  1.46915484,
         1.01033856,  0.20871554]])

### Dimensionality Reduction

PCA and Isomap are your new best friends

In [193]:
model = None

if Test_PCA:
    print('Computing 2D Principle Components')
    # TODO: Implement PCA here. Save your model into the variable 'model'.
    # You should reduce down to two dimensions.
    
    model = PCA(n_components = 2)

else:
    print('Computing 2D Isomap Manifold')
    # TODO: Implement Isomap here. Save your model into the variable 'model'
    # Experiment with K values from 5-10.
    # You should reduce down to two dimensions.

    model = manifold.Isomap(n_neighbors = 5, n_components = 2)

Computing 2D Isomap Manifold


Train your model against data_train, then transform both `data_train` and `data_test` using your model. You can save the results right back into the variables themselves.

In [194]:
model.fit_transform(data_train)
model.transform(data_test)

array([[  1.52136328e+01,  -3.77260170e-01],
       [ -1.12574472e+01,  -1.07767157e+00],
       [ -9.83619282e+00,  -1.36613843e+00],
       [ -1.19223792e+01,  -3.78241565e-01],
       [  2.39646267e+01,   1.34487701e+00],
       [ -9.60345160e+00,  -2.95256810e-01],
       [  1.67288915e+01,   8.98354538e+00],
       [  1.92012854e+00,   7.70517860e+00],
       [ -9.00289315e+00,   1.93050957e-01],
       [  1.75637410e+01,   1.26806807e+01],
       [ -8.53861444e+00,  -1.25647630e+00],
       [ -8.82782401e+00,  -6.25090763e-01],
       [ -9.86397610e+00,   1.77168396e+00],
       [ -9.10485916e+00,   2.27107202e+00],
       [  1.65304818e+01,   8.86136147e+00],
       [ -8.32366320e+00,   4.08186846e+00],
       [ -9.20439631e+00,  -2.15627741e+00],
       [ -1.00880575e+01,  -7.88672147e-01],
       [  1.65000353e+01,   1.14995170e+01],
       [ -1.19549866e+01,  -2.14012923e+00],
       [ -1.02010889e+01,   2.37224687e-01],
       [ -9.11935187e+00,  -3.46783662e-01],
       [ -

Implement and train `KNeighborsClassifier` on your projected 2D training data here. You can name your variable `knmodel`. You can use any `K` value from 1 - 15, so play around with it and see what results you can come up. Your goal is to find a good balance where you aren't too specific (low-K), nor are you too general (high-K). You should also experiment with how changing the weights parameter affects the results.

In [195]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4, weights = 'distance')
knn.fit(data_train, label_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')

Be sure to always keep the domain of the problem in mind! It's WAY more important to errantly classify a benign tumor as malignant, and have it removed, than to incorrectly leave a malignant tumor, believing it to be benign, and then having the patient progress in cancer. Since the UDF weights don't give you any class information, the only way to introduce this data into SKLearn's KNN Classifier is by "baking" it into your data. For example, randomly reducing the ratio of benign samples compared to malignant samples from the training set.

Calculate and display the accuracy of the testing set:

In [196]:
knn.score(data_test, label_test)

0.95999999999999996

In [197]:
plotDecisionBoundary(knn, data_test, label_test)

Plotting...


TypeError: unhashable type: 'slice'