In [16]:
import pandas as pd
import sklearn
import numpy as np
%matplotlib inline
import matplotlib
import matplotlib.pyplot  as plt

###  Machine Learning

####  Familiarising ourselves with the data

In this programming task, we are going to be using the 
##### Breast Cancer Wisconsin Dataset.
This includes data about different patients and the corresponding diagnosis. In particular, features are computed from a digitised image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The diagnosis involves characterising the tumour as 'malignant' or 'benign' (labelled 0 and 1, respectively).

This dataset is built in scikit-learn, just like the iris dataset that we saw in this weeks' programming task.

We'll load the dataset and call it cancer_dataset.

In [35]:
from sklearn.datasets import load_breast_cancer
cancer_dataset = load_breast_cancer()

Note that, similarly to the iris_dataset object that we saw in this week's programming task, the cancer_dataset object that is returned by load_breast_cancer is a Bunch object. By running the next cell, you will see that its structure is very similar to that of the iris_dataset object.

In [11]:
print("Keys of cancer_dataset: ", cancer_dataset.keys())

Keys of cancer_dataset:  dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


Optional: If you are interested in exploring the cancer_dataset object (e.g. its feature names, target names, etc.), then write your code in the following cell and run it. Remember that you can add more cells if you wish to.

In [12]:
print(cancer_dataset.data)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]


### Question

Write your code in the next cell to get the shape of the data part of the cancer_dataset.

In [36]:
cancer_dataset.data.shape

(569, 30)

### Splitting our dataset into training data and test data

In [37]:
#from sklearn.cross_validation import train_test_split #it doest support cross vallidation please make sure use model_selection
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    cancer_dataset['data'], cancer_dataset['target'], random_state=0)

### K Nearest Neighbours

We will now learn how to build a classification model for the breast cancer dataset with the use of the k nearest neighbours algorithm.

##### Building and evaluating the model for 1 nearest neighbour

Run the code below to create a KNeighborsClassifier model called knn_model. Note that n_neighbors=1 is setting the number of nearest neighbours to 1.

In [22]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [23]:
print("Test set score:{:.3f}".format(knn_model.score(X_test, y_test)))

Test set score:0.916


#### Write your code in the next cell(s) to build and evaluate a K Nearest Neighbours model for 5 neighbours.

In [24]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

KNeighborsClassifier()

In [25]:
print("Test set score:{:.3f}".format(knn_model.score(X_test, y_test)))

Test set score:0.937


### Using the model to make predictions

The following code specifies a previously unseen patient case.

In [26]:
X_new = np.array([[
  1.239e+01, 1.538e+01, 1.328e+02, 1.382e+03, 1.007e-01, 2.661e-01, 3.791e-01,
  1.001e-01, 2.009e-01, 6.371e-02, 6.895e-01, 8.943e-01, 4.259e+00, 9.594e+01,
  5.789e-03, 3.864e-02, 3.233e-02, 1.187e-02, 3.003e-02, 5.923e-03, 2.242e+01,
  1.689e+01, 1.926e+02, 2.721e+03, 1.782e-01, 5.461e-01, 6.579e-01, 1.958e-01,
  4.811e-01, 1.008e-01]])

###  Decision Tree

Use the training and test data specified in Section 2.2 to create a Decision Tree with maximal depth 5.

Important note: You should set the random_state parameter of the DecisionTreeClassifier to 20.

In [27]:
from sklearn.tree import DecisionTreeClassifier
decision_model = DecisionTreeClassifier(random_state=20, max_depth=5)
decision_model.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5, random_state=20)

#### Evaluate the accuracy of the decision tree that you just built.

In [28]:
print("Test set score:{:.3f}".format(decision_model.score(X_test, y_test)))

Test set score:0.909


Write your code in the next cell(s) to use your decision tree model to make a prediction for the new patient case specified earlier in this notebook.

In [29]:
X_new = np.array([[
  1.239e+01, 1.538e+01, 1.328e+02, 1.382e+03, 1.007e-01, 2.661e-01, 3.791e-01,
  1.001e-01, 2.009e-01, 6.371e-02, 6.895e-01, 8.943e-01, 4.259e+00, 9.594e+01,
  5.789e-03, 3.864e-02, 3.233e-02, 1.187e-02, 3.003e-02, 5.923e-03, 2.242e+01,
  1.689e+01, 1.926e+02, 2.721e+03, 1.782e-01, 5.461e-01, 6.579e-01, 1.958e-01,
  4.811e-01, 1.008e-01]])

In [33]:
# from sklearn.cross_validation import train_test_split  #inplace of cross validation use model selections
from sklearn.model_selection import train_test_split
X_new, X_test, y_train, y_test = train_test_split(
    cancer_dataset['data'], cancer_dataset['target'], random_state=0)


from sklearn.tree import DecisionTreeClassifier
decision_model = DecisionTreeClassifier(random_state=20, max_depth=2)
decision_model.fit(X_new, y_train)

DecisionTreeClassifier(max_depth=2, random_state=20)

In [34]:
print("Test set score:{:.3f}".format(decision_model.score(X_test, y_test)))

Test set score:0.937
