## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

In [189]:
import numpy as np
import pandas as pd
import pydotplus
import matplotlib.pyplot as plt
%matplotlib inline

from pandas.tools.plotting import scatter_matrix
from sklearn import datasets
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.externals.six import StringIO

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

In [190]:
iris = datasets.load_iris()
iris

 'data': array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
      

In [193]:
x = iris.data[:,2:]
y = iris.target
dt = tree.DecisionTreeClassifier()

In [194]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [195]:
dt = dt.fit(x_train,y_train)

In [196]:
def measure_performance(x,y,dt, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
    y_pred=dt.predict(x)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confusion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [197]:
measure_performance(x_test,y_test,dt)

Accuracy:0.947 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        30
          1       0.93      0.93      0.93        28
          2       0.88      0.88      0.88        17

avg / total       0.95      0.95      0.95        75
 

Confusion matrix
[[30  0  0]
 [ 0 26  2]
 [ 0  2 15]] 



Nearly 95%
accuracy, with 100% precision for the first species (0), and progressively less precision for the latter two species (1, 2) . 

30 were classified as species 1, 26 as species 2, and 15 as species 3; though there were two falsely classified samples in both species 2 and species 3. 

### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

In [198]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
dt = dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)
measure_performance(x_test,y_test,dt)

Accuracy:0.974 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       0.93      1.00      0.97        14
          2       1.00      0.92      0.96        12

avg / total       0.98      0.97      0.97        38
 

Confusion matrix
[[12  0  0]
 [ 0 14  0]
 [ 0  1 11]] 



Over 97% accuracy, with again 100% precision for the first species (0), and a similar percent accuracy for the latter two species (1, 2) . 

12 were classified as species 1, 14 as species 2, and 11 as species 3; and there was one falsely classified sample in species 3. 

There are less total samples because the testing size is larger.

### 3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?
For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [199]:
bc = datasets.load_breast_cancer()
bc

 'data': array([[  1.80e+01,   1.04e+01,   1.23e+02, ...,   2.65e-01,   4.60e-01,
           1.19e-01],
        [  2.06e+01,   1.78e+01,   1.33e+02, ...,   1.86e-01,   2.75e-01,
           8.90e-02],
        [  1.97e+01,   2.12e+01,   1.30e+02, ...,   2.43e-01,   3.61e-01,
           8.76e-02],
        ..., 
        [  1.66e+01,   2.81e+01,   1.08e+02, ...,   1.42e-01,   2.22e-01,
           7.82e-02],
        [  2.06e+01,   2.93e+01,   1.40e+02, ...,   2.65e-01,   4.09e-01,
           1.24e-01],
        [  7.76e+00,   2.45e+01,   4.79e+01, ...,   0.00e+00,   2.87e-01,
           7.04e-02]]),
 'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
        'mean smoothness', 'mean compactness', 'mean concavity',
        'mean concave points', 'mean symmetry', 'mean fractal dimension',
        'radius error', 'texture error', 'perimeter error', 'area error',
        'smoothness error', 'compactness error', 'concavity error',
        'concave points error', 

The attributes (x) are listed near the top of the dataset, and include radius, texture, perimeter, etc. 
The target (y) is what we are trying to predict. Here, that is whether a tumor is malignant or benign. 

In [200]:
x = bc.data[:,:] 
y = bc.target 
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)
dt = dt.fit(x_train,y_train)

In [201]:
def measure_performance(x,y,dt, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
    y_pred=dt.predict(x)
    if show_accuracy:
        print("Accuracy:{0:.5f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confusion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [202]:
measure_performance(x_test,y_test,dt)

Accuracy:0.92632 

Classification report
             precision    recall  f1-score   support

          0       0.98      0.83      0.90       112
          1       0.90      0.99      0.94       173

avg / total       0.93      0.93      0.93       285
 

Confusion matrix
[[ 93  19]
 [  2 171]] 



### 4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

In [203]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
dt = dt.fit(x_train,y_train)

In [204]:
measure_performance(x_test,y_test,dt)

Accuracy:0.95105 

Classification report
             precision    recall  f1-score   support

          0       0.98      0.89      0.93        53
          1       0.94      0.99      0.96        90

avg / total       0.95      0.95      0.95       143
 

Confusion matrix
[[47  6]
 [ 1 89]] 



Over 95% accuracy, with rougly 95% precision. 7 out of 143 samples were misclassified. 