## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

In [1]:
import pandas as pd
%matplotlib inline
from sklearn import datasets
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
iris = datasets.load_iris() 

In [2]:
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

x.shape

(150, 2)

In [3]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)
from sklearn.cross_validation import train_test_split
x.shape

(150, 2)

In [4]:
len(x[0])

2

In [5]:
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5,random_state=42)
dt = dt.fit(x_train,y_train)
from sklearn import metrics
import numpy as np

In [7]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [8]:
measure_performance(x_test,y_test,dt)

Accuracy:0.947 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        29
          1       0.85      1.00      0.92        23
          2       1.00      0.83      0.90        23

avg / total       0.95      0.95      0.95        75
 

Confusion matrix
[[29  0  0]
 [ 0 23  0]
 [ 0  4 19]] 



### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=42)
measure_performance(x_test,y_test,dt)

Accuracy:0.974 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        15
          1       0.92      1.00      0.96        11
          2       1.00      0.92      0.96        12

avg / total       0.98      0.97      0.97        38
 

Confusion matrix
[[15  0  0]
 [ 0 11  0]
 [ 0  1 11]] 



### 3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?
For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [10]:
bc = datasets.load_breast_cancer() 

In [11]:
print(bc.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)
        
        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.
 

In [12]:
print(bc.target_names)

['malignant' 'benign']


In [13]:
print(bc.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


### 4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

In [20]:
x = bc.data[:,13:] # the attributes
y = bc.target # the target variable

dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5,random_state=42)

In [22]:
dt = dt.fit(x_train,y_train)

In [23]:
def measure_performance(X,seed,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    seed_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(seed, seed_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(seed,seed_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(seed,seed_pred),"\n")

In [24]:
measure_performance(x_test,y_test,dt)

Accuracy:0.919 

Classification report
             precision    recall  f1-score   support

          0       0.84      0.94      0.89        98
          1       0.97      0.91      0.94       187

avg / total       0.92      0.92      0.92       285
 

Confusion matrix
[[ 92   6]
 [ 17 170]] 



In [26]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=42)

In [27]:
measure_performance(x_test,y_test,dt)

Accuracy:0.930 

Classification report
             precision    recall  f1-score   support

          0       0.89      0.93      0.91        54
          1       0.95      0.93      0.94        89

avg / total       0.93      0.93      0.93       143
 

Confusion matrix
[[50  4]
 [ 6 83]] 

