## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

In [1]:
import pandas as pd
%matplotlib inline
from sklearn import datasets
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

In [59]:
iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
# 50% for training, 50% for test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [10]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [12]:
from sklearn import metrics
import numpy as np

In [13]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [14]:
measure_performance(x_test,y_test,dt) #measure on the test data (rather than train)

Accuracy:0.973 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        23
          1       0.96      0.96      0.96        25
          2       0.96      0.96      0.96        27

avg / total       0.97      0.97      0.97        75
 

Confusion matrix
[[23  0  0]
 [ 0 24  1]
 [ 0  1 26]] 



**Comment:** The machine classified 2 out of the 75 entries of the train set wrong. The precision is high. Since we are just sorting flowers, this precision we wouldn't have to reach for a higher precision.  

### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

In [18]:
iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

In [19]:
# 75% for training, 25% for test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

In [17]:
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x_train,y_train)

In [20]:
measure_performance(x_test,y_test,dt) #measure on the test data (rather than train)

Accuracy:0.974 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        10
          1       1.00      0.93      0.97        15
          2       0.93      1.00      0.96        13

avg / total       0.98      0.97      0.97        38
 

Confusion matrix
[[10  0  0]
 [ 0 14  1]
 [ 0  0 13]] 



**Comment:** Since there was more data for learning the precision is even higher. On the other hand: There were not many values to test left. So maybe we would achieve the best result with a split in between -- for example 2/3 to 1/3. 

### 3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes do we have? What are we trying to predict?
For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

We want to predict if a cancer is bening or malignant. 

In [47]:
cancer = datasets.load_breast_cancer()
cancer['feature_names']

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'], 
      dtype='<U23')

In [76]:
x = cancer.data[:,2:] # the attributes
y = cancer.target # the target variable

In [77]:
# 75% for training, 25% for test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

In [78]:
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x_train,y_train)

In [79]:
measure_performance(x_test,y_test,dt) #measure on the test data (rather than train)

Accuracy:0.930 

Classification report
             precision    recall  f1-score   support

          0       0.91      0.91      0.91        54
          1       0.94      0.94      0.94        89

avg / total       0.93      0.93      0.93       143
 

Confusion matrix
[[49  5]
 [ 5 84]] 



### 4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

In [61]:
cancer = datasets.load_breast_cancer()

In [62]:
x = cancer.data[:,2:] # the attributes
y = cancer.target # the target variable

In [63]:
# 50% for training, 50% for test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

In [65]:
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x_train,y_train)

In [66]:
measure_performance(x_test,y_test,dt) #measure on the test data (rather than train)

Accuracy:0.944 

Classification report
             precision    recall  f1-score   support

          0       0.90      0.90      0.90        42
          1       0.96      0.96      0.96       101

avg / total       0.94      0.94      0.94       143
 

Confusion matrix
[[38  4]
 [ 4 97]] 



In [68]:
# 50% for training, 50% for test

cancer = datasets.load_breast_cancer()

x = cancer.data[:,2:] # the attributes

y = cancer.target # the target variable

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

dt = tree.DecisionTreeClassifier()
dt = dt.fit(x_train,y_train)

measure_performance(x_test,y_test,dt) #measure on the test data (rather than train)

Accuracy:0.930 

Classification report
             precision    recall  f1-score   support

          0       0.89      0.95      0.92        57
          1       0.96      0.92      0.94        86

avg / total       0.93      0.93      0.93       143
 

Confusion matrix
[[54  3]
 [ 7 79]] 

