## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

In [31]:
import pandas as pd
%matplotlib inline

In [32]:
from sklearn import datasets
from pandas.tools.plotting import scatter_matrix
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np

In [33]:
import matplotlib.pyplot as plt

In [34]:
iris = datasets.load_iris()

In [51]:
iris

 'data': array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
      

In [50]:
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [35]:
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

In [36]:
dt = tree.DecisionTreeClassifier()

In [37]:
#dt = dt.fit(x,y)

In [38]:
#Creating test validation

In [39]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [40]:
dt = dt.fit(x_train,y_train)

In [41]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [42]:
measure_performance(x_test,y_test,dt)

Accuracy:0.973 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        25
          1       0.96      0.96      0.96        28
          2       0.95      0.95      0.95        22

avg / total       0.97      0.97      0.97        75
 

Confusion matrix
[[25  0  0]
 [ 0 27  1]
 [ 0  1 21]] 



In [None]:
#The last test results I would not consider good. 4 values that are predicted wrongly seem bad. That was the first go. 
#When I ran the test through 2-3 times the results improved.

### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

In [43]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.75,train_size=0.25)

In [None]:
dt = dt.fit(x_train,y_train)

In [44]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [45]:
measure_performance(x_test,y_test,dt)

Accuracy:0.982 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        40
          1       0.97      0.97      0.97        36
          2       0.97      0.97      0.97        37

avg / total       0.98      0.98      0.98       113
 

Confusion matrix
[[40  0  0]
 [ 0 35  1]
 [ 0  1 36]] 



In [46]:
#The results are better of course, as they the predictor has more data to work with.

### 3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?
For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [None]:
#We are predicting, whether, depending on the 

In [48]:
breast_cancer = datasets.load_breast_cancer()

In [49]:
breast_cancer['feature_names']

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'], 
      dtype='<U23')

In [62]:
#breast_cancer.target

In [181]:
x = breast_cancer.data[:,20:] # the attributes 
#With 29 only 64% accurancy, with 27 - 29 already 85%. 26 - 29 already 87%. with 25:29 is goes doen again 83%. WIth 20:
#it seems to work best. But is there a way to calculate the best combination of parameters to take?
y = breast_cancer.target # the target variable

In [190]:
x

array([[  2.53800000e+01,   1.73300000e+01,   1.84600000e+02, ...,
          2.65400000e-01,   4.60100000e-01,   1.18900000e-01],
       [  2.49900000e+01,   2.34100000e+01,   1.58800000e+02, ...,
          1.86000000e-01,   2.75000000e-01,   8.90200000e-02],
       [  2.35700000e+01,   2.55300000e+01,   1.52500000e+02, ...,
          2.43000000e-01,   3.61300000e-01,   8.75800000e-02],
       ..., 
       [  1.89800000e+01,   3.41200000e+01,   1.26700000e+02, ...,
          1.41800000e-01,   2.21800000e-01,   7.82000000e-02],
       [  2.57400000e+01,   3.94200000e+01,   1.84600000e+02, ...,
          2.65000000e-01,   4.08700000e-01,   1.24000000e-01],
       [  9.45600000e+00,   3.03700000e+01,   5.91600000e+01, ...,
          0.00000000e+00,   2.87100000e-01,   7.03900000e-02]])

In [183]:
y 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0,

In [184]:
dt = tree.DecisionTreeClassifier()

In [185]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [186]:
dt = dt.fit(x_train,y_train)

In [187]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [188]:
measure_performance(x_test,y_test,dt)

Accuracy:0.951 

Classification report
             precision    recall  f1-score   support

          0       0.92      0.95      0.93       105
          1       0.97      0.95      0.96       180

avg / total       0.95      0.95      0.95       285
 

Confusion matrix
[[100   5]
 [  9 171]] 



### 4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

In [207]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
#Why do results vary?

In [208]:
dt = dt.fit(x_train,y_train)

In [209]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [210]:
measure_performance(x_test,y_test,dt)

Accuracy:0.951 

Classification report
             precision    recall  f1-score   support

          0       0.95      0.93      0.94        56
          1       0.95      0.97      0.96        87

avg / total       0.95      0.95      0.95       143
 

Confusion matrix
[[52  4]
 [ 3 84]] 

