In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

<h1>Model Evaluation and Selection</h1>

<p>In this assignment we will introduce some techniques to evaluate the quality of a method and how to select good parameter values.</p>

<p>We will be using the scikit built-in breast_cancer data set. It is binary classification problem where breast masses are classified as malignin (equal 0) or benign (equal 1).</p>

In [3]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()

data = dataset.data
target = dataset.target

###Find how many features we have and their names
features = dataset.feature_names
print "there are ",dataset.feature_names.size, 'features in the datasets namely: ' 
print dataset.feature_names

#The columns 10 to 19 are measurements errors and we can drop them without affecting much the work done here
###Remove the columns 10 to 19 in the data

index = [10, 19]
features_sub = np.delete(features, index)



there are  30 features in the datasets namely: 
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


<h2>Example on a Single Decision Tree</h2>

<p>In this section we will introduce evaluation and paramtere selection techniques on a single decision tree.</p>

<h4>Simple Evaluation</h4>
<p>Evaluating the accuracy of a method can naively be done by splitting the data set in a training set and a test set.
We train our classifier on the training set (obviously) and we evaluate the accuracy on the test set.<br>
In scikit this is easily done by using the <i>.score()</i> functions of the classifier.</p>


In [6]:
from sklearn.cross_validation import train_test_split

###Split the data in train and test sets and the target in train_target and test_target (ratio 70%-30%)
## Hint : by using the keyword "random_state=0" when you call train_test_split
##        you make sure that the splits are the same for both data and target

train, test, train_target, test_target = train_test_split(data, target, test_size=0.3, random_state=0)

###Import a decision tree and train it on the training set with the default settings
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(train, train_target)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [14]:
###Compute the accuracy on the test set
pred_test = dtc.predict(test)
dtc.score(test, test_target)

0.91812865497076024

<p>The accuracy is simply giving you the amount of samples that have been correctly classified<br>
Other methods to measure the quality of the classifier are available. For instance one can use the F1 score. F1 score use the <i>precision</i> and <i>recall</i> (see https://en.wikipedia.org/wiki/Precision_and_recall) to evaluate the quality of a classification.</p>

In [15]:
from sklearn.metrics import f1_score
f1_score(test_target,pred_test,average='binary')


0.93333333333333335

<p>It is also possible to have the detail of precision and recall for both classes :</p>

In [22]:
from sklearn.metrics import classification_report
print classification_report(test_target, pred_test, target_names=['malignant', 'benign'])


             precision    recall  f1-score   support

  malignant       0.86      0.94      0.89        63
     benign       0.96      0.91      0.93       108

avg / total       0.92      0.92      0.92       171



<p>We started this study by doing a random separation for the train/test sets. Actually all scores of tests performed so far depend on this separation.</p>

<h4> <u>QUESTION 1 :</u> Explain why all scores are specific to our first sets split.</h4>
<p><i>The function, <em style="color:red">train_test_split</em> splits data into subsets randomly (using pseudo-random number generator).</i></p>

In [None]:
from sklearn.cross_validation import cross_val_score

###use the cross_val_score function on the whole dataset
cvs = 

#When calling the cross_val_score it returns one score per fold
#As by default the function uses a three-fold separation you have three value
###Compute and print the mean and the standard deviation of the cross_val_score function



<p>Several techniques exist to divide the data set in folds (see http://scikit-learn.org/stable/modules/cross_validation.html for more details).</p>
<p>Nonetheless, it is worth mentioning another technique : the ShuffleSplit. This technique generates a pre-defined number of independent train/test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.</p>
<p>This can be implemented as follow :</p>

In [None]:
from sklearn.cross_validation import ShuffleSplit
cv_ss = ShuffleSplit(data.shape[0],n_iter=5,test_size=0.4,random_state=0)

###Use again the cross_val_score function and set "cv=cv_ss"

###Compute again the mean and the standard deviation of the output



<h4> <u>QUESTION 2 :</u> Are results very different?</h4>
<p><i>Type your answer here</i></p>

<h3>Finding Optimal Parameters</h3>

<p>In the previous section we have use the default settings for our classifiers but this is usually not necessarily the most optimal choice.</p>
<p>In this section we will introduce tools to find good parameters value.</p>

<h4>Grid Search - a brute force approach</h4>
<p>A decision tree has several parameters we can change to optimize the classification. Scikit offers the possibility to investigate several parameters using <i>GridSearchCV</i>. You simply need to define a "parameter grid" (a dictionary in python) that defines the parameters values you want to try and feed it to a GridSearchCV object :</p>

In [None]:
from sklearn.grid_search import GridSearchCV

###Example of a parameter grid dictionary, run it once and then include more parameters in p_grid
p_grid=dict({'criterion':['entropy','gini']})

#grd = GridSearchCV(<classifier>,cv=3,param_grid=<dictionary of parameters to investigate>)
#grd.fit(<training set>,<train set targets>)
grd = GridSearchCV(dtr,cv=3,param_grid=param_grid)
grd.fit(train,train_target)

<h4> <u>QUESTION 2 :</u> What does the "CV" at the end of GridSearchCV stands for? What is it telling you?</h4>
<p><i>Type your answer here</i></p>

In [None]:
#You can ask for the best parameters found by running the following command
grd.best_params_

In [None]:
#And creat directly a new classifier with the optimal parameters by running
new_dtr = grd.best_estimator_

###Check the "new_dtr" parameters


In [None]:
###Now train the new classifier


In [None]:
###Compute its accuracy


In [None]:
###print the classification report


<h4>Learning and Validation Curves</h4>
<p> Scikit provides additional tools to tune our algorithm.<br>
One useful tool is the learning curve. It gives the cross-validated training and test scores for different training sets sizes.
We can use it on the previously defined new classifier "new_dtr" :</p>

In [None]:
from sklearn.learning_curve import learning_curve

###Compute the learning curve
#<number of elements in train>,<train score>,<test score> = learning_curve(<classifier>,<data>,<targets>,train_sizes=<liste of training sizes>,cv=3)
#The list of training sizes can be absolute numbers or amount if between (0,1]

###Visualize the learning curve (don't forget labels, title,legend,etc)

#You should see why I chose a 70%-30% ratio

<h4> <u>QUESTION 3 :</u> Why isn't the training score equal to one?</h4>
<p><i>Type your answer here</i></p>

<p>A second tool meant to investigate a specific parameter influence on scores is the <i>validation curve</i>. It is basically like a gridsearch with a single parameter. </p>

In [None]:
from sklearn.learning_curve import validation_curve

#Just an example
train_score,test_score = validation_curve(new_dtr,data,target,param_name='max_depth',param_range=np.arange(2,8),cv=3)

###Plot the validation curve


<h3>Application - Evaluating the Random Forest classifier and the SVC</h3>

<p>In the following you will apply the evaluation and optimization tools to compare the Random Forest technique and the SVC technique.</p>

In [None]:
###Import the "RandomForestClassifier" classifier

###Train it on the training set


In [None]:
###Import the "SVC" classifier

###Train it on the training set  (don't forget to scale it!)


<p>Before evaluating the performance of both classifiers we will first determine the best values for their parameters</p>

In [None]:
###Using the grid search optimize the Random Forest Classifier



In [None]:
###Create a new Random Forest Classifier using the best parameters found


In [None]:
###Using the grid search optimize the SVC Classifier
#(!don't forget to scale the data!)



In [None]:
###Create a new Random Forest Classifier using the best parameters found


<p>Now we have optimized out two classifier we can compare how they perform</p>

In [None]:
###Compute and the accuracy of both classifiers on the training set using the cross_val_score
# (!scale for SVC!)


#Print the results (average and std)



<h4> <u>QUESTION 4 :</u> Which classifier gives the best accuracy?</h4>
<p><i>Type your answer here</i></p>

In [None]:
###Print the Classification report for the Random Forest Classifier


In [None]:
###Do the same for the SVC classifier


<h4> <u>QUESTION 5 :</u> Analyze the last two classification reports.</h4>
<p><i>Type your answer here</i></p>

<h4> <u>QUESTION 6 :</u> Recall the classification report from the optimized decision tree to conclude on the best algorithm to chose to efficiently detect malignant masses.</h4>
<p><i>Type your answer here</i></p>

<h4> <u>BONUS :</u> Repeat the optimization and evalution procedure with the k-nearest neighbors approach.</h4>
<p><i>Type your answer here</i></p>