# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

The type of problem is 'Classification' as the objective is to find out the students will pass or not.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("C:\Users\marimuthuananthavelu\Desktop\Student-Intervention-System_Udacity_Marimuthu\student_intervention\student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [3]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = len(student_data)
n_features = student_data.shape[1]-1
n_passed = len(student_data[student_data['passed']=='yes'])
n_failed = len(student_data[student_data['passed']=='no'])
grad_rate = 100.0* n_passed/n_students
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [5]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [6]:
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset

X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_all,y_all,train_size=300)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.


The 3 supervised models used for solving the problem. They are:



1. Decision Trees
2. Gaussian Naive Bayes
3. Support Vector Machines

1. Decision Trees

The Decisition Tree model that predicts the value of a target variable with the help of learning simple decision rules from the data features.

The strengths and weaknesses of Decistion trees are;

Strengths(I,II):

1. Decision Trees are very flexible, easy to understand, and easy to debug.One of the coolest things about Decision Trees is they only need a table of data and they will build a classifier directly from that data without needing any up front design work to take place. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable.(I,IV)

2. Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.(I)

3. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.(I)

4. Able to handle multi-output problems.

5. Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

6. Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

7. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

8. Addresses non-linearity.

Weaknesses(I,III,IV):

1. They overfit. Splitting a lot leads to complex trees and raises probability you are overfitting. Decision-tree learners can create over-complex trees that do not generalise the data well. One require to prune the trees to overcome this issue or setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. One does not have any upfront design cost due to its simplicity, but one will pay that back on tuning the trees performance.(I,IV)

2. Instability : Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.(III)

3. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.(I)

4. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.(I)

5. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.(I)

References:

(I) http://scikit-learn.org/stable/modules/tree.html
(II) https://www.quora.com/What-are-the-advantages-of-using-a-decision-tree-for-classification
(III) https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
(IV) http://stackoverflow.com/questions/10317885/decision-tree-vs-naive-bayes-classifier


Reason for considering Decision Trees:

a. Decision trees is able to work with both numerical categorical variables unlike other models and Decistion trees take little efforts for Data preparation. Due to the content and the efforts for data preparation, i preferred Decision trees as one of the algorithm.

b. Decision trees is simple to understand and interpret. With respect to the student data, i do visualize how the decistion trees will create leafs and nodes with respect to number of features and 2 possible outcomes i.e.'yes' or 'no'. Easy to interpret overall.


2. Gaussian Naive Bayes

Strengths (I,III):
1.Simple in comparison,fast to train and fast to classify.
2.Not sensitive to irrelevant features (see #1 in weaknesses)
3.Good for a smaller dataset.
4.Handles streaming data well(III)
5. Bayes is based upon the conditional probability. Their answers are in terms of probabilities i.e. P(yes)=95 %,P(no)= 5%.

Weaknesses (II,III):
1.Assumes independence of Fearues
2. High bias classifier due to its split and narrowing. (III)

References:

(I) https://books.google.ca/books?id=3DPcCgAAQBAJ&pg=PT139&lpg=PT139&dq=naive+Bayes#v=onepage&q=naive%20Bayes&f=false

(II) http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf

(III) https://www.quora.com/What-are-the-advantages-of-different-classification-algorithms

Reason for considering Naive Bayes:

a. They are based upon conditional probability. Since in our case, the students performances are expected to be classified, my intuition is that finding the probability whether the students will pass or not based upon the input features which all are giving information about his nature/activity/age/etc.

For example, as one would say,the students who scored and passed have attended the class in an average of about 70%. So the intuition is, probability is high if the students attend more classes.

b. Our dataset is small with close to 400 samples. So i believe, Naive bayes is believed to yield good prediction.

3. Support Vector Machines

Strengths(I):

1.High accuracy.Effective in high dimensional spaces. Kernel trick is the strength of this algorithm. With an appropriate kernel they can work well even if the data isn't linearly separable in the base feature space.(I)

2.Still effective in cases where number of dimensions is greater than the number of samples

3. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

Weaknesses:

1. Perhaps the biggest limitation of the support vector approach lies in choice of the kernel-selection of the kernel function parameters in high dimensional spaces.(II)

2. A second limitation is speed and size, both in training and testing. (I)

3. If the number of features is much greater than the number of samples, the method is likely to give poor performances.(II)

4. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.(II)

References:
(I) https://www.quora.com/What-are-the-advantages-of-different-classification-algorithms
(II) http://www.svms.org/disadvantages.html


Reason for considering Support Vector Machines:

 a. Due to its accuracy in prediction of the students performances. Due to its highly predictive ability, this was one of the most important reason to pick up.

b. Due to its advantage of being efficient at high dimensional spaces. So there is no constraint for not having linearly separable data.

In [7]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    trainingtime=end-start
    print "Done!\nTraining time (secs): {:.3f}".format(trainingtime)
    return trainingtime

# TODO: Choose a model, import it and instantiate an object
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

# Fit model to training data
DTC_trainingtime_300=train_classifier(clf, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it


Training DecisionTreeClassifier...
Done!
Training time (secs): 0.013


In [8]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0


In [9]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.736842105263


In [10]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))
    
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant

------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.003
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.744525547445
------------------------------------------
Training set size: 200
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.004
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.721804511278
------------------------------------------
Training set size: 300
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.006
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 

In [11]:
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
train_classifier(clf, X_train, y_train)

Training GaussianNB...
Done!
Training time (secs): 0.004


0.003999948501586914

In [13]:
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.003
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.883720930233
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.69696969697
------------------------------------------
Training set size: 200
Training GaussianNB...
Done!
Training time (secs): 0.003
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.798507462687
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.714285714286
------------------------------------------
Training set size: 300
Training GaussianNB...
Done!
Training time (secs): 0.003
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.790816326531
Predicting labels using GaussianNB...
Done!
Prediction time 

In [14]:
from sklearn.svm import SVC
clf=SVC()
train_classifier(clf, X_train, y_train)

Training SVC...
Done!
Training time (secs): 0.049


0.04900002479553223

In [15]:
train_predict(clf, X_train[0:100], y_train[0:100], X_test, y_test)
train_predict(clf, X_train[0:200], y_train[0:200], X_test, y_test)
train_predict(clf, X_train, y_train, X_test, y_test)

------------------------------------------
Training set size: 100
Training SVC...
Done!
Training time (secs): 0.004
Predicting labels using SVC...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.916666666667
Predicting labels using SVC...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.858974358974
------------------------------------------
Training set size: 200
Training SVC...
Done!
Training time (secs): 0.007
Predicting labels using SVC...
Done!
Prediction time (secs): 0.008
F1 score for training set: 0.870967741935
Predicting labels using SVC...
Done!
Prediction time (secs): 0.005
F1 score for test set: 0.858895705521
------------------------------------------
Training set size: 300
Training SVC...
Done!
Training time (secs): 0.018
Predicting labels using SVC...
Done!
Prediction time (secs): 0.014
F1 score for training set: 0.864988558352
Predicting labels using SVC...
Done!
Prediction time (secs): 0.005
F1 score for test set: 0.8625


## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

Of all the chosen model, we should choose the model which performs optimally in processing time and F1 Score. The observations from the above results for the 3 different models are as below;

1.Training time and prediction times are high for SVM in comparison to Decision Trees, Naïve Bayes. They increase considerably with respect to training set. And SVM comes with high average F1 score on test set.

2.Training time is not consistent for Naïve Bayes with respect to increases in Training set.

3. Average F1 Score on testing set with different training set for Decision trees is less than the other chosen classifiers.

4. Gaussian Naïve Bayes is fast in training and classifying. After NB, Decision trees take less time to learn the training set among the chosen algorithms for all the training size.


Since 'Processing time' and 'Optimum F1 Score' are the essentials to choose the best algorithm, here i propose Support Vector Machines algorithm for the given dataset where the accuracy is high though it comes at an expense of additional processing time in comparison.

There are 2 most important reasons i gave as a reason for choosing SVM. They are;
1.SVM's are accurate than the other chosen algorithms. Please look at the average F1 Score on testing set for all the chosen classifiers in the report. A difference in accuracy makes prediction for few or more students likely to join the program.

2.Though SVM take more processing time than other 2 classifiers, the following are the convincing reality which we need to make;
    a. First, the student interventions system model is required to learn and predict in a given time which is not critical in nature. For example, a High School teacher can expect to wait for an additional hour to know the predictions from the machine learning model. Less computing power increases the wait time.More computing power improves the performance time of machine learning algorithm where one can draw an intution upon AlphaGo whereas the thinking time is 2 seconds.https://en.wikipedia.org/wiki/AlphaGo
    
    b. We are not dealing with very critical output from our model within a very short moment. For example, in chemical process industries or oil refinery applications, the data should be processed quickly and get processed as quick to avoid any unsafe activity.But in our case, prediction of student performance can take an another several minutes oran hour based upon dataset. Satisfying the more accuracy gives more confidence in SVM.

The chosen algorithm , Support Vector Machines work in the following way:

Lets imagine , we place two colors of balls on the table and we want to segregate(classify) based upon the colors. You are given a scale / ruler to place on the table where an exact separation happens. How our intuition will decide where to place the ruler on the table between two different colors of balls.

Yes, we will look for a place where the maximum liklihood of one set of balls with same color on one side and the another set of balls with another color will go on another side. So , we find a place which is in between these 2 different colors. And that place has space where in multiple ways the ruler can be placed on different angles. To choose the best placement, lets think for a while how this ruler can best be placed in that space. Yes, we would prefer the ruler to be placed closest to the middle in the gap/space between two different colors. In support vector machines we do the same,i.e. Classification. The distance between the ruler and the each side of color balls will be targeted to be maximum.

Also SVM comes with trick called 'Kernel' which is used for handling the data which all are not linearly separable. 

In our example, lets say a kid comes and randomly plays the balls on the table and left it for you to palce a ruler to separate the two different colors of balls. Now there are no ordered balls on the table.

That would be difficult task as the balls are no more segregated nicely as before and Ruler can not be straight away be used. Lets think and see how we can classify. 

One way to separate these balls is to increase the elevation of one set of balls with same color above. We can tie thread to balls with same color and lift above the table surface. This will give a different elevation from one set of balls with a color from another set of balls with different color.

Now, if we look horizontally, the two sets of balls with different colors can be separable with the help of straight ruler. Here we introduced a new hero called 'Thread' to make it happen.

As we can see, in Support Vector Machines, we add more features to existing set of features. In our above example, it is 'Thread'.

In our student sets as well, we can see for example, a set of new features can be introduced. Students study time and travel time can be used to create a new feature called 'time management' which may classify into 'Good' or 'little Good' for students.

I play an example video here for your reference.
https://www.youtube.com/watch?v=3liCbRZPrZA

We can see beautifully here how SVM works to play with features to convert the non-linearly separable data into linearly separable.

In our student data set as well, we intend to classify the students who require intervention and who does not, using the similar way of drawing a best decision boundary. As we see, the data is not so linear where we can draw conclusion easily but requires techniques like kernel to use which comes along with SVM.

References:
https://www.udacity.com/course/viewer#!/c-ud726-nd/l-5447009165/e-2428048554/m-2436168579

In [17]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer           
from sklearn import grid_search
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.cross_validation import StratifiedShuffleSplit


cv = StratifiedShuffleSplit(y_train, random_state=42)

clf = SVC()
param_grid = [
  {'C': [1, 10, 100, 200, 300, 400, 500, 600, 700],
   'gamma': [1e-2, 1e-3, 1e-4, 1e-5, 1e-6],
   'kernel': ['rbf'], 'tol':[1e-3, 1e-4, 1e-5, 1e-6]
  }
 ]

f1_scorer = make_scorer(f1_score, pos_label="yes")
grid_search = grid_search.GridSearchCV(clf, param_grid, scoring=f1_scorer)
grid_search.fit(X_train, y_train)
reg = grid_search.best_estimator_
train_f1_score = predict_labels(reg, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)
print "F1 score for test set: {}".format(predict_labels(reg, X_test, y_test))

# http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
#http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
#http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html

Predicting labels using SVC...
Done!
Prediction time (secs): 0.012
F1 score for training set: 0.816593886463
Predicting labels using SVC...
Done!
Prediction time (secs): 0.004
F1 score for test set: 0.880503144654


Final F1 score for test set is: 0.880503144654