# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

#### Answer: 

The most fundamental question in model building is determining what you would like the model to predict. Since we are trying to predict a discrete binary outcome, whether the student will pass or fail this is classification problem.


## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

In [49]:
# Import libraries
import numpy as np
import pandas as pd

In [50]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features


In [51]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1] -1 # excluding target column
n_passed = (student_data[student_data.passed == 'yes']).shape[0]
n_failed = (student_data[student_data.passed == 'no']).shape[0]                              
grad_rate = (n_passed / float(n_passed + n_failed)) * 100

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)


Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [52]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [53]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [54]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_all, 
                                                                     y_all, 
                                                                     test_size=0.24)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data


Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

#### Helper Functions

In [55]:
class CreateTable(dict):
    # Overridden dict class which takes a dict in the form {'a': 2, 'b': 3},
    # and renders an HTML Table in IPython Notebook.
    def _repr_html_(self):
        tableId = self['1_trainigTime']['tableId']
        
        html = ["<table width=60% id='" + tableId + "'>"]
        html.append("<tr style='background-color:#E6E6E6; font-weight: bold;'>")
        html.append("<td width=25%>{0}</td>".format(tableId))
        html.append("<td colspan=3>Training set size </td>")
        html.append("</tr>")
        html.append("<tr style='background-color:#E6E6E6; font-weight: bold;'>")
        html.append("<td width=40%>{0}</td>".format(''))
        html.append("<td width=20%>100</td>".format('100'))
        html.append("<td width=20%>{0}</td>".format('200'))
        html.append("<td width=20%>{0}</td>".format('300'))
        html.append("</tr>")
            
        for key, value in self.iteritems():
            html.append("<tr>")            
            html.append("<td width=40% style='background-color:#E6E6E6;'>{0}</td>".format(self[key]['name']))
            html.append("<td width=20%>{0}</td>".format(self[key]['100']))
            html.append("<td width=20%>{0}</td>".format(self[key]['200']))
            html.append("<td width=20%>{0}</td>".format(self[key]['300']))
            html.append("</tr>")
            
        html.append("</table>")
                
        return ''.join(html)

In [56]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    # print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    # print "Done!\nTraining time (secs): {:.3f}".format(end - start)
    
    model_results['1_trainigTime'][str(len(X_train))] = '{0:.3f}'.format(end - start)

In [57]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    # print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    # print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    
    model_results['2_predictionTime'][str(len(features))] = '{0:.3f}'.format(end - start)
    
    return f1_score(target.values, y_pred, pos_label='yes')

In [58]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    # print "------------------------------------------"
    # print "Training set size: {}".format(len(X_train))
    
    # Fit model to training data
    train_classifier(clf, X_train, y_train)      
    
    # print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    # print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))
    
    train_f1_score = predict_labels(clf, X_train, y_train)    
    test_f1_score = predict_labels(clf, X_test, y_test)
    
    model_results['3_train_f1_score'][str(len(X_train))] = '{0:.3f}'.format(train_f1_score)    
    model_results['4_test_f1_score'][str(len(X_train))] = '{0:.3f}'.format(test_f1_score)

For the student intervention problem I've deceided to train and evaluate following classifiers:

* Decision Tree Classifier
* Suport Vector Machines
* Randomized Forest

In [59]:
# Train and predict using different training set sizes and Models
train_sizes = [100, 200, 300] # Training Sizes

# Model 1: Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dtc_clf = DecisionTreeClassifier(random_state=0)

# Model 2: Suport Vector Machines
from sklearn.svm import SVC
SVM_clf = SVC()

#Model 3: Randomized Forest
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier(n_estimators=15)

models = {"SVM classifier": SVM_clf, "Decision Tree Classifier" : dtc_clf, "Randomized Forest": RF_clf}

dic_DecisionTrees = {}
dic_SVC = {}
dic_RandomForest = {}

for model_name, model in models.items():
 
    model_results = {}
    model_results = {'1_trainigTime': {'name': 'Trainig Time'}, 
        '2_predictionTime': {'name': 'Prediction Time'}, 
        '3_train_f1_score': {'name': 'F1 score for training set'},
        '4_test_f1_score': {'name': 'F1 score for testing set'}}
    model_results['1_trainigTime']['tableId'] = model.__class__.__name__

    #print "\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n"
    #print "Testing Model {}\n".format(model_name)
    for size in train_sizes:
        train_predict(model, X_train[:size], y_train[:size], X_test, y_test)
        
    if (model.__class__.__name__ == "SVC"):
        dic_SVC = model_results
    if (model.__class__.__name__ == "DecisionTreeClassifier"):
        dic_DecisionTrees = model_results
    if (model.__class__.__name__ == "RandomForestClassifier"):
        dic_RandomForest = model_results


### Decision Trees Classifier

#### What are the general applications of this model

A decision tree is the minimum number of yes/no questions that one has to ask to assess the probability of making a correct decision, or we can also say that a decision tree is a set of rules used to classify data into categories. It looks at the variables in a data set, determines which are most important, and then comes up with a tree of decisions which best partitions the data. The tree is created by splitting data up by variables and then counting to see how many are in each bucket after each split.

#### Strengths 

* Simple to understand and to interpret. Trees can be visualized.
* If the decision tree is short, it is easy for a human to interpret it.
* Able to handle both numerical and categorical data.
* Able to handle multi-output problems.
* Ability to deal with irrelevant features. The algorithm selects “relevant” features first, and generally ignores irrelevant features.
* Requires little data preparation.
* Decision trees combined into an ensemble create some of the best binary classifiers.


#### Weaknesses

* Sometimes decision tree learners can create over complex trees that do not generalize the data well. This is called overfitting and it is one of their main disadvantage. 
* Decision trees can be unstable because small changes in the data might result in large changes in the tree. This problem is mitigated by using decision trees within an ensemble.
* Decision tree learners create biased trees if some classes dominate.
* Another disadvantage is that they don’t support online learning, which means that you have to rebuild your tree when new examples come on. 


#### Why did you choose this model to apply?

I thought it would be good potential candidate for modeling the student intervention system because:
* Is less complex so should be very time efficient. 
* A simpler model that is easier to interpret and explain to the audience. It delivers the best humanly understandable results. 
* Predictive power, relative simplicity. 

### Testing Model SVM classifier

#### What are the general applications of this model

SVM is a non-probabilistic parametric classifier with a broad range of applications. It uses a linear hyperplane for separating the data points, which can also be used as a nonlinear classifier through the use of kernels.

* Especially popular in text classification problems where very high-dimensional spaces are the norm.
* Hand-written characters can be recognized using SVM.
* SVMs are helpful in text and hypertext categorization as their application can significantly reduce the need for labeled training instances.
* Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
* SVMs are also useful in medical science to classify proteins with up to 90% of the compounds classified correctly.


#### Strengths 

* SVMs have good generalization performance. High accuracy.
* Effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function, so it is also memory efficient.
* With an appropriate kernel they can work well even if the data isn't linearly separable in the base feature space.


#### Weaknesses

* The major downside of SVMs is that they can be painfully inefficient to train. 
* Expensive training and testing phase, both in speed and size. 
* It is not recommend for any problem where you have many training examples. 
* SVMs are not recommend for most "industry scale" applications. Anything beyond a toy/lab problem might be better approached with a different algorithm.
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation


#### Why did you choose this model to apply?

* SVM does not require a significant amount of data to make a reasonable prediction. Since the student intervention dataset is small and SVM provides good generalization performance, high accuracy and is less computationally intensive, I thought it would be another good potential candidate for modeling the student intervention system. Also SVM is a powerful and flexible classifier that could be adjusted to specific applications through fine tuning its parameters. 

### Random Forest Classifier

#### What are the general applications of this model

Random Forest Classifier is an ensemble learning method in which a number of relatively naive hypotheses are aggregated up to create a more robust hypothesis. 


#### Strengths 

* Random forest is robust to outliers.
* Give you a really good idea of which features in your data set are the most important.
* Almost always have lower classification error and better f-scores than decision trees.
* Deal really well with uneven data sets that have missing variables.
* It can be used to generate very good classifiers if techniques (such as bagging) are used to “prune” the forest so it generalizes better. 


#### Weaknesses

* This algorithm is computationally intensive and should use a relatively large amount of computing power.
* For better accuracy, need more trees. This can slow down the training performance.
* Will not work as well with a small dataset.
* Prone to overfitting.


#### Why did you choose this model to apply?

* Although the dataset is small, I decided to try the Random forest model. I also wanted to see if a computationally intensive algorithm is viable on a small data set given the budget constraints. 

In [60]:
CreateTable(dic_DecisionTrees)

0,1,2,3
DecisionTreeClassifier,Training set size,Training set size,Training set size
,100,200,300
Trainig Time,0.000,0.000,0.000
Prediction Time,0.000,0.000,0.000
F1 score for training set,1.000,1.000,1.000
F1 score for testing set,0.723,0.750,0.748


In [61]:
CreateTable(dic_SVC)

0,1,2,3
SVC,Training set size,Training set size,Training set size
,100,200,300
Trainig Time,0.004,0.004,0.012
Prediction Time,0.000,0.008,0.004
F1 score for training set,0.883,0.887,0.870
F1 score for testing set,0.792,0.789,0.803


In [62]:
CreateTable(dic_RandomForest)

0,1,2,3
RandomForestClassifier,Training set size,Training set size,Training set size
,100,200,300
Trainig Time,0.028,0.043,0.047
Prediction Time,0.004,0.000,0.000
F1 score for training set,0.993,1.000,0.998
F1 score for testing set,0.800,0.808,0.789


## 5. Choosing the Best Model


##### Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?

All models performed well in case of time efficiency (training and predicting time). Only Random Forests were a bit slower while training the model comparatively to our other models. However the time difference is so insignificant that it cannot be the main factor to choose the best model for this problem. So the model in this case should be chosen based on the F1 scores that they produce.

Even though Decision trees did perfect job on classifying training data correctly, where the F1 score for all training sizes is 1.0, they were not able to reproduce the same F1 score for the testing data that means that Decision Trees are performing poorly on unseen data. Also it seems that Random Forest Classifier got the best F1 score for training set at 300 training examples. In terms of the testing F1 scores it seems that SVM produced the best F1 score for 200 training examples. 

Considering all this I’ve choose SVM as the best model for this problem. SVM’s F1 score were consistent and changed very little with varying training dataset sizes. The F1 scores produced by SVM for testing data were quite good for all training sizes.

Therefore, I chose SVM’s SVC as the best model to describe the data in this problem.

##### In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).


SVM is a type of linear separator. Suppose we want to split black circles from the white ones by drawing a line. Typically there are an infinite number of lines that will accomplish this task. SVMs, in particular, find the "maximum-margin" line - this is the line "in the middle". Intuitively, this works well because it allows for noise and is most tolerant to mistakes on either side. 

In case of a two dimensional problem, SVM will try to draw a curve between the different features to separate the outcomes. In our case we have multidimensional problem, so the SVM is going to try to make a surface, instead of a single curve, between all of those dimensions that best separates the students that graduated and those that did not. The best surface or curve is the one that maximizes the distance between the different points of that feature with the different outcomes.

##### Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.


In [63]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'kernel': ['linear', 'poly', 'rbf'], 'degree': [1,2,3,4,5], 'C': (0.05,0.15,0.25,0.35,0.45,0.50)}
f1_scorer = make_scorer(f1_score, pos_label="yes")

clf = GridSearchCV(SVM_clf, parameters, scoring = f1_scorer)

clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['linear', 'poly', 'rbf'], 'C': (0.05, 0.15, 0.25, 0.35, 0.45, 0.5), 'degree': [1, 2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True,
       scoring=make_scorer(f1_score, pos_label=yes), verbose=0)

##### What is the model's final F<sub>1</sub> score?

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. F1 score is summarizing the number of correct positives and correct negatives out of all possible cases. 

The final F1 score that the tuned model can produce is 0.792, which is slightly better that what the default SVM could produce.

In [64]:
best_F1_score = '{0:.3f}'.format(f1_score(clf.predict(X_test), y_test, pos_label='yes'))

print "Best F1 Score: " +  best_F1_score
print "\nBest model parameter:  " + str( clf.best_params_)
print "\nBest estimator:\n{}".format(clf.best_estimator_)

Best F1 Score: 0.819

Best model parameter:  {'kernel': 'poly', 'C': 0.25, 'degree': 2}

Best estimator:
SVC(C=0.25, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=2, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
