# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

This is a classification problem because the target column is made up of categorical data. If it will have been a regression problem if the target column had been made up of real / continious values.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [3]:
# Import libraries
import numpy as np
import pandas as pd

In [4]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [5]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1]-1
n_passed = student_data.passed[student_data.passed == 'yes'].count()
n_failed = student_data.passed[student_data.passed == 'no'].count()
grad_rate = (float(n_passed) / n_students) * 100.0
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [6]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [7]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [8]:
# First, decide how many training vs test samples you want
from sklearn.cross_validation import train_test_split
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=300)
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.
### Assumptions from our data.
- Our dataset is small.
- Our dataset is slightly unbalanced. (There are more passing students than failing students)

### Decision Tree Classifer
 
###### Reason for using this classifier:
I choose Decision Tree Classifiers because 
- it is a classifier that can easily be visualized by a non-technical audience
- once trained, predictions is done in logarithmic time
- it is capable of binary classification, this is useful since the result of our prediction is binary(whether a student passes or fails). 
#### The Advantages:
- The decision tree can be easily visualized, which makes it easy to understand and interprete
- They require little data preparation
- The storage cost for generating the prediction tree is logarithmic relative to the quantity of data provided
The Disadvantages
- Decision Trees can get overly complex performing well during training but not during prediction - a situation known as overfitting. There are ways to deal with overfitting.
- Decision Trees need to be rebuilt and can result in a completely new tree once there is a variation in the original data.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

### Support Vector Classifier
#### Reason for using this classifier
I also choose Support Vector Classifier because
- They work well in complicated domains where there are a lot of features relative to the size of data available. 
- There is not a lot of noise in our dataset. SVCs work well when the size of the dataset is small
- Our dataset does not contain a lot of noise. That makes it a good candidate for SVCs.
#### The advantages:
- Effective in high dimensional spaces - They could still provide good prediction when the number of features is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function.
#### The disadvantages:
- It could have poor performance if the number of features is much greater than the number of samples
- It does not provide probability estimates.


### Gaussian Naive Bayes
#### Reason for choosing this classifier
I choose naive bayes becuase
- It is it has a light memory and cpu footprint
- I will argue that this classifier can still make good predictions even with little data. This is because it uses the available evidence to make predictions.
#### The Advantages:
- They require a small amount of training data to estimate the necessary parameters.
- Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.
- The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution
#### The Disadvantages:
- it is known to be a bad estimator


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Classifier</th>
      <th>F1 score - test</th>
      <th>F1 score - train</th>
      <th>Size</th>
      <th>Train time</th>
      <th>predict time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>DecisionTreeClassifier</td>
      <td>0.650407</td>
      <td>1.000000</td>
      <td>100</td>
      <td>0.001</td>
      <td>0.000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>DecisionTreeClassifier</td>
      <td>0.661017</td>
      <td>1.000000</td>
      <td>200</td>
      <td>0.002</td>
      <td>0.000</td>
    </tr>
    <tr>
      <th>2</th>
      <td>DecisionTreeClassifier</td>
      <td>0.837209</td>
      <td>1.000000</td>
      <td>300</td>
      <td>0.003</td>
      <td>0.000</td>
    </tr>
    <tr>
      <th>3</th>
      <td>SVC</td>
      <td>0.800000</td>
      <td>0.858896</td>
      <td>100</td>
      <td>0.002</td>
      <td>0.001</td>
    </tr>
    <tr>
      <th>4</th>
      <td>SVC</td>
      <td>0.815789</td>
      <td>0.872131</td>
      <td>200</td>
      <td>0.006</td>
      <td>0.002</td>
    </tr>
    <tr>
      <th>5</th>
      <td>SVC</td>
      <td>0.828947</td>
      <td>0.846316</td>
      <td>300</td>
      <td>0.037</td>
      <td>0.005</td>
    </tr>
    <tr>
      <th>6</th>
      <td>GaussianNB</td>
      <td>0.225000</td>
      <td>0.409091</td>
      <td>100</td>
      <td>0.002</td>
      <td>0.001</td>
    </tr>
    <tr>
      <th>7</th>
      <td>GaussianNB</td>
      <td>0.738462</td>
      <td>0.804270</td>
      <td>200</td>
      <td>0.006</td>
      <td>0.001</td>
    </tr>
    <tr>
      <th>8</th>
      <td>GaussianNB</td>
      <td>0.805755</td>
      <td>0.817156</td>
      <td>300</td>
      <td>0.002</td>
      <td>0.000</td>
    </tr>
  </tbody>
</table>

In [9]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)
    return end - start

# TODO: Choose a model, import it and instantiate an object
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
#print clf  # you can inspect the learned model by printing it

Training DecisionTreeClassifier...
Done!
Training time (secs): 0.000


0.0

In [10]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes'), end - start

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score[0])

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0


In [17]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)[0])

Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.705882352941


In [12]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train)[0])
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test)[0])

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant

In [13]:
# TODO: Train and predict using two other models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
classifiers = [DecisionTreeClassifier(), SVC(), GaussianNB()]
results = { 
        'Classifier': [],
        'Size': [], 
        'Train time': [], 
        'predict time': [], 
        'F1 score - train': [], 
        'F1 score - test': []
    }
datasets = [train_test_split(X_all, y_all, train_size=x, test_size=95) for x in [100, 200, 300]]
for clf in classifiers:
    for data in datasets:
        X_train, X_test, y_train, y_test = data
        time_train = train_classifier(clf, X_train, y_train)
        f1_train, time_predict = predict_labels(clf, X_train, y_train)
        f1_test, time_predict = predict_labels(clf, X_test,y_test)
        
        results['Classifier'].append(clf.__class__.__name__)
        results['Size'].append(X_train.shape[0])
        results['Train time'].append("{:.3f}".format(time_train))
        results['predict time'].append("{:.3f}".format(time_predict))
        results['F1 score - train'].append(f1_train)
        results['F1 score - test'].append(f1_test)
        
pd.DataFrame(results)

Training DecisionTreeClassifier...
Done!
Training time (secs): 0.000
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.015
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.000
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.000
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
Training SVC...
Done!
Training time (secs): 0.000
Predicting labels using SVC...
Done!
Prediction time (secs): 0.000
Predicting labels using SVC...
Done!
Prediction time (secs): 0.000
Training SVC...
Done!
Training time (secs): 0.000
Predicting labels using SVC...
Done!
Predic

Unnamed: 0,Classifier,F1 score - test,F1 score - train,Size,Train time,predict time
0,DecisionTreeClassifier,0.744526,1.0,100,0.0,0.0
1,DecisionTreeClassifier,0.677966,1.0,200,0.0,0.0
2,DecisionTreeClassifier,0.710744,1.0,300,0.0,0.0
3,SVC,0.873418,0.88,100,0.0,0.0
4,SVC,0.758621,0.864353,200,0.0,0.0
5,SVC,0.780822,0.881356,300,0.016,0.015
6,GaussianNB,0.75,0.820896,100,0.0,0.0
7,GaussianNB,0.727273,0.78626,200,0.0,0.0
8,GaussianNB,0.705882,0.804819,300,0.0,0.0


In [16]:
res = pd.DataFrame(results)
res.columns

Index([u'Classifier', u'F1 score - test', u'F1 score - train', u'Size',
       u'Train time', u'predict time'],
      dtype='object')

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

### Reviewing the experiment's result
Based on my experiments carried out previously, I believe DecisionTree Classifiers is generally more appropriate based on the available data. To understand why, I will review the three models here in terms of Performance, Train time and Prediction Time. Remember, that one of our goals here is to use as little computation as possible.

#### Decision Trees
The average F1 Score `0.716` across all training sizes. The training time increased linearly with the training size and the prediction time is negligible.
#### Support Vector Classifiers.
The average F1 Score `0.826` across all training sizes. The training time increased exponentially with the training size and so does the prediction time. This model does have a better f1 score but it takes the most amount of time to both train and predict
#### Gaussian Naive Bayes
The average F1 Score `0.59` across all training sizes. The training time has no clear pattern relative to the training size, however, it is considerable low. The prediction time appears near constant, but it is more than the decision tree.

I would argue that the decision tree classifier is a better model for our data because it gives a good f1 score, get trained fast and also predicts data in negligible time.

### How Decision Trees work
A decision tree is a set of rules used to classify data into categories. It looks at the variables in a data set, determines which are most important, and then comes up with a, tree of decisions which best partitions the data. The tree is created by splitting data up by variables and then counting to see how many are in each bucket after each split.

Imagine you were playing a guessing game where your opponent has a secret answer, but allows you to ask true or false questions. He then tells you if the answer to your question is true or false. How do you find the secret answer in the fewest number of questions? Let us assume that the game is to tell you if a student passes or fails. Some of the questions you can ask are `Is the student in a rural or urban area?`, `Is the student on education support?`,`Does he have internet`, `is the student male or female`, and so on. The algorithm groups tries to create a tree. To find out if a student passes or fails, you simply start asking questions. Each answer gives you a smaller question tree or tells you if the student will pass / fail. The tree is constructed to require the smallest number of questions to make a prediction.

### Explaining the Final Model
My Model's final f1 score is 0.8102
To do arrive at that score, gridsearch showed that setting a maximum depth to 2 nodes and maximum features using the square root of features produces the best model f1 score.

In [124]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
def performance_metric(label, prediction):
    return f1_score(label, prediction, pos_label='yes')

X_train, X_test, y_train, y_test = datasets[2]
parameters = {
    'max_features':['log2', 'sqrt'], 
    'max_depth':[1,2,3]}

# parameters = {}
gnb = GaussianNB()
dtc = DecisionTreeClassifier()
clf = GridSearchCV(dtc, parameters, scoring = make_scorer(performance_metric, greater_is_better=True))
clf.fit(X_train, y_train)
f1_test, time_predict = predict_labels(clf.best_estimator_, X_test,y_test)
f1_test

Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000


0.81250000000000011

In [116]:
clf.best_score_

0.80641348255869216