# Machine Learning Engineer Nanodegree
## Supervised Learning
## Project 2: Building a Student Intervention System

Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `'TODO'` statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: **

•	This is clearly a classification question as the goal is to predict whether a student is going to be successful or not. Given a student, the ask is to assign the classes “successful” and “not successful” to that student, so that teachers and other people could administer early interventions. Also, this is not a regression problem because we are not trying to predict a continuous output such as predict the final (continuous) score of the student. 


## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [2]:
# TODO: Calculate number of students
n_students = student_data.shape[0]

# TODO: Calculate number of features
n_features = student_data.shape[1] -1

# TODO: Calculate passing students
n_passed = student_data[student_data['passed'] == 'yes'].shape[0]

# TODO: Calculate failing students
n_failed = student_data[student_data['passed'] == 'no'].shape[0]

# TODO: Calculate graduation rate
grad_rate = n_passed / float(n_students)
grad_rate = grad_rate * 100

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.

In [3]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)

print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [5]:
# TODO: Import any additional functionality you may need here
from sklearn.cross_validation import StratifiedShuffleSplit

# TODO: Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train
print "Num Test", num_test
# TODO: Shuffle and split the dataset into the number of training and testing points above

X_train = None
X_test  = None
y_train = None
y_test = None

def split_shuffle(X, y, num_train):
    
    shuffleSplit = StratifiedShuffleSplit(y, 1, .24)
    #print shuffleSplit
    for train_index, test_index in shuffleSplit:
        X_train = X.iloc[train_index]
        X_test = X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
      
        #print train_index, "", train_index
   
    return X_train, X_test, y_train, y_test

y_all = student_data['passed']

#print "YYY:",y_all.shape[0]

X_train, X_test, y_train, y_test = split_shuffle(X_all, y_all, num_train)


print "-------------------------------------------------"
#X_train = None
#X_test = None
#y_train = None
#y_test = None

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])


Num Test 95
-------------------------------------------------
Training set has 300 samples.
Testing set has 95 samples.


## Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.

### Question 2 - Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

**Answer: **

About dataset


	1) There are more features relative to the number of training instances, which might
	   make models more likely to suffer overfitting.
	
	2) The labels are imbalanced. Instead of having similar amounts of
		students who passed and students who failed, there is bit more of the
		latter than the former

	3)  Student performance characteristics tend to follow Gaussian (Normal) distribution

In summary, because of labels being unbalanced, some models are likely to perform poorly.

For our project, I have chosen Random Forest, Gradient Boosting and Gaussian Bayes Classifier algorithms.  I wanted to try both deterministic and probabilistic approach. For deterministic approach, as a first choice Random forest is the best. Random Forest is considered to be the most popular and successful algorithm to be considered as a first choice.  With some tuning/boosting, Gradient Boosting is a good choice. Since the dataset tend to follow Normal distribution, I have selected Gaussian Bayes Classifier as the third one.


Random forest
-----------------

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. 

Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and regressors.

Applicability:

Random forests can be used for classification or regression analysis. They are an ensemble of different trees and are used for classification or nonlinear multiple regression. Each leaf contains a distribution for the random/continuous output variable/s.

Examples Applications using Random Forest

	* Predicting Customer Retentions
	* Many applications in Medical and Bioinformatics
	* Face recognitions systems


Strengths

•	Most accurate learning algorithm. For many data sets, it produces a highly accurate classifier.
•	Fast to build and even faster to predict
•	Fully parallelizable 
•	Resistance to over training
•	Runs efficiently on large data bases
•	Gives estimates of what variables are important in the classification
•	Provides effective methods for estimating missing data
•	Provides methods for balancing error in class population unbalanced data sets
•	Generated forests can be saved for future use on other data
•	Can handle thousands of input variables without variable deletion.


Weakness

•	Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.

•	For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

•	Most implementations of Random Forests are slow when you have a large number of features 

•	Random Forests aren't good at generalizing to cases with completely new data. For example, if I tell you that 1 chocolate costs $1, 2 chocolates cost $2, and 3 chocolates cost $3, how much do 10 chocolates cost? A linear regression can easily figure this out, while a Random Forest has no way of finding the answer.

I have selected Random forest as a first choice because Random Forest is one of the best base model to start with.  There are no or very minimal tuning required and works very well for small and larger data set.


Gradient Boosting 
------------------------------

Gradient boosting algorithms are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. 

They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. 

A common task that appears in different machine learning applications is to build a non-parametric regression or classification model from the data. When designing a model in domain-specific areas, one strategy is to build a model from theory and adjust its parameters based on the observed data. 

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. 

In gradient boosting, the learning procedure consecutively fits new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. 

The loss functions applied can be arbitrary, but to give a better intuition, if the error function is the classic squared-error loss, the learning procedure would result in consecutive error-fitting. In general, the choice of the loss function is up to the researcher, with both a rich variety of loss functions derived so far and with the possibility of implementing one's own task-specific loss.

Strengths

• Heterogeneous data (features measured on different scale)
• Supports different loss functions (e.g. huber)
• Automatically detects (non-linear) feature interactions

Weakness

* More concretely, GBDTs have more hyper-parameters to tune and are also more prone to overfitting
• Requires careful tuning
• Slow to train (but fast to predict)
• Cannot extrapolate


I have selected Gradient Boosting as a second choice mainly to understand whether tuning/boosting can help to get better results.  Often times, boosting and selecting proper loss function would help the model to be more accurate.


Gaussian Bayes Classifier
-----------------------

-	Gaussian Bayesian classifiers assume attributes have independent distributions
-	Each feature has a Gaussian distribution

Example:

Many medical images/processing applications 

Strengths

* Fast to train (single scan). Fast to classify
* Not sensitive to irrelevant features
* Handles real and discrete data
* Handles streaming data well
* Naïve Bayes is fast and space efficient
* Naïve Bayes is NOT sensitive to irrelevant features
* If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. 
 * A good bet if you want to do some kind of semi-supervised learning

 *The Bayes classification rule minimizes any potential misclassifications 


Weakness
	
The Naive Bayes classifier assumes that all variables are conditionally independent given the outcome. This assumption rarely holds in practice

I have selected Gaussian Bayes Classifier as a third choice to understand whether probabilistic model would give better accuracy. If the features have independent distributions (as Gaussian or normal), this technique is probably the best one as the training time is very quick.




### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [6]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    #print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [7]:
# TODO: Import the three supervised learning models from sklearn
# from sklearn import model_A
# from sklearn import model_B
# from skearln import model_C
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import GradientBoostingClassifier

# TODO: Initialize the three models
clf_A = RandomForestClassifier(n_estimators=30,random_state=0) #n_estimators=30,
clf_B = GradientBoostingClassifier(n_estimators=3,random_state=0) #n_estimators=3
clf_C = GaussianNB()  ## seems there is no method to random_state for GaussianNB


y_all = student_data['passed']

# TODO: Set up the training set sizes

X_train, X_test, y_train, y_test = split_shuffle(X_all, y_all, 300)
for clf in [clf_A, clf_B, clf_C]:
    print "\n{}: \n".format(clf.__class__.__name__)
    for n in [100, 200, 300]:
        print "\n\nFor Dataset size:",n
        train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)





RandomForestClassifier: 



For Dataset size: 100
Trained model in 0.0723 seconds
Made predictions in 0.0026 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0027 seconds.
F1 score for test set: 0.7887.


For Dataset size: 200
Trained model in 0.0756 seconds
Made predictions in 0.0026 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0020 seconds.
F1 score for test set: 0.7801.


For Dataset size: 300
Trained model in 0.0832 seconds
Made predictions in 0.0034 seconds.
F1 score for training set: 0.9975.
Made predictions in 0.0023 seconds.
F1 score for test set: 0.8194.

GradientBoostingClassifier: 



For Dataset size: 100
Trained model in 0.0018 seconds
Made predictions in 0.0014 seconds.
F1 score for training set: 0.8323.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.8188.


For Dataset size: 200
Trained model in 0.0030 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.8146.
Made predictions in 0.0002 seconds.
F1 

### Tabular Results
Edit the cell below to see how a table can be designed in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables). You can record your results from above in the tables provided.

** Classifer 1 - RandomForestClassifier  

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0856                 |         0.0023        |      1.0000      |    0.7945      |
| 200               | 0.0756                 |         0.0030         |       1.0000      |    0.7801      |
| 300               | 0.0786               |         0.0033         |      0.9975      |      0.7917   |

** Classifer 2 - GradientBoostingClassifier 

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |   0.0021              |     0.0003             |    0.8214        |    0.8050     |     
| 200               |   0.0025                |     0.0002             |    0.8571        |    0.8112      |
| 300               |   0.0031                |     0.0003             |    0.8315        |    0.8053        |

** Classifer 3 - GaussianNB  

| Training Set Size | Prediction Time (train) | Prediction Time (test) | F1 Score (train) | F1 Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               |     0.0006              |         0.0003         |     0.6604    |     0.5524  |
| 200               |     0.0008             |         0.0003         |     0.7895      |     0.7097     |
| 300               |       0.0008         |       0.0003          |     0.7904         |     0.7460    |

## Choosing the Best Model
In this final section, you will choose from the three supervised learning models the *best* model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (`X_train` and `y_train`) by tuning at least one parameter to improve upon the untuned model's F<sub>1</sub> score. 





### Question 3 - Chosing the Best Model
*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*

**Answer: **


The data set given has small number of training instances compared to the number of features. This might lead to models to overfitting. Ideally, we want a classifier that is correct as much as possible using least amount of space and time. Support Vector Machines is not a good choice beacuse of small number of data points.

The three models I have preselected are

•	Random Forest 
•	GradientBoostingClassifier
•	GaussianNB


Notes from using Random Forest experiment

    - Using Random Forests, we end up having small decision trees and it helps in preventing overfitting. 
    - The training and testing time of Random Forest is much higher than other two methdods
    - F1 score for training time is better than others but F1 scores for testing time is lower

Notes from using GradientBoostingClassifier experiment

    - GradientBoostingClassifier finds weak learners that perform better and better on the previously incorrect
      predictions. In this experiment, we did not run GradientBoostingClassifier many times.

- The F1 score for training and testing is significantly better than Random Forest experiment & slightly better than Gaussian NB 


Notes from using Gaussian Naïve Bayes experiment

    - Gaussian Naïve Bayes significantly faster than other models. 
    - F1 score for training and test is NOT better than GradientBoostingClassifier


In summary, 

    - The best overall performing model is GradientBoostingClassifier
    

### Question 4 - Model in Layman's Terms
*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. For example if you've chosen to use a decision tree or a support vector machine, how does the model go about making a prediction?*

**Answer: **

Here is the metaphor (little story) that I am trying to explain in layman terms.

Imagine that, one nice morning, stundetns in a high scholl AP History class need to read few books and answer for 20 questions. There are 100 students.  The goal is to answer those 20 questions with minimum error as possible.

The teacher divides the student in 20 teams. Each team have 5 students. The students in a team can have varying expertise. Some may be Seniors and others may Juniors or Freshman. They come with different backgrounds - meaning some might have taken similar courses etc.

The answers are in binary form : Yes or No.  The aim of the process is to figure out maximum correct answers  Any member can be a  part of more than 1 team.

The  process starts with a random guess of answers for all 20 questions. The process then forms a first team and then get the answers. Then it calculates error  ( = Actual - Predicted Answer). 

Next, the process builds a (new) team of 5 members, which reduces the error by maximum. Again, it calculates the  error. The second team has to reduce the error rate further.  Next team  doesn't trust its previous partner fully, so it assume that answers are  correct with x probability (learning rate). This process go on till 20 teams are build and answers are improved each time. 

In layman’s terms, Gradient Boosting is basically about "boosting" many weak predictive models into a strong one, in the form of ensemble of weak models. Here, a weak predict model can be any model that works just a little better than random guess. 

In our Student example, team expertiese corresponds to model, questions are training data set and no of steps (20) are decision trees. The student with many roles, participated in many teams is important varibale.

In AdaBoost, arguably the most popular boosting algorithm, weak models are trained in an adaptive way. In Gradient boosting, new base-learners is maximally correlated with the negative gradient of the loss function. The choice of the loss function is up to the researcher, with both a rich variety of loss functions derived so far and with the possibility of implementing new ones.





### Implementation: Model Tuning
Fine tune the chosen model. Use grid search (`GridSearchCV`) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
- Import [`sklearn.grid_search.gridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).
- Create a dictionary of parameters you wish to tune for the chosen model.
 - Example: `parameters = {'parameter' : [list of values]}`.
- Initialize the classifier you've chosen and store it in `clf`.
- Create the F<sub>1</sub> scoring function using `make_scorer` and store it in `f1_scorer`.
 - Set the `pos_label` parameter to the correct value!
- Perform grid search on the classifier `clf` using `f1_scorer` as the scoring method, and store it in `grid_obj`.
- Fit the grid search object to the training data (`X_train`, `y_train`), and store it in `grid_obj`.

In [8]:
# TODO: Import 'gridSearchCV' and 'make_scorer'

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer


parameters = {'learning_rate': [0.01, 0.05, 0.09, 0.1, 0.15, 0.25, 0.75, 1, 1.25, 3, 4, 5]}

# TODO: Initialize the classifier
clf = clf_B

pos_label="yes"

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label="yes")

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring= f1_scorer)

# TODO: Fit the grid search object to the training data and find the optimal parameters

X_train, X_test, y_train, y_test = split_shuffle(X_all, y_all, 300)

grid_obj=grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

#print clf.estimator.get_params().keys()
# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

Made predictions in 0.0003 seconds.
Tuned model has a training F1 score of 0.8072.
Made predictions in 0.0004 seconds.
Tuned model has a testing F1 score of 0.8050.


### Question 5 - Final F<sub>1</sub> Score
*What is the final model's F<sub>1</sub> score for training and testing? How does that score compare to the untuned model?*

**Answer: **

Made predictions in 0.0003 seconds.
Tuned model has a training F1 score of 0.8053
Made predictions in 0.0002 seconds.
Tuned model has a testing F1 score of 0.8258.

After tuning, you can find that F1 score for test is imporved noticably. For large data set, it will make a difference.



> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.