Boosting algorithms play a crucial role in dealing with bias variance trade-off.  Unlike bagging algorithms, which only controls for high variance in a model, boosting controls both the aspects (bias & variance), and is considered to be more effective. 

Boosting is a sequential technique which works on the principle of ensemble. It combines a set of weak learners and delivers improved prediction accuracy. At any instant t, the model outcomes are weighed based on the outcomes of previous instant t-1. The outcomes predicted correctly are given a lower weight and the ones miss-classified are weighted higher. This technique is followed for a classification problem while a similar technique is used for regression.

# Boosting – Essential Tuning Parameters
Model complexity and over-fitting can be controlled by using correct values for three categories of parameters.

### 1. Tree structure (These effect each individual tree)

#### max_depth: 
    Maximum depth of the individual estimators. The best value depends on the interaction of the input variables.
    
    max_depth = 8 : Should be chosen (5-8) based on the number of observations and predictors
    
#### max_leaf_nodes:
    The maximum number of terminal nodes or leaves in a tree. Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves. If this is defined, GBM will ignore max_depth.

#### min_samples_leaf: 
    Defines the minimum samples (or observations) required in a terminal node or leaf.
    This will be helpful to ensure sufficient number of samples result in leaf. Used to control over-fitting.
    
    min_samples_leaf = 50 : Can be selected based on intuition. This is just used for preventing overfitting and again a small value because of imbalanced classes.
    
#### min_weight_fraction_leaf:
    Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer

#### min_samples_split:
    Defines the minimum number of samples (or observations) which are required in a node to be considered for            splitting. Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Too high values can lead to under-fitting hence, it should be tuned using CV.
    min_sample_split = 500 : This should be ~0.5-1% of total values. For imbalanced class problem, take a small value from the range.


    
    
### 2. Regularization parameter (This effect the boosting operations in the model)

#### learning_rate: 
    This controls the magnitude of change in estimators. Lower learning rate is better, which requires higher n_estimators (that is the trade-off). Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems

#### n_estimators: 
    This is the number of weak learners to be built. Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast

    Lower the learning rate and increase the estimators proportionally to get more robust models.

#### subsample: 
    The fraction of sample to be used for fitting individual models (default=1). Typically .8 (80%) is used to              introduce random selection of samples, which, in turn, increases the robustness against over-fitting.

### 3. Miscellaneous Parameters: Other parameters for overall functioning.

#### loss:
    It refers to the loss function to be minimized in each split. Generally the default values work fine. Other values should be chosen only if you understand their impact on the model.
    
#### init:
    This can be used if we have made another model whose outcome is to be used as the initial estimates for GBM.
    
#### random_state
    The random number seed so that same random numbers are generated every time. This is important for parameter tuning. If we don’t fix the random number, then we’ll have different outcomes for subsequent runs on the same parameters and it becomes difficult to compare models.
    
#### verbose:
    The type of output to be printed when the model fits. The different values can be:
    0: no output generated (default)
    1: output generated for trees in certain intervals
    >1: output generated for all trees

#### warm_start:
    Using this, we can fit additional trees on previous fits of a model. It can save a lot of time and you should explore this option for advanced applications
    


    


In [83]:
#Gradient Boosting

#Import libraries:
import pandas as pd
import numpy as np
import matplotlib as plt

from sklearn.ensemble import GradientBoostingClassifier  #GBM algorithm
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search


import os

os.chdir('C:\\Analytics\\Personal\\Machine Learning\\Training\\R\\Dataset')


In [None]:
# read the data in
df = pd.read_csv("diabetes.csv")

In [None]:
#build a quick logistic regression model and check the accuracy

#X = df.iloc[:,:8] # independent variables
y = 'Class' # dependent variables

In [88]:
def modelfit(alg, dtrain, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Class'])
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_validation.cross_val_score(alg, dtrain[predictors], dtrain['Class'], 
                                                    cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['Class'].values, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Class'], dtrain_predprob))
    
    if performCV:
        print ("CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))

In [89]:
#Choose all predictors except target
predictors = df.columns.values[:8]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)


Model Report
Accuracy : 0.9062
AUC Score (Train): 0.971530
CV Score : Mean - 0.8259696 | Std - 0.02850008 | Min - 0.7925 | Max - 0.8775472


In [104]:
predictors = df.columns.values[:8]
gbm01 = GradientBoostingClassifier(learning_rate=0.1,
                                   n_estimators=60,
                                   max_depth=9,
                                   subsample=0.8,
                                   random_state=10)
modelfit(gbm01, df, predictors)                       


Model Report
Accuracy : 1
AUC Score (Train): 1.000000
CV Score : Mean - 0.8103424 | Std - 0.03942149 | Min - 0.7509259 | Max - 0.8735849


In [37]:
import pandas as pd
import numpy as np

# Bagged Decision Trees for Classification
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn import metrics

import os

os.chdir('C:\\Analytics\\Personal\\Machine Learning\\Training\\R\\Dataset')

In [4]:
# read the data in
df = pd.read_csv("diabetes.csv")

In [5]:
#build a quick logistic regression model and check the accuracy

X = df.iloc[:,:8] # independent variables
y = df['Class'] # dependent variables

In [52]:
#Normalize
X = preprocessing.StandardScaler().fit_transform(X)

In [57]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)



In [58]:
import sklearn.cross_validation as cross_validation
kfold = cross_validation.StratifiedKFold(y = y_train, n_folds=5, random_state=2017)

In [59]:
num_trees = 100
# Dection Tree with 5 fold cross validation
# lets restrict max_depth to 3 to have more impure leaves
clf_DT = DecisionTreeClassifier(max_depth=1, random_state=2017).fit(X_train,y_train)
results = cross_validation.cross_val_score(clf_DT, X_train,y_train,cv=kfold)
print ("Decision Tree (stand alone) - Train : ", results.mean())
print ("Decision Tree (stand alone) - Test : ", metrics.accuracy_score(clf_DT.predict(X_test), y_test))

Decision Tree (stand alone) - Train :  0.698752327026
Decision Tree (stand alone) - Test :  0.720779220779


In [60]:
# Using Adaptive Boosting of 100 iteration
clf_DT_Boost = AdaBoostClassifier(base_estimator=clf_DT, n_estimators=num_trees, 
                                  learning_rate=0.1, random_state=2017).fit(X_train,y_train)
results = cross_validation.cross_val_score(clf_DT_Boost, X_train, y_train,
cv=kfold)
print ("Decision Tree (AdaBoosting) - Train : ", results.mean())
print ("Decision Tree (AdaBoosting) - Test : ", metrics.accuracy_score(clf_DT_Boost.predict(X_test), y_test))

Decision Tree (AdaBoosting) - Train :  0.755730181046
Decision Tree (AdaBoosting) - Test :  0.798701298701


# Xgboost (eXtreme Gradient Boosting):

It is an extended, more regularized version of a gradient boosting algorithm. Build on C++ as part of the Distributed (Deep) Machine Learning Community. This is one of the most well-performing large-scale,
scalable machine learning algorithms that has been playing a major role in winning solutions of Kaggle

Some of the key advantages of the xgboost algorithm are these:
1. It implements parallel processing.
2. It has a built-in standard to handle missing values, which means user can specify a particular value different than other observations (such as -1 or -999) and pass it as a parameter.
3. It will split the tree up to a maximum depth unlike Gradient Boosting where it stops splitting node on encounter of a negative loss in the split.

XGboost has bundle of parameters, and at a high level we can group them into three categories. Let's look at the most important within these
categories.
1. General Parameters: 
   #nthread - Number of parallel threads; if not given a value all cores will be used.
   #Booster - This is the type of model to be run with gbtree (tree-based model) being the default. 'gblinear' to be       used for linear models
2. Boosting Parameters
   #eta - This is the learning rate or step size shrinkage to prevent over-fitting; default is 0.3 and it can range        between 0 to 1
   #max_depth - Maximum depth of tree with default being 6.
   #min_child_weight - Minimum sum of weights of all observations required in child. Start with 1/square root of event     rate
   #colsample_bytree - Fraction of columns to be randomly sampled for each tree with default value of 1.
   #Subsample -Fraction of observations to be randomly sampled for each tree with default of value of 1. Lowering this    value makes algorithm conservative to avoid over-fitting.
   #lambda - L2 regularization term on weights with default value of 1.
   #alpha - L1 regularization term on weight.
   
3. Task Parameters
   #objective - This defines the loss function to be minimized with default value 'reg:linear'. For binary           classification it should be 'binary:logistic' and for multiclass 'multi:softprob' to get the probability value and 'multi:softmax' to get predicted class. For multiclass num_class (number of unique classes) to be specified.
   #eval_metric - Metric to be use for validating model performance.