# Tree-Based-Models

\begin{exercise}
In this home quiz you must do the following:

1) Study the 5 presentations about three models and their performance. I will ask each group to make a presentation of each part. You can enhance the presentation using web resources or the material that I will upload today in piaza.

2) For each part there is a code. You must select appropriate dataset to run it. You will use the codes to support your presentations.

\end{exercise}


# PART 1 - CLASSIFICATION AND REGRESSION TREES (CART)
- Set of supervised learning models used for problems involving classification and regression

## Classification Tree
- Sequence of if-else questions about individual features
- Objective: Infer class labels
- Able to capture non-linear relationships between features and labels
- Don't require feature scaling. For example, it do not need standardization etc.

## Decision regions 
- Decision region is the region in the feature space where all the instances are assigned to one class label.
- For example, if result is of two class Pass or Fail. Then there will be 2 decision region. One is Pass region other is Fail region.

## Decision boundary
- It is the seperating boundary between two region.
- In above example, decision boundary will be 33% (which is the passing marks)

## Logistic regression vs classification tree
- A classification tree divides the feature space into rectangular regions.
- In contrast, a linear model such as logistic regression produces only a single linear decision boundary dividing the feature space into two decision regions.
- In other word, decision boundary produced by logistic regression is linear (line) while the boundaries produced by the classification tree divide the feature space into rectangular regions (Not a line but boxes/region it divides two class).

## Building block of Decision Tree 
- Root: No parent node, question giving rise to two children nodes.
- Internal node: One parent node, question giving rise to two children nodes.
- Lead: One parent node, no children node -> Prediction.

## Classication-Tree Learning (Working) - 
- Nodes are grown recursively (based on last node).
- At each node, split the data based on:
1. feature f and split-point(sp) to maximize IG(Information gain from each node).
2. If IG(node)= 0, declare the node a leaf.

## Information Gain-
- IG is a synonym for Kullback–Leibler divergence.
- It is the amount of information gained about a random variable or signal from observing another random variable.
- The term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence.
- KL divergance is the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

## Criteria to measure the impurity of a node I(node):
1. Variance (Regression) [Variance reduction of a node N is defined as the total reduction of the variance of the target variable x due to the split at this node]
2. Gini impurity (Classification) [Measure of impurity. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset]
3. Entropy (Classification) [Measure of purity. Information entropy is the average rate at which information is produced by a stochastic source of data]

Note 
- Most of the time, the gini index and entropy lead to the same results.
- The gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier model of scikit-learn

## Regression Tree Classification
- Measurement are done through MSE (Mean Square error)
- Information Gain is the MSE. So the target variable will have the Mean Square Error.
- Regression trees tries to find the split that produce the leaf where in each leaf, the target value are an average of closest possible to the mean value of labels in that leaf.

# PART 2 - BIAS VARIANCE TRADEOFF

## Supervised Learning
- y = f(x), f is the function which is unknown
- Our model output will be that function
- But that function may contains various type of error like noise

## Goals of Supervised Learning
- Find a model f1 that best approximates f: f1 ≈ f ()
- f1 can be LogisticRegression, Decision Tree, Neural Network ...
- Discard noise as much as possible.
- End goal:f1 should acheive a low predictive error on unseen datasets.

## Dificulties in Approximating f
- Overtting: f1(x) fits the training set noise.
- Undertting: f1 is not flexible enough to approximate f

## Generalization error 
- Generalization Error of f1 : Does f1 generalize well on unseen data?
- It can be decomposed as follows: Generalization Error of
- f1 = bias + variance + irreducible error

## Bias
- Bias: error term that tells you, on average, how much f1 ≠ f.
- High Bias lead to underfitting

## Variance
- Variance: tells you how much f is inconsistent over different training sets.
- High Variance lead to overfitting

- If we decrease Bias then Variance increase. Or Vice versa.

## Model Complexity
- Model Complexity: sets the flexibility of f1.
- Example: Maximum tree depth, Minimum samples per leaf etc etc.

## Bias Variance Tradeoff 
- It is the problem is in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.

## Estimating the Generalization Error, Solution:
- Split the data to training and test sets 
- Fit t1 to the training set
- Evaluate the error of f1 on the unseen test set
- Generalization error of f1 ≈ test set error of f1.

## Better Model Evaluation with Cross-Validation
- Test set should not be touched until we are confident about f1's performance.
- Evaluating f1 on training set: biased estimate,f1 has already seen all training points.
- Solution → K Cross-Validation (CV)

## Diagnose Variance Problems
- If f1 suffers from high variance: CV error of f1 > training set error of f1.
- f1 is said to overfit the training set. To remedy overtting:
- decrease model complexity
- for ex: decrease max depth, increase min samples per leaf
- gather more data

## Diagnose Bias Problems
- If f1 suffers from high bias: CV error of f1 ≈ training set error of f1 >> desired error.
- f1 is said to underfit the training set. To remedy underfitting:
- increase model complexity
- for ex: increase max depth, decrease min samples per leaf
- gather more relevant features

## Limitations of CARTs
- Classification: can only produce orthogonal decision boundaries.
- Sensitive to small variations in the training set.
- High variance: unconstrained CARTs may overt the training set.
- Solution: ensemble learning.

## Ensemble Learning
- Train different models on the same dataset.
- Let each model make its predictions.
- Meta-model: aggregates predictions of individual models.
- Final prediction: more robust and less prone to errors.
- Best results: models are skillful in different ways.

## Steps in Ensemble learning 
1. Training set is fed to different classifier like Decision tree, Logistic regression, KNN etc.
2. Each classifier learn its parameter and make prediction
3. Each prediction are fed into another model and that model make final prediction.
4. That final model is known as ensemble model.

# PART 3 - BAGGING AND RANDOM FOREST

## Bagging
- Bagging is an ensemble method involving training the same algorithm many times using different subsets sampled from the training data
- In bagging, it uses same algorithm (only one algo is used)
- However the model is not training on entire training set
- Instead each model is trained on different subset of data
- Bagging: Bootstrap Aggregation.
- Uses a technique known as the bootstrap.
- Reduces variance of individual models in the ensemble.
- For example, suppose a training dataset contains 3 parts - a,b,c.
- It create subset by method sample by replacement. For example aaa,aab,aba,acc,aca etc.
- On this subset, the models are trained.

## Bagging Classication
- Aggregates predictions by majority voting (Final model is selected by voting).
- BaggingClassifier in scikit-learn.

## Bagging Regression
- Aggregates predictions through averaging (Final model is selected by avergaing).
- BaggingRegressor in scikit-learn.

## Bagging limitations
- Some instances may be sampled severaltimes for one model,
- Other instances may not be sampled at all.

## Out Of Bag (OOB) instances
- On average,for each model, 63% ofthe training instances are sampled.
- The remaining 37% constitute the OOB instances.
- Since OOB instances are not seen by the model during training.
- This can be used to estimate the performance of the model without the need of cross validation.
- This technique is known as OOB evaluation

## Random Forest
- Another ensemble model
- Base estimator: Decision Tree
- Each estimator is trained on a different bootstrap sample having the same size as the training set
- RF introduces further randomization in the training of individual trees
- d features are sampled at each node without replacement ( d < total number of features)

## Random Forests Classication:
- Aggregates predictions by majority voting
- RandomForestClassifier in scikit-learn

## Random Forests Regression:
- Aggregates predictions through averaging
- RandomForestRegressor in scikit-learn

## Feature Importance
- Tree-based methods: enable measuring the importance of each feature in prediction.
- In sklearn :
- how much the tree nodes use a particular feature (weighted average) to reduce impurity
- accessed using the attribute feature_importance_

# Part 4 BOOSTING

- Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. 
- Boosting: Ensemble method combining several weak learners to form a strong learner.
- Weak learner: Model doing slightly better than random guessing.
- Example of weak learner: Decision stump (CART whose maximum depth is 1).
- Train an ensemble of predictors sequentially.
- Each predictor tries to correct its predecessor.
- Most popular boosting methods: AdaBoost, Gradient Boosting.

## Adaboost
- Stands for Adaptive Boosting.
- Each predictor pays more attention to the instances wrongly predicted by its predecessor.
- Achieved by changing the weights of training instances.
- Each predictor is assigned a coefficient α.
- α depends on the predictor's training error
- Learning rate: 0 < η ≤ 1. It help to shrink coeeficient α. It is the tradeoff between η and number of estimator.
- Smaller number of η should be compensiated by high number of estimator.

## AdaBoost Classication
- Weighted majority voting.
- In sklearn: AdaBoostClassifier.

## AdaBoost Regression
- Weighted average.
- In sklearn: AdaBoostRegressor.

## Gradient Boosted Trees
- Sequential correction of predecessor's errors.
- Does not tweak the weights of training instances.
- Fit each predictor is trained using its predecessor's residual errors as labels.
- Gradient Boosted Trees: a CART is used as a base learner.

## Gradient Boosted Regression:
- y = y + ηr + ... + ηr
- In sklearn: GradientBoostingRegressor .

## Gradient Boosted Classication:
- In sklearn: GradientBoostingClassifier .

## Gradient Boosting: Cons
- GB involves an exhaustive search procedure.
- Each CART is trained to find the best split points and features.
- May lead to CARTs using the same split points and maybe the same features.

## Stochastic Gradient Boosting
- Each tree is trained on a random subset of rows of the training data.
- The sampled instances (40%-80% ofthe training set) are sampled without replacement.
- Features are sampled (without replacement) when choosing split points.
- Result: further ensemble diversity.
- Effect: adding further variance to the ensemble oftrees.

# PART 5 - MODEL TUNING

- The hyperparameters of a machine learning model are parameters that are not learned from data.
- They should be set prior to fitting the model to the training set.

## Parameters
- learned from data
- CART example: split-point of a node, split-feature of a node, ...

## Hyperparameters
- not learned from data, set prior to training
- CART example: max_depth , min_samples_leaf , splitting criterion ...

## What is hyperparameter tuning?
- Problem: search for a set of optimal hyperparameters for a learning algorithm.
- Solution: find a set of optimal hyperparameters that results in an optimal model.
- Optimal model: yields an optimal score.
- Score: in sklearn defaults to accuracy (classication) and R-squared (regression).
- Cross validation is used to estimate the generalization performance.

## Why tune hyperparameters?
- In sklearn, a model's default hyperparameters are not optimal for all problems.
- Hyperparameters should be tuned to obtain the best model performance.

## Approaches to hyperparameter tuning
- Grid Search
- Random Search
- Bayesian Optimization
- GeneticAlgorithms etc.

## Grid search cross validation
- Manually set a grid of discrete hyperparameter values.
- Set a metric for scoring model performance.
- Search exhaustively through the grid.
- For each set of hyperparameters, evaluate each model's CV score.
- The optimal hyperparameters are those ofthe model achieving the best CV score.

## Grid search cross validation: example
- Hyperparameters grids:
- max_depth = {2,3,4},
- min_samples_leaf = {0.05, 0.1}
- hyperparameter space = { (2,0.05) , (2,0.1) , (3,0.05), ... }
- CV scores = { score , ... }
- optimal hyperparameters = set of hyperparameters corresponding to the best CV score.

Tuning a RF's Hyperparameters
- Tuning is expensive

## Hyperparameter tuning:
- computationally expensive,
- sometimes leads to very slight improvement,
- Weight the impact oftuning on the whole project.

# Part I - CLASSIFICATION AND REGRESSION TREES (CART)

In [None]:
# CLASSIFICATION
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import train_test_split
from sklearn.model_selection import train_test_split

# Import accuracy_score
from sklearn.metrics import accuracy_score

# Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y,test_size=0.2,stratify=y,random_state=1)

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate dt, set 'criterion' to 'gini'
dt = DecisionTreeClassifier(criterion= 'gini', random_state=1)

# Instantiate dt, set 'entropy' as the information criterion
dt = DecisionTreeClassifier(criterion='entropy', max_depth=8, random_state=1)

# Most of the time, the gini index and entropy lead to the same results.
# The gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier model of scikit-learn

# Fit dt to the training set
dt.fit(X_train,y_train)

# Predict test set labels
y_pred = dt.predict(X_test)

# Evaluate test-set accuracy
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

# REGRESSION
# Import DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor

# Import train_test_split
from sklearn.model_selection import train_test_split

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=3)

# Instantiate a DecisionTreeRegressor 'dt'
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)
# 0.1 implies Atleast 10% of the training data

# Fit 'dt' to the training-set
dt.fit(X_train, y_train)

# Predict test-set labels
y_pred = dt.predict(X_test)

# Compute test-set MSE
mse_dt = MSE(y_test, y_pred)

# Compute test-set RMSE
rmse_dt = mse_dt**(1/2)

# Print rmse_dt
print(rmse_dt)
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))


# Part 2 - Bias Variance Tradeoff

In [None]:
#K-Fold CV in regression

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

# Set seed for reproducibility
SEED = 123

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=SEED)

# Instantiate decision tree regressor and assign it to 'dt'
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.14, random_state=SEED)

# Evaluate the list of MSE ontained by 10-fold CV
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv= 10, scoring= 'neg_mean_squared_error' , n_jobs = -1)

# Fit 'dt' to the training set
dt.fit(X_train, y_train)

# Predict the labels of training set
y_predict_train = dt.predict(X_train)
y_predict_train

# Predict the labels of test set
y_predict_test = dt.predict(X_test)

# CV MSE
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))

# Training set MSE
print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))

# Test set MSE
print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))

# Suppose CV MSE = 20.51, Train MSE = 15.30 and Test MSE = 20.92
# Train MSE < CV MSE.
# Suggested that model is overfit and is suffering from high variance.
# CV MSE and Test MSE are roughly equal

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV.mean())**(1/2)
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

# Suppose, RMSE_CV = 5.14, RMSE_train = 5.15 and baseline_RMSE = 5.1
# RMSE_CV < RMSE_train means dt suffers from high bias because RMSE_CV ≈ RMSE_train and both scores are greater than baseline_RMSE.
# dt is indeed underfitting the training set as the model is too constrained to capture the nonlinear dependencies between features and labels


# Ensemble Learning
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility
SEED = 1

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= SEED)

# Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN(n_neighbors=27)
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
  
  #fit clf to the training set
  clf.fit(X_train, y_train)
  
  # Predict the labels of the test set
  y_pred = clf.predict(X_test)
  
  # Evaluate the accuracy of clf on the test set
  print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))
  
# OR 
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)   
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

# Instantiate a VotingClassifier 'vc'
vc = VotingClassifier(estimators=classifiers)

# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

# Evaluate the test-set accuracy of 'vc'
print('Voting Classifier: {.3f}'.format(accuracy_score(y_test, y_pred)))  


# Part 3 - Bagging and Random Forest  

In [None]:
# BAGGING CLASSIFICATION
# Import models and utility functions

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Set seed for reproducibility
SEED = 1

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=SEED)

# Instantiate a BaggingClassifier 'bc'
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1)

# Fit 'bc' to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate and print test-set accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))


# OOB EVALUATION IN SKLEARN
# Import models and split utility function
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Set seed for reproducibility
SEED = 1

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, stratify= y, random_state=SEED)

# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=SEED)

# Instantiate a BaggingClassifier 'bc'; set oob_score= True
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, oob_score=True, n_jobs=-1)

# Fit 'bc' to the traing set
bc.fit(X_train, y_train)

# Predict the test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
test_accuracy = accuracy_score(y_test, y_pred)

# Extract the OOB accuracy from 'bc'
oob_accuracy = bc.oob_score_

# Print test set accuracy
print('Test set accuracy: {:.3f}'.format(test_accuracy))

# Print OOB accuracy
print('OOB accuracy: {:.3f}'.format(oob_accuracy))
# The difference between test and oob accuracy will be minimal which proved that we don't need cross validation to check the model accuracy


# RANDOM FOREST REGRESSOR
# Basic imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# Set seed for reproducibility
SEED = 1

# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a random forests regressor 'rf' 400 estimators
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=SEED)

# Fit 'rf' to the training set
rf.fit(X_train, y_train)

# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print the test set RMSE
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))


# FEATURE IMPORTANCE in sklearn

import pandas as pd
import matplotlib.pyplot as plt

# Create a pd.Series of features importances
importances_rf = pd.Series(rf.feature_importances_, index = X_train.columns)
#importances_rf = pd.DataFrame(rf.feature_importances_,
#                                   index = X_train.columns,
#                                    columns=['importance']).sort_values('importance',  
#ascending=False)

# Sort importances_rf
sorted_importances_rf = importances_rf.sort_values()

# Make a horizontal bar plot
sorted_importances_rf.plot(kind= 'barh', color= 'lightgreen'); 
plt.show()


# Part 4 - Boosting

In [None]:
# AdaBoost Classication in sklearn
# Import models and utility functions
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Set seed for reproducibility
SEED = 1

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(max_depth=1, random_state=SEED)

# Instantiate an AdaBoost classifier 'adab_clf'
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=100)

# Fit 'adb_clf' to the training set
adb_clf.fit(X_train, y_train)

# Predict the test set probabilities of positive class
# Once the classifier adb_clf is trained, call the .predict_proba() method by passing X_test as a parameter 
# Extract these probabilities by slicing all the values in the second column as follows
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]

# Evaluate test-set roc_auc_score


adb_clf_roc_auc_score= roc_auc_score(y_test, y_pred_proba)

# Print adb_clf_roc_auc_score
print('ROC AUC score: {:.2f}'.format(adb_clf_roc_auc_score))


# Gradient Boosting in sklearn
# Import models and utility functions
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# Set seed for reproducibility
SEED = 1

# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=SEED)

# Instantiate a GradientBoostingRegressor 'gbt'
gbt = GradientBoostingRegressor(n_estimators=300, max_depth=1, random_state=SEED)

# Fit 'gbt' to the training set
gbt.fit(X_train, y_train)

# Predict the test set labels
y_pred = gbt.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print the test set RMSE
print('Test set RMSE: {:.2f}'.format(rmse_test))


# Stochastic Gradient Boosting in sklearn
# Import models and utility functions
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# Set seed for reproducibility
SEED = 1

# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=SEED)

# Instantiate a stochastic GradientBoostingRegressor 'sgbt'
sgbt = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=SEED)
# 0.8 refers to sample 80% of datafor training
# 0.2 refers to each tree uses 20% of the available features to perform best split

# Fit 'sgbt' to the training set
sgbt.fit(X_train, y_train)

# Predict the test set labels
y_pred = sgbt.predict(X_test)

# Evaluate test set RMSE 'rmse_test'
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print 'rmse_test'
print('Test set RMSE: {:.2f}'.format(rmse_test))


# Part 5 - Model Tuning 

In [None]:
#Inspecting the hyperparameters of a CART
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Set seed to 1 for reproducibility
SEED = 1

# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)

# Print out 'dt's hyperparameters
print(dt.get_params())

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of hyperparameters 'params_dt'
params_dt = {
              'max_depth': [3, 4,5, 6],
              'min_samples_leaf': [0.04, 0.06, 0.08],
              'max_features': [0.2, 0.4,0.6, 0.8]
            }

# Instantiate a 10-fold CV grid search object 'grid_dt'
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='accuracy', cv=10, n_jobs=-1)

# Fit 'grid_dt' to the training data
grid_dt.fit(X_train, y_train)

# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('Best hyerparameters:\n', best_hyperparams)

# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('Best CV accuracy'.format(best_CV_score))

# Extract best model from 'grid_dt'
best_model = grid_dt.best_estimator_

# Evaluate test set accuracy
test_acc = best_model.score(X_test,y_test)

# Print test set accuracy
print("Test set accuracy of best model: {:.3f}".format(test_acc))

# Import roc_auc_score from sklearn.metrics 
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))


# Inspecting RF Hyperparameters in sklearn
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Set seed for reproducibility
SEED = 1

# Instantiate a random forests regressor 'rf'
rf = RandomForestRegressor(random_state= SEED)

# Inspect rf' s hyperparameters
rf.get_params()

# Basic imports
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV

# Define a grid of hyperparameter 'params_rf'
params_rf = {
                'n_estimators': [300, 400, 500],
                'max_depth': [4, 6, 8],
                'min_samples_leaf': [0.1, 0.2],
                'max_features': ['log2','sqrt']
            }

# Instantiate 'grid_rf'
grid_rf = GridSearchCV(estimator=rf,param_grid=params_rf, cv=3, scoring= 'neg_mean_squared_error',verbose=1, n_jobs=-1)

# Searching for the best hyperparameters
# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

# Extract best hyperparameters from 'grid_rf'
best_hyperparams = grid_rf.best_params_
print('Best hyerparameters:\n', best_hyperparams)

# Extract best model from 'grid_rf'
best_model = grid_rf.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print the test set RMSE
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
