# Ensemble Models
- Combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model.
- 3 broad model types
    - Bagging (Bootstrap Aggregation): building multiple models (usually same type) from different subsamples of training data.
    - Boosting: building multiple models (usually same type) sequentially which learn to fix errors of prior model
    - Voting: building multiple models (usually different types) and simple statistics (e.g. mean) used to combine predictions

## Decision Trees
- Classification and Regression Trees (CART)
- Advantages
    - Easy to understand
    - Useful during data exploration
    - Less data cleaning required
    - Data type not a constraint: handles categorical & numerical
    - Non parametric
- Disadvantages:
    - Overfitting
- Types:
    - Categorical Variable Decision Tree: categorical target variable
    - Continuous Variable Decision Tree: continuous target variable
- Does not require special data preparation
- Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.
- Foundation for algorithms: bagged decision trees, random forest and boosted decision trees.
- Representation for the CART model is a binary tree
- Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is constructed.
- Creating a binary decision tree:
    - Process of dividing up the input space. 
    - A greedy approach is used to divide the space called recursive binary splitting.
    - All the values are lined up and different split points are tried and tested using a cost function. 
    - The split with the best cost (lowest cost because we minimize cost) is selected.
    - For regression predictive modeling problems the cost function that is minimized to choose split points is the sum squared error across all training samples that fall within the rectangle: sum(y – prediction)^2
    - For classification the Gini index function is used which provides an indication of how “pure” the leaf nodes are (how mixed the training data assigned to each node is). G = sum(pk * (1 – pk)) Where G is the Gini index over all classes, pk are the proportion of training instances with class k in the rectangle of interest. A node that has all classes of the same type (perfect class purity) will have G=0, where as a G that has a 50-50 split of classes for a binary classification problem (worst purity) will have a G=0.5.
- Stopping Criterion
    - The recursive binary splitting procedure described above needs to know when to stop splitting as it works its way down the tree with the training data.
    - Most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. If the count is less than some minimum then the split is not accepted and the node is taken as a final leaf node.
    - The count of training members is tuned to the dataset, e.g. 5 or 10. It defines how specific to the training data the tree will be. Too specific (e.g. a count of 1) and the tree will overfit the training data and likely have poor performance on the test set.
- Pruning the tree
    - The stopping criterion is important as it strongly influences the performance of your tree. You can use pruning after learning your tree to further lift performance.
    - The complexity of a decision tree is defined as the number of splits in the tree. Simpler trees are preferred. They are easy to understand (you can print them out and show them to subject matter experts), and they are less likely to overfit your data.
    - The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the effect of removing it using a hold-out test set. Leaf nodes are removed only if it results in a drop in the overall cost function on the entire test set. You stop removing nodes when no further improvements can be made.
    - More sophisticated pruning methods can be used such as cost complexity pruning (also called weakest link pruning) where a learning parameter (alpha) is used to weigh whether nodes can be removed based on the size of the sub-tree.

## Bagging
- Implement similar learners on small sample populations and then takes a mean of all the predictions.

### Bagged Decision Trees
- Bagging works best for algorithms with high variance like decision trees (which are often constructed without pruning)
- Example: use the BaggingClassifier with the Classification and Regression Trees algorithm (DecisionTreeClassifier) to create a total of 100 trees
- Running the example, we get a robust estimate of model accuracy.

In [None]:
# Bagged Decision Trees for Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Random Forest
- Random forest is an extension & improvement of bagged decision trees.
- Samples of the training dataset are taken with replacement
- Trees are constructed in a way that reduces the correlation between individual classifiers. 
- Rather than choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.
- Builds multiple decision trees and amalgamates them together to get a more accurate and stable prediction

In [None]:
# Random Forest Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 3
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

#### Random forest parameters
- Improve predictive power:
    - max_features
        - Max number of features Random Forest is allowed to try in individual tree. There are multiple options.
        - Auto/None : This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on the individual tree.
        - sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree. 'log2' is another similar type of option for max_features.
        - 0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered.
        - Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest. But, for sure, you decrease the speed of algorithm by increasing the max_features.
    - n_estimators
        - number of trees you want to build before taking the maximum voting or averages of predictions
        - Higher number of trees give you better performance but makes your code slower.
        - Choose as high value as your processor can handle because this makes your predictions stronger and more stable.
    - min_sample_leaf
        - Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data.
        - I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for your use case.
- Training speed
    - n_jobs
        -  How many processors is it allowed to use.
        - A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.
    - random_state
        - Makes a solution easy to replicate
        - A definite value of random_state will always produce same results if given with same parameters and training data.
        - I have personally found an ensemble with multiple models of different random states and all optimum parameters sometime performs better than individual random state.
    - oob_score
        - Random forest cross validation method.
        - Similar but faster than leave one out validation.
        - This method simply tags every observation used in different tress & finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train itself.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
import pandas as pd

x = pd.read_csv("train.csv")
y = x.pop("Survived")
model =  RandomForestRegressor(n_estimator = 100 , oob_score = TRUE, random_state = 42)
model.fit(x(numeric_variable,y))
print("AUC - ROC : ", roc_auc_score(y,model.oob_prediction))

# Tuning model

sample_leaf_options = [1,5,10,50,100,200,500]
for leaf_size in sample_leaf_options:
    model = RandomForestRegressor(n_estimator = 200, oob_score = TRUE, n_jobs = -1,random_state =50, max_features = "auto", min_samples_leaf = leaf_size)
    model.fit(x(numeric_variable,y))
    print("AUC - ROC : ", roc_auc_score(y,model.oob_prediction))

### Extra Trees
- Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.
- The example below provides a demonstration of extra trees with the number of trees set to 100 and splits chosen from 7 random features.

In [None]:
# Extra Trees Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import ExtraTreesClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

## Boosting
- Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence.
- Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction.
- Common boosting algorithms: AdaBoost & Stochastic Gradient Boosting
- Adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may sometimes over fit on the training data.

### AdaBoost
- Works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay or or less attention to them in the construction of subsequent models.
- The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost algorithm.
- AdaBoost is best used to boost the performance of decision trees on binary classification problems.
- Used for classification rather than regression.
- AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners.
- The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.
- Weak models are added sequentially, trained using the weighted training data
- Data preparation:
    - Quality Data: Ensemble method continues to attempt to correct misclassifications in the training data
    - Outliers: Force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic - remove
    - Noisy Data: Especially noise in the output variable can cause problems - try to isolate and clean

In [None]:
# AdaBoost Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 30
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Stochastic Gradient Boosting (also called Gradient Boosting Machines)
- Sophisticated ensemble technique: one of the the best available for improving performance via ensembles

In [None]:
# Stochastic Gradient Boosting Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

### Voting/Stacking
- Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.
- Use a learner to combine output from different learners.
- It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.
- The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from submodels, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.
- The code below provides an example of combining the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem.

In [None]:
# Voting Ensemble for Classification
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())