lets take an example we ask n no of peoples for a better perspective towards a problem we got a better perspective by aggregating all their views instead of asking one person
similarly if we aggregate the predictions of a group of predictors (such as classifiers and regressors) we will get a better predicted value 
a group of predictors is called ensemble, the technique is called ensemble learning and the akgorithm is called ensemble method

as an example we train multiple trees based on different random subset of the same training set, at prediction time each tree gives its prediction or vote for a class, the final prediction is the class with the most votes this is called majority voting
such an ensemble of decision trees are called random forests and despite being simple its one of the most powerful model machine learning algorithm available today 

as discussed in chap 2 we will often use ensemble methods at the end of a project once you have built a few good predictors to combine them into an even good predictor 

in this chapter we will discuss about post popular ensemble methods including voting classifiers and pasting ensembles, random forests and boosting and stacking ensembles 

VOTING CLASSIFIERS :

suppose you trained a few classifiers, each one achieving about 80% accuracy.
you may have a logistic regression classifier an svm classifier, a random forest classifier, a k nearest neighbors classifiers and perhaps a few more as shown in fig 7-1

a very simple way to create a even better classifier is to aggregrate the predictions of each classifier and the class which gets the more vote is the ensembles prediction this majority vote classifier is called a hard voting classifier 

even weak models can come together to create a strong model if certain conditions are met 
a voting classifier (or other ensemble method) often performs better than most individual model in the group 
this is because : each model can catch diff patterns, their errors can cancel each other, the majority is always right even if some makes a mistake
weak learners can become strong together 
if there are all weak members but theres a randomness in their training data it can still perform well

eg:
you are flipping a slightly biased coin :
probab of heads = 51%
probab of tails = 49%

you run this coin 10000 times 
repeat this process in 10 seperate series 
the chart shows the head ratio as tosses increase 

see fig 7-3:
at start the the heads ratio fluctuates a lot randomness dominates 
as the no of tosses increase the ratio starts converging toward 51% and eventually all stabilise above 50 % very close to 51 %

this shows how ensemble methods work :
each weak model has only slight advantage over random guessing 
when you combine many weak learners their collective power becomes stronger 

similarly if you have 10000 classifiers each with 51% of accuracy :
if these classifiers are perfectly independent and make uncorrelated errors then just like a biased coin example the majority will be correct most of the time potentially upto 75% accuracy 

but in reality thats not the case your classifiers are not indpendent as they are trained on same datasets or similar features, as a result they tend to make similar mistakes so there will be many majority votes for the wrong class reducing the ensembles accuracy 

ensemble methods work best when predictors are as independent from one another as possible. one way to get diverse classifiers is to train them using very diff algos, this increases the chance that they will make very diff type of errors improving the ensembles accuracy 

scikit learn provides VotingClassifiers class thats quite easy to use, we will load and split the moons dataset into train and test set, then we will create and train a voting classifier composed of three diverse classifiers : 

In [1]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

x, y = make_moons(n_samples=500, noise=0.30, random_state=42)
x_train, x_test , y_train, y_test = train_test_split(x,y,random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(x_train, y_train)

lets see how it works internally :

scikit learn doesnt use the original estimators directly instead it :
clones each estimator using sklearn.base.clone() so it doesnt alter your originals 
trains these cloned estimators 
keep the fitted clone in voting_clf.estimators_

In [2]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(x_test,y_test))

lr = 0.864
rf = 0.896
svc = 0.896


named_estimators_ = a dict of fitted estimators by name
this loop prints the individual model accuracies on the test set 

now lets see the results collectively :

In [3]:
[clf.predict(x_test[:1]) for clf in voting_clf.estimators_]

[array([1], dtype=int64), array([1], dtype=int64), array([0], dtype=int64)]

this line gets the predictions for the first instance from each model in the ensemble
log reg predicts class 1
rf predicts class 1
svm predicts class 0

hard voting result :

In [4]:
voting_clf.predict(x_test[:1])

array([1], dtype=int64)

majority votes for class 1 so the ensemble prediction is class 1
now lets see the ensemble accuracy :

In [5]:
voting_clf.score(x_test,y_test)

0.912

91.2 % accuracy better than individual models it outperforms all individual classifiers 

now lets switch from hard voting to soft voting which often gives better performance :

what is soft voting :
instead of voting one class by a classifier using predict(), soft voting uses probability estimates predict_proba()
for each class it averages out the predicted probabilities across all classifiers 
the final prediction is the class with highest average probab
this way more confident predictions are weighted more heavily 

svc does not have any predict_proba unless you set probability = true 

In [6]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(x_train, y_train)
voting_clf.score(x_test, y_test)

0.92

92 % better than hard voting 

BAGGING AND PASTING :

one way to get a diverse classifiers is to use very different training algorithms as just discussed but another way is to train the algorithm using the random subsets of training set, 

when sampling is performed with replacement its called bagging, you randomly pick instances with replacement from training set to build each subset, so some instances may appear multiple times in a single subset 
goal : reduce variance by averaging out model instability 

and when its performed without replacement its called pasting, you still build random subsets  but each training instance appears at most once per subset, this leds to more distinct subsets with less overlap than bagging 
goal : improves diversity 

in other words both bagging and pasting allow training instances to be sampled several times across multiple predictors but only bagging allows training instances to be sampled several times for the same predictor 

once all predictors are trained the ensemble can make a prediction for a new instance by simply aggregating the predictions of all the predictors 
the aggregation function is typically the statistical mode for classification 
how its done :

| Task Type      | Aggregation Method                   | Analogy                     |
| -------------- | ------------------------------------ | --------------------------- |
| Classification | **Statistical mode** (majority vote) | Like hard voting classifier |
| Regression     | **Average of predictions**           | Smooths out extreme values  |

training on less data means bias increases slightly (each predictor is a bit less accurate on average)
variance decreases (less sensitivity to random fluctuations in the training data)
but aggregating their predictions tends to cancel random errors 
so the variance drops more than the bias increases 
net results :
similarly or slightly higher bias, lower variance better generalisation overall

see fig 7-4 :
as you can see predictors can be all trained in parallel via different cpu cores or even different servers similarly predictions can be made parallely 
this is the reason bagging and scaling is popular they scale very well

bagging and pasting in scikit learn :

scikit learn provides BaggingClassifier class for both bagging and pasting, BaggingRegressor for regression 

the below code trains a 500 deecision tree classifier each is trained on 100 training instances randomly sampled from the training set with replacement, this is bagging but if you want pasting instead just set bootstrap = False 
the n_jobs pararmeters tells scikit learn the no of cpu cores to use for training and predictions and -1 tells it to use all the available cores 

In [7]:
from sklearn.ensemble import BaggingClassifier #for the ensemble 
from sklearn.tree import DecisionTreeClassifier #as the base model for each ensemble member 

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), #base estimator 
    n_estimators=500, #we train 500 diff trees more trees more stability
    max_samples = 100, # no of samples for training each tree
    n_jobs= -1, #use all cpu cores 
    random_state=42 #for reproductibility
)

bag_clf.fit(x_train, y_train)

A BaggingClassifier automatically performs soft voting instead
of hard voting if the base classifier can estimate class probabilities
i.e., if it has a predict_proba() method, which is the case with
decision tree classifiers.

see fig 7-5 :

it compares the decision boundary of a single decision tree and 500 decision trees of the ensemble model trained on same the moons dataset, as you can see the ensembles decision boundary generlaie much better than the single decision tree's boundary
the ensemble has a comparable bias but a smaller variance 
it makes roughly the same error on training set but the decision boundary is less irregular 

| Feature     | **Bagging**                      | **Pasting**                             |
| ----------- | -------------------------------- | --------------------------------------- |
| Sampling    | With replacement                 | Without replacement                     |
| Diversity   | Higher (same samples can repeat) | Lower (each instance used at most once) |
| Correlation | Lower (predictors less similar)  | Higher (predictors more similar)        |
| Bias        | Slightly **higher**              | Slightly **lower**                      |
| Variance    | **Lower** due to more diversity  | Higher                                  |
| Overall     | Often better generalization      | Sometimes slightly better fit           |

bagging is often better because :
more diversity between the predictors, leads to reduced correlation between their errors ultimately resulting in lower ensemble variance even if each individual model has slightly higher bias 
its not a rule though, you can try cross validation to compare bagging and pasting in your specific taskb

OUT OF BAG EVALUATION :

with bagging some training instances may be sampled several times for any given predictor, while others may not be sampled at all, by default a BaggingClassifier samples m training instances with replacement(bootstrap = True) where m is the size of training set 
mathematically only about 63% of training instances are sampled on average for each individual model the remaining 37% of the training instances are called out of bag instances 

since 37 % of the data is left unused it can be used as a test set for that individual model 
for each training instance some estimators will not have seen it during the training those are oob estimators for that instance 
we can use these oob estimators to make a prediction for that specific training instance (if a sample wasnt used in training 200 out of 500 trees, then those trees can be used to predict its label)
this gives us one prediction per training instance based solely on the models that didnt see it during training

benefits :
no need for a seperate validation set 
allows you to evaluate ensemble performance during training 
helps avoid overfitting to a validation set 

lets see the code part :

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True, #enables out of bag evaluation
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(x_train, y_train)

for each of the 500 trees the model samples the training set with replacement 
for each training instance the model averages out the predictions of the trees where that instance was OOB 
the overall OOB accuracy is calculated based on how well those averaged predictions match the true labels

In [9]:
bag_clf.oob_score_

0.896

according to OOB evaluation, this BaggingClassifier is likely to achieve about 89.6 % accuracy on test set lets verify this :

In [10]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test, y_pred)

0.92

we got 92% on the test the, OOB evaluation was a bit too pesssimistic just over 2% too low 

when the base estimator supports predic_proba(), scikit learn can return the OOB-predicted class probablities for each training instance 

In [11]:
bag_clf.oob_decision_function_[:3]

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

interpretation :

instance 1 :
32.4 % probability class 0
67.6% class 1 likely predicted class

this shows us how confident the model is for each OOB instance 

RANDOM PATCHES AND RANDOM SUBSPACES :

LETS look at two advanced techniques for improving ensemble diversity by sampling features not just training instances :

make individual predictors less correlated by giving each one access to only a subset of features not the full feature set 
this is helpful in high dimensional datasets where some features may dominate the model if always present 

| Hyperparameter       | Purpose                                                                  | Works like    | Applies to   |
| -------------------- | ------------------------------------------------------------------------ | ------------- | ------------ |
| `max_features`       | Max number (or fraction) of features each base estimator sees            | `max_samples` | **Features** |
| `bootstrap_features` | Whether to sample **with** replacement (`True`) or **without** (`False`) | `bootstrap`   | **Features** |


| Technique            | Description                                                                                                         |
| -------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Random Patches**   | Sample **both** training instances and features (rows & columns). Combines instance and feature sampling.           |
| **Random Subspaces** | Sample **only features** (not training instances). Useful to reduce overfitting in high-dimensional feature spaces. |


how it helps in high dimensional dat (like images) :
training speed increases as the model uses fewer features 
the predictor diversity increases as each model sees different features 
this reduces overfitting, the bias increases slightly as each model may lack key features individually

RANDOM FORESTS :

as we know random forests are an ensemble of decision trees generally trained via bagging method or sometimes pasting, typically with max_samples set to the size of the training data 
instead of building a bagging classifier and passing it to a decisionTreeClassifier, you can use the randomForestClassifier class which is more conviniet and optimized for decision trees 
the following code trains a random forest classifier with 500 decision trees, each limited in maximum 16 leaf nodes using all available CPU cores :

In [12]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)

rnd_clf.fit(x_train, y_train)

y_pred_rf = rnd_clf.predict(x_test)

the random forests algorithm introduces extra randomness :
when training a decision tree normally decisionTreeClassifier, the algorithm at each node evaluates all features and picks the best split among them 
but in random forest at each node it randomly select a subset from the feature set and chooses the best split only from this subset not the full set this is called feature bagging or random feature selection 

by deafult it samples root n features , where n is the total features, the algorithm results in greater diversity which again leads in higher bias and lower variance resulting in yielding an overall better model 

the code down below is similar to RandomForestClassifier given above :

In [13]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

EXTRA TREES :

while growing trees in a random forest we select random subsets of the features for splitting, it is possible to create more random trees by using random thresholds for each feature rather than searching for the best thresholds for this set splitter = "random" when creating a DecisionTreeClassifier 

This increases tree randomness, which:
Reduces correlation between trees.
Lowers ensemble variance
Increases bias (because split thresholds aren’t optimized).
Often improves generalization by preventing overfitting.

| Technique     | Feature Selection | Threshold Selection | Diversity | Bias   | Variance |
| ------------- | ----------------- | ------------------- | --------- | ------ | -------- |
| Decision Tree | All               | Optimal             | Low       | Low    | High     |
| Random Forest | Random subset     | Optimal             | Medium    | Medium | Medium   |
| Extra-Trees   | Random subset     | **Random**          | High      | Higher | Lower    |

a forest of such extremely random trees is called extremely randomized trees ensemble (or extra trees), it also makes extra trees classifiers much faster to train as finding the best possible split of thresholds is time consuming tasks of growing a tree 

you can use this by applying scikit learn's RandomForestClassifier its almost similar to random forest classifier except bootstrap defaults to false, same goes for ExtraTreesRegressor

FEATURE IMPORTANCE :

One of the main features of random forests is that they make it easy to measure the importance of a feature by looking at how much the tree nodes are able to reduce the impurity on average who uses that particular feature
in simple words the total impurity reduction that a feature contributes to across all splits is used to judge its importance 

If a node affects a large number of training samples, the impurity reduction at that node is more significant.
So, each node’s impurity reduction is weighted by how many samples passed through it.

scikit learn automatically caculates this and scales the sum of the value to 1 one can view this by using the feature_importances_variable

lets see it :

In [15]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score,2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


we used the iris dataset and found out that the most important features are petal length and petal width while sepal length and width are not important 

similarly if we train a random forest classifier on MNIST dataset and plot each features importance we get the image in fig 7-6

random forests are good to understand what features actually matters if you need to perform feature selection