lets take an example we ask n no of peoples for a better perspective towards a problem we got a better perspective by aggregating all their views instead of asking one person
similarly if we aggregate the predictions of a group of predictors (such as classifiers and regressors) we will get a better predicted value 
a group of predictors is called ensemble, the technique is called ensemble learning and the akgorithm is called ensemble method

as an example we train multiple trees based on different random subset of the same training set, at prediction time each tree gives its prediction or vote for a class, the final prediction is the class with the most votes this is called majority voting
such an ensemble of decision trees are called random forests and despite being simple its one of the most powerful model machine learning algorithm available today 

as discussed in chap 2 we will often use ensemble methods at the end of a project once you have built a few good predictors to combine them into an even good predictor 

in this chapter we will discuss about post popular ensemble methods including voting classifiers and pasting ensembles, random forests and boosting and stacking ensembles 

VOTING CLASSIFIERS :

suppose you trained a few classifiers, each one achieving about 80% accuracy.
you may have a logistic regression classifier an svm classifier, a random forest classifier, a k nearest neighbors classifiers and perhaps a few more as shown in fig 7-1

a very simple way to create a even better classifier is to aggregrate the predictions of each classifier and the class which gets the more vote is the ensembles prediction this majority vote classifier is called a hard voting classifier 

even weak models can come together to create a strong model if certain conditions are met 
a voting classifier (or other ensemble method) often performs better than most individual model in the group 
this is because : each model can catch diff patterns, their errors can cancel each other, the majority is always right even if some makes a mistake
weak learners can become strong together 
if there are all weak members but theres a randomness in their training data it can still perform well

eg:
you are flipping a slightly biased coin :
probab of heads = 51%
probab of tails = 49%

you run this coin 10000 times 
repeat this process in 10 seperate series 
the chart shows the head ratio as tosses increase 

see fig 7-3:
at start the the heads ratio fluctuates a lot randomness dominates 
as the no of tosses increase the ratio starts converging toward 51% and eventually all stabilise above 50 % very close to 51 %

this shows how ensemble methods work :
each weak model has only slight advantage over random guessing 
when you combine many weak learners their collective power becomes stronger 

similarly if you have 10000 classifiers each with 51% of accuracy :
if these classifiers are perfectly independent and make uncorrelated errors then just like a biased coin example the majority will be correct most of the time potentially upto 75% accuracy 

but in reality thats not the case your classifiers are not indpendent as they are trained on same datasets or similar features, as a result they tend to make similar mistakes so there will be many majority votes for the wrong class reducing the ensembles accuracy 

ensemble methods work best when predictors are as independent from one another as possible. one way to get diverse classifiers is to train them using very diff algos, this increases the chance that they will make very diff type of errors improving the ensembles accuracy 

scikit learn provides VotingClassifiers class thats quite easy to use, we will load and split the moons dataset into train and test set, then we will create and train a voting classifier composed of three diverse classifiers : 

In [1]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

x, y = make_moons(n_samples=500, noise=0.30, random_state=42)
x_train, x_test , y_train, y_test = train_test_split(x,y,random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(x_train, y_train)

lets see how it works internally :

scikit learn doesnt use the original estimators directly instead it :
clones each estimator using sklearn.base.clone() so it doesnt alter your originals 
trains these cloned estimators 
keep the fitted clone in voting_clf.estimators_

In [2]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(x_test,y_test))

lr = 0.864
rf = 0.896
svc = 0.896


named_estimators_ = a dict of fitted estimators by name
this loop prints the individual model accuracies on the test set 

now lets see the results collectively :

In [3]:
[clf.predict(x_test[:1]) for clf in voting_clf.estimators_]

[array([1], dtype=int64), array([1], dtype=int64), array([0], dtype=int64)]

this line gets the predictions for the first instance from each model in the ensemble
log reg predicts class 1
rf predicts class 1
svm predicts class 0

hard voting result :

In [4]:
voting_clf.predict(x_test[:1])

array([1], dtype=int64)

majority votes for class 1 so the ensemble prediction is class 1
now lets see the ensemble accuracy :

In [5]:
voting_clf.score(x_test,y_test)

0.912

91.2 % accuracy better than individual models it outperforms all individual classifiers 

now lets switch from hard voting to soft voting which often gives better performance :

what is soft voting :
instead of voting one class by a classifier using predict(), soft voting uses probability estimates predict_proba()
for each class it averages out the predicted probabilities across all classifiers 
the final prediction is the class with highest average probab
this way more confident predictions are weighted more heavily 

svc does not have any predict_proba unless you set probability = true 

In [6]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(x_train, y_train)
voting_clf.score(x_test, y_test)

0.92

92 % better than hard voting 

BAGGING AND PASTING :

one way to get a diverse classifiers is to use very different training algorithms as just discussed but another way is to train the algorithm using the random subsets of training set, 

when sampling is performed with replacement its called bagging, you randomly pick instances with replacement from training set to build each subset, so some instances may appear multiple times in a single subset 
goal : reduce variance by averaging out model instability 

and when its performed without replacement its called pasting, you still build random subsets  but each training instance appears at most once per subset, this leds to more distinct subsets with less overlap than bagging 
goal : improves diversity 

in other words both bagging and pasting allow training instances to be sampled several times across multiple predictors but only bagging allows training instances to be sampled several times for the same predictor 

once all predictors are trained the ensemble can make a prediction for a new instance by simply aggregating the predictions of all the predictors 
the aggregation function is typically the statistical mode for classification 
how its done :

| Task Type      | Aggregation Method                   | Analogy                     |
| -------------- | ------------------------------------ | --------------------------- |
| Classification | **Statistical mode** (majority vote) | Like hard voting classifier |
| Regression     | **Average of predictions**           | Smooths out extreme values  |

training on less data means bias increases slightly (each predictor is a bit less accurate on average)
variance decreases (less sensitivity to random fluctuations in the training data)
but aggregating their predictions tends to cancel random errors 
so the variance drops more than the bias increases 
net results :
similarly or slightly higher bias, lower variance better generalisation overall

see fig 7-4 :
as you can see predictors can be all trained in parallel via different cpu cores or even different servers similarly predictions can be made parallely 
this is the reason bagging and scaling is popular they scale very well

bagging and pasting in scikit learn :

scikit learn provides BaggingClassifier class for both bagging and pasting, BaggingRegressor for regression 

the below code trains a 500 deecision tree classifier each is trained on 100 training instances randomly sampled from the training set with replacement, this is bagging but if you want pasting instead just set bootstrap = False 
the n_jobs pararmeters tells scikit learn the no of cpu cores to use for training and predictions and -1 tells it to use all the available cores 

In [7]:
from sklearn.ensemble import BaggingClassifier #for the ensemble 
from sklearn.tree import DecisionTreeClassifier #as the base model for each ensemble member 

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), #base estimator 
    n_estimators=500, #we train 500 diff trees more trees more stability
    max_samples = 100, # no of samples for training each tree
    n_jobs= -1, #use all cpu cores 
    random_state=42 #for reproductibility
)

bag_clf.fit(x_train, y_train)

A BaggingClassifier automatically performs soft voting instead
of hard voting if the base classifier can estimate class probabilities
i.e., if it has a predict_proba() method, which is the case with
decision tree classifiers.

see fig 7-5 :

it compares the decision boundary of a single decision tree and 500 decision trees of the ensemble model trained on same the moons dataset, as you can see the ensembles decision boundary generlaie much better than the single decision tree's boundary
the ensemble has a comparable bias but a smaller variance 
it makes roughly the same error on training set but the decision boundary is less irregular 

| Feature     | **Bagging**                      | **Pasting**                             |
| ----------- | -------------------------------- | --------------------------------------- |
| Sampling    | With replacement                 | Without replacement                     |
| Diversity   | Higher (same samples can repeat) | Lower (each instance used at most once) |
| Correlation | Lower (predictors less similar)  | Higher (predictors more similar)        |
| Bias        | Slightly **higher**              | Slightly **lower**                      |
| Variance    | **Lower** due to more diversity  | Higher                                  |
| Overall     | Often better generalization      | Sometimes slightly better fit           |

bagging is often better because :
more diversity between the predictors, leads to reduced correlation between their errors ultimately resulting in lower ensemble variance even if each individual model has slightly higher bias 
its not a rule though, you can try cross validation to compare bagging and pasting in your specific taskb

OUT OF BAG EVALUATION :

with bagging some training instances may be sampled several times for any given predictor, while others may not be sampled at all, by default a BaggingClassifier samples m training instances with replacement(bootstrap = True) where m is the size of training set 
mathematically only about 63% of training instances are sampled on average for each individual model the remaining 37% of the training instances are called out of bag instances 

since 37 % of the data is left unused it can be used as a test set for that individual model 
for each training instance some estimators will not have seen it during the training those are oob estimators for that instance 
we can use these oob estimators to make a prediction for that specific training instance (if a sample wasnt used in training 200 out of 500 trees, then those trees can be used to predict its label)
this gives us one prediction per training instance based solely on the models that didnt see it during training

benefits :
no need for a seperate validation set 
allows you to evaluate ensemble performance during training 
helps avoid overfitting to a validation set 

lets see the code part :

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True, #enables out of bag evaluation
    n_jobs=-1,
    random_state=42
)

bag_clf.fit(x_train, y_train)

for each of the 500 trees the model samples the training set with replacement 
for each training instance the model averages out the predictions of the trees where that instance was OOB 
the overall OOB accuracy is calculated based on how well those averaged predictions match the true labels

In [9]:
bag_clf.oob_score_

0.896

according to OOB evaluation, this BaggingClassifier is likely to achieve about 89.6 % accuracy on test set lets verify this :

In [10]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test, y_pred)

0.92

we got 92% on the test the, OOB evaluation was a bit too pesssimistic just over 2% too low 

when the base estimator supports predic_proba(), scikit learn can return the OOB-predicted class probablities for each training instance 

In [11]:
bag_clf.oob_decision_function_[:3]

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

interpretation :

instance 1 :
32.4 % probability class 0
67.6% class 1 likely predicted class

this shows us how confident the model is for each OOB instance 

RANDOM PATCHES AND RANDOM SUBSPACES :

LETS look at two advanced techniques for improving ensemble diversity by sampling features not just training instances :

make individual predictors less correlated by giving each one access to only a subset of features not the full feature set 
this is helpful in high dimensional datasets where some features may dominate the model if always present 

| Hyperparameter       | Purpose                                                                  | Works like    | Applies to   |
| -------------------- | ------------------------------------------------------------------------ | ------------- | ------------ |
| `max_features`       | Max number (or fraction) of features each base estimator sees            | `max_samples` | **Features** |
| `bootstrap_features` | Whether to sample **with** replacement (`True`) or **without** (`False`) | `bootstrap`   | **Features** |


| Technique            | Description                                                                                                         |
| -------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Random Patches**   | Sample **both** training instances and features (rows & columns). Combines instance and feature sampling.           |
| **Random Subspaces** | Sample **only features** (not training instances). Useful to reduce overfitting in high-dimensional feature spaces. |


how it helps in high dimensional dat (like images) :
training speed increases as the model uses fewer features 
the predictor diversity increases as each model sees different features 
this reduces overfitting, the bias increases slightly as each model may lack key features individually

RANDOM FORESTS :

as we know random forests are an ensemble of decision trees generally trained via bagging method or sometimes pasting, typically with max_samples set to the size of the training data 
instead of building a bagging classifier and passing it to a decisionTreeClassifier, you can use the randomForestClassifier class which is more conviniet and optimized for decision trees 
the following code trains a random forest classifier with 500 decision trees, each limited in maximum 16 leaf nodes using all available CPU cores :

In [12]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)

rnd_clf.fit(x_train, y_train)

y_pred_rf = rnd_clf.predict(x_test)

the random forests algorithm introduces extra randomness :
when training a decision tree normally decisionTreeClassifier, the algorithm at each node evaluates all features and picks the best split among them 
but in random forest at each node it randomly select a subset from the feature set and chooses the best split only from this subset not the full set this is called feature bagging or random feature selection 

by deafult it samples root n features , where n is the total features, the algorithm results in greater diversity which again leads in higher bias and lower variance resulting in yielding an overall better model 

the code down below is similar to RandomForestClassifier given above :

In [13]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

EXTRA TREES :

while growing trees in a random forest we select random subsets of the features for splitting, it is possible to create more random trees by using random thresholds for each feature rather than searching for the best thresholds for this set splitter = "random" when creating a DecisionTreeClassifier 

This increases tree randomness, which:
Reduces correlation between trees.
Lowers ensemble variance
Increases bias (because split thresholds aren’t optimized).
Often improves generalization by preventing overfitting.

| Technique     | Feature Selection | Threshold Selection | Diversity | Bias   | Variance |
| ------------- | ----------------- | ------------------- | --------- | ------ | -------- |
| Decision Tree | All               | Optimal             | Low       | Low    | High     |
| Random Forest | Random subset     | Optimal             | Medium    | Medium | Medium   |
| Extra-Trees   | Random subset     | **Random**          | High      | Higher | Lower    |

a forest of such extremely random trees is called extremely randomized trees ensemble (or extra trees), it also makes extra trees classifiers much faster to train as finding the best possible split of thresholds is time consuming tasks of growing a tree 

you can use this by applying scikit learn's RandomForestClassifier its almost similar to random forest classifier except bootstrap defaults to false, same goes for ExtraTreesRegressor

FEATURE IMPORTANCE :

One of the main features of random forests is that they make it easy to measure the importance of a feature by looking at how much the tree nodes are able to reduce the impurity on average who uses that particular feature
in simple words the total impurity reduction that a feature contributes to across all splits is used to judge its importance 

If a node affects a large number of training samples, the impurity reduction at that node is more significant.
So, each node’s impurity reduction is weighted by how many samples passed through it.

scikit learn automatically caculates this and scales the sum of the value to 1 one can view this by using the feature_importances_variable

lets see it :

In [14]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score,2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


we used the iris dataset and found out that the most important features are petal length and petal width while sepal length and width are not important 

similarly if we train a random forest classifier on MNIST dataset and plot each features importance we get the image in fig 7-6

random forests are good to understand what features actually matters if you need to perform feature selection

BOOSTING :

boosting(originally called hypothesis boosting) is a name given to any ensemble method that combines weak learners into a string learner, 
the general idea is to train the predictors sequentially where each predictor is trying to correct the previous erros made by the previous models 
this is different from bagging(random forests) where models are trained independently in parallel boosting trains them one after another 

there are many boosting methods but the most popular is adaboost(adaptive boosting) and gradient boosting lets look at it now :

ADABOOST :

one way for a new predictor to perform better than its previous one is by focusing on training instances that the previous one underfit, thia results in new predictors that focus more on hard cases this is called adaboost

adaboost starts by training an initial base model lets say a decision tree 
it uses this model to predict on the entire training set 
training  instances that were misclassified by the first model are given higher weigths 
these weights tell the next model that these points are more important and need better attention 
the process repeats each new model is trained on a weighted version of the training set until a stopping condtion is met 



see fig 7-8 :

shows the decision boundary of 5 consecutive predictors on the moons dataset trained using a svm classifier with rbf kernel, the first classifier gets many instances wrong so their weights get boosted 
the second classifier therefore does a better job on these instances and so on 

Learning rate = 0.5 (right plot): the weight updates are more conservative.
Learning rate = 1 (left plot): weight updates are stronger.
Lower learning rates result in slower adaptation by each model, often leading to smoother, less overfit boundaries.

Gradient descent tweaks the parameters of a single model to reduce error.
AdaBoost, by contrast, adds entire models one-by-one, each correcting the errors of its predecessor.
This makes AdaBoost resemble gradient descent conceptually — but at the ensemble level, not the parameter level.

Left Plot (learning_rate = 1):
Decision boundaries from 5 predictors are more adaptive and complex.
The ensemble rapidly fits the hard points (but may overfit).

Right Plot (learning_rate = 0.5):
Decision boundaries evolve more smoothly across iterations.
The model learns more conservatively, reducing the risk of overfitting.

Here's the main difference from bagging:
In AdaBoost, each model is not treated equally.
Models that performed better (i.e., had lower error on the weighted data) are given more importance (higher weight) in the final prediction.
The final prediction is a weighted vote (for classification) or weighted average (for regression).

one important drawback of to this sequential learning is that each predictor can be trained only when the previous one will be trained and evaluated not parallely as a result it does not scale well like bagging or pasting

now lets look at adaboost algorithm :

initial weights :
we have m training instances and every training instance starts with the same weight this means at the start all data points are equally imp see copy 1

train the first predictor :
the algorithm trains a base model 
this model predicts y(i)^j for each training instance i, where j is the predictor no

weighted error r(j) :
see copy 2 :
this means only misclassified instances contribute to the error and their contribution 

why weights matter :
initiallly all points are equal so its like normal error rate calculation
but in later iterations misclassified points get higher weights so the error rate reflects how well the predictor does on harder cases 

now we will assign a weight α(j) to the predictor to decide how much influence it has in the final ensemble, see copy 3
n is the learning rate (default 1, can be smaller to slow learning)
If rj is small (predictor is good), a(j) is large -> predictor gets more influence in the final vote 
if rj = 0.5(random guessing), aj = 0 -> predictor has no influence 
if rj > 0.5(worse than random), aj becomes negative -> predictors votes are flipped in ensemble  

after each predictor is trained adaboost increases the focus on the eg it got wrong so that the predictor will try harder to classify them correctly 
see copy 4
if the instance was classified correctly no need to increase weight 
if the instance was classified incorrectly increase its weight meaning that in the next round this instance will count more when training the next predictor 

now we will normalize the weights as after the boosting the misclassified instances the total sum of weights becomes greater than 1 but in adaboost the weights are treated as probability  distribution over the training set so we divide each weight by copy 5

now with the new normalized weights train another weak learner 
this new model will focus more on the samples that were misclassified in previous rounds because they now have higher weights 

Repeat :
Compute the new predictor’s weight 𝛼j (based on its error rate).
Update the instance weights (boost the ones it got wrong).
Normalize the weights again.
Train the next weak learner.

Stopping Conditions :
Option 1: You’ve reached the maximum number of predictors you planned for.
Option 2: You find a perfect predictor (zero classification error) — in that case, adding more models is pointless.

the whole algo :
give all training instances equal weights 
train a simple model (eg. decision stump) using the current weights 
measure the weighted error smaller rj means the predictor is more accurate 
compute preditors weight their influence in the final vote 
update the instance weights 
normalize weights 

now to make predictions adaboost computes the predictions of all the predictors and weighs them using the predictor weights αj, the predicted class is the one that recieves the majority of weighted votes see copy 7

Once all predictors are trained:
Get predictions from each predictor 𝑦j(x) for the new input 𝑥
Weight each prediction by 𝛼(j)
Sum weights per class:
pick the class with max score 




scikit lear uses a uses a multiclass version of adaboost called SAMME
SAMME (stagewise additive modelling using a multiclass exponential loss function)
similar to adaboost but the weights calculation function is adjusted so that it can handle the multiple class
for binary classification samme = adaboost (no difference)

SAMME.R (THE "REAL" version):
instead of using the hard predictions (like class 'A' or class 'B') it uses the predicted probab for each base model (predict_proba() output)
this allows more better weight updates (eg: if the model is 90% sure vs 51% sure)

lets see the code now :

In [15]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30, 
    learning_rate=0.5, random_state=42)

ada_clf.fit(x_train, y_train)



the above code trains a adaboost classifier using 30 decision stumps, which is a decision tree with max depth = 1, in other words it consists of one root node and two leaf nodes 
if the adaboost ensemble is overfitting the training set, you can try reducing the no of estimators or more strongly regularizing the base estimator 

GRADIENT BOOSTING :

just like adaboost this also works by seuentially adding models into the ensemble where each model tries to fix the errors made by previous models but instead of gradually changing the instance weights, it keeps the weight same and fits the next model on the residual errors
you can think it as correcting the leftover mistakes step by step 

now lets see the code example what we will be doing is called gradient tree boosting or gradient boosted regression trees (GBRT) first we will generate a noisy quadratic dataset and fit a decisionTreeRegressor to it 

In [16]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
x = np.random.rand(100,1) - 0.5
y = 3 * x[:,0]**2 + 0.05 * np.random.randn(100) #y = 3x^2 + gaussian noise 

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(x,y)

now we will train another regressor tree using the errors made by the first predictor :

In [17]:
y2 = y - tree_reg1.predict(x)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(x, y2)

now we train the third tree using the residual errors made by the third tree :

In [18]:
y3 = y2 - tree_reg2.predict(x)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(x, y3)

now we have 3 trees in the ensemble we can easily make predictions on a new instance by combining the predictions of all trees 

In [19]:
x_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(x_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.49484029, 0.04021166, 0.75026781])

see fig 7-9 

the left column represents the three trees and the right column represents ensemble predictions
the first row is same coz theres only tree 1 in the ensemble 
the second row right side contains the combination of both tree 1 and tree 2 predictions
the third row contains all the three trees predictions 
you can see the ensembles prediction gradually get better as trees are added to the ensemble 

scikit learn has a gradientBoostingRegressor class to train a GBRT ensembles more easily, just like randomForestRegressor it has hyperparams to control the trees growth eg : max_depth, min_samples_leaf as well as hyperparams to control the ensemble training such as the no of trees (n_estimators),
the following code generates the same ensemble like above :

In [20]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)

gbrt.fit(x,y)

the learning rate scales the contribution of each tree, if its low you need more no of trees in the ensemble but the predictions will generalise better this is called shrinkage a regularisation technique 

see fig 7-10 
there are two gbrt ensembles trained on different learning rate the left side does not have enough no of trees and the right one have an exact no of trees if we added more trees it will start to overfit the training set 

to find the optimal no of trees you can use cross validation using gridsearchcv and randomizedsearchcv but theres one more way if you set n_iter_no_change hyperparam to an integer value lets say 10 it will automatically stop adding the trees during training if it sees the last 10 trees didnt help
its a type of stopping condition, it tolerates having no progress for a few iterations before it stops 
lets see the code :

In [21]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500, n_iter_no_change=10, random_state=42
)

gbrt_best.fit(x,y)

if you set n_iter_no_change too low training may stop too early and the model will underfit, but if you set it too high it will overfit instead , we set learning rate too low and the n_estimators high but the early stopping stops the training before it reaches the 500 estimators

In [22]:
gbrt_best.n_estimators_

92

when n_iter_no_change is set the fit method automatically splits the training set into train and validation set this allows the model to evaluate every time a tree is added to the ensemble, the validation set is controlled by the validation_fraction hyperparam which is 10% by default 
the tol hyperparam determines the minimum performance the model consider an "improvement" which is 0.0001 by default 

theres one more hyperparam called subsample which is set to 1 by default it means each tree uses the whole training instance for training 
if we set it to subsample<1.0 each tree is trained on random fraction of training set 
for eg: if we set it 0.25 each tree will be trained on 25 % of training instances choosed randomly 

lower variance since each tree is trained on different random subset the ensemble becomes less sensitive to the noise
high bias since each tree is trained on fewer samples so its a bit less accurate individually

it also speed up the training process, this is called stochastic gradient boosting 

histogram - based gradient boosting :

it is especially designed for large datasets, instead of working with continuous values it divided them into bins which are fixed in numbers 
1. each feature is converted into integers representing its bin index
2. the no of bins is controlled by max_bins hyperparam which is 255 by default 
3. it cannot go beyond that coz it is choosen to balance precision and speed 

how it works :
binning :
1. each continuous feature is split into a fixed no of bins 
2. eg : a feature's value ranges from 0 to 100 and max_bins are 10 so it will be divided into 10 equal width bins 
3. A feature value like 37.5 gets mapped to bin 3

tree training :
instead of testing all possible thresholds the algo only tests bin boundaries this reduces the computation massively
working with integers makes it possible to use faster and
more memory-efficient data structures. And the way the bins are built removes the need for sorting the features when training each tree.

this implementation has a computational analysis of O(bxm) instead of O(nxmxlog(m)) where b is the no of bins and m is the no of training instances and n is the no of features 
hgb can train 100 times faster than the regular gbrt on large datasets, however binning causes a precision loss whuch can act as a regulariser depending on the dataset this may help to reduce overfitting but may also cause underfiting 

Scikit-Learn provides two classes for HGB: HistGradientBoostingRegressor and
HistGradientBoostingClassifier 
They’re similar to GradientBoostingRegressor and GradientBoostingClassifier, with a few notable differences:
1. early stopping :
in HGB if your dataset has > 10000 instances early stopping is automatically enabled by default 
you can also control it by using early_stopping = true or false

2. subsampling is not supported 

3. in hgb the no of boosting iterations is controlled by max_iter instead of n_estimators 

4. HGB's trees are simplified compared to GBRT's, you can only tune :
max_leaf_nodes (max no of leaf nodes per tree)
min_samples_leaf (minimum no of samples per leaf)
max_depth (max depth of each tree)


the hgb class have two nice features they both support categorical features and missing values, this simplifies the preprocessing little bit 
however you need to convert your categorical values to numbers ranging from 0 to a number lower than max_bins, you can use an ordinal encoder for this for example :

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
import pandas as pd

file_1 = r"C:\Users\gulsh\OneDrive\Desktop\ml weekly tasks\housing.csv"

housing = pd.read_csv(file_1)

housing_labels = housing["median_house_value"].copy()

housing_tr = housing.drop("median_house_value", axis=1)

hgb_reg = make_pipeline(
    make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]), 
                            remainder="passthrough"), #keeps other feature as it is
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)

hgb_reg.fit(housing_tr, housing_labels)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



no need for imputer, scaler or a one hot encoder, the categorical features must be set to the categorical column indices, without any hyperparam tuning the model gives an rmse of about 47,600 which is not too bad 

several other optimized implementations are available in python ML ecosystem in particular, xgboost, catboost and lightgbm, they are all specialized for gradient boosting their api's are very similar to scikit learn and they provide several benefits like gpu acceleration  