### 🔹 Bagging Regression  

Bagging Regression (Bootstrap Aggregating) is an **ensemble learning technique** that aims to improve model stability and accuracy by combining predictions from multiple base regressors.  
It reduces variance and helps prevent overfitting, especially for models that are sensitive to fluctuations in the training data (like Decision Trees).  

The main idea:  
- Multiple base models (often Decision Trees) are trained on **different bootstrap samples** of the training data.  
- Each sample is drawn **with replacement**, meaning some data points may appear multiple times.  
- The final prediction is obtained by **averaging** the predictions of all models.  

Mathematically:  

$$
\hat{Y} = \frac{1}{N} \sum_{i=1}^{N} f_i(X)
$$  

where:  
- $(f_i(X))$ is the prediction of the \(i\)-th base regressor,  
- $(\hat{Y})$ is the final predicted value,  
- $(N)$ is the number of models in the ensemble.  

Bagging Regression helps us to:  
- **Reduce variance** and improve model robustness.  
- **Stabilize** high-variance models like Decision Trees.  
- Work effectively with **non-linear and noisy data**.  

In this notebook, we will implement **Bagging Regression** and compare its performance with a single Decision Tree and Random Forest 🚀.  


# --------------------------------------------------------------------------

# import dataset

In [1]:
# from google.colab import files, drive

# up = files.upload()
# drive.mount('/content/drive')

In [2]:
import pandas as pd

df = pd.read_csv('dataset.csv')
df.head(3)

Unnamed: 0,A,B,C,T
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136


In [3]:
# df.info()

# cleaning

In [4]:
# clean the data

# encoding

In [5]:
# encode the data

# define x , y

In [6]:
import numpy as np

x = df[['A', 'B', 'C']].values  # 2D
y = df['T'].values              # 1D

# spliting

In [7]:
# # finding best random state 

# from sklearn.model_selection import train_test_split
# from sklearn.ensemble import BaggingRegressor
# from sklearn.tree import DecisionTreeRegressor
# base_model = DecisionTreeRegressor()
# br = BaggingRegressor(estimator=base_model)
# from sklearn.metrics import r2_score

# import time
# t1 = time.time()
# lst = []
# for i in range(1,10):
#     x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=i)
#     br.fit(x_train, y_train)
#     yhat_test = br.predict(x_test)
#     r2 = r2_score(y_test, yhat_test)
#     lst.append(r2)
# t2 = time.time()

# print(f"run time: {round((t2 - t1)/60, 2)} min")
# print(f"r2_score: {round(max(lst), 2)}")
# rs = np.argmax(lst) + 1
# print(f"random_state: {rs}")

In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

# scaling

In [9]:
# do not typically require data scaling

# fit the model

In [10]:
### K-fold cross validation

# from sklearn.ensemble import BaggingRegressor
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     '': [],
#     '': []
# }

# br = BaggingRegressor()
# gs = GridSearchCV(estimator=br, param_grid=parameters, cv=5)

# gs.fit(x_train, y_train)

# best_params = gs.best_params_
# print(best_params)

In [11]:
# def param
# estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, 
# bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0

In [12]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

base_model = DecisionTreeRegressor()

br = BaggingRegressor(estimator=base_model, n_estimators=100, random_state=42)
br.fit(x_train, y_train)

# predict test data

In [13]:
yhat_test = br.predict(x_test)

# evaluate the model

In [14]:
from sklearn.metrics import r2_score

print("r2-score (train data): %0.4f" % r2_score(y_train, br.predict(x_train)))
print("r2-score (test data): %0.4f" % r2_score(y_test, yhat_test))

r2-score (train data): 0.9926
r2-score (test data): 0.9613


In [15]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

print(f"MSE (train data): {mean_squared_error(y_train, br.predict(x_train))}")
print(f"RMSE (train data): {np.sqrt(mean_squared_error(y_train, br.predict(x_train)))}")
print(f"MAE (train data): {mean_absolute_error(y_train, br.predict(x_train))}")
print('------------')
print(f"MSE (test data): {mean_squared_error(y_test, yhat_test)}")
print(f"RMSE (test data): {np.sqrt(mean_squared_error(y_test, yhat_test))}")
print(f"MAE (test data): {mean_absolute_error(y_test, yhat_test)}")

MSE (train data): 29.230014952164993
RMSE (train data): 5.406478979166107
MAE (train data): 1.5520581733891106
------------
MSE (test data): 160.42008965654398
RMSE (test data): 12.665705256974205
MAE (test data): 4.284602466821569


# save the model

In [16]:
# import joblib

# joblib.dump(br, 'br_model.pkl')

# load the model

In [17]:
# import joblib

# br = joblib.load('br_model.pkl')