### 🔹 Extra Trees Regression  

Extra Trees Regression (Extremely Randomized Trees) is an **ensemble learning method** similar to Random Forest, but with **more randomization** during tree construction.  
It aims to reduce variance and improve generalization by building multiple uncorrelated trees.  

The main idea:  
- Like Random Forest, it trains many decision trees on random subsets of the data.  
- However, **Extra Trees choose split points randomly**, not based on the best information gain.  
- This added randomness often makes the model faster and less prone to overfitting.  

Mathematically, the final prediction is the **average** of all tree predictions:  

$$
\hat{Y} = \frac{1}{N} \sum_{i=1}^{N} f_i(X)
$$  

where:  
- $(f_i(X))$ is the prediction of the $(i)$-th tree,  
- $(\hat{Y})$ is the final predicted value,  
- $(N)$ is the total number of trees.  

Extra Trees Regression helps us to:  
- Achieve **high accuracy** with less variance than individual decision trees.  
- Train **faster** than Random Forest since it avoids computing optimal splits.  
- Handle **non-linear relationships** and **noisy datasets** effectively.  

In this notebook, we will implement **Extra Trees Regression** and compare its performance with Random Forest and Gradient Boosting models 🚀.  


# --------------------------------------------------------------------------

# import dataset

In [1]:
# from google.colab import files, drive

# up = files.upload()
# drive.mount('/content/drive')

In [2]:
import pandas as pd

df = pd.read_csv('dataset.csv')
df.head(3)

Unnamed: 0,A,B,C,T
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       1067 non-null   float64
 1   B       1067 non-null   int64  
 2   C       1067 non-null   float64
 3   T       1067 non-null   int64  
dtypes: float64(2), int64(2)
memory usage: 33.5 KB


# cleanig

In [4]:
# clean the data

# encoding

In [5]:
# encode the data

# define x , y

In [6]:
import numpy as np

x = df[['A', 'B', 'C']].values  # 2D
y = df['T'].values                 # 1D

# spliting

In [7]:
# # finding best random state 

# from sklearn.model_selection import train_test_split
# from sklearn.ensemble import ExtraTreesRegressor
# etr = ExtraTreesRegressor()
# from sklearn.metrics import r2_score

# import time
# t1 = time.time()
# lst = []
# for i in range(1,10):
#     x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=i)
#     etr.fit(x,y)
#     yhat_test = etr.predict(x_test)
#     r2 = r2_score(y_test, yhat_test)
#     lst.append(r2)
# t2 = time.time()

# print(f"run time: {round((t2 - t1)/60, 2)} min")
# print(f"r2_score: {round(max(lst), 2)}")
# rs = np.argmax(lst) + 1
# print(f"random_state: {rs}")

In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=5)

# fit the model

In [9]:
# # k-fold cross validation

# from sklearn.ensemble import ExtraTreesRegressor
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     # 'n_estimators': [100],
#     'max_depth': [10, 20],
#     'min_samples_split': [2, 5],
#     'min_samples_leaf': [1, 2],
#     'max_features': [0.5, 'sqrt'],
# }

# et = ExtraTreesRegressor()
# gs = GridSearchCV(estimator=et, param_grid=parameters, cv=5)

# gs.fit(x_train, y_train)

# best_params = gs.best_params_
# print(best_params)

# # parameters = {
# #     'n_estimators': [50, 100, 200],
# #     'max_depth': [None, 10, 20, 30],
# #     'min_samples_split': [2, 5, 10],
# #     'min_samples_leaf': [1, 2, 4],
# #     'max_features': ['auto', 'sqrt', 0.5],
# #     'bootstrap': [False, True],
# #     'ccp_alpha': [0.0, 0.01, 0.1]
# # }

In [10]:
#def param
# n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, 
# min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, 
# bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, 
# ccp_alpha=0.0, max_samples=None, monotonic_cst=None

In [11]:
from sklearn.ensemble import ExtraTreesRegressor

etr = ExtraTreesRegressor(
    # max_depth=20, max_features=0.5, min_samples_leaf=1, min_samples_split=2
)
etr.fit(x_train, y_train)

# predict test data

In [12]:
yhat_test = etr.predict(x_test)

# evaluate the model

In [13]:
from sklearn.metrics import r2_score

print("r2-score (train data): %0.4f" % r2_score(y_train, etr.predict(x_train)))
print("r2-score (test data): %0.4f" % r2_score(y_test, yhat_test))

r2-score (train data): 0.9956
r2-score (test data): 0.9812


In [14]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

print(f"MSE (train data): {mean_squared_error(y_train, etr.predict(x_train))}")
print(f"RMSE (train data): {np.sqrt(mean_squared_error(y_train, etr.predict(x_train)))}")
print(f"MAE (train data): {mean_absolute_error(y_train, etr.predict(x_train))}")
print('------------')
print(f"MSE (test data): {mean_squared_error(y_test, yhat_test)}")
print(f"RMSE (test data): {np.sqrt(mean_squared_error(y_test, yhat_test))}")
print(f"MAE (test data): {mean_absolute_error(y_test, yhat_test)}")

MSE (train data): 17.397084270833332
RMSE (train data): 4.170981211997164
MAE (train data): 0.7118229166666663
------------
MSE (test data): 80.05026464315442
RMSE (test data): 8.947081347744327
MAE (test data): 1.9342509363295886


# save the model

In [15]:
# import joblib
# joblib.dump(etr, 'etr_model.pkl')

# load the model

In [16]:
# import joblib
# etr = joblib.load('etr_model.pkl')