# Project- Day 9 - Ensemble methods

> Today's goal is:
1. Apply Bagging (Random Forest with Multiple Decision Tree Regressor ) - evaluate performance and check if we can deploy it or not
2. Apply Boosting Algorithms: XGboost, LightGBM, AdaBoost and compare model performances
3. Apply stacking Random Forest, XGBoost, and LightGBM, with a simple model like Linear Regression as the final estimator.
4. Documentation, deploy the best and simple model and Finalize the Project.

In [23]:
import pandas as pd
import numpy as np

In [24]:
df = pd.read_csv('../data/processed/data_after_feature_engineering2.csv')

In [25]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

In [26]:
df.head(2)

Unnamed: 0,Age,Best overall,Overall rating,Potential,foot,Value,Wage,Height_cm,Weight_kg,Acceleration,Sprint speed,Agility,Reactions,Balance,Stamina,Strength,Jumping,Total attacking,Crossing,Finishing,Heading accuracy,Short passing,Volleys,Total skill,Dribbling,Curve,FK Accuracy,Long passing,Ball control,Total defending,Defensive awareness,Standing tackle,Sliding tackle,Interceptions,Aggression,Total goalkeeping,GK Diving,GK Handling,GK Kicking,GK Positioning,GK Reflexes,Total mentality,Att. Position,Vision,Penalties,Composure,Total power,Shot power,Long shots,Total stats,International reputation,On Loan,Team_encoded,Best position_CAM,Best position_CB,Best position_CDM,Best position_CF,Best position_CM,Best position_GK,Best position_LB,Best position_LM,Best position_LW,Best position_LWB,Best position_RB,Best position_RM,Best position_RW,Best position_RWB,Best position_ST,Years left,Forward Score,Midfielder Score,Defender Score,Goalkeeper Score,Position Category_Defender,Position Category_Forward,Position Category_Goalkeeper,Position Category_Midfielder
0,22,74,72,84,1,15.520259,10.203629,191,75,65,69,62.0,70,54.0,66,75,79.0,0.559494,59,29,72,66,41.0,0.567901,66,51.0,34,64,67,0.801653,70,75,74.0,70,68,0.091335,12,8,11.0,10,8,0.593838,48,42.0,49.0,64.0,0.588424,52,44,1730,1,0,14.960267,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,56.2,61.0,72.8,9.8,1.0,0.0,0.0,0.0
1,20,78,76,86,1,16.618871,11.066654,180,75,77,73,77.0,72,69.0,77,75,87.0,0.726582,56,67,78,79,53.0,0.740741,77,68.0,52,77,78,0.834711,69,78,80.0,72,78,0.093677,5,10,15.0,12,8,0.806723,75,74.0,54.0,78.0,0.78135,72,65,2059,1,1,14.939842,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,73.2,77.0,74.8,10.0,0.0,0.0,0.0,1.0


## 1. Recap

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error, mean_absolute_error

In [38]:
X = df.drop('Value', axis=1)
y = df['Value']

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [40]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3840, 76), (960, 76), (3840,), (960,))

In [41]:
Lr = LinearRegression()
Lr.fit(X_train, y_train)

In [42]:
Ridge = Ridge()
Ridge.fit(X_train, y_train)

In [43]:
Dt = DecisionTreeRegressor()
Dt.fit(X_train, y_train)

In [44]:
lr_pred = Lr.predict(X_test)
ridge_pred = Ridge.predict(X_test)
dt_pred = Dt.predict(X_test)

models = {
    'Linear Regression': lr_pred,
    'Ridge': ridge_pred,
    'Decision Tree': dt_pred
}

metrics = {}

for model_name, predictions in models.items():
    metrics[model_name] = {
        'R2 Score': r2_score(y_test, predictions),
        'MAE': mean_absolute_error(y_test, predictions),
        'RMSE': np.sqrt(mean_squared_error(y_test, predictions))
    }

metrics_df = pd.DataFrame(metrics).round(6)
print("Model Performance Comparison:")
print("-" * 50)
print(metrics_df)

Model Performance Comparison:
--------------------------------------------------
          Linear Regression     Ridge  Decision Tree
R2 Score           0.973219  0.973231       0.986520
MAE                0.172437  0.172429       0.084679
RMSE               0.222162  0.222110       0.157613


## 2. Ensemble Methods

### Bagging with Random Forest

Bootstrap aggregation (bagging) to reduce variance by averaging multiple decision trees.

In [45]:
Rf = RandomForestRegressor()
Rf.fit(X_train, y_train)

In [46]:
y_pred = Rf.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9947531014679627


In [57]:
import joblib
joblib.dump(Rf, '../models/random_forest_model.pkl')

['../models/random_forest_model.pkl']

Wow Brilliant?

In [47]:
from sklearn.model_selection import GridSearchCV

In [49]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}
grid_search = GridSearchCV(Rf, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
best_Rf = grid_search.best_estimator_

In [None]:
rfbest = RandomForestRegressor(n_estimators=200, max_depth=20)

In [56]:
rfbest.fit(X_train, y_train)
y_pred = rfbest.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9948399642317804


yep!! We found the best model now: which is rfbest, obtained by hyperparameter tuning this is not drastic change, but a small upgrade. but guess what, this is happiness

### Gradient Boosting Regressor

Sequentially builds trees to correct errors of previous models (boosting).

In [58]:
from sklearn.ensemble import GradientBoostingRegressor


In [59]:
Gb = GradientBoostingRegressor(random_state=42)
Gb.fit(X_train, y_train)
y_pred = Gb.predict(X_test)

In [61]:
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}
grid_search = GridSearchCV(Gb, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
best_gb = grid_search.best_estimator_

In [62]:
best_gb

In [64]:
best_Gb = GradientBoostingRegressor(max_depth=5, n_estimators=200, random_state=42)

In [65]:
best_Gb.fit(X_train, y_train)
y_pred = best_Gb.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9967783719841966


*This is by far the best model yet*

### Stacking

Combine predictions of base models (e.g., Linear Regression, Decision Tree) using a meta-model.

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

In [None]:
# These are our best models

base_models = [
    ('Gradient Boosting', GradientBoostingRegressor(max_depth=5, n_estimators=200, random_state=42)),
    ('Random Forest', RandomForestRegressor(n_estimators=200, max_depth=20)),
    ('Decision Tree', DecisionTreeRegressor()),
]

TypeError: 'Ridge' object is not callable

In [None]:
stacked = StackingRegressor(estimators=base_models, final_estimator=Ridge())