# Project- Day 9 - Ensemble methods

> Today's goal is:
1. Apply Bagging (Random Forest with Multiple Decision Tree Regressor ) - evaluate performance and check if we can deploy it or not
2. Apply Boosting Algorithms: XGboost, LightGBM, AdaBoost and compare model performances
3. Apply stacking Random Forest, XGBoost, and LightGBM, with a simple model like Linear Regression as the final estimator.
4. Documentation, deploy the best and simple model and Finalize the Project.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('../data/processed/final_feature_engineering.csv')

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

In [5]:
df.head(2)

Unnamed: 0,Age,foot,Value,Wage,Height_cm,Weight_kg,Acceleration,Sprint speed,Agility,Balance,Stamina,Strength,International reputation,On Loan,Team_encoded,Years left,Forward Score,Midfielder Score,Defender Score,Goalkeeper Score,Position Category_Defender,Position Category_Forward,Position Category_Goalkeeper,Position Category_Midfielder
0,22,1,15.520259,10.203629,191,75,65,69,62.0,54.0,66,75,1,0,14.960267,3,56.2,61.0,72.8,9.8,1.0,0.0,0.0,0.0
1,20,1,16.618871,11.066654,180,75,77,73,77.0,69.0,77,75,1,1,14.939842,0,73.2,77.0,74.8,10.0,0.0,0.0,0.0,1.0


## 1. Recap

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error, mean_absolute_error

In [7]:
X = df.drop('Value', axis=1)
y = df['Value']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [9]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3840, 23), (960, 23), (3840,), (960,))

In [10]:
Lr = LinearRegression()
Lr.fit(X_train, y_train)

In [11]:
Ridge = Ridge()
Ridge.fit(X_train, y_train)

In [12]:
Dt = DecisionTreeRegressor()
Dt.fit(X_train, y_train)

In [13]:
lr_pred = Lr.predict(X_test)
ridge_pred = Ridge.predict(X_test)
dt_pred = Dt.predict(X_test)

models = {
    'Linear Regression': lr_pred,
    'Ridge': ridge_pred,
    'Decision Tree': dt_pred
}

metrics = {}

for model_name, predictions in models.items():
    metrics[model_name] = {
        'R2 Score': r2_score(y_test, predictions),
        'MAE': mean_absolute_error(y_test, predictions),
        'RMSE': np.sqrt(mean_squared_error(y_test, predictions))
    }

metrics_df = pd.DataFrame(metrics).round(6)
print("Model Performance Comparison:")
print("-" * 50)
print(metrics_df)

Model Performance Comparison:
--------------------------------------------------
          Linear Regression     Ridge  Decision Tree
R2 Score           0.830564  0.830734       0.801407
MAE                0.431873  0.431954       0.452892
RMSE               0.558800  0.558519       0.604973


## 2. Ensemble Methods

### Bagging with Random Forest

Bootstrap aggregation (bagging) to reduce variance by averaging multiple decision trees.

In [14]:
Rf = RandomForestRegressor()
Rf.fit(X_train, y_train)

In [15]:
y_pred = Rf.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9290588776843546


In [16]:
import joblib
joblib.dump(Rf, '../models/random_forest_model.pkl')

['../models/random_forest_model.pkl']

Wow Brilliant?

In [17]:
from sklearn.model_selection import GridSearchCV

In [18]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}
grid_search = GridSearchCV(Rf, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
best_Rf = grid_search.best_estimator_

In [19]:
rfbest = RandomForestRegressor(n_estimators=200, max_depth=20)

In [20]:
rfbest.fit(X_train, y_train)
y_pred = rfbest.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9292498581557154


yep!! We found the best model now: which is rfbest, obtained by hyperparameter tuning this is not drastic change, but a small upgrade. but guess what, this is happiness

### Gradient Boosting Regressor

Sequentially builds trees to correct errors of previous models (boosting).

In [21]:
from sklearn.ensemble import GradientBoostingRegressor


In [22]:
Gb = GradientBoostingRegressor(random_state=42)
Gb.fit(X_train, y_train)
y_pred = Gb.predict(X_test)

In [23]:
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}
grid_search = GridSearchCV(Gb, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
best_gb = grid_search.best_estimator_

In [24]:
best_gb

In [25]:
best_Gb = GradientBoostingRegressor(max_depth=5, n_estimators=200, random_state=42)

In [26]:
best_Gb.fit(X_train, y_train)
y_pred = best_Gb.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9465900185424172


In [27]:
cleaned_data = pd.read_csv('../data/processed/cleaned_data.csv')

In [28]:
# Extract the 'Name' column from cleaned_data
names = cleaned_data[['Name']]

# Merge the 'Name' column with the original dataframe 'df'
df_with_names = df.merge(names, left_index=True, right_index=True)

# Display the first few rows of the updated dataframe
df_with_names.head()

Unnamed: 0,Age,foot,Value,Wage,Height_cm,Weight_kg,Acceleration,Sprint speed,Agility,Balance,Stamina,Strength,International reputation,On Loan,Team_encoded,Years left,Forward Score,Midfielder Score,Defender Score,Goalkeeper Score,Position Category_Defender,Position Category_Forward,Position Category_Goalkeeper,Position Category_Midfielder,Name
0,22,1,15.520259,10.203629,191,75,65,69,62.0,54.0,66,75,1,0,14.960267,3,56.2,61.0,72.8,9.8,1.0,0.0,0.0,0.0,K. De Winter
1,20,1,16.618871,11.066654,180,75,77,73,77.0,69.0,77,75,1,1,14.939842,0,73.2,77.0,74.8,10.0,0.0,0.0,0.0,1.0,Andrey Santos
2,21,1,16.341239,10.491302,173,75,84,85,82.0,82.0,78,79,1,0,16.318711,3,78.8,71.2,33.6,11.0,0.0,1.0,0.0,0.0,G. Simeone
3,16,1,13.910822,7.601402,185,78,74,74,73.0,77.0,68,65,1,0,14.150029,1,66.8,56.8,31.4,8.2,0.0,1.0,0.0,0.0,M. Melia
4,33,1,17.96655,12.506181,181,75,67,67,75.0,78.0,76,75,5,0,16.651179,0,79.2,89.8,66.0,11.2,0.0,0.0,0.0,1.0,K. De Bruyne


In [29]:
# Predict using the best_gb model
predicted_values = best_Gb.predict(X_test)

# Create a DataFrame to display the results
results_df = pd.DataFrame({
    'Name': df_with_names.loc[X_test.index, 'Name'],
    'Actual Value': np.expm1(y_test).astype(float),
    'Predicted Value': np.expm1(predicted_values).astype(int),
})

# Display the results
print(results_df.sample(20))

                 Name  Actual Value  Predicted Value
2463      M. Cleworth      600000.0           924699
2509         T. Alloh     1500000.0          1261004
1084          J. Doig     6000000.0          5122923
3638       R. Stutter     1100000.0          1522527
1941         A. Harit      650000.0          1055260
2830         D. Rossi    11000000.0         11206935
3077      F. Guilbert     1300000.0          1714390
1837     Víctor Gómez     2000000.0          1792090
3256          Rodinei     1800000.0          3177179
139   B. El Khannouss    12500000.0         19319207
1728      Lee Tae Suk     2100000.0          2985282
1732         L. Mbete     2800000.0          2702942
3437       E. Mangala      500000.0           423633
1813     F. Ioannidis    10000000.0          6562342
3496       K. Boateng     4700000.0          3046012
3673           Damián     1800000.0           897253
2458             Simo      170000.0           179288
3857       T. Mansour      800000.0          1

*This is by far the best model yet*

### Stacking

Combine predictions of base models (e.g., Linear Regression, Decision Tree) using a meta-model.

In [30]:
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

In [31]:
# These are our best models

base_models = [
    ('Gradient Boosting', GradientBoostingRegressor(max_depth=5, n_estimators=200, random_state=42)),
    ('Random Forest', RandomForestRegressor(n_estimators=200, max_depth=20)),
    ('Decision Tree', DecisionTreeRegressor()),
]

In [32]:
stacked = StackingRegressor(estimators=base_models, final_estimator=Ridge())


In [33]:
accuracy = stacked.fit(X_train, y_train).score(X_test, y_test)
print(f"R2 Score: {accuracy}")

R2 Score: 0.9470304680993231


In [34]:
df.head(3)

Unnamed: 0,Age,foot,Value,Wage,Height_cm,Weight_kg,Acceleration,Sprint speed,Agility,Balance,Stamina,Strength,International reputation,On Loan,Team_encoded,Years left,Forward Score,Midfielder Score,Defender Score,Goalkeeper Score,Position Category_Defender,Position Category_Forward,Position Category_Goalkeeper,Position Category_Midfielder
0,22,1,15.520259,10.203629,191,75,65,69,62.0,54.0,66,75,1,0,14.960267,3,56.2,61.0,72.8,9.8,1.0,0.0,0.0,0.0
1,20,1,16.618871,11.066654,180,75,77,73,77.0,69.0,77,75,1,1,14.939842,0,73.2,77.0,74.8,10.0,0.0,0.0,0.0,1.0
2,21,1,16.341239,10.491302,173,75,84,85,82.0,82.0,78,79,1,0,16.318711,3,78.8,71.2,33.6,11.0,0.0,1.0,0.0,0.0


In [41]:
df.to_pickle('../models/df.pkl')

In [42]:
import sklearn
import pandas as pd
import joblib

print("scikit-learn version:", sklearn.__version__)
print("pandas version:", pd.__version__)
print("joblib version:", joblib.__version__)

scikit-learn version: 1.5.2
pandas version: 2.2.2
joblib version: 1.4.2


In [35]:
df_with_names.head(3)

Unnamed: 0,Age,foot,Value,Wage,Height_cm,Weight_kg,Acceleration,Sprint speed,Agility,Balance,Stamina,Strength,International reputation,On Loan,Team_encoded,Years left,Forward Score,Midfielder Score,Defender Score,Goalkeeper Score,Position Category_Defender,Position Category_Forward,Position Category_Goalkeeper,Position Category_Midfielder,Name
0,22,1,15.520259,10.203629,191,75,65,69,62.0,54.0,66,75,1,0,14.960267,3,56.2,61.0,72.8,9.8,1.0,0.0,0.0,0.0,K. De Winter
1,20,1,16.618871,11.066654,180,75,77,73,77.0,69.0,77,75,1,1,14.939842,0,73.2,77.0,74.8,10.0,0.0,0.0,0.0,1.0,Andrey Santos
2,21,1,16.341239,10.491302,173,75,84,85,82.0,82.0,78,79,1,0,16.318711,3,78.8,71.2,33.6,11.0,0.0,1.0,0.0,0.0,G. Simeone


In [37]:
df_with_names.to_csv('../data/processed/final_data.csv', index=False)

In [39]:
df_with_names.to_pickle('../models/df.pkl')

In [40]:
df_loaded = pd.read_pickle('../models/df.pkl')
print(df_loaded.head())

   Age  foot      Value       Wage  Height_cm  Weight_kg  Acceleration  \
0   22     1  15.520259  10.203629        191         75            65   
1   20     1  16.618871  11.066654        180         75            77   
2   21     1  16.341239  10.491302        173         75            84   
3   16     1  13.910822   7.601402        185         78            74   
4   33     1  17.966550  12.506181        181         75            67   

   Sprint speed  Agility  Balance  Stamina  Strength  \
0            69     62.0     54.0       66        75   
1            73     77.0     69.0       77        75   
2            85     82.0     82.0       78        79   
3            74     73.0     77.0       68        65   
4            67     75.0     78.0       76        75   

   International reputation  On Loan  Team_encoded  Years left  Forward Score  \
0                         1        0     14.960267           3           56.2   
1                         1        1     14.939842      