## Applied ML for baseball:

#### By Thomas Maxence Franco 
Submitted to the Faculty of Science in partial fulfillment of the requirements for the degree of 
#### Master of Modeling for Science and Engineering 
at the 
#### UNIVERSITAT AUTÒNOMA DE BARCELONA 
Directed by 
Tomás Manuel Margalef Burrull
July 2024


In [None]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from itertools import combinations
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix
from sklearn.preprocessing import MinMaxScaler

In [None]:
file_path = "C:\\Users\\mfran\\OneDrive - UAB\\Masters\\Thesis\\Batting\\tables\\tradfinalbat.csv"
df = pd.read_csv(file_path)


In [None]:
df.head()

In [None]:
df.info()

Null values where handled in a previous step manually. The small amount of null values was due to players not playing through a season because of injury, but that would damage both of the season as we need last season's data to fill the second to last. I had to find each player's stats in order to not lose any data. This can be handled differently to optimize time. 

In [None]:
df2 = df.rename(columns={'date':'year'})
df2.head()

#### The Shohei Ohtani case. 

The japanese superstar signed a record 700M-10year deal with the LA Dodgers this season (2024). The contract itself has very special characteristics that make it unique. First, he is the first player since Babe Ruth to play both pitcher and batter at an elite level. The Dodgers are essentially getting two players extremely talented players in one, except for this year that he won't be able to pitch as he is recovering from a UCL injury. The record AAV for a pitcher excluding Ohtani is 43.3M for Justin Verlander and Max Scherzer. For a batter is 40M per season for Aaron Judge signed last year. 

But, Ohtani is not getting paid 70M dollars per year. The Dodgers deffered 680 million to be paid starting in 2034, when the contract expires. Ohtani is receiving 2M per season and will get 68 every year starting in 10 years from now. As explained in the paper, today's money won't be worth the same in 10 years. He is not getting 700 million dollars in 2024, he will be getting way less. 

We could solve this in two ways: Take Ohtani's accumulated WAR from the last 3 years (28.3) and divide it into two: Offensive WAR (14.3) and Pitching WAR (14.2) and calculate how much the Dodgers are paying Ohtani according to his different abilities, which will be approximately 35M for each. Right up there with what the top players at each position have received. 


The next proposed solution is to take the the annual competitive balance tax (CBT, or luxury tax) figure for the Dodgers, which is approximately 46.6M and use it as a definitive number for both pitching and batting. This estimation to what the Dodgers will be paying him taking interest rates into account. In average MLB salaries have increased 3.5% per year which would make Ohtani's 70 million, 50 million in 2024. 

For this work I will use the first solution as I see it more fitting for both the pitching models and the batting models. 

I had already converted the 28.5 WAR to 14.3 in the Data Preprocessing step. 

Now I will convert the salary and AAV to match its position valuation.



In [None]:
shohei_index = df2.index[df2['Name'] == 'Shohei Ohtani'].tolist()[0]
df2.loc[shohei_index, 'salary'] *= 0.501754386
df2.loc[shohei_index, 'AAV'] *= 0.501754386

In [None]:
df2.head()

In [None]:
df2.dropna(inplace=True)

In [None]:
df2.info()

### Interest Rates

As mentioned before in the Ohtani case, money is not worth the same every year. This will convert every contract to its 2024 value. The coefficients have been calculated previously taking into account the entire salary mass in MLB and the change per year. 

In [None]:
df2.loc[df2['year'] == 2024, 'salary'] *= 1
df2.loc[df2['year'] == 2023, 'salary'] *= 0.994232329
df2.loc[df2['year'] == 2022, 'salary'] *= 1.097132508
df2.loc[df2['year'] == 2020, 'salary'] *= 1.187629677
df2.loc[df2['year'] == 2019, 'salary'] *= 1.188733275
df2.loc[df2['year'] == 2018, 'salary'] *= 1.183309539
df2.loc[df2['year'] == 2017, 'salary'] *= 1.171102114
df2.loc[df2['year'] == 2016, 'salary'] *= 1.231297408
df2.head()

In [None]:
df2.loc[df2['year'] == 2024, 'AAV'] *= 1
df2.loc[df2['year'] == 2023, 'AAV'] *= 0.994232329
df2.loc[df2['year'] == 2022, 'AAV'] *= 1.097132508
df2.loc[df2['year'] == 2020, 'AAV'] *= 1.187629677
df2.loc[df2['year'] == 2019, 'AAV'] *= 1.188733275
df2.loc[df2['year'] == 2018, 'AAV'] *= 1.183309539
df2.loc[df2['year'] == 2017, 'AAV'] *= 1.171102114
df2.loc[df2['year'] == 2016, 'AAV'] *= 1.231297408
df2.head()

In [None]:
df2.count()

#### Drop minor league players

Minor league contracts have always the same AAV with some rare exceptions. Having these contracts won't help us with trying to predict a value we know already. The objective is to predict which players are worth major league contracts. 

In [None]:
677-(df2['minor_league'] == 1).sum()

In [None]:
df3 = df2[df2['minor_league'] != 1].copy()
df3.head()

In [None]:
(df3['minor_league'] == 1).sum()

In [None]:
del df3["minor_league"]

#### Categorizing team names. 


This could help to know if certain teams over pay for free agents or those who spend relatively cheap. 

In [None]:
df3["team_code"]=df3["new_team"].astype("category").cat.codes.copy()
df3["prev_team_code"]=df3["former_team"].astype("category").cat.codes.copy()

In [None]:
df3.head()

#### SB and CS in a single column, as a percentage SB%. 

This will make the running value more understandable instead of having two variables for the same thing. 

A runner can have 20SB which is a lot, but could also have 10SB. And while this just shows a big amount of stealing attempts, its not showing us the efficiency of the runner in a clear way. 

In [None]:
df3["SB_success"] = (df3["SB"] / (df3["SB"] + df3["CS"]) * 100).fillna(0)
df3["SB_success_2"] = (df3["SB_2"] / (df3["SB_2"] + df3["CS_2"]) * 100).fillna(0)
df3.head()

In [None]:
df_full=df3.copy()
df3=df3.copy()
df3.head()

## Feature Selection

Our goal is to predict AAV and nothing else. Not contract years or the final accumulated salary. 

Drop salary and contract years as we only need AAV. 

MLBAMID is repetitive when we have PlayerId.
New_team, former_team have already been used to know if the player stayed in the same team after signing.

Year wont be need anymore as we have converted all the values to the actual one.

WAR3 will be used until the Advanced Statistics part, 

In [None]:
removed_columns = ['salary', 'contract_years', 'MLBAMID', 'new_team', 'former_team', 'WAR3', 'SB','SB_2','CS','CS_2']
selected_columns = df3.columns[~df3.columns.isin(removed_columns)]


In [None]:
df3.describe()

In [None]:
df4 = df3.drop(columns=removed_columns).copy()

df4.head()

In [None]:
df4.select_dtypes(include=['number']).corr().style.background_gradient("coolwarm", vmin=-1, vmax=1)

We can see K% in both years doesn't have a significant correlation with AAV. 

The same goes for BB%. BB%'s correlation is higher, but it is very highly correlated with OBP, and that makes sense as both are calculation for essentialy the same thing. OBP has a much higher correlation with AAV, so dropping BB% for both years wont impact our model. 

Our created variable 'Catcher' seems to have a very small correlation to AAV, but higher with some others. I chose to keep it.

I can't say the same thing for 'stayed_same_team'. Players seem to have accepted a very slight paycut when staying in the same team. This can be attributed, as pointed out by Libsch (2018) to players familiarity with the city, and an already established position. Basically they gave up very little money for comfort. I chose to keep it in the mean time. 

CS AND SB will be kept for now. 

Finally, from Yrs / career_games / Age, the one that shows a bigger correlation with AAV is age. It doesn't make sense to keep all three, just AGE. 

In [None]:
columns_to_drop = ['BB%', 'BB%_2', 'Yrs', 'career_games', 'K%', 'K%_2', 'team_code', 'prev_team_code']

df5 = df4.drop(columns=columns_to_drop)

df5.head()

In [None]:
df5.info()

In [None]:
boolean_columns = ['catcher', 'stayed_same_team']
df5[boolean_columns] = df5[boolean_columns].astype(bool)

In [None]:
df5.to_csv('df5.csv', index=False)

I will select all of the features in our updated df except the target and PlayerId

In [None]:
target = "AAV"
features = [col for col in df5.columns if col != target and col != "PlayerId" and col!= "Name" and col!= "year"]
X, y = df5[features], df5[target]

## Train-Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

#### Distribution of y_train values

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.hist(y_train, bins=30, edgecolor='k', alpha=0.7)
plt.title('Histogram of y_train')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.kdeplot(y_train, shade=True)
plt.title('KDE of y_train')
plt.xlabel('Value')
plt.ylabel('Density')

plt.tight_layout()
plt.show()

#### Distribution of test values

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.hist(y_test, bins=30, edgecolor='k', alpha=0.7)
plt.title('Histogram of y_test')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.kdeplot(y_test, shade=True)
plt.title('KDE of y_test')
plt.xlabel('Value')
plt.ylabel('Density')

plt.tight_layout()
plt.show()

## Linear Model

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [None]:
linear_model.coef_, linear_model.intercept_

In [None]:
linear_y_pred = linear_model.predict(X_test)

results_df = X_test.copy()
results_df["y_real"] = y_test
results_df["y_pred"] = linear_y_pred
results_df["err"] = results_df["y_real"] - results_df["y_pred"]
results_df["%_err"] = results_df["err"] / results_df["y_real"] * 100
results_df

### Evaluation Metrics 

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, r2_score

print(f"RMSE: {mean_squared_error(y_test, linear_y_pred)**0.5}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, linear_y_pred)}")
print(f"R^2: {r2_score(y_test, linear_y_pred)}")

In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
linear_y_pred_binned = bin_values(linear_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, linear_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()

#### Correction for negative predictions (Linear Regression)

In [None]:
negative_values_exist = (results_df["y_pred"] < 0).any()

if negative_values_exist:
    print("There are negative values in the 'y_pred' column.")
else:
    print("There are no negative values in the 'y_pred' column.")

If our target variable (AAV) is strictly positive, we might transform the target variable before training the model and then inverse transform the predictions. A common transformation for strictly positive data is the logarithm.

Step-by-Step Implementation

    Log Transformation: 
        We apply a logarithmic transformation to the target variable (y_train). This maps the target values from the positive domain to the real number domain, where the regression model can better capture the relationships without producing negative predictions.

    Training the Model: 
        Train the regression model using the transformed target variable.

    Prediction: 
        Make predictions using the trained model on the test set.

    Inverse Transformation: 
        Apply the exponential function to the predicted values to transform them back to the original scale.

In [None]:
linear_model = LinearRegression()

y_train_log = np.log(y_train)
linear_model.fit(X_train, y_train_log)

linear_y_pred_log = linear_model.predict(X_test)


linear_y_pred = np.exp(linear_y_pred_log)

results_df = X_test.copy()
results_df["y_real"] = y_test
results_df["y_pred"] = linear_y_pred
results_df["err"] = results_df["y_real"] - results_df["y_pred"]
results_df["%_err"] = results_df["err"] / results_df["y_real"] * 100
results_df

In [None]:
print(f"RMSE: {mean_squared_error(y_test, linear_y_pred)**0.5}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, linear_y_pred)}")
print(f"R^2: {r2_score(y_test, linear_y_pred)}")

The model improves drastically in terms of average deviation (MAPE). It went from predicting with an error of 100% on average, to 50%. 

### Evaluating per ranges

#### Testing 0-5 range values

In [None]:
range_min, range_max = 0, 5

def in_range(y_real, y_pred, range_min, range_max):
    return range_min <= y_real <= range_max and range_min <= y_pred <= range_max

results_df['y_real_in_range'] = results_df['y_real'].apply(lambda y: range_min <= y <= range_max)

results_df['in_range'] = results_df.apply(lambda row: in_range(row['y_real'], row['y_pred'], range_min, range_max), axis=1)

total_y_real_in_range = results_df['y_real_in_range'].sum()

correct_predictions = results_df['in_range'].sum()

print(f'Number of y_real values in the range {range_min}-{range_max}: {total_y_real_in_range}')
print(f'Number of correct predictions in the range {range_min}-{range_max}: {correct_predictions}')

#### All ranges visualized with a bar plot

In [None]:
ranges = [ (0, 5), (5, 10), (10, 15), (15, 20), (20, 25), (25, 30), (30, 35), (35, 40), (40, float('inf'))]

def in_range(y_real, y_pred, range_min, range_max):
    return range_min <= y_real <= range_max and range_min <= y_pred <= range_max

results_list = []

for range_min, range_max in ranges:
    results_df['y_real_in_range'] = results_df['y_real'].apply(lambda y: range_min <= y <= range_max)
    results_df['in_range'] = results_df.apply(lambda row: in_range(row['y_real'], row['y_pred'], range_min, range_max), axis=1)
    total_y_real_in_range = results_df['y_real_in_range'].sum()
    
    correct_predictions = results_df['in_range'].sum()

    results_list.append({
        'Range': f'{range_min}-{range_max}' if range_max != float('inf') else f'{range_min}+',
        'Total Real in range': total_y_real_in_range,
        'Correct_Predictions': correct_predictions
    })

results_summary = pd.DataFrame(results_list)

print(results_summary)

results_summary.set_index('Range').plot(kind='bar', figsize=(10, 6))
plt.title('Number of Real values and Correct Predictions by Range')
plt.xlabel('Range')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
linear_y_pred_binned = bin_values(linear_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, linear_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()

## XGBoost

In [None]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor()

xgb_model.fit(X_train, y_train)
xgb_y_pred = xgb_model.predict(X_test)

results_df_xgb = X_test.copy()
results_df_xgb["y_real"] = y_test
results_df_xgb["y_pred"] = xgb_y_pred
results_df_xgb["err"] = results_df_xgb["y_real"] - results_df_xgb["y_pred"]
results_df_xgb["%_err"] = results_df_xgb["err"] / results_df_xgb["y_real"] * 100


results_df_xgb


In [None]:
print(f"RMSE: {mean_squared_error(y_test, xgb_y_pred)**0.5}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, xgb_y_pred)}")
print(f"R^2: {r2_score(y_test, xgb_y_pred)}")

The XGBoost model performed better in terms of the average deviation (MAPE), but it was still missing by almost 100% every time. The RMSE rose to 5.18 from 4.63 on average in the LM. The MAPE, like I said, decreased slightly from 106% to 92% on average. And R^2 score dropped by 10 percent. 

In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
xgb_y_pred_binned = bin_values(xgb_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, xgb_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()

#### Correction for negative predictions (XGBoost Regression)

In [None]:
negative_values_exist = (results_df_xgb["y_pred"] < 0).any()

if negative_values_exist:
    print("There are negative values in the 'y_pred' column.")
else:
    print("There are no negative values in the 'y_pred' column.")

## Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

ridge_model = Ridge()

ridge_model.fit(X_train, y_train)

ridge_y_pred = ridge_model.predict(X_test)

results_df_ridge = X_test.copy()
results_df_ridge["y_real"] = y_test
results_df_ridge["y_pred"] = ridge_y_pred
results_df_ridge["err"] = results_df_ridge["y_real"] - results_df_ridge["y_pred"]
results_df_ridge["%_err"] = results_df_ridge["err"] / results_df_ridge["y_real"] * 100

results_df_ridge

In [None]:
print(f"RMSE: {mean_squared_error(y_test, ridge_y_pred)**0.5}")
print(f"MAPE: {mean_absolute_percentage_error(y_test, ridge_y_pred)}")
print(f"R^2: {r2_score(y_test, ridge_y_pred)}")

In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
ridge_y_pred_binned = bin_values(ridge_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, ridge_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()

##### Correction for negative predictions 

In [None]:
negative_values_exist = (results_df_ridge["y_pred"] < 0).any()

if negative_values_exist:
    print("There are negative values in the 'y_pred' column.")
else:
    print("There are no negative values in the 'y_pred' column.")

In [None]:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error


y_train_log = np.log(y_train)


ridge_model = Ridge()
ridge_model.fit(X_train, y_train_log)


ridge_y_pred_log = ridge_model.predict(X_test)


ridge_y_pred = np.exp(ridge_y_pred_log)

rmse = np.sqrt(mean_squared_error(y_test, ridge_y_pred))
r2 = r2_score(y_test, ridge_y_pred)
mape = np.mean(np.abs((y_test - ridge_y_pred) / y_test)) 

print(f"RMSE: {rmse}")
print(f"MAPE: {mape}")
print(f"R^2: {r2}")


results_df_ridge = X_test.copy()
results_df_ridge["y_real"] = y_test
results_df_ridge["y_pred"] = ridge_y_pred
results_df_ridge["err"] = results_df_ridge["y_real"] - results_df_ridge["y_pred"]
results_df_ridge["%_err"] = results_df_ridge["err"] / results_df_ridge["y_real"] * 100

results_df_ridge


#### Testing 0-5 range values

In [None]:
range_min, range_max = 0, 5

def in_range(y_real, y_pred, range_min, range_max):
    return range_min <= y_real <= range_max and range_min <= y_pred <= range_max

results_df_ridge['y_real_in_range'] = results_df_ridge['y_real'].apply(lambda y: range_min <= y <= range_max)

results_df_ridge['in_range'] = results_df_ridge.apply(lambda row: in_range(row['y_real'], row['y_pred'], range_min, range_max), axis=1)

total_y_real_in_range = results_df_ridge['y_real_in_range'].sum()

correct_predictions = results_df_ridge['in_range'].sum()

print(f'Number of Real values in the range {range_min}-{range_max}: {total_y_real_in_range}')
print(f'Number of correct predictions in the range {range_min}-{range_max}: {correct_predictions}')

#### All ranges visualized with a bar plot

In [None]:
def in_range(y_real, y_pred, range_min, range_max):
    return range_min <= y_real <= range_max and range_min <= y_pred <= range_max

results_list = []

for range_min, range_max in ranges:
    results_df_ridge['y_real_in_range'] = results_df_ridge['y_real'].apply(lambda y: range_min <= y <= range_max)
    results_df_ridge['in_range'] = results_df_ridge.apply(lambda row: in_range(row['y_real'], row['y_pred'], range_min, range_max), axis=1)
    total_y_real_in_range = results_df_ridge['y_real_in_range'].sum()
    
    correct_predictions = results_df_ridge['in_range'].sum()

    results_list.append({
        'Range': f'{range_min}-{range_max}' if range_max != float('inf') else f'{range_min}+',
        'Total Real in range': total_y_real_in_range,
        'Correct Predictions': correct_predictions
    })

results_summary = pd.DataFrame(results_list)

print(results_summary)

results_summary.set_index('Range').plot(kind='bar', figsize=(10, 6))
plt.title('Number of Real values and Correct Predictions by Range')
plt.xlabel('Range')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
ridge_y_pred_binned = bin_values(ridge_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, ridge_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()

#### Min Max Scale for Ridge Regression

In [None]:
y_train_log = np.log(y_train)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

ridge_model = Ridge()
ridge_model.fit(X_train_scaled, y_train_log)

ridge_y_pred_log = ridge_model.predict(X_test_scaled)

ridge_y_pred = np.exp(ridge_y_pred_log)

rmse = np.sqrt(mean_squared_error(y_test, ridge_y_pred))
r2 = r2_score(y_test, ridge_y_pred)
mape = np.mean(np.abs((y_test - ridge_y_pred) / y_test))

print(f"RMSE: {rmse}")
print(f"MAPE: {mape}")
print(f"R^2: {r2}")

results_df_ridge = X_test.copy()
results_df_ridge["y_real"] = y_test
results_df_ridge["y_pred"] = ridge_y_pred
results_df_ridge["err"] = results_df_ridge["y_real"] - results_df_ridge["y_pred"]
results_df_ridge["%_err"] = results_df_ridge["err"] / results_df_ridge["y_real"] * 100

results_df_ridge


In [None]:
def bin_values(values, bin_size):
    return np.floor(values / bin_size).astype(int)

bin_size = 5
y_test_binned = bin_values(y_test, bin_size)
ridge_y_pred_binned = bin_values(ridge_y_pred, bin_size)

conf_matrix = confusion_matrix(y_test_binned, ridge_y_pred_binned)

bins = range(conf_matrix.shape[0])
bin_labels = [f'{i*bin_size}-{(i+1)*bin_size-1}' for i in bins]
conf_matrix_df = pd.DataFrame(conf_matrix, index=bin_labels, columns=bin_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Confusion Matrix (Binned Ranges)')
plt.xlabel('Predicted Ranges')
plt.ylabel('Actual Ranges')
plt.show()