# wOBA, Weighted On-Base Average, is a metric which is designed to measure hitter’s offensive productivity. Unlike on-base percentage, which treats all times on base equally, wOBA assigns run-value-based weights to each offensive situations. These weights are recalculated each season according to the league run environment. Let’s build a Linear regression model to estimate wOBA for 2024 using some simple statistics.
* the wOBA calculation = (unintentional BB factor x unintentional BB + HBP factor x HBP + 1B factor x 1B + 2B factor x 2B + 3B factor x 3B + HR factor x HR)/(AB + unintentional BB + SF + HBP)
* reference : baseballsavant


Date preprocessing

In [52]:
# Import csv files - "pd.read_csv()"
import pandas as pd
df = pd.read_csv('2024stats.csv')
df.head()

Unnamed: 0,"last_name, first_name",player_id,year,player_age,ab,hit,single,double,triple,home_run,strikeout,walk,batting_avg,slg_percent,on_base_percent,b_rbi,r_total_stolen_base,woba,n_outs_above_average,sprint_speed
0,"Henderson, Gunnar",683002,2024,23,630,177,102,31,7,37,159,78,0.281,0.529,0.364,92,21,0.381,,28.9
1,"Bohm, Alec",664761,2024,27,554,155,94,44,2,15,86,40,0.28,0.448,0.332,97,5,0.335,,26.3
2,"Mountcastle, Ryan",663624,2024,27,473,128,83,30,2,13,114,27,0.271,0.425,0.308,63,3,0.316,,27.6
3,"Contreras, William",661388,2024,26,595,167,105,37,2,23,139,78,0.281,0.466,0.365,92,9,0.359,,26.4
4,"Suárez, Eugenio",553993,2024,32,571,146,86,28,2,30,176,49,0.256,0.469,0.319,101,2,0.337,,26.5


In [53]:
# We need to replace columns with easy names
df.columns = ['Player','ID','Year','Age','AB','H','1B','2B','3B','HR','SO','BB','AVG','SLG','OBP','RBI','SB','wOBA','OAA','Sprint_Speed']
df.head(1)

Unnamed: 0,Player,ID,Year,Age,AB,H,1B,2B,3B,HR,SO,BB,AVG,SLG,OBP,RBI,SB,wOBA,OAA,Sprint_Speed
0,"Henderson, Gunnar",683002,2024,23,630,177,102,31,7,37,159,78,0.281,0.529,0.364,92,21,0.381,,28.9


In [54]:
# Let's take a look at the characteristics of this table
# There are 129 players who exceeded qualified PA in 2024, and 20 columns are selected.
print('The shape of DataFrame: ',df.shape)

The shape of DataFrame:  (129, 20)


In [55]:
# Let's find out the information of DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Player        129 non-null    object 
 1   ID            129 non-null    int64  
 2   Year          129 non-null    int64  
 3   Age           129 non-null    int64  
 4   AB            129 non-null    int64  
 5   H             129 non-null    int64  
 6   1B            129 non-null    int64  
 7   2B            129 non-null    int64  
 8   3B            129 non-null    int64  
 9   HR            129 non-null    int64  
 10  SO            129 non-null    int64  
 11  BB            129 non-null    int64  
 12  AVG           129 non-null    float64
 13  SLG           129 non-null    float64
 14  OBP           129 non-null    float64
 15  RBI           129 non-null    int64  
 16  SB            129 non-null    int64  
 17  wOBA          129 non-null    float64
 18  OAA           51 non-null     

In [56]:
# To do this, I need to select how to manage NULL(OAA)
# My goal is predicting wOBA, so I will drop OAA column because it is related with defense.
df.drop('OAA',axis=1,inplace=True)

In [57]:
# Also, I can use 'describe'
df.describe()

Unnamed: 0,ID,Year,Age,AB,H,1B,2B,3B,HR,SO,BB,AVG,SLG,OBP,RBI,SB,wOBA,Sprint_Speed
count,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0,129.0
mean,643345.418605,2024.0,27.728682,545.976744,141.317829,89.573643,27.620155,2.449612,21.674419,128.48062,53.503876,0.258147,0.435922,0.328845,75.178295,12.736434,0.330783,27.434884
std,51254.830162,0.0,3.652379,53.492182,22.405351,17.029283,6.985693,2.588922,9.134552,33.990371,18.710844,0.025159,0.061989,0.02986,18.977556,12.610462,0.032852,1.30741
min,457705.0,2024.0,20.0,430.0,96.0,52.0,11.0,0.0,2.0,29.0,15.0,0.196,0.331,0.27,32.0,0.0,0.27,24.5
25%,621566.0,2024.0,25.0,501.0,128.0,78.0,23.0,1.0,16.0,103.0,41.0,0.242,0.394,0.312,62.0,3.0,0.314,26.5
50%,663993.0,2024.0,27.0,550.0,139.0,90.0,27.0,2.0,20.0,127.0,52.0,0.256,0.428,0.325,74.0,9.0,0.326,27.4
75%,676475.0,2024.0,30.0,584.0,155.0,101.0,31.0,3.0,26.0,156.0,65.0,0.275,0.464,0.342,86.0,19.0,0.342,28.5
max,701538.0,2024.0,39.0,671.0,211.0,161.0,48.0,14.0,58.0,218.0,133.0,0.332,0.701,0.458,144.0,67.0,0.476,30.5


In [59]:
# Find some leaders of AVG, HR, wOBA  - padas prints row as Series, so I add values[0] to watch only name
print('2024 Batting Leader: ', df[df['AVG']==df['AVG'].max()]['Player'].values[0])
print('2024 Homerun Leader: ', df[df['HR']==df['HR'].max()]['Player'].values[0])
print('2024   wOBA  Leader: ', df[df['wOBA']==df['wOBA'].max()]['Player'].values[0])

2024 Batting Leader:  Witt Jr., Bobby
2024 Homerun Leader:  Judge, Aaron
2024   wOBA  Leader:  Judge, Aaron


In [60]:
# Find top 5 wOBA list - using sort_values
df.sort_values('wOBA',ascending = False).head(5)

Unnamed: 0,Player,ID,Year,Age,AB,H,1B,2B,3B,HR,SO,BB,AVG,SLG,OBP,RBI,SB,wOBA,Sprint_Speed
7,"Judge, Aaron",592450,2024,32,559,180,85,36,1,58,171,133,0.322,0.701,0.458,144,10,0.476,26.8
71,"Ohtani, Shohei",660271,2024,29,636,197,98,38,7,54,162,81,0.31,0.646,0.39,130,59,0.431,28.1
43,"Soto, Juan",665742,2024,25,576,166,90,31,4,41,119,129,0.288,0.569,0.419,109,7,0.421,26.8
123,"Witt Jr., Bobby",677951,2024,24,636,211,123,45,11,32,106,57,0.332,0.588,0.389,109,31,0.41,30.5
10,"Alvarez, Yordan",670541,2024,27,552,170,99,34,2,35,95,69,0.308,0.567,0.392,86,6,0.402,26.1


In [61]:
# Mean of AVG,wOBA of qualified PA
round(df[['AVG','wOBA']].mean() ,3)

Unnamed: 0,0
AVG,0.258
wOBA,0.331


In [63]:
# Make new column - (OPS = SLG+OBP)
# Find top 5 OPS list
df['OPS'] = df['SLG']+df['OBP']
df.sort_values('OPS',ascending = False).head()

Unnamed: 0,Player,ID,Year,Age,AB,H,1B,2B,3B,HR,SO,BB,AVG,SLG,OBP,RBI,SB,wOBA,Sprint_Speed,OPS
7,"Judge, Aaron",592450,2024,32,559,180,85,36,1,58,171,133,0.322,0.701,0.458,144,10,0.476,26.8,1.159
71,"Ohtani, Shohei",660271,2024,29,636,197,98,38,7,54,162,81,0.31,0.646,0.39,130,59,0.431,28.1,1.036
43,"Soto, Juan",665742,2024,25,576,166,90,31,4,41,119,129,0.288,0.569,0.419,109,7,0.421,26.8,0.988
123,"Witt Jr., Bobby",677951,2024,24,636,211,123,45,11,32,106,57,0.332,0.588,0.389,109,31,0.41,30.5,0.977
10,"Alvarez, Yordan",670541,2024,27,552,170,99,34,2,35,95,69,0.308,0.567,0.392,86,6,0.402,26.1,0.959


In [64]:
# Find corr with wOBA
df1=df[['Age','AB','H','1B','2B','3B','HR','SO','BB','AVG','SLG','OBP','RBI','SB','OPS','Sprint_Speed','wOBA']]
df1.corr()['wOBA']

Unnamed: 0,wOBA
Age,0.179275
AB,0.221182
H,0.599068
1B,0.165524
2B,0.477555
3B,0.119375
HR,0.761774
SO,0.098029
BB,0.620523
AVG,0.735981


In [65]:
# In this correlation analysis, I can see SLG,OBP,OPS(SLG+OBP) are strongly correlated with wOBA

# To predict wOBA, make train,test data set using model_selection/train_test_split
X=df[['Age','AB','H','1B','2B','3B','HR','SO','BB','AVG','SLG','OBP','RBI','SB','OPS']] # I don't need Sprint_Speed
y=df['wOBA']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [68]:
# Using LinearRegression model - linear_model/LinearRegression
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

In [69]:
# In regression, there are 3 ways to show the accuracy - metrics/mean_squared_error,mean_absolute_error, r2
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Predicting wOBA of test set with the model
y_pred = model.predict(X_test)

# Compare predict value of test set(y_pred) with the real wOBA of test set(y_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MSE : ",mse)
print("MAE : ",mae)
print("r2  : ",r2)

MSE :  4.343081825923922e-06
MAE :  0.001479041911444889
r2  :  0.9975522896491291


In [76]:
# It has quite good accuracy, but it could be overfitted because the amount of datasets are too small(129).
# Compare values
df['y_pred']=model.predict(X)
df[['Player','wOBA','y_pred']].sort_values('wOBA',ascending=False).head(10)

Unnamed: 0,Player,wOBA,y_pred
7,"Judge, Aaron",0.476,0.483126
71,"Ohtani, Shohei",0.431,0.43202
43,"Soto, Juan",0.421,0.419947
123,"Witt Jr., Bobby",0.41,0.410723
10,"Alvarez, Yordan",0.402,0.40604
81,"Guerrero Jr., Vladimir",0.398,0.399876
64,"Ozuna, Marcell",0.395,0.392587
90,"Rooker Jr., Brent",0.392,0.390207
58,"Marte, Ketel",0.391,0.391918
0,"Henderson, Gunnar",0.381,0.380352


In [77]:
# Find coefficient and intercept
print('model.coefficient:', model.coef_)
print('model.intercept  :', model.intercept_)

model.coefficient: [ 8.54467755e-07 -1.89921213e-05  2.07881058e-04 -1.87397596e-04
 -3.35439328e-05  8.90805449e-05  3.39742043e-04  9.03057297e-06
 -9.51101249e-05 -2.18016373e-02 -4.43947613e-02  3.28314889e-01
 -3.26230943e-05 -1.27380320e-05  2.83920127e-01]
model.intercept  : 0.0282786468417644


#result:
It was found that the accuracy of model was very high because of the small size of the dataset and the strong correlation between wOBA and OPS(SLG+OBA). This project did not involve feature selection or preprocessing steps such as scaling, as the main focus was on building a simple model. Through this project, it was reaffirmed that wOBA effectively reflects the modern baseball emphasis on slugging and on-base.