In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# MLB DATA ANALYSIS #

## Authors: ##

# Introduction/Background: #

what are we looking at, why do we care, what are we interested in?

# Data #

where were datasets obtained, where, and what did we do to the data

In [None]:
contract_df = pd.read_csv('2025_contract_avg.csv')
show(contract_df)

This dataset simply looks at the contract average per year of all MLB players as of 2025. Contract average is a useful metric as it takes into account the full contract value, not just the amount being paid out per year. This accounts for players like Shohei Ohtani, who despite having a 700 million dollar contract, has deferred it and only gets paid 2 million per year. A value like contract average takes this into account, and still shows that he is in fact the highest paid player currently in the MLB. This data was obtained from spotrac.com, looking at their list of MLB Contract Average Rankings list. It was cleaned to just names and salaries, where we then added a categorical description of the salary range for extra clarity.

In [None]:
plus_df = pd.read_csv('plusstats.csv')
show(plus_df)

This dataset looks at 3 main batting stats for MLB players, in their + form, which takes into account park factors, since some parks pose advantages based on their weather conditions, but also due to non-standard outfields. These stats are then normalized so that 100 is the league average, meaning any players above 100 are stronger than average, and below are weaker. OBP+ looks at the on-base percentage of a player, to determine how consistently they either walk or hit to get on base. SLG+ looks at the power of their hits, determinig the strength of their at-bats, not just successful hits. Finally, wRC+ looks at runs created for their team, looking at not just things like home runs, but also runs batted in, and even things like sacrifice pop-outs.

# Initial Exploration #

what did we do initially to analyze the dataset and get ideas?

(probably just do some value counts and such here)

also probably talk about how we decided what stats were good determiners here

# Proposed Questions #

1. How does age affect quality of player?
2. How does salary affect quality of player?
3. How does position affect quality of player?

# Analysis and Results #
just start dumping stuff in here, we can clean and explain later

## Salary ##

In [None]:
# Just merging the plus and salary dataframes for easier graphing
df_plussal = pd.merge(plus_df, contract_df, on = 'Name', how = 'left')

# Defining the order of categories to call on later for easier visuals
desired_order = ['<1M', '1 - 10M', '10 - 20M', '20 - 30M', '30 - 40M', '40 - 50M', '>50M']

### Salary vs. OBP+ ###

In [None]:
fig = px.scatter(df_plussal,
                 x = 'Salary',
                 y = 'OBP+',
                 color = 'salcat',
                 category_orders={'salcat': desired_order},
                 color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']),
                 hover_data = 'Name')

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. OBP+',
                  xaxis_title = 'Average Contract Value')

fig.show()

In [None]:
fig = px.box(df_plussal,
             x = 'salcat',
             y = 'OBP+',
             color = 'salcat',
             category_orders={'salcat': desired_order},
             color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']))

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. OBP+',
                  xaxis_title = 'Average Contract Value')


fig.show()

Here we can see that there does appear to be some increase in average and median OBP+ on the basis of salary. It is clearly present at the very upper end, where the OBP+ of 40+ million players does sit clearly higher than the rest, but there appears to be fairly marginal changes on the lower ends. This is also not free from the fact that there are far fewer players on the upper end of contract values, meaning we can't certainly say that they are not outliers.

### Salary vs. SLG+ ###

In [None]:
fig = px.scatter(df_plussal,
                 x = 'Salary',
                 y = 'SLG+',
                 color = 'salcat',
                 category_orders={'salcat': desired_order},
                 color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']),
                 hover_data = 'Name')

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. SLG+',
                  xaxis_title = 'Average Contract Value')

fig.show()

In [None]:
fig = px.box(df_plussal,
             x = 'salcat',
             y = 'SLG+',
             color = 'salcat',
             category_orders={'salcat': desired_order},
             color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']))

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. SLG+',
                  xaxis_title = 'Average Contract Value')


fig.show()

Here there is a much more notable increase on the lower end of SLG+ as salary increases. Aside from <1M vs 1-10M, the median increases as the salary range increases. While the spreads of these points suggests that it is not guaranteed that a higher paid player hits harder, it does seem to play some partial role in its prediction.

### Salary vs. wRC+ ###

In [None]:
fig = px.scatter(df_plussal,
                 x = 'Salary',
                 y = 'wRC+',
                 color = 'salcat',
                 category_orders={'salcat': desired_order},
                 color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']),
                 hover_data = 'Name')

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. wRC+',
                  xaxis_title = 'Average Contract Value')

fig.show()

In [None]:
fig = px.box(df_plussal,
             x = 'salcat',
             y = 'wRC+',
             color = 'salcat',
             category_orders={'salcat': desired_order},
             color_discrete_sequence = (['#D62728', '#FF7F0E', '#EECA3B', '#109618', '#4C78A8', '#9467BD', '#999999']))

fig.update_layout(width = 800,
                  height = 600,
                  legend_title = 'Average Contract Range',
                  title = 'Average Contract vs. wRC+',
                  xaxis_title = 'Average Contract Value')


fig.show()

With wRC+, there also appears to be a morderate increase in the range of values as you move up the salary ranges. Once again a 30-40M player is far from guaranteed to have a higher wRC+ compared to a 20-30M player, but they do seem significantly better than <10M player.

# Conclusion #
what did we find made a difference, what didn't?