# Analysis of MLB Projections data (via Streamer) from Fangraphs #

We'll start by analyzing hitters and seeing which ones are the best to target. 

Auction calculator data was derived from Fangraphs based on the following conditions: (those which exist in the league that I play in)

1. Active roster salary cap of 270 dollars for your active roster
2. AL Only league with minimum bid of 1 dollar and roto point scoring format
3. Offensive Categories: RBI, R, SB, HR, OBP, TB
4. Pitching Categories: W, K, ERA, WHIP, HLD, SV
5. Roster Composition: P-10, C-2, 1B-1, 2B-1, 3B-1, SS-1, CI-1, MI-1, OF-5, DH-1
6. You do get 10 players as reserves but these are picked in a snake format after the auction draft and do not count towards your $270 cap.

NOTE: I did not factor in keepers even though this is a keeper league. I wanted to see what the calculator thinks a player is worth and then contrast with the keeper decisions guys in the league make as we head towards the draft.  

Let's prepare the requisite packages. 

In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's see how players distributed as far as RBI production ie: who can get me the most RBIs.

In [None]:
# Load necessary library
library(ggplot2)

# Load the data from the attached file
data <- read.csv("/kaggle/input/auction-calc-hitters/fangraphs-auction-calculator.csv")

# Ensure the mRBI column is numeric
data$mRBI <- as.numeric(data$mRBI)

# Calculate the percentiles
percentiles <- quantile(data$mRBI, probs = c(0.25, 0.50, 0.75, 1.00), na.rm = TRUE)

# Find the corresponding players for each percentile
players <- sapply(percentiles, function(score) {
  data[which.min(abs(data$mRBI - score)), "Name"]
})

# Combine players and scores
annotations <- data.frame(
  Player = players,
  Percentile = c("25th", "50th (Median)", "75th", "100th"),
  mRBI = as.numeric(percentiles)
)

# Create the boxplot
p <- ggplot(data, aes(y = mRBI)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
  theme_minimal() +
  labs(title = "Boxplot of mRBI with Player Annotations", y = "mRBI", x = "") +
  geom_text(
    data = annotations,
    aes(y = mRBI, x = 1, label = paste(Player, "\n", Percentile, "\n", "mRBI:", round(mRBI, 2))),
    color = "darkred",
    size = 3,
    hjust = 2
  )

# Print the plot
print(p)

Unsurprisingly, Aaron Judge projects to be the top RBI man, projected as worth $5.8 on just RBI production alone relative to a replacement level player. (that's what mRBI means)

Another interesting observation is that Matt Wallner ranks in the 75th percentile while adding just $0.17 in value from his RBI production relative to a hypothetical replacement. 

Let's see how many players contribute an mRBI rating of at least $1 and who they are.

In [None]:
# Load necessary library
library(dplyr)

# Filter players with mRBI >= 1 and sort in descending order
filtered_players <- data %>% 
  filter(mRBI >= 1) %>% 
  arrange(desc(mRBI)) %>% 
  select(Name, Team, POS, mRBI)

# Print the result
print(filtered_players)

Out of the set of 308 players, only 60 add at least $1 of value in RBIs. This means if you want success in the RBI category, there is a very narrow pool of players that you must target. 

But how do things break down by position? We'll plot a histogram to see.

In [None]:
#Modify the POS column to ignore everything after the first slash
filtered_players <- filtered_players %>%
  mutate(POS = gsub("/.*", "", POS))

# Create a histogram of the modified POS column
ggplot(filtered_players, aes(x = POS)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Number of Players by POS (Simplified)", x = "Position (POS)", y = "Count of Players") +
  theme_minimal()

As expected, there is more to be had amongst outfielders and first basemen. Note, that only 5 catchers feature in the top 60. Since we require two catchers on our roster in this league, and there are 12 teams, it is very difficult to find even one that is a better than 75th percentile RBI producer. Having two, well, that would be a massive advantage, especially if these are keepers at solid price points. Middle infielders are also a rarity here but far less than catchers. Still, GMs at the draft would likely be willing to pay a premium to secure RBI production at SS or 2B.  


Now, let's look at total bases. 

In [None]:
# Ensure the mTB column is numeric
data$mTB <- as.numeric(data$mTB)

# Calculate the percentiles
percentiles <- quantile(data$mTB, probs = c(0.25, 0.50, 0.75, 1.00), na.rm = TRUE)

# Find the corresponding players for each percentile
players <- sapply(percentiles, function(score) {
  data[which.min(abs(data$mTB - score)), "Name"]
})

# Combine players and scores
annotations <- data.frame(
  Player = players,
  Percentile = c("25th", "50th (Median)", "75th", "100th"),
  mTB = as.numeric(percentiles)
)

# Create the boxplot
p_TB <- ggplot(data, aes(y = mTB)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
  theme_minimal() +
  labs(title = "Boxplot of mTB with Player Annotations", y = "mTB", x = "") +
  geom_text(
    data = annotations,
    aes(y = mTB, x = 1, label = paste(Player, "\n", Percentile, "\n", "mTB:")),
    color = "darkred",
    size = 3,
    hjust = 2
  )
# Print the plot
print(p_TB)

So again, Aaron Judge leads the pack. Given that he's twice eclipsed the 50HR plateau, no surprise here. He projects at 47HR and adds $8.5 in value just from HR. 

Curiously, you just break even with Colt Keith (0.3mHR), even though he's projected at a healthy 16HR. In today's more homer-centric game, perhaps homers are depressed a bit in value by basic supply and demand. Still though, there's a gulf in class between most of the pack, and the elite sluggers in the game. 

So as we did with mRBI, who adds at least a dollar over replacement value in homers?

In [None]:
# Filter players with mTB >= 1 and sort in descending order
filtered_players_TB <- data %>% 
  filter(mTB >= 1) %>% 
  arrange(desc(mTB)) %>% 
  select(Name, Team, POS, mTB)

# Print the result
print(filtered_players_TB)

As you might expecte, a lot of the same players from the RBI list, given that power hitters tend to be the great run producers. And yet, there's a smaller pool of players to choose from. Also note that Jordan Westburg is the last name on the list. Streamer projects him at 20HR exactly, ie: to get $1 of value, you need at least a 20HR player, and there are only 55 to go around, not including keepers. 

Again, we want to see how they break down by position, as this will matter a lot.

In [None]:
#Modify the POS column to ignore everything after the first slash
filtered_players_TB <- filtered_players_TB %>%
  mutate(POS = gsub("/.*", "", POS))

# Create a histogram of the modified POS column
ggplot(filtered_players_TB, aes(x = POS)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Number of Players by POS (Simplified)", x = "Position (POS)", y = "Count of Players") +
  theme_minimal()

Again, note the acute shortage of catchers and middle infielders (SS/2B). In fact, you have more outfielders to choose from than all these positions combined. This further reinforces the need to secure at least one solid hitting catcher, preferably two. If they're keepers, better still.

Let's see who contributes most in stolen bases. 

In [None]:
# Ensure the mSB column is numeric
data$mSB <- as.numeric(data$mSB)

# Calculate the percentiles
percentiles <- quantile(data$mSB, probs = c(0.25, 0.50, 0.75, 1.00), na.rm = TRUE)

# Find the corresponding players for each percentile
players <- sapply(percentiles, function(score) {
  data[which.min(abs(data$mSB - score)), "Name"]
})

# Combine players and scores
annotations <- data.frame(
  Player = players,
  Percentile = c("25th", "50th (Median)", "75th", "100th"),
  mSB = as.numeric(percentiles)
)

# Create the boxplot
p_SB <- ggplot(data, aes(y = mSB)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
  theme_minimal() +
  labs(title = "Boxplot of mSB with Player Annotations", y = "mSB", x = "") +
  geom_text(
    data = annotations,
    aes(y = mSB, x = 1, label = paste(Player, "\n", Percentile, "\n", "mSB:")),
    color = "darkred",
    size = 3,
    hjust = 2
  )
# Print the plot
print(p_SB)

If he does indeed steal the 36 bases projected for him, Victor Robles will probbably justify the 7.5 mSB valuation, but playing time is a concern with him given his historically weak hitting. The more salient observation here is the lack of quality base stealers avaiable. This justifuies what often say in our league that you have to buy SB if you want to do well in that category. Anyone offering an mSB of even 0 is basically an outlier. To get even an mSB of 2, you basically need an extreme outlier. That's just how baseball is these days. It is hard to get base stealers.

In [None]:
# Filter players with mSB >= 1 and sort in descending order
filtered_players_SB <- data %>% 
  filter(mSB >= 1) %>% 
  arrange(desc(mSB)) %>% 
  select(Name, Team, POS, mSB)

# Print the result
print(filtered_players_SB)

Reinforcing the earlier point, there are 41 players of the 308 that would net you even $1 of value in SB, and in some cases, that's contigent on them playing enough to do so. Some of these players are still fighting for full-time jobs out of camp.

Let's see the breakdown by position now.

In [None]:
#Modify the POS column to ignore everything after the first slash
filtered_players_SB <- filtered_players_SB %>%
  mutate(POS = gsub("/.*", "", POS))

# Create a histogram of the modified POS column
ggplot(filtered_players_SB, aes(x = POS)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Number of Players by POS (Simplified)", x = "Position (POS)", y = "Count of Players") +
  theme_minimal()

I don't believe anyone will be surprised by the fact that there are no first basemen nor catchers among the stolen base leaders. Outfielders are more plentiful, but judging by some of the names, not all are exactly sure bets. Middle infielders have a little depth re: SB, and for 3B, well, there's Jose Ramirez of course. As the guys in my league say, you gotta buy stolen bases. There's just not much to go around, especially when you factor in keepers. 

Now let's look at OBP. Same process. 

In [None]:
# Ensure the mOBP column is numeric
data$mOBP <- as.numeric(data$mOBP)

# Calculate the percentiles
percentiles <- quantile(data$mOBP, probs = c(0.25, 0.50, 0.75, 1.00), na.rm = TRUE)

# Find the corresponding players for each percentile
players <- sapply(percentiles, function(score) {
  data[which.min(abs(data$mOBP - score)), "Name"]
})

# Combine players and scores
annotations <- data.frame(
  Player = players,
  Percentile = c("25th", "50th (Median)", "75th", "100th"),
  mOBP = as.numeric(percentiles)
)

# Create the boxplot
p_OBP <- ggplot(data, aes(y = mOBP)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
  theme_minimal() +
  labs(title = "Boxplot of mOBP with Player Annotations", y = "mOBP", x = "") +
  geom_text(
    data = annotations,
    aes(y = mOBP, x = 1, label = paste(Player, "\n", Percentile, "\n", "mOBP:")),
    color = "darkred",
    size = 3,
    hjust = 2
  )
# Print the plot
print(p_OBP)

OBP is also dominated by Aaron Judge unsurprisingly. Also note, you need to go the 75th percentile (Nick Loftin), to get even to breakeven in terms of value. Now let's see which players get us at least 1 mOBP.

In [None]:
# Filter players with mSB >= 1 and sort in descending order
filtered_players_OBP <- data %>% 
  filter(mOBP >= 1) %>% 
  arrange(desc(mOBP)) %>% 
  select(Name, Team, POS, mOBP)

# Print the result
print(filtered_players_OBP)

Given the state of our game today, is it really surprising that there's less than 40 guys that add even $1 in value to your OBP? There's low OBPs across baseball now given the emphasis on power and the corresponding high strikeout rates. Guys also don't walk as much. 

Alright, so the breakdown by position?

In [None]:
#Modify the POS column to ignore everything after the first slash
filtered_players_OBP <- filtered_players_OBP %>%
  mutate(POS = gsub("/.*", "", POS))

# Create a histogram of the modified POS column
ggplot(filtered_players_OBP, aes(x = POS)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Number of Players by POS (Simplified)", x = "Position (POS)", y = "Count of Players") +
  theme_minimal()

OBP is generally a more position agnostic stat, but given the demands of the position, again, catchers are a rarity. Good hitting catchers are hard to find, period.

Now, let's bring it all together. I think if you really want to get max value from a given player, you need the best combo of OBP, TB and SB. SB are self-explanatory. TB takes care of HR and probabaly R and RBI as well. Also, you generally want good hitters who get on base consistently and therefore have a better chance of scoring. Think Moneyball. 

Below is a plot of the top 25 players who best combine these three metrics: mSB, mOBP, mTB. The plot is interative so you can hover over a circle to see each player. You can also drag to make it easier to see one metric or another.

In [None]:
# Select relevant columns and filter for top 10 players combining mTB, mOBP, and mSB
top_players <- data %>%
  mutate(Combined_Score = mTB + mOBP + mSB) %>%
  arrange(desc(Combined_Score)) %>%
  slice(1:25) %>%
  select(Name, mTB, mOBP, mSB)

# Create a 3D scatter plot with interactive tooltips to reduce clutter
plot <- plot_ly(
  data = top_players,
  x = ~mTB,
  y = ~mOBP,
  z = ~mSB,
  type = "scatter3d",
  mode = "markers",
  text = ~paste(Name, "<br>mTB:", mTB, "<br>mOBP:", mOBP, "<br>mSB:", mSB),
  marker = list(size = 4, color = 'blue'),
  hoverinfo = "text" # Show names and values in tooltips
) %>%
  layout(
    title = "Top 25 Players: mTB, mOBP, mSB",
    scene = list(
      xaxis = list(title = "mTB"),
      yaxis = list(title = "mOBP"),
      zaxis = list(title = "mSB")
    )
  )

# Show the plot
plot

List list them too in case the plot is cumbersome.

In [None]:
print(top_players)