# Initial MLB Trade Predictor EDA

The idea for my product is a tool that predicts a player's likelihood to be traded in the coming offseason + season.

*Alternatively, perhaps it would be equally useful (or more useful) to predict it on a daily basis; however, this would require that I find signal of ~50 players traded in a span of ~100,000 data points, a signal of ~0.05%. I don't think this is as doable.*

For predictor variables, I expect to use:

* Player's stats from the previous season
* Player's salary (and/or portion of team's total salary)
* Player's age
* Team's record the past X seasons (maybe X is 3-5?)
* Team's total salary commitment

Some other features that might be helpful (but very possibly not):

* Population of team's immediate area (proxy for attendance)
* Team annual/average attendance for home games
* Appearance in "likely to be traded" articles or lists
* Appearance in "Team X should trade for player Y" articles

# Alternative idea: Predict free agent locations

In another alternative, every year there are ~100-200 major league free agents every offseason (doesn't count players from other leagues). Using similar metrics on statistics from previous year(s), age, team payrolls (previous couple years?), and position....could we predict where free agents would end up or for how much of a contract? Specifically, predict one of:

* Team to sign him
* How many years
* How much total money
* How much AAV

Reasons this is better:

* Free agents only happen in the offseason; very defined time
* Pool of free agents is a small, defined group of ~100-200 players
* A fair number of people write columns about who will go where
* Might be a lower bar for success; random would get ~1/30 (~3.3%) correct, versus beating the ~95% for 

Reasons this is worse:

* Not as much data to work with; rather than ~900 trade candidates, there's like ~200 

In [1]:
# Load necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import re
from bs4 import BeautifulSoup

## Task 1: Pull down leaderboard data:

Perhaps I'll pull down leaderboard data from Fangraphs (.csv). The link to the leaderboard itself is:

http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2017&month=0&season1=2017&ind=0

That also requires clicking to download, and the DOWNLOAD button includes this URL info:

"javascript:__doPostBack('LeaderBoard1$cmdCSV','')"

Alternatively, there's the Lahman Database, which aggregates these data all the way back to 1871 and has a .CSV link here:

http://seanlahman.com/files/database/baseballdatabank-2017.1.zip

It also has a SQL link here:

http://seanlahman.com/files/database/lahman2016-sql.zip

There's also a GitHub version...

### First let's try Github version...I've cloned it locally

In [2]:
all_players = pd.read_csv("/home/matt/Github/baseballdatabank/core/Batting.csv")
all_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104324 entries, 0 to 104323
Data columns (total 22 columns):
playerID    104324 non-null object
yearID      104324 non-null int64
stint       104324 non-null int64
teamID      104324 non-null object
lgID        103586 non-null object
G           104324 non-null int64
AB          104324 non-null int64
R           104324 non-null int64
H           104324 non-null int64
2B          104324 non-null int64
3B          104324 non-null int64
HR          104324 non-null int64
RBI         103568 non-null float64
SB          101956 non-null float64
CS          80832 non-null float64
BB          104324 non-null int64
SO          97974 non-null float64
IBB         67722 non-null float64
HBP         101507 non-null float64
SH          98255 non-null float64
SF          68259 non-null float64
GIDP        78921 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 17.5+ MB


In [3]:
all_players.tail()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
104319,zimmejo02,2017,1,DET,AL,29,6,0,1,0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
104320,zimmery01,2017,1,WAS,NL,144,524,90,159,33,...,108.0,1.0,0.0,44,126.0,1.0,3.0,0.0,5.0,16.0
104321,zobribe01,2017,1,CHN,NL,128,435,58,101,20,...,50.0,2.0,2.0,54,71.0,2.0,2.0,2.0,3.0,13.0
104322,zuninmi01,2017,1,SEA,AL,124,387,52,97,25,...,64.0,1.0,0.0,39,160.0,0.0,8.0,0.0,1.0,8.0
104323,zychto01,2017,1,SEA,AL,45,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# Load the Teams csv file
all_teams = pd.read_csv("/home/matt/Github/baseballdatabank/upstream/Teams.csv")
all_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2865 entries, 0 to 2864
Data columns (total 19 columns):
yearID            2865 non-null int64
lgID              2815 non-null object
teamID            2865 non-null object
franchID          2865 non-null object
divID             1348 non-null object
Rank              2865 non-null int64
Ghome             2466 non-null float64
DivWin            1320 non-null object
WCWin             684 non-null object
LgWin             2837 non-null object
WSWin             2508 non-null object
name              2865 non-null object
park              2831 non-null object
attendance        2586 non-null float64
BPF               2865 non-null int64
PPF               2865 non-null int64
teamIDBR          2865 non-null object
teamIDlahman45    2865 non-null object
teamIDretro       2865 non-null object
dtypes: float64(2), int64(4), object(13)
memory usage: 425.4+ KB


In [5]:
# Look at the last couple records to get an idea
all_teams.tail()

Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,Ghome,DivWin,WCWin,LgWin,WSWin,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
2860,2017,NL,SLN,STL,C,3,81.0,N,N,N,N,St. Louis Cardinals,Busch Stadium III,3447937.0,98,98,STL,SLN,SLN
2861,2017,AL,TBA,TBD,E,3,81.0,N,N,N,N,Tampa Bay Rays,Tropicana Field,1253619.0,94,94,TBR,TBA,TBA
2862,2017,AL,TEX,TEX,W,4,81.0,N,N,N,N,Texas Rangers,Rangers Ballpark in Arlington,2507760.0,107,107,TEX,TEX,TEX
2863,2017,AL,TOR,TOR,E,4,81.0,N,N,N,N,Toronto Blue Jays,Rogers Centre,3203886.0,105,105,TOR,TOR,TOR
2864,2017,NL,WAS,WSN,E,1,81.0,Y,N,N,N,Washington Nationals,Nationals Park,2524980.0,103,102,WSN,MON,WAS


In [115]:
# Filter the teams data to only 1998-2017
all_teams = all_teams[all_teams.yearID >= 1998]
all_teams.head()

Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,Ghome,DivWin,WCWin,LgWin,WSWin,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
2265,1998,AL,ANA,ANA,W,2,81.0,N,N,N,N,Anaheim Angels,Edison International Field,2519280.0,102,102,ANA,ANA,ANA
2266,1998,NL,ARI,ARI,W,5,81.0,N,N,N,N,Arizona Diamondbacks,Bank One Ballpark,3610290.0,100,99,ARI,ARI,ARI
2267,1998,NL,ATL,ATL,E,1,81.0,Y,N,N,N,Atlanta Braves,Turner Field,3360860.0,100,98,ATL,ATL,ATL
2268,1998,AL,BAL,BAL,E,4,81.0,N,N,N,N,Baltimore Orioles,Oriole Park at Camden Yards,3684650.0,98,97,BAL,BAL,BAL
2269,1998,AL,BOS,BOS,E,2,81.0,N,Y,N,N,Boston Red Sox,Fenway Park II,2314704.0,102,101,BOS,BOS,BOS


We should also read the "Batting", "Pitching", and "Master" data into our dataframe format

In [None]:
# Load the Teams csv file
all_batting = pd.read_csv("/home/matt/Github/baseballdatabank/upstream/Batting.csv")
all_pitching = pd.read_csv("/home/matt/Github/baseballdatabank/upstream/Pitching.csv")
all_master = pd.read_csv("/home/matt/Github/baseballdatabank/upstream/Pitching.csv")

## Task 2: Compile Salary Data

Note that there's no salary data right now...there's salary data per year here:

http://www.thebaseballcube.com/extras/payrolls/

There's also this site, which looks :

http://www.stevetheump.com/Payrolls.htm

And there's some other sites...let's see what we can do with some BeautifulSoup

In [26]:
# Get the first site
url = "http://www.thebaseballcube.com/extras/payrolls/"

r = requests.get(url).text

salary_soup = BeautifulSoup(r, 'html.parser')
just_table = salary_soup.find_all('table')[1]

After some exploration of how to access the data, I'm going to do it using the table:

In [37]:
# A header row and a row for each team
len(just_table.contents)

31

In [98]:
# Get all the data on salary (1998-2017) for the Diamondbacks
post_2002 = [line.text for line in just_table.contents[1].find_all('a')]
pre_2003 = [line.text.split(' - ')[1] for line in just_table.contents[1].find_all('td')[-1].find_all('option')]
all_dbacks = (post_2002 + pre_2003)
all_dbacks.reverse() # Sort so now its 1998->2017, not the other way around
print(all_dbacks)


['29.16', '70.37', '77.88', '85.25', '102.82', '80.64', '69.78', '62.33', '59.68', '52.07', '66.20', '73.52', '60.72', '53.64', '74.28', '90.16', '112.69', '65.77', '98.66', '90.73']


In [113]:
# Get data for all teams
team_dict = {}
for entry in just_table.contents[1:]:
    team = entry.find_all('td')[0].text
    post_2002 = [line.text for line in entry.find_all('a')]
    pre_2003 = [line.text.split(' - ')[1] for line in entry.find_all('td')[-1].find_all('option')]
    
    # Shorten the pre-2003 to just 5 years, back to 1998
    
    all_salaries = post_2002 + pre_2003[0:5]
    all_salaries.reverse()
    team_dict[team] = all_salaries

# Make it a dataframe
salary_df = pd.DataFrame(team_dict, index= list(range(1998, 2018)))
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 1998 to 2017
Data columns (total 30 columns):
Arizona Diamondbacks     20 non-null object
Atlanta Braves           20 non-null object
Baltimore Orioles        20 non-null object
Boston Red Sox           20 non-null object
Chicago Cubs             20 non-null object
Chicago White Sox        20 non-null object
Cincinnati Reds          20 non-null object
Cleveland Indians        20 non-null object
Colorado Rockies         20 non-null object
Detroit Tigers           20 non-null object
Houston Astros           20 non-null object
Kansas City Royals       20 non-null object
Los Angeles Angels       20 non-null object
Los Angeles Dodgers      20 non-null object
Miami Marlins            20 non-null object
Milwaukee Brewers        20 non-null object
Minnesota Twins          20 non-null object
New York Mets            20 non-null object
New York Yankees         20 non-null object
Oakland Athletics        20 non-null object
Philadelphia

In [66]:
# Get the team name titles (column names)
team_names = [team.find_all('td')[0].text for team in just_table.contents][1:]
team_names

['Arizona Diamondbacks',
 'Atlanta Braves',
 'Baltimore Orioles',
 'Boston Red Sox',
 'Chicago Cubs',
 'Chicago White Sox',
 'Cincinnati Reds',
 'Cleveland Indians',
 'Colorado Rockies',
 'Detroit Tigers',
 'Houston Astros',
 'Kansas City Royals',
 'Los Angeles Angels',
 'Los Angeles Dodgers',
 'Miami Marlins',
 'Milwaukee Brewers',
 'Minnesota Twins',
 'New York Mets',
 'New York Yankees',
 'Oakland Athletics',
 'Philadelphia Phillies',
 'Pittsburgh Pirates',
 'San Diego Padres',
 'San Francisco Giants',
 'Seattle Mariners',
 'St. Louis Cardinals',
 'Tampa Bay Rays',
 'Texas Rangers',
 'Toronto Blue Jays',
 'Washington Nationals']

In [81]:
just_table.contents[1].find_all('a')[0]

re.findall('Y=\d{4}', str(just_table.contents[1].find_all('a')[1]))

['Y=2016']

In [90]:
# Grab the last 5 years of the dataset


102.82
85.25
77.88
70.37
29.16


We also need the free agent data itself; one source is Baseball Reference:

Example (2016): https://www.baseball-reference.com/leagues/MLB/2016-free-agents.shtml#fa_signings::none

Probably easiest to just scrape the years I want; could do it fairly quickly