# Getting Real-time Stats

The idea of this notebook is to figure out how to get real-time stats that I can update daily (nightly) to use for updating my site. There's a number of sites that might suffice:

* Fangraphs
* MLB.com
* Baseball Reference
* ESPN.com
...

They all seemingly have pluses and minuses; I'm going to start with Fangraphs

**Notes on sites:**

* Fangraphs is a pain because it's posting via Javascript; I'm having a hell of a time getting the .CSV to download on its own. And otherwise, I'd have to sense all the pages (e.g. 7 of them)
* MLB.com doesn't have what I want
* Baseball reference doesn't have anything like what I want
* Yahoo is not useful
* BP might be the solution!!

## Data from Fangraphs

Fangraphs is tricky because it presents 2 non-optimal options:

1. Scrape data from a series of HTML tables
2. Export a .csv file through a Javascript post statement

The 2nd option is probably more optimal, as the CSV file is very friendly. But I need to figure out how to do it, as the "Export" button actually runs this command:

```
<a id="LeaderBoard1_cmdCSV" href="javascript:__doPostBack('LeaderBoard1$cmdCSV','')">Export Data</a>
```

I'll import the necessary libraries:

In [215]:
# Import DS tools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Import Web tools
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

# Import my own fxns
import sys
sys.path.append("/home/matt/Github/MLB_FA_Predictor")
import buildFeatureMatrix as bfm

Apparently, this behaves as a form; here's the page itself:

In [8]:
# Grab 2018 page
fg_page = requests.get("https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0")

# Show the top
print(fg_page.text[0:1000])


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="Head1"><script type="text/javascript">window.NREUM||(NREUM={});NREUM.info = {"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"8c8459e5ba","applicationID":"2284934","transactionName":"ZlMHMEtVDUdTW0ZQC18ZJDdpGw9RU1xXSxcfVxYUQQ==","queueTime":0,"applicationTime":134,"agent":"","atts":""}</script><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return function(){return i(e,[f.now()].concat(u(arguments)),t?null:this,n),t?void 0:this}}var i=e("handle"),a=e(2),


Note that the HTTP trace shows this:

```https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0

POST https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0
Origin: https://www.fangraphs.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: __qca=P0-1096495856-1511568322815; _jsuid=3860423848; __gads=ID=46c9360072e56023:T=1511568323:S=ALNI_MYcBKGNgue606hF789xDyvSpqAUZw; _cb_ls=1; opti-userid=58b2535a-37d6-47f6-9094-3187e1bf74d9; _ga=GA1.2.2091433700.1511568323; _cb=-orahCJsTnrDQuPoq; _pk_id.5863.07a3=75985b650b517c43.1511568324.2.1514091887.1514091879.; opti-position=1; _1ci_7ag23o86kjasbfd=393d0950-0564-11e8-a7ce-f5d3fbed2065; fitracking_12=no; btpdb.0M6ZVb2.dGZjLjYwOTQyNzg=REFZUw; btpdb.0M6ZVb2.dGZjLjYwOTQyODU=VVNFUg; btpdb.0M6ZVb2.dGZjLjYwMjU2Mjg=REFZUw; btpdb.0M6ZVb2.dGZjLjYwOTc4MDM=REFZUw; btpdb.0M6ZVb2.dGZjLjYwMjU2NDk=VVNFUg; fgadp=0; _gid=GA1.2.532765737.1524073489; __atuvc=1%7C12%2C0%7C13%2C3%7C14%2C6%7C15%2C2%7C16; OX_plg=pm; _gat=1; _first_pageview=1; heatmaps_g2g_100553825=yes; fiutm=direct|direct||||; _chartbeat2=.1511568328475.1524162311895.1001011110001011.d3vRTb1Q3LC0SDV3x2CxFRdzPo.1; _cb_svref=https%3A%2F%2Fwww.fangraphs.com%2Fstandings%2Fplayoff-odds-graphs%3Flg%3DNL%26div%3DW%26stat%3Dpoff%26year%3D2018; _eventqueue=%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%2Fleaders.aspx%3Fpos%3Dall%26stats%3Dbat%26lg%3Dall%26qual%3Dy%26type%3D8%26season%3D2018%26month%3D0%26season1%3D2018%26ind%3D0%22%2C%22x%22%3A1360%2C%22y%22%3A670%2C%22w%22%3A1835%7D%5D%2C%22events%22%3A%5B%5D%7D

HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/csv; charset=utf-8;
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Content-Disposition: attachment;filename="FanGraphs Leaderboard.csv"
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 19 Apr 2018 18:25:33 GMT
Content-Length: 12960
```

In [53]:
# Try posting to same link
fg_dl = requests.post('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0',
                      data = {'Origin' : 'https://www.fangraphs.com',
                              'Upgrade-Insecure-Requests' : '1',
                              'Content-Type' : 'application/x-www-form-urlencoded',
                              'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                              'Referer' : 'https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0',
                              'Accept-Encoding' : 'gzip, deflate, br',
                              'Accept-Language' : 'en-US,en;q=0.9',
                              'Cookie' : '__qca=P0-1096495856-1511568322815; _jsuid=3860423848; __gads=ID=46c9360072e56023:T=1511568323:S=ALNI_MYcBKGNgue606hF789xDyvSpqAUZw; _cb_ls=1; opti-userid=58b2535a-37d6-47f6-9094-3187e1bf74d9; _ga=GA1.2.2091433700.1511568323; _cb=-orahCJsTnrDQuPoq; _pk_id.5863.07a3=75985b650b517c43.1511568324.2.1514091887.1514091879.; opti-position=1; _1ci_7ag23o86kjasbfd=393d0950-0564-11e8-a7ce-f5d3fbed2065; fitracking_12=no; btpdb.0M6ZVb2.dGZjLjYwOTQyNzg=REFZUw; btpdb.0M6ZVb2.dGZjLjYwOTQyODU=VVNFUg; btpdb.0M6ZVb2.dGZjLjYwMjU2Mjg=REFZUw; btpdb.0M6ZVb2.dGZjLjYwOTc4MDM=REFZUw; btpdb.0M6ZVb2.dGZjLjYwMjU2NDk=VVNFUg; fgadp=0; _gid=GA1.2.532765737.1524073489; __atuvc=1%7C12%2C0%7C13%2C3%7C14%2C6%7C15%2C2%7C16; OX_plg=pm; _gat=1; _first_pageview=1; heatmaps_g2g_100553825=yes; fiutm=direct|direct||||; _chartbeat2=.1511568328475.1524162311895.1001011110001011.d3vRTb1Q3LC0SDV3x2CxFRdzPo.1; _cb_svref=https%3A%2F%2Fwww.fangraphs.com%2Fstandings%2Fplayoff-odds-graphs%3Flg%3DNL%26div%3DW%26stat%3Dpoff%26year%3D2018; _eventqueue=%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%2Fleaders.aspx%3Fpos%3Dall%26stats%3Dbat%26lg%3Dall%26qual%3Dy%26type%3D8%26season%3D2018%26month%3D0%26season1%3D2018%26ind%3D0%22%2C%22x%22%3A1360%2C%22y%22%3A670%2C%22w%22%3A1835%7D%5D%2C%22events%22%3A%5B%5D%7D'
                             }
                     )

In [54]:
fg_dl.text[0:1000]

'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head id="Head1"><script type="text/javascript">window.NREUM||(NREUM={});NREUM.info = {"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"8c8459e5ba","applicationID":"2284934","transactionName":"ZlMHMEtVDUdTW0ZQC18ZJDdpGw9RU1xXSxcfVxYUQQ==","queueTime":0,"applicationTime":93,"agent":"","atts":""}</script><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return function(){return i(e,[f.now()].concat(u(arguments)),t?null:this,n),t?void 0:this}}var i=e("handle"),a

In [55]:
test = requests.get('https://in.getclicky.com/in.php?site_id=100553825&res=1920x1080&lang=en&secure=1&type=heatmap&heatmap[]=%2Fleaders.aspx%3Fpos%3Dall%26stats%3Dbat%26lg%3Dall%26qual%3Dy%26type%3D8%26season%3D2018%26month%3D0%26season1%3D2018%26ind%3D0|1006|662|1142&jsuid=3860423848&hmset&mime=js&x=0.4253376462049516')

In [56]:
test.text

'// static34\n\n// exit trax0r\n'

This is dumb; I'm going to try Selenium

In [35]:
# Set URL
my_url = 'https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2018&month=0&season1=2018&ind=0'
#my_url = 'https://www.fangraphs.com/'

In [52]:
# Open Chrome and navigate to the page
driver = webdriver.Chrome()
driver.get(my_url)
print(driver.title)
link = driver.find_element_by_id('LeaderBoard1_cmdCSV')

# wait for page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, 'LeaderBoard1_cmdCSV')))
print("Page loaded")

# Download the file
print(link)
csv_url = urljoin(my_url, link.get_attribute("href"))
urlretrieve(csv_url, "players.csv")

driver.close()

Major League Leaderboards » 2018 » Batters » Dashboard | FanGraphs Baseball
Page loaded
<selenium.webdriver.remote.webelement.WebElement (session="6f3f773c0e84559f76d3731b1f5e4732", element="0.5879974233295713-1")>


URLError: <urlopen error unknown url type: javascript>

This is dumb; I'm going to try out BP:

## Baseball Prospectus

In general, the batting/pitching URLs I want look like this:

Batting (2018) on 4/19:

`https://legacy.baseballprospectus.com/sortable/index.php?cid=1918875`

Pitching (2018) on 4/19:

`https://legacy.baseballprospectus.com/sortable/index.php?cid=2508773`

These are the same....I have 2 ideas for approaches:

1. Hard-code the stat URLs; this isn't ideal, as they could change day to day
2. Use Selenium to grab the stats by starting on the main page (cut off URL after "sortable")

I'm going to try Selenium first:

### BP w/ Selenium:

In [62]:
# Set URL
my_url = 'https://legacy.baseballprospectus.com/sortable'

# Open Chrome and navigate to the page
driver = webdriver.Chrome()
driver.get(my_url)
print(driver.title)

# wait for page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.LINK_TEXT, '1. Individual Stats - Season Totals')))
print("Page loaded")

# Find the batting stats URL
bat_link = driver.find_element_by_link_text('1. Individual Stats - Season Totals').get_attribute('href')
print(bat_link)

# Find the pitching stats URL
pitch_link = driver.find_element_by_link_text('3. Individual Stats - Season Totals').get_attribute('href')
print(pitch_link)

driver.close()

Baseball Prospectus | Statistics | Custom Statistics Reports
Page loaded
https://legacy.baseballprospectus.com/sortable/index.php?cid=1918875
https://legacy.baseballprospectus.com/sortable/index.php?cid=2508773


Now that I have the URLs I want, I can get the pages using requests and scrape with Beautiful Soup

In [98]:
# Get the batting and pitching results
bat_results = requests.get(bat_link).text
pitch_results = requests.get(pitch_link).text

# Grab all the batting stats
bat_table = BeautifulSoup(bat_results, 'html.parser').find('table', {'id' : 'TTdata'})

# Print out just the header
bat_head = bat_table.find('tr', {'class': 'TTdata_ltblue'})
col_names = [col_name.text for col_name in bat_head.find_all('td')]
print(col_names)

['#', 'NAME', 'YEAR', 'AGE', 'G', 'PA', 'AB', 'R', 'H', '1B', '2B', '3B', 'HR', 'TB', 'BB', 'IBB', 'SO', 'HBP', 'SF', 'SH', 'RBI', 'DP', 'NETDP', 'SB', 'CS', 'AVG', 'OBP', 'SLG', 'OPS', 'ISO', 'BPF', 'oppOPS', 'TAv', 'VORP', 'FRAA', 'BWARP']


Now I'll grab the rest of the table

In [102]:
# Try grabbing everything, but drop the first row
all_stats = [stat.text for stat in bat_table.find_all('td')][len(col_names):]

# Print the first 2 players
print(all_stats[0:72])

['1.', 'Mookie Betts', '2018', '25', '16', '72', '59', '20', '23', '11', '7', '0', '5', '45', '10', '1', '6', '2', '1', '0', '13', '0', '-1.10', '2', '1', '.390', '.486', '.763', '1.249', '.373', '99', '.769', '.415', '14.2', '-0.8', '1.35', '2.', 'Didi Gregorius', '2018', '28', '16', '69', '51', '14', '17', '4', '7', '1', '5', '41', '14', '1', '4', '1', '3', '0', '16', '2', '0.21', '2', '1', '.333', '.464', '.804', '1.268', '.471', '107', '.746', '.394', '12.9', '0.1', '1.30']


In [103]:
# Reshape as a 2-D array
bat_array = np.reshape(np.array(all_stats), (-1, len(col_names)))
print(test2)

[['1.' 'Mookie Betts' '2018' ..., '14.2' '-0.8' '1.35']
 ['2.' 'Didi Gregorius' '2018' ..., '12.9' '0.1' '1.30']
 ['3.' 'Christian Villanueva' '2018' ..., '12.5' '-1.2' '1.13']
 ..., 
 ['682.' 'Logan Morrison' '2018' ..., '-6.6' '-0.1' '-0.66']
 ['683.' 'Lewis Brinson' '2018' ..., '-6.5' '-0.1' '-0.67']
 ['684.' 'Pat Valaika' '2018' ..., '-6.8' '0.0' '-0.68']]


In [106]:
# Make it into a dataframe, add column names, and drop the "#" one
bat_df = pd.DataFrame(bat_array)
bat_df.columns = col_names
bat_df = bat_df.drop('#', axis = 'columns')

# Inspect it
print(bat_df.head())
print(bat_df.info())

                   NAME  YEAR AGE   G  PA  AB   R   H  1B 2B  ...    OBP  \
0          Mookie Betts  2018  25  16  72  59  20  23  11  7  ...   .486   
1        Didi Gregorius  2018  28  16  69  51  14  17   4  7  ...   .464   
2  Christian Villanueva  2018  27  16  60  50  10  17   7  4  ...   .450   
3       Yasmani Grandal  2018  29  14  62  54   9  19  11  5  ...   .435   
4          Todd Frazier  2018  32  17  73  55  10  16   9  5  ...   .438   

    SLG    OPS   ISO  BPF oppOPS   TAv  VORP  FRAA BWARP  
0  .763  1.249  .373   99   .769  .415  14.2  -0.8  1.35  
1  .804  1.268  .471  107   .746  .394  12.9   0.1  1.30  
2  .780  1.230  .440   90   .684  .433  12.5  -1.2  1.13  
3  .611  1.047  .259   87   .651  .377  10.3   0.9  1.13  
4  .491   .929  .200   91   .700  .360  10.3   0.9  1.12  

[5 rows x 35 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 684 entries, 0 to 683
Data columns (total 35 columns):
NAME      684 non-null object
YEAR      684 non-null object
A

In [108]:
# Convert numericals
to_num = bat_df.columns.drop('NAME')
bat_df[to_num] = bat_df[to_num].apply(pd.to_numeric)
print(bat_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 684 entries, 0 to 683
Data columns (total 35 columns):
NAME      684 non-null object
YEAR      684 non-null int64
AGE       684 non-null int64
G         684 non-null int64
PA        684 non-null int64
AB        684 non-null int64
R         684 non-null int64
H         684 non-null int64
1B        684 non-null int64
2B        684 non-null int64
3B        684 non-null int64
HR        684 non-null int64
TB        684 non-null int64
BB        684 non-null int64
IBB       684 non-null int64
SO        684 non-null int64
HBP       684 non-null int64
SF        684 non-null int64
SH        684 non-null int64
RBI       684 non-null int64
DP        684 non-null int64
NETDP     684 non-null float64
SB        684 non-null int64
CS        684 non-null int64
AVG       684 non-null float64
OBP       684 non-null float64
SLG       684 non-null float64
OPS       684 non-null float64
ISO       684 non-null float64
BPF       539 non-null float64
oppOPS    

I should be able to do the same with pitching...let me make it into a fxn

In [207]:
def bpStatsToDF(url, stat_type):
    
    '''Given a BP stats page link, create a dataframe with correct data types'''
    
    # Debug: print the url
    print(url)
    
    # Get the results page
    stat_results = requests.get(url).text

    # Grab all the batting stats
    stat_table = BeautifulSoup(stat_results, 'html.parser').find('table', {'id' : 'TTdata'})

    # Grab just the header
    stat_head = stat_table.find('tr', {'class': 'TTdata_ltblue'})
    col_names = [col_name.text for col_name in stat_head.find_all('td')]
    
    # Try grabbing everything, but drop the first row
    all_stats = [stat.text for stat in stat_table.find_all('td')][len(col_names):]
    
    # Reshape as a 2-D array
    stat_array = np.reshape(np.array(all_stats), (-1, len(col_names)))
    
    # Make it into a dataframe, add column names, and drop the "#" one
    stat_df = pd.DataFrame(stat_array)
    stat_df.columns = col_names
    
    # Change columns to drop based on whether it's batting or pitching
    if stat_type == 'batting':
        stat_df = stat_df.drop('#', axis = 'columns')
    elif stat_type == 'pitching':
        stat_df = stat_df.drop(['#','LVL'], axis = 'columns')
    else:
        raise ValueError("Must be either 'batting' or 'pitching'")
            
    # Convert numericals, dropping commas
    to_num = stat_df.columns.drop('NAME')
    stat_df[to_num] = stat_df[to_num].apply(lambda x: x.str.replace(',','')).apply(pd.to_numeric)
    
    return stat_df

In [135]:
# Test on batters
test = bpStatsToDF(bat_link, 'batting')

print(test.head())

                   NAME  YEAR  AGE   G  PA  AB   R   H  1B  2B  ...      OBP  \
0          Mookie Betts  2018   25  17  77  64  22  25  11   8  ...    0.481   
1        Didi Gregorius  2018   28  17  73  55  14  18   5   7  ...    0.452   
2          Todd Frazier  2018   32  18  76  58  11  17   9   5  ...    0.434   
3       Yasmani Grandal  2018   29  14  62  54   9  19  11   5  ...    0.435   
4  Christian Villanueva  2018   27  16  60  50  10  17   7   4  ...    0.450   

     SLG    OPS    ISO    BPF  oppOPS    TAv  VORP  FRAA  BWARP  
0  0.797  1.277  0.406   99.0   0.773  0.418  15.4  -0.8   1.46  
1  0.764  1.216  0.436  107.0   0.745  0.383  12.7   0.3   1.30  
2  0.534  0.969  0.241   92.0   0.692  0.368  11.4   1.2   1.26  
3  0.611  1.047  0.259   87.0   0.648  0.377  10.3   1.4   1.17  
4  0.780  1.230  0.440   90.0   0.679  0.433  12.5  -1.3   1.13  

[5 rows x 35 columns]


In [136]:
# Test on pitchers
test = bpStatsToDF(pitch_link, 'pitching')

print(test.head())

              NAME  YEAR  AGE  G  GS  PITCHES    IP  IP Start  IP Relief  W  \
0    Bartolo Colon  2018   45  4   2      243  18.7      13.7        5.0  0   
1      Cc Sabathia  2018   37  3   3      213  13.3      13.3        0.0  0   
2  Fernando Rodney  2018   41  6   0      101   5.7       0.0        5.7  1   
3     Matt Belisle  2018   38  5   0       95   6.3       0.0        6.3  0   
4  Adam Wainwright  2018   36  3   3      275  15.7      15.7        0.0  1   

   ...    SH    PPF  CMD  PWR  STM   FIP  cFIP   ERA  DRA  PWARP  
0  ...     0  109.0  NaN  NaN  NaN  2.33     0  1.45  0.0    0.0  
1  ...     0  107.0  NaN  NaN  NaN  5.98     0  2.70  0.0    0.0  
2  ...     0  110.0  NaN  NaN  NaN  4.72     0  3.18  0.0    0.0  
3  ...     0  104.0  NaN  NaN  NaN  5.50     0  5.68  0.0    0.0  
4  ...     2   97.0  NaN  NaN  NaN  5.17     0  3.45  0.0    0.0  

[5 rows x 39 columns]


*Note: to indent a block, use Ctrl + [ or Ctrl + ]*

In [143]:
# Test an error
import unittest

class bpTestCase(unittest.TestCase):
    def test_Exception(self):
        with self.assertRaises(ValueError):
            bpStatsToDF(pitch_link, 'nothing')
        
unittest.main(argv=[''], verbosity=2, exit=False)

test_Exception (__main__.bpTestCase) ... ok

----------------------------------------------------------------------
Ran 1 test in 1.550s

OK


<unittest.main.TestProgram at 0x7fe52332d5c0>

## Getting old BP stats

I've gotten the BP stats for the current year (2018); what I need is the archived stats. This would allow me to be consistent, plus I could avoid the Lahman database altogether. 

I think the script will be similar, I just need to:

1. Go to the URL using the correct link
2. Sequentially, change the dropdown menu year going back till 2004 (or maybe 1998?)
3. Each time, pull the stats for that year and save to a data frame
4. Concatenate the data frames into 1

I'll start by testing that I can mess with the dropdown menu:

In [151]:
# Set URL
my_url = 'https://legacy.baseballprospectus.com/sortable'

# Open Chrome and navigate to the page
driver = webdriver.Chrome()
driver.get(my_url)
print(driver.title)

# wait for page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.LINK_TEXT, '1. Individual Stats - Season Totals')))
print("2018 Page loaded")

# Go to the batting stats
bat_link = driver.find_element_by_link_text('1. Individual Stats - Season Totals')
bat_link.click()
print(driver.title)

# Locate the dropdown menu and pick the year
driver.find_element_by_xpath("//select[@name='year']/option[text()='2017']").click()
driver.find_element_by_xpath("//input[@value = 'View Data']").click()

# wait for page to load
wait = WebDriverWait(driver, 10)
#wait.until(EC.presence_of_element_located((By.LINK_TEXT, '1. Individual Stats - Season Totals')))
print("2017 Page loaded")

# Do it again, but for 2016
driver.find_element_by_xpath("//select[@name='year']/option[text()='2016']").click()
driver.find_element_by_xpath("//input[@value = 'View Data']").click()

# wait for page to load
wait = WebDriverWait(driver, 10)
#wait.until(EC.presence_of_element_located((By.LINK_TEXT, '1. Individual Stats - Season Totals')))
print("2016 Page loaded")

# Close the browser
driver.close()

Baseball Prospectus | Statistics | Custom Statistics Reports
2018 Page loaded
Baseball Prospectus | Statistics | Custom Statistics Reports: Batter Season
2017 Page loaded
2016 Page loaded


Okay, so I think I ultimately want to make a couple functions:

1. Takes an open driver and a year, then returns the correct data frame
2. Opens a driver with the general page and goes to the correct batting/pitching page 

I think this is all I need; I'll modify what I had earlier. I might also need to make 1 more function that wraps around these...I'll see

**Change of plans; I just need the URLs for each year!**

I've already written the code to get the correct data for a link, so I might as well just get the links

In [163]:
# Function for taking the open driver to stat home page and year range, then returning the correct data frame:
def getURLForYears(year_range, stat_type):
    
    '''Given a year range and stat type, return the link for the years requested'''
    
    # Initialize a list of years
    all_year_urls = []
    
    # Change link based on stat type
    if stat_type == 'batting':
        stat_link = '1. Individual Stats - Season Totals'
    elif stat_type == 'pitching':
        stat_link = '3. Individual Stats - Season Totals'
    else:
        raise ValueError("Must be either 'batting' or 'pitching'")
    
    # Create the driver and use the basic URL
    my_url = 'https://legacy.baseballprospectus.com/sortable'
    driver = webdriver.Chrome()
    driver.get(my_url)
    
    # Wait for page to load
    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_element_located((By.LINK_TEXT, stat_link)))
    
    # Click on the correct link
    stat_link = driver.find_element_by_link_text(stat_link)
    stat_link.click()
    
    # Loop through each year supplied
    for year in year_range:
        
        # Locate the dropdown menu and pick the year, then bring up data
        driver.find_element_by_xpath("//select[@name='year']/option[text()='{}']".format(year)).click()
        driver.find_element_by_xpath("//input[@value = 'View Data']").click()

        # Grab the URL, then add it to the list
        all_year_urls.append(driver.current_url)
    
    # Close the browser to be nice
    driver.close()
    
    # Feed the current URL to the link and return it
    return all_year_urls

I'm going to test this for 2016-2017 batting and pitching

In [170]:
test = getURLForYears(list(range(2016, 2018)), 'batting')
print(test)
test = getURLForYears(list(range(2016, 2018)), 'pitching')
print(test)

['https://legacy.baseballprospectus.com/sortable/index.php?cid=2021133', 'https://legacy.baseballprospectus.com/sortable/index.php?cid=2556049']
['https://legacy.baseballprospectus.com/sortable/index.php?cid=2508797', 'https://legacy.baseballprospectus.com/sortable/index.php?cid=2555916']


Now wrap everything into 2 blocks

* Grab all necessary URLs
* Get the data frames

In [172]:
# Get all URLs till 2018
bat_urls = getURLForYears(list(range(2001, 2019)), 'batting')
pitch_urls = getURLForYears(list(range(2001, 2019)), 'pitching')

Now get ALL the data!

In [176]:
%%time
# Do all the batting links
all_bat_dfs = [bpStatsToDF(bat_link, 'batting') for bat_link in bat_urls]

CPU times: user 35.3 s, sys: 158 ms, total: 35.5 s
Wall time: 1min 5s


In [209]:
%%time
# Do all the pitching links
all_pitch_dfs = [bpStatsToDF(pitch_link, 'pitching') for pitch_link in pitch_urls]

https://legacy.baseballprospectus.com/sortable/index.php?cid=2563320
https://legacy.baseballprospectus.com/sortable/index.php?cid=2525123
https://legacy.baseballprospectus.com/sortable/index.php?cid=2563321
https://legacy.baseballprospectus.com/sortable/index.php?cid=2513248
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510188
https://legacy.baseballprospectus.com/sortable/index.php?cid=2528667
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510190
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510189
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510191
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510192
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510193
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510194
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510195
https://legacy.baseballprospectus.com/sortable/index.php?cid=2510196
https://legacy.baseballprospectus.

Now I'll convert them to 1 DF each

In [210]:
full_bat = pd.concat(all_bat_dfs)
full_pitch = pd.concat(all_pitch_dfs)

print(full_bat.shape)
print(full_pitch.shape)

(21807, 35)
(11703, 39)


In [211]:
print(full_bat.head())

            NAME  YEAR  AGE    G   PA   AB    R    H   1B  2B  ...      OBP  \
0    Barry Bonds  2001   36  153  664  476  129  156   49  32  ...    0.515   
1     Sammy Sosa  2001   32  160  711  577  146  189   86  34  ...    0.437   
2   Jason Giambi  2001   30  154  671  520  109  178   91  47  ...    0.477   
3    Shawn Green  2001   28  161  701  619  121  184  100  31  ...    0.372   
4  Luis Gonzalez  2001   33  162  728  609  128  198   98  36  ...    0.429   

     SLG    OPS    ISO    BPF  oppOPS    TAv   VORP  FRAA  BWARP  
0  0.863  1.379  0.536   96.0   0.757  0.428  134.8 -16.5  11.67  
1  0.737  1.174  0.409   97.0   0.765  0.381  104.2   4.2  10.92  
2  0.660  1.137  0.317   95.0   0.764  0.381   91.6   0.1   9.08  
3  0.598  0.970  0.300   94.0   0.757  0.338   70.3  16.8   8.80  
4  0.688  1.117  0.363  101.0   0.763  0.354   88.4  -1.0   8.47  

[5 rows x 35 columns]


For now, I'm saving these to pickle files so I don't have to re-do it while I'm testing

In [212]:
full_bat.to_pickle('./batting_stats.pickle')
full_pitch.to_pickle('./pitching_stats.pickle')

## Joining with free agent data

As I did in the actual project, now I need to join with the free agents themselves. This is easy enough, I just need to grab the correct data:

In [227]:
# Bring in free agent data
engine = bfm.db_connect()
people = bfm.pullFullTable('people', engine)
free_agents = bfm.pullFullTable('free_agents', engine)

# Print the free agent head
print(free_agents.sort_values('WAR_3', ascending = False).head())
print(free_agents.shape)

      index  Age       Full_Name  WAR_3 nameFirst   nameLast  Year  \
297     297   32  Alex Rodriguez   23.3      Alex  Rodriguez  2007   
1115   1115   32   Albert Pujols   22.5    Albert     Pujols  2011   
1460   1460   31   Robinson Cano   22.0  Robinson       Cano  2013   
1797   1797   32    Zack Greinke   20.4      Zack    Greinke  2015   
493     493   28     CC Sabathia   18.2        CC   Sabathia  2008   

          Dollars  Length            Name Position  
297   275000000.0      10  Alex Rodriguez       DH  
1115  250000000.0      10   Albert Pujols       DH  
1460  240000000.0      10   Robinson Cano       2B  
1797  206500000.0       6    Zack Greinke       SP  
493   161000000.0       7     CC Sabathia       SP  
(2195, 11)


A few things to note:

1. This has ALL free agents; that means pitchers and batters, which are in different stats data frames
2. There's 2 Name fields, plus the nameFirst/nameLast fields. I'm going to go out on a limb here and say I don't need all those
3. The age and index columns are also redundant. Time to clean!

So I'll start by stripping out the age, index, and excess name columns

In [228]:
fa_trimmed = free_agents.drop(['index','Age','nameFirst', 'nameLast', 'Name'], axis = 'columns')
print(fa_trimmed.sort_values('WAR_3', ascending = False).head())

           Full_Name  WAR_3  Year      Dollars  Length Position
297   Alex Rodriguez   23.3  2007  275000000.0      10       DH
1115   Albert Pujols   22.5  2011  250000000.0      10       DH
1460   Robinson Cano   22.0  2013  240000000.0      10       2B
1797    Zack Greinke   20.4  2015  206500000.0       6       SP
493      CC Sabathia   18.2  2008  161000000.0       7       SP


Now then, I'll have to join on name AND Year. What sort of shape will it be? Remember...

- There are ~2200 free agents
- There are ~32500 total player season stat records

If there's many fewer records than in free agents, then the punctuation effect isn't a big deal. But it probably is

In [231]:
# Test joining batters (remove the pitchers!)
test_bat_join = pd.merge(fa_trimmed[fa_trimmed.Position.isin(['SP','RP']) == False] , full_bat, 
                         left_on= ['Full_Name', 'Year'], 
                         right_on = ['NAME', 'YEAR'])
# Test joining batters
test_pitch_join = pd.merge(fa_trimmed, full_pitch, 
                         left_on= ['Full_Name', 'Year'], 
                         right_on = ['NAME', 'YEAR'])

print(test_bat_join.shape)
print(test_pitch_join.shape)

(1047, 41)
(1022, 45)


I had to take pitchers out of the batting stats for obvious reasons. Unfortunately, I'm missing a bit over 100 players. Maybe removing periods and spaces will help? I'll look at the free agents who aren't there...

In [235]:
# Gather the free agents who DO appear in the joins
in_join = pd.concat([test_bat_join.Full_Name, test_pitch_join.Full_Name]).unique()

# Find set difference
not_in_join = fa_trimmed[fa_trimmed.Full_Name.isin(in_join) == False]

# Print the head of it
print(not_in_join.head())

          Full_Name  WAR_3  Year     Dollars  Length Position
8      Sandy Alomar   -0.4  2006         NaN       0        C
11      Mike DeJean    0.9  2006         NaN       0       RP
15        J.D. Drew   15.5  2006  70000000.0       5       RF
81      J.C. Romero    1.2  2006   1600000.0       1       RP
85  Greg Maddux HOF    9.3  2006  10000000.0       1       SP


Things I need to catch:

* Suffixes (e.g. "Jr")
* Dots (e.g. "J.D.")
* The "HOF" thing

Maybe best to join on first AND last AND year? Plus replace the dots with nothing!