## Baseball Prediction: 5a - Getting (Raw) Individual Pitcher Data
In the previous lesson we compared our simple, hitting-only model to the Las Vegas odds.  We concluded that incorporating the starting pitcher information would be a crucial next step to improve our model.

In this notebook we will learn how to scrape individual, game-level, pitching data from retrosheet.  We will write a loop to go through and download the data.  This will enable us to augment our game-level dataframe with features derived from the previous performance of the starting pitcher.

Let's start by going to retrosheet and finding the stats for Dwight Gooden.

www.retrosheet.org


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

import lxml
import html5lib
from urllib.request import urlopen
import time

from bs4 import BeautifulSoup
import requests

## Scrape a single season

In [2]:
url = 'https://www.retrosheet.org/boxesetc/1985/Kgoodd0010021985.htm'
page = requests.get(url)

In [4]:
#page.content

In [5]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 1985 NY  N Regular Season Pitching Log for Dwight Gooden</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="#">About ↓</a>
<ul>
<li><a href="https://www.retrosheet.org/site.htm">Overview</a>
<li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
<li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
<li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
</li></li></li

In [11]:
soup1 = list(soup.children)[-1]
soup1

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 1985 NY  N Regular Season Pitching Log for Dwight Gooden</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="#">About ↓</a>
<ul>
<li><a href="https://www.retrosheet.org/site.htm">Overview</a>
<li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
<li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
<li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
</li></li></li></li></ul>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><

In [18]:
soup2 = list(soup1.children)[-1]
soup2

<body>
<p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="#">About ↓</a>
<ul>
<li><a href="https://www.retrosheet.org/site.htm">Overview</a>
<li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
<li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
<li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
</li></li></li></li></ul>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.o

In [19]:
soup3 = list(soup2.children)
soup3

['\n',
 <p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>,
 '\n',
 <div class="mbcenter">
 <ul class="nav">
 <li><a href="https://www.retrosheet.org/">Home</a>
 <li><a href="#">About ↓</a>
 <ul>
 <li><a href="https://www.retrosheet.org/site.htm">Overview</a>
 <li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
 <li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
 <li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
 </li></li></li></li></ul>
 <li><a href="#">Games/People/Parks ↓</a>
 <ul>
 <li><a href="#">People →</a>
 <ul>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
 <li><a hr

In [20]:
index_num = np.where(["Opponent" in str(x) for x in soup3])[0][0]
index_num

12

In [21]:
soup4 = soup3[index_num]
soup4

<pre>   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA
<a href="../1985/04091985.htm"> 4- 9-1985</a>   <a href="../1985/B04090NYN1985.htm">BOX+PBP</a> VS STL N   1   0   0   0   0   6     6   26   1   4   3   2   0   6   0   0   0   0   0   1   0   0   0   0   0   4.50
<a href="../1985/04141985.htm"> 4-14-1985</a>   <a href="../1985/B04140NYN1985.htm">BOX+PBP</a> VS CIN N   1   1   1   0   0   9     4   33   0   0   0   2   0  10   0   0   0   0   0   0   0   0   0   1   0   1.80
<a href="../1985/04191985.htm"> 4-19-1985</a>   <a href="../1985/B04190PHI1985.htm">BOX+PBP</a> AT PHI N   1   0   0   0   0   8     3   27   0   0   0   1   0   7   0   0   0   0   0   0   0   0   0   1   0   1.17
<a href="../1985/04241985.htm"> 4-24-1985</a>   <a href="../1985/B04240SLN1985.htm">BOX+PBP</a> AT STL N   1   0   0   0   0   7     4   27   0   2   2   3   1   3   1   1   0   0   0   0   0   0   0   0   1   

In [22]:
soup5 = list(soup4.children)
soup5

['   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA\n',
 <a href="../1985/04091985.htm"> 4- 9-1985</a>,
 '   ',
 <a href="../1985/B04090NYN1985.htm">BOX+PBP</a>,
 ' VS STL N   1   0   0   0   0   6     6   26   1   4   3   2   0   6   0   0   0   0   0   1   0   0   0   0   0   4.50\n',
 <a href="../1985/04141985.htm"> 4-14-1985</a>,
 '   ',
 <a href="../1985/B04140NYN1985.htm">BOX+PBP</a>,
 ' VS CIN N   1   1   1   0   0   9     4   33   0   0   0   2   0  10   0   0   0   0   0   0   0   0   0   1   0   1.80\n',
 <a href="../1985/04191985.htm"> 4-19-1985</a>,
 '   ',
 <a href="../1985/B04190PHI1985.htm">BOX+PBP</a>,
 ' AT PHI N   1   0   0   0   0   8     3   27   0   0   0   1   0   7   0   0   0   0   0   0   0   0   0   1   0   1.17\n',
 <a href="../1985/04241985.htm"> 4-24-1985</a>,
 '   ',
 <a href="../1985/B04240SLN1985.htm">BOX+PBP</a>,
 ' AT STL N   1   0   0   0   0   7     4   27   0  

In [23]:
for i in range(12):
    print(soup5[i].get_text().split())


['Date', '#', 'Opponent', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H', 'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP', 'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']
['4-', '9-1985']
[]
['BOX+PBP']
['VS', 'STL', 'N', '1', '0', '0', '0', '0', '6', '6', '26', '1', '4', '3', '2', '0', '6', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '4.50']
['4-14-1985']
[]
['BOX+PBP']
['VS', 'CIN', 'N', '1', '1', '1', '0', '0', '9', '4', '33', '0', '0', '0', '2', '0', '10', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '1.80']
['4-19-1985']
[]
['BOX+PBP']


In [26]:
## Given the url that refers to a specific pitcher and season
## we scrape the data and process it a bit
def get_season_pitching_data(url):    
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)[-1]
    body = list(html.children)[-1]
    sec_next = list(body.children)
    secnum = np.where(["Opponent" in str(x) for x in sec_next])[0][0]
    key_section = sec_next[secnum]
    working_part = list(key_section.children)
    p_header = working_part[0].strip().split()
    mod_header= ['at_vs','Opponent','League', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H',
            'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP',
            'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']

    date_list = []
    day_href_list = []
    for k in range(1,len(working_part),4):
        date_list.append(working_part[k].get_text().strip())
        day_href_list.append(working_part[k].attrs['href'])

    dblhead_num_list = []
    for k in range(2,len(working_part),4):
        dblhead_num_list.append(working_part[k].strip())

    game_href_list = []
    for k in range(3,len(working_part),4):
        game_href_list.append(working_part[k].attrs['href'])

    main_data_matrix = []
    for k in range(4,len(working_part),4):
        main_data_row = (working_part[k].strip().split())[:29]
        main_data_matrix.append(main_data_row)

    out_df = pd.DataFrame(main_data_matrix, columns = mod_header)
    out_df['Date'] = date_list
    out_df['dblhead_num'] = dblhead_num_list
    return(out_df)

In [27]:
url

'https://www.retrosheet.org/boxesetc/1985/Kgoodd0010021985.htm'

In [28]:
get_season_pitching_data(url)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,VS,STL,N,1,0,0,0,0,6.0,6,26,1,4,3,2,0,6,0,0,0,0,0,1,0,0,0,0,0,4.5,4- 9-1985,
1,VS,CIN,N,1,1,1,0,0,9.0,4,33,0,0,0,2,0,10,0,0,0,0,0,0,0,0,0,1,0,1.8,4-14-1985,
2,AT,PHI,N,1,0,0,0,0,8.0,3,27,0,0,0,1,0,7,0,0,0,0,0,0,0,0,0,1,0,1.17,4-19-1985,
3,AT,STL,N,1,0,0,0,0,7.0,4,27,0,2,2,3,1,3,1,1,0,0,0,0,0,0,0,0,1,1.5,4-24-1985,
4,VS,HOU,N,1,1,0,0,0,9.0,4,29,1,1,1,2,0,8,0,0,0,0,0,0,0,2,1,1,0,1.38,4-30-1985,
5,AT,CIN,N,1,0,0,0,0,7.0,7,30,0,2,2,3,0,9,0,0,0,0,0,1,0,0,0,1,0,1.57,5- 5-1985,
6,VS,PHI,N,1,1,1,0,0,9.0,3,32,0,0,0,3,0,13,0,0,0,0,0,1,0,1,0,1,0,1.31,5-10-1985,
7,AT,HOU,N,1,0,0,0,0,6.1,8,29,0,3,3,2,0,1,0,0,0,0,0,2,0,0,0,1,0,1.61,5-15-1985,
8,VS,SD,N,1,0,0,0,0,8.0,9,33,1,2,2,0,0,9,0,0,1,0,0,2,0,0,0,0,1,1.69,5-20-1985,
9,VS,LA,N,1,0,0,0,0,7.0,5,24,1,3,3,1,0,9,0,0,0,0,1,0,0,0,0,0,1,1.89,5-25-1985,


### Get all season links for a player

In [29]:
url = 'https://www.retrosheet.org/boxesetc/G/Pgoodd001.htm'
page = requests.get(url)
sup = BeautifulSoup(page.content, 'html.parser')
sup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Pdescr.htm">Read Me</a></pre>
<head>
<title>Dwight Gooden</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="#">About ↓</a>
<ul>
<li><a href="https://www.retrosheet.org/site.htm">Overview</a>
<li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
<li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
<li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
</li></li></li></li></ul>
<li><a href="#">Games/People/Parks 

In [30]:
sup2 = list(sup.children)[2]
sup3 = list(sup2.children)[5]

In [31]:
sup3

<body>
<p class="nopad"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="#">About ↓</a>
<ul>
<li><a href="https://www.retrosheet.org/site.htm">Overview</a>
<li><a href="https://www.retrosheet.org/archives.htm">Site history</a>
<li><a href="https://www.retrosheet.org/news.htm">Newsletters</a>
<li><a href="https://www.retrosheet.org/faq.htm">FAQ</a>
</li></li></li></li></ul>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.o

In [32]:
# Plan - find the <pre> tag that starts with 'Pitching Record' (after stripping whitespace)
# Get the href attribute for all the <a> tags with the word "Daily"

In [33]:
pre_tags = [x for x in sup3.find_all('pre')]
pre_tag_text = [x.get_text().strip() for x in pre_tags]
pre_tag_text

['Top Performances',
 'Pitcher Matchups   Batter Matchups',
 'Batting Record\nYear Team                     G    AB    R    H  2B  3B  HR  RBI   BB IBB   SO HBP  SH  SF  XI ROE GDP   SB  CS   AVG   OBP   SLG   BFW Year Team\n1984 NY  N    Daily Splits   31    70    5   14   0   0   0    3    1   0   14   0  10   2   0   3   3    0   0  .200  .205  .200   0.0 1984 NY  N\n1985 NY  N    Daily Splits   35    93   11   21   2   0   1    9    5   0   15   0   9   0   0   5   1    0   0  .226  .265  .280   0.0 1985 NY  N\n1986 NY  N    Daily Splits   33    81    5    7   0   1   0    4    2   0   16   1  13   0   0   3   3    0   0  .086  .119  .111   0.0 1986 NY  N\n1987 NY  N    Daily Splits   25    64    4   14   0   0   0    4    1   0    9   0   5   1   0   2   1    0   0  .219  .227  .219   0.0 1987 NY  N\n1988 NY  N    Daily Splits   34    90    8   16   1   0   1    9    1   0   18   0   9   1   0   2   1    0   0  .178  .185  .222   0.0 1988 NY  N\n1989 NY  N    Daily Splits   19    

In [34]:
np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]

7

In [35]:
ind = np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]
a_tags = pre_tags[ind].find_all('a')
a_tags

[<a href="../1984/Y_1984.htm">1984</a>,
 <a href="../1984/TNYN01984.htm">NY  N</a>,
 <a href="../1984/Kgoodd0010011984.htm">Daily</a>,
 <a href="../1984/Lgoodd0010011984.htm">Splits</a>,
 <a href="../1984/Y_1984.htm">1984</a>,
 <a href="../1984/TNYN01984.htm">NY  N</a>,
 <a href="../1985/Y_1985.htm">1985</a>,
 <a href="../1985/TNYN01985.htm">NY  N</a>,
 <a href="../1985/Kgoodd0010021985.htm">Daily</a>,
 <a href="../1985/Lgoodd0010021985.htm">Splits</a>,
 <a href="../1985/Y_1985.htm">1985</a>,
 <a href="../1985/TNYN01985.htm">NY  N</a>,
 <a href="../1986/Y_1986.htm">1986</a>,
 <a href="../1986/TNYN01986.htm">NY  N</a>,
 <a href="../1986/Kgoodd0010031986.htm">Daily</a>,
 <a href="../1986/Lgoodd0010031986.htm">Splits</a>,
 <a href="../1986/Y_1986.htm">1986</a>,
 <a href="../1986/TNYN01986.htm">NY  N</a>,
 <a href="../1987/Y_1987.htm">1987</a>,
 <a href="../1987/TNYN01987.htm">NY  N</a>,
 <a href="../1987/Kgoodd0010041987.htm">Daily</a>,
 <a href="../1987/Lgoodd0010041987.htm">Splits</a>,


In [36]:
links = [x.attrs['href'] for x in a_tags if x.get_text()=='Daily']
links

['../1984/Kgoodd0010011984.htm',
 '../1985/Kgoodd0010021985.htm',
 '../1986/Kgoodd0010031986.htm',
 '../1987/Kgoodd0010041987.htm',
 '../1988/Kgoodd0010051988.htm',
 '../1989/Kgoodd0010061989.htm',
 '../1990/Kgoodd0010071990.htm',
 '../1991/Kgoodd0010081991.htm',
 '../1992/Kgoodd0010091992.htm',
 '../1993/Kgoodd0010101993.htm',
 '../1994/Kgoodd0010111994.htm',
 '../1996/Kgoodd0010121996.htm',
 '../1997/Kgoodd0010131997.htm',
 '../1998/Kgoodd0010141998.htm',
 '../1999/Kgoodd0010151999.htm',
 '../2000/Kgoodd0010162000.htm',
 '../2000/Kgoodd0010172000.htm',
 '../2000/Kgoodd0010182000.htm']

In [37]:
### Get the links to the pitcher-season tables given the pitcher id
def get_daily_season_links(pitcher_id):
    letter = pitcher_id.upper()[0]
    url_prefix = 'https://www.retrosheet.org/boxesetc/'
    url = url_prefix+letter+'/P'+pitcher_id+'.htm'
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)
    body = list(html[2].children)[5]
    pre_texts = [x for x in body.find_all('pre')]
    secnum = np.where([x.get_text().strip().startswith('Pitching Record') for x in pre_texts])[0][0]
    a_pre_texts = pre_texts[secnum].find_all('a')
    daily_season_links = [url_prefix+x.attrs['href'][3:] for x in a_pre_texts if x.get_text()=='Daily']
    return(daily_season_links)

In [38]:
get_daily_season_links('goodd001')

['https://www.retrosheet.org/boxesetc/1984/Kgoodd0010011984.htm',
 'https://www.retrosheet.org/boxesetc/1985/Kgoodd0010021985.htm',
 'https://www.retrosheet.org/boxesetc/1986/Kgoodd0010031986.htm',
 'https://www.retrosheet.org/boxesetc/1987/Kgoodd0010041987.htm',
 'https://www.retrosheet.org/boxesetc/1988/Kgoodd0010051988.htm',
 'https://www.retrosheet.org/boxesetc/1989/Kgoodd0010061989.htm',
 'https://www.retrosheet.org/boxesetc/1990/Kgoodd0010071990.htm',
 'https://www.retrosheet.org/boxesetc/1991/Kgoodd0010081991.htm',
 'https://www.retrosheet.org/boxesetc/1992/Kgoodd0010091992.htm',
 'https://www.retrosheet.org/boxesetc/1993/Kgoodd0010101993.htm',
 'https://www.retrosheet.org/boxesetc/1994/Kgoodd0010111994.htm',
 'https://www.retrosheet.org/boxesetc/1996/Kgoodd0010121996.htm',
 'https://www.retrosheet.org/boxesetc/1997/Kgoodd0010131997.htm',
 'https://www.retrosheet.org/boxesetc/1998/Kgoodd0010141998.htm',
 'https://www.retrosheet.org/boxesetc/1999/Kgoodd0010151999.htm',
 'https://

In [39]:
get_season_pitching_data(get_daily_season_links('goodd001')[2])

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,AT,PIT,N,1,1,0,0,0,9.0,6,33,1,2,2,1,0,6,1,0,0,0,0,2,1,0,0,1,0,2.0,4- 8-1986,
1,VS,STL,N,1,0,0,0,0,8.0,5,30,0,2,2,1,0,6,0,1,0,0,0,1,1,0,0,0,0,2.12,4-14-1986,
2,VS,PHI,N,1,1,0,0,0,9.0,6,36,0,2,1,2,0,10,1,0,0,0,0,2,0,0,1,1,0,1.73,4-19-1986,
3,AT,STL,N,1,1,1,0,0,9.0,5,30,0,0,0,0,0,5,0,0,0,0,0,0,0,1,0,1,0,1.29,4-25-1986,
4,AT,ATL,N,1,0,0,0,0,8.0,6,31,1,1,1,2,0,5,0,0,0,0,1,0,0,0,0,1,0,1.26,4-30-1986,
5,VS,HOU,N,1,1,1,0,0,9.0,2,31,0,0,0,2,0,7,0,0,0,0,0,0,0,1,1,1,0,1.04,5- 6-1986,
6,VS,CIN,N,1,0,0,0,0,5.0,8,23,0,3,3,1,0,3,0,0,0,0,0,1,0,1,0,0,1,1.42,5-11-1986,
7,AT,LA,N,1,0,0,0,0,8.0,7,33,0,3,0,1,0,7,0,0,0,0,0,0,0,0,1,0,0,1.25,5-16-1986,
8,AT,SF,N,1,0,0,0,0,4.0,9,23,0,7,6,2,0,3,1,0,0,0,0,1,0,0,0,0,1,1.96,5-22-1986,
9,VS,LA,N,1,1,0,0,0,9.0,5,33,2,2,2,2,0,10,0,0,0,0,0,1,0,1,0,1,0,1.96,5-28-1986,


In [40]:
# Get all the data for a particular pitcher
def get_full_pitching_data(pitcher_id):
    link_list = get_daily_season_links(pitcher_id)
    df_pitching = pd.DataFrame()
    for url in link_list:
        df_pitching = pd.concat((df_pitching, get_season_pitching_data(url)))
    return(df_pitching)

In [41]:
dg_data = get_full_pitching_data('goodd001')

In [43]:
dg_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 430 entries, 0 to 17
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   at_vs        430 non-null    object
 1   Opponent     430 non-null    object
 2   League       430 non-null    object
 3   GS           430 non-null    object
 4   CG           430 non-null    object
 5   SHO          430 non-null    object
 6   GF           430 non-null    object
 7   SV           430 non-null    object
 8   IP           430 non-null    object
 9   H            430 non-null    object
 10  BFP          430 non-null    object
 11  HR           430 non-null    object
 12  R            430 non-null    object
 13  ER           430 non-null    object
 14  BB           430 non-null    object
 15  IB           430 non-null    object
 16  SO           430 non-null    object
 17  SH           430 non-null    object
 18  SF           430 non-null    object
 19  WP           430 non-null    o

In [44]:
dg_data.sample(5)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,AT,HOU,N,1,0,0,0,0,5,3,20,0,1,1,2,0,5,0,0,0,0,0,0,0,0,0,1,0,1.8,4- 7-1984,
4,VS,KC,A,1,0,0,0,0,7,4,24,0,2,1,1,0,4,1,1,0,0,0,2,0,0,1,0,0,1.86,8- 1-2000,
6,AT,SEA,A,1,0,0,0,0,5,10,25,2,4,4,3,0,4,0,0,0,0,0,0,0,1,0,0,0,6.61,5-19-2000,
16,AT,CIN,N,1,0,0,0,0,2,3,11,2,3,2,0,0,1,1,0,0,0,0,0,0,0,2,0,1,2.99,7- 1-1989,
14,VS,PHI,N,1,0,0,0,0,7,5,27,1,3,3,1,1,8,0,0,0,1,0,1,0,1,0,0,1,3.36,6-17-1988,


## Load in our game level data

In [45]:
df=pd.read_csv('df_bp3.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [46]:
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
len(start_pitchers_h), len(start_pitchers_v)

(2785, 2800)

In [47]:
start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
len(start_pitchers_all), start_pitchers_all[:10]

(3015,
 array(['aased001', 'abadf001', 'abboc001', 'abbog001', 'abboj001',
        'abbok001', 'abbop001', 'abrej001', 'aceva001', 'acevj001'],
       dtype=object))

In [48]:
# run this for everyone in the list - may take a bit to run...
for p_id in start_pitchers_all[:2]:
    print(p_id)
    df_temp = get_full_pitching_data(p_id)
    # may want to modify below to save to a dedicated folder
    fname_out = 'pitching_data_'+p_id+'.csv'
    df_temp.to_csv(fname_out, index=False)

aased001
abadf001
