# MLB Game Outcome Model - Live Data Wrangling
**This is the second notebook that we run when we use our MLB Game Outcome model.**<br>

The goal of this notebook is to save a dataframe with all the features needed to make predictions for the winner of that day's Major League Baseball games.<br>

This notebook will:<br>

+ Scrape RotoGrinders to identify the starting pitchers for today's MLB games.<br>
+ Scrape FanGraphs to collect data for our model's features relating to each starting pitcher.<br>
+ Merge those two dataframes so we have the name, team, opponent, ballpark, date and all of the features for each starting pitcher.<br>
+ Derive our Avg_Outs feature, the mean number of outs each pitcher records per start.<br>
+ Scrape FanGraphs for team bullpen and hitting features over the last 30 days and merge with our main dataframe.<br>
+ Temporarily create separate home and away dataframes so we can add 'A_' and 'H_' suffixes to each feature, which will indicate home or away team.<br>
+ Read in the '2023_Game_Scores' CSV, which includes every score this season, and use it to derive total wins, wins in last 10 games, last 30 games, Pythagorean Wins and Run Differential.<br>
+ Add the Park Factors feature, which measures how favorable the ballpark is for hitters.<br>
+ Engineer all the features from the perspective of the home team so that the model will predict the answer to the question: Will the home team win the game? (1 is a home win, 0 is an away win). So for example Win_Diff is how many more wins the home team has than the visiting team. Run_Diff is the home team's run differential in its games compared to the away team's run differential in its games. A positive number means the home team's run differential is better than the visiting team's.

In [332]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import datetime as dt
from datetime import timedelta
import time
import papermill as pm

In [333]:
def clean_unnamed(df):
    if 'Unnamed: 0' in df.columns:
        df = df.drop(columns = ['Unnamed: 0'])
        return df
    print("Dataframe does not have 'Unnamed: 0' column.")

In [334]:
today = dt.date.today()

In [335]:
today_str = str(today)

In [336]:
yesterday = dt.date.today() - timedelta(days = 1)

In [337]:
yesterday_str = str(yesterday)

# Scraping RotoGrinders data using Selenium
We take today's starting pitcher data from the RotoGrinders MLB First Look page. Since the page we need isn't on the home page of that site, we use Selenium to get to that page.

In [338]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [339]:
driver = webdriver.Chrome()
driver.get("https://rotogrinders.com/daily-fantasy-baseball")

In [340]:
link_element = WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "MLB DFS First Look")))


# Click on the link
driver.execute_script("arguments[0].click();", link_element)

In [341]:
first_look_url = driver.current_url

In [342]:
first_look_url

'https://rotogrinders.com/articles/mlb-dfs-first-look-tuesday-august-15th-sortable-odds-salaries-stats-3923892'

# Now Beautiful Soup
Now that we have the url we need, we can use Beautiful Soup for the rest of the scraping. We can extract the tables from the soup object. We only need the second of those tables. We run three functions to create the dataframe.

In [397]:
#Scraping RotoGrinders MLB First Look tables
rg_string = first_look_url
rg_page = requests.get(rg_string)
rg_html_doc = rg_page.text
rg_soup_obj = BeautifulSoup(rg_html_doc)

In [398]:
tables = rg_soup_obj.find_all('table')

In [399]:
#There are 4 tables on the page we scraped
len(tables)

4

In [400]:
#List of strings that we will use to name our CSVs
var_name_str = ['odds', 'pitcher', 'offense', 'hitter']

In [401]:
# today = dt.date.today()

In [402]:
#Turning today's date into string and replacing hyphen with underscore
today_str_u = str(today).replace('-', '_')

In [403]:
#Extracting all table header tags
def get_headers(table):
    """
    Extracting all table header tags
    """
    headers = table.find_all('th')
    cols = []
    for header in headers:
        cols.append(header.get_text().strip('\t'))
    return cols

In [404]:
#Extracting all table data tags
def get_content(table):
    """
    Extracting all table data tags
    """
    content = table.find_all('td')
    table_content = []
    for item in content:
        table_content.append(item.get_text().strip(' '))
    table_content = [content.strip('\t') for content in table_content]
    return table_content

In [405]:
#Turning table data into rows, using number of table headers to divide and get the right number of rows
def get_rows(cols, content):
    """
    Turning table data into rows, using number of table headers to divide and get the right number of rows
    """
    num_rows = len(content)/len(cols)
    content = iter(content)
    rows = []
    for i in range(int(num_rows)):
        new_row = []
        for j in range(len(cols)):
            new_row.append(next(content))
        rows.append(new_row)
    return rows

In [406]:
headers = get_headers(tables[1])
content = get_content(tables[1])
rows = get_rows(headers, content)
todays_SPs = pd.DataFrame(rows, columns = headers)
# todays_SPs
# filepath = r'C:\Users\Owner\FantasySports\MLB_DFS_2023\{}_{}.csv'.format(var_name_str[i], today_str)
# temp_df.to_csv(filepath)

# Park column
We use the Opponent column to derive a column that indicates the ballpark in which the game is being played.

In [407]:
todays_SPs['Park'] = np.where(todays_SPs.Opponent.str[:2] == 'at',\
                                       todays_SPs.Opponent.str[-3:], todays_SPs['Team'])

In [408]:
todays_SPs = todays_SPs.rename(columns = {'Pitcher': 'Name'})

In [355]:
driver.quit()

In [409]:
pitch_url = 'https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=c,8,43,45,62,59,42,110,122,123,113,55&season=2023&month=1000&season1=2023&ind=0&team=0&rost=0&age=0&filter=&players=0&startdate=2023-03-30&enddate='

# Starting pitching features
The following code will scrape the following features from FanGraphs for each day's starting pitcher:

**FIP:** Fielding Independent Pitching, similar to ERA except only home runs, unintentional walks, strikeouts and hit batters are taken into account. It removes from the equation all instances of the ball being put in play.<br>
**xFIP:** Home runs allowed is replaced in the equation with fly balls/league average of home runs allowed per fly ball. It penalizes pitchers a little more for allowing fly balls.<br>
**BABIP:** Batting average on balls in play. If a pitcher has a low BABIP against him, it suggests that when batters do make contact, it's not strong contact.<br>
**WAR:** Wins Above Replacement. Measures a player's value in wins over a replacement-level player at his position.<br>
**WHIP:** Walks and hits per innings pitched<br>
**Contact%:** Total pitches where contact was made / Total swings<br>
**SIERA:** Skills-interactive ERA. Scale similar to ERA. Unlike FIP and xFIP, batted balls are taken into a account<br>
**RS/9:** Run support per nine innings. How many runs does a pitcher's team score for him while he's in the game?<br>
**SwStr%:** Swings and misses divided by total pitchers thrown by a pitcher.<br>



In [410]:
url_string_page1 = pitch_url + yesterday_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
#print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
#We fill in the column names with the th tags of the rgHeader class
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
if num_pages > 2:
    for j in range(2, num_pages):
        temp_url_string = pitch_url + yesterday_str + '&page=' + str(j) + '_30'
        temp_r = requests.get(temp_url_string)
        temp_html_doc = temp_r.text
        temp_soup_obj = BeautifulSoup(temp_html_doc)
        temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
        for entry in temp_data:
            all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
SP_live_df = pd.DataFrame(data_lists, columns = col_names)
#Adding one day to the date so that the data accounts for every day through the previous day.
#For example, if we're predicting a game on June 30, the data goes through June 29
date_plus_1 = today_str
date_plus_1 = pd.to_datetime(date_plus_1, format = '%Y-%m-%d')
SP_live_df['Date'] = date_plus_1
#list_of_dfs.append(SP_live_df)
#print(f"{current_date_str} done {date_plus_1}")
#Sleeping loop for 30 seconds so we don't get caught for scraping too frequently.
#time.sleep(30)


In [411]:
SP_live_df.head()

Unnamed: 0,#,Name,Team,GS,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SIERA,RS/9,SwStr%,Start-IP,Date
0,1,Josh Donaldson,NYY,0,0.0,3.27,6.54,0.0,0.0,100.0%,7.56,27.0,0.0%,,2023-08-15
1,2,Brandon Crawford,SFG,0,0.25,6.27,11.18,0.0,2.0,87.5%,10.33,0.0,5.0%,,2023-08-15
2,3,Eduardo Escobar,LAA,0,0.2,3.27,4.91,0.0,1.0,100.0%,4.54,0.0,0.0%,,2023-08-15
3,4,Daniel Hudson,LAD,0,0.333,2.94,4.57,0.0,1.67,67.7%,4.57,0.0,15.6%,,2023-08-15
4,5,Chris Owings,PIT,0,0.333,3.27,8.18,0.0,2.0,100.0%,7.56,9.0,0.0%,,2023-08-15


In [412]:
len(SP_live_df)

796

In [413]:
# SP_live_df[SP_live_df['Name'] == 'Touki Toussaint']

# Starting pitchers
We'll only need the pitcher's name, the pitcher's team, the park in which he's pitching and his team's opponent. We'll need to be on the lookout for games that aren't in regular home parks, like the games in Europe.<br>

We're also going to replace every occurrence of 'WAS' with 'WSN' (Washington Nationals) in the RotoGrinders data so that it matches FanGraphs.<br>

In [414]:
# todays_SPs = pd.read_csv('C:\\Users\Owner\Tableau_Projects\Live MLB DFS\Tableau_Pitching_Salaries.csv')

In [415]:
todays_SPs = todays_SPs[['Name', 'Team', 'Park', 'Opponent']]

In [416]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent
0,Logan Allen,CLE,CIN,at CIN
1,Graham Ashcraft,CIN,CIN,vs CLE
2,Cristian Javier,HOU,MIA,at MIA
3,Johnny Cueto,MIA,MIA,vs HOU
4,Nick Pivetta,BOS,WAS,at WAS
5,Josiah Gray,WAS,WAS,vs BOS
6,Zack Wheeler,PHI,TOR,at TOR
7,Yusei Kikuchi,TOR,TOR,vs PHI
8,Bailey Falter,PIT,NYM,at NYM
9,David Peterson,NYM,NYM,vs PIT


In [417]:
# todays_SPs.loc[3045, 'Name'] = 'Jake Bird'

In [418]:
len(todays_SPs)

30

In [419]:
#Slicing the 'at ' or 'vs ' off the Opponent string
todays_SPs['Opponent'] = todays_SPs['Opponent'].str[-3:]

In [420]:
todays_SPs['Team'] = todays_SPs['Team'].replace('WAS', 'WSN')
todays_SPs['Park'] = todays_SPs['Park'].replace('WAS', 'WSN')
todays_SPs['Opponent'] = todays_SPs['Opponent'].replace('WAS', 'WSN')

# Boiling FanGraphs data down to today's SPs
The data we scrape from FanGraphs includes every player who has pitched this season, starters, reliever and position players. That's more than 750 rows (and growing as the season goes along). Here, we match it with the names of today's starting pitchers that we scraped for RG. It gives us a chance to detect potential missing values.

In [421]:
todays_starters = SP_live_df[SP_live_df['Name'].isin(list(todays_SPs['Name']))]

In [422]:
todays_starters

Unnamed: 0,#,Name,Team,GS,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SIERA,RS/9,SwStr%,Start-IP,Date
67,68,Emerson Hancock,SEA,1,0.143,3.87,5.18,0.1,1.0,77.3%,5.75,1.8,11.5%,5.0,2023-08-15
142,143,Michael Wacha,SDP,15,0.251,3.73,4.63,1.9,1.07,78.0%,4.53,5.67,10.5%,85.2,2023-08-15
235,236,Jordan Montgomery,- - -,23,0.299,3.78,4.1,2.7,1.25,77.8%,4.26,4.67,10.4%,133.0,2023-08-15
238,239,Bailey Ober,MIN,19,0.297,3.79,4.31,2.0,1.11,74.7%,3.95,2.9,13.5%,108.2,2023-08-15
249,250,Yusei Kikuchi,TOR,23,0.291,4.52,3.96,1.4,1.23,74.2%,3.99,5.3,12.5%,122.1,2023-08-15
252,253,Logan Allen,CLE,17,0.307,4.09,4.28,1.4,1.37,77.7%,4.45,5.12,10.8%,91.1,2023-08-15
262,263,Bryce Elder,ATL,23,0.276,4.3,4.32,1.5,1.24,78.8%,4.58,5.56,9.8%,131.0,2023-08-15
272,273,Josiah Gray,WSN,23,0.296,4.74,5.01,1.7,1.44,75.8%,5.01,5.12,11.3%,126.2,2023-08-15
281,282,Zack Wheeler,PHI,23,0.307,3.06,3.45,4.3,1.11,75.3%,3.44,5.52,12.3%,137.0,2023-08-15
307,308,Bobby Miller,LAD,13,0.303,3.57,4.1,1.6,1.24,78.8%,4.18,5.97,9.9%,69.1,2023-08-15


In [423]:
len(todays_starters)

29

# Fixing aggregate teams

We'll write the fix_agg_team function to address situations in which a pitcher has pitched for more than one team. In those cases, his team will be indicated as '---' and that won't match our RotoGrinders data. If there are aggregate teams, we'll be prompted to input each pitcher's name and his current team.

In [424]:
def fix_agg_team(convert_dict):
    """
    takes a dictionary with the pitcher's name as a key and his current team as a value and adjusts the dataframe.
    """
    for k, v in convert_dict.items():
        SP_live_df.loc[SP_live_df['Name'] == k, 'Team'] = v

In [425]:
if '- - -' in list(todays_starters['Team']):
    fix_agg_dict = {}
    num_composite = todays_starters['Team'].value_counts().loc['- - -']
    print(f"There are {num_composite} pitchers with aggregate teams")
    temp_df = todays_starters[todays_starters['Team'] == '- - -']
    for i in range(len(temp_df)):
        comp_pitch = temp_df.iloc[i, :]['Name']#input("Enter pitcher's name: ")
        comp_team = input(f"Enter {comp_pitch}'s current team: ")
        fix_agg_dict.update({comp_pitch: comp_team})
    fix_agg_team(fix_agg_dict)
        

There are 6 pitchers with aggregate teams
Enter Jordan Montgomery's current team: TEX
Enter Zack Littell's current team: TBR
Enter Touki Toussaint's current team: CHW
Enter Jack Flaherty's current team: BAL
Enter Lucas Giolito's current team: LAA
Enter Bailey Falter's current team: PIT


In [426]:
# SP_live_df.loc[SP_live_df['Name'] == 'Lance Lynn', 'Team'] = 'LAD'
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'Team'] = 'TBR'
# SP_live_df.loc[SP_live_df['Name'] == 'Rich Hill', 'Team'] = 'SDP'
# SP_live_df.loc[SP_live_df['Name'] == 'Bailey Falter', 'Team'] = 'PIT'
# SP_live_df.loc[SP_live_df['Name'] == 'Ryan Yarbrough', 'Team'] = 'LAD'

In [427]:
# SP_live_df.loc[SP_live_df['Name'] == 'Touki Toussiaint', 'Team'] = 'CHW'
# SP_live_df.loc[SP_live_df['Name'] == 'Lucas Giolito', 'Team'] = 'LAA'
# SP_live_df.loc[SP_live_df['Name'] == 'Yonny Chirinos', 'Team'] = 'ATL'

In [428]:
# SP_live_df.loc[SP_live_df['Name'] == 'Zack Thompson', 'GS'] = 1
# SP_live_df.loc[SP_live_df['Name'] == 'Zack Thompson', 'Start-IP'] = 5
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'GS'] = 2
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'Start-IP'] = 6

In [429]:
todays_SPs = pd.merge(todays_SPs, SP_live_df, on = ['Name', 'Team'], how = 'left')

In [430]:
todays_SPs = todays_SPs.drop(['#', 'Contact%', 'SwStr%'], axis = 1)

In [431]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date
0,Logan Allen,CLE,CIN,CIN,17.0,0.307,4.09,4.28,1.4,1.37,4.45,5.12,91.1,2023-08-15
1,Graham Ashcraft,CIN,CIN,CLE,22.0,0.292,5.13,4.71,1.1,1.41,4.92,5.25,120.0,2023-08-15
2,Cristian Javier,HOU,MIA,MIA,22.0,0.265,4.59,5.17,1.4,1.22,4.77,6.04,117.2,2023-08-15
3,Johnny Cueto,MIA,MIA,HOU,5.0,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,24.0,2023-08-15
4,Nick Pivetta,BOS,WSN,WSN,10.0,0.26,4.37,3.92,0.7,1.17,3.61,5.18,52.1,2023-08-15
5,Josiah Gray,WSN,WSN,BOS,23.0,0.296,4.74,5.01,1.7,1.44,5.01,5.12,126.2,2023-08-15
6,Zack Wheeler,PHI,TOR,TOR,23.0,0.307,3.06,3.45,4.3,1.11,3.44,5.52,137.0,2023-08-15
7,Yusei Kikuchi,TOR,TOR,PHI,23.0,0.291,4.52,3.96,1.4,1.23,3.99,5.3,122.1,2023-08-15
8,Bailey Falter,PIT,NYM,NYM,9.0,0.342,5.01,4.72,0.5,1.55,4.85,2.98,44.0,2023-08-15
9,David Peterson,NYM,NYM,PIT,13.0,0.374,4.45,3.6,0.5,1.61,3.98,4.43,61.0,2023-08-15


In [432]:
any_missing = todays_SPs.isna().any().any()

In [433]:
any_missing

True

In [434]:
missing_rows = todays_SPs[todays_SPs.isna().any(axis=1)]

In [435]:
missing_rows

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date
14,Spenser Watkins,OAK,STL,STL,,,,,,,,,,NaT


In [436]:
missing_rows_opp = list(missing_rows['Opponent'])

# Last call for missing data
The two other most likely reasons that we would still have missing data is if a pitcher is making his first start this season but he has pitched in relief. In that case we could look at Baseball Reference and fill in the GS and Start-IP columns based off the previous season, his entire career or as a last resort we could just use his total appearances and total IP if he's never started before.<br>

There also might be cases where we don't have FanGraphs data on a pitcher because he's a rookie and hasn't pitched before. In those cases, we don't necessarily know if this pitcher is a hot prospect or some scrub being called up from the minors to make a spot start for an overworked pitching staff. We just won't try to predict games in these situations because we don't know enough about the starting pitcher.<br>

The create_start_data function will be prompted when we don't have GP and Start-IP values, and we have a conditional statement that will drop rows with missing data as well as the rows with the opponent's data so that the entire game is dropped from our dataframe.

In [437]:
def create_start_data(name, games, IP):
    """
    Fills in data for pitchers who haven't started this year but have all the other FanGraphs data
    """
    todays_SPs.loc[todays_SPs['Name'] == name, 'GS'] = games
    todays_SPs.loc[todays_SPs['Name'] == name, 'Start-IP'] = IP

In [438]:
# def drop_game(team1, team2, df):
#     """
#     Drops games from the dataframe
#     """
#     df = df.drop(df[(df['Team'] == team1) | (df['Team'] == team2)].index)
#     return df

In [439]:
if '0' in list(todays_SPs['GS']):
    num_zero = todays_SPs['GS'].value_counts().loc['0']
    print(f"There are {num_zero} pitchers with no starts.")
    temp_df = todays_SPs[todays_SPs['GS'] == '0']
    for i in range(len(temp_df)):
        comp_pitch = temp_df.iloc[i, :]['Name']#input("Enter pitcher's name: ")
        gs_val = int(input(f"Enter a GS value for {comp_pitch}: "))
        ip_val = input(f"Enter a Start-IP value for {comp_pitch}: ") 
        create_start_data(comp_pitch, gs_val, ip_val)

There are 1 pitchers with no starts.
Enter a GS value for Joe Mantiply: 18
Enter a Start-IP value for Joe Mantiply: 20


In [444]:
if any_missing:
    print("Dropping games in which pitchers have missing FanGraphs data.")
    todays_SPs = todays_SPs.dropna()
    for opp in missing_rows_opp:
        if opp in list(todays_SPs['Team']):
            todays_SPs = todays_SPs[todays_SPs['Team'] != opp]

Dropping games in which pitchers have missing FanGraphs data.
29
29
STL
STL
28


In case we need to fill in starting data or dop a game, we can uncomment the function calls below.

In [445]:
#create_start_data('Joe Mantiply', 18, 20)

In [446]:
# todays_SPs = drop_game('OAK', 'STL', todays_SPs)

In [447]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date
0,Logan Allen,CLE,CIN,CIN,17,0.307,4.09,4.28,1.4,1.37,4.45,5.12,91.1,2023-08-15
1,Graham Ashcraft,CIN,CIN,CLE,22,0.292,5.13,4.71,1.1,1.41,4.92,5.25,120.0,2023-08-15
2,Cristian Javier,HOU,MIA,MIA,22,0.265,4.59,5.17,1.4,1.22,4.77,6.04,117.2,2023-08-15
3,Johnny Cueto,MIA,MIA,HOU,5,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,24.0,2023-08-15
4,Nick Pivetta,BOS,WSN,WSN,10,0.26,4.37,3.92,0.7,1.17,3.61,5.18,52.1,2023-08-15
5,Josiah Gray,WSN,WSN,BOS,23,0.296,4.74,5.01,1.7,1.44,5.01,5.12,126.2,2023-08-15
6,Zack Wheeler,PHI,TOR,TOR,23,0.307,3.06,3.45,4.3,1.11,3.44,5.52,137.0,2023-08-15
7,Yusei Kikuchi,TOR,TOR,PHI,23,0.291,4.52,3.96,1.4,1.23,3.99,5.3,122.1,2023-08-15
8,Bailey Falter,PIT,NYM,NYM,9,0.342,5.01,4.72,0.5,1.55,4.85,2.98,44.0,2023-08-15
9,David Peterson,NYM,NYM,PIT,13,0.374,4.45,3.6,0.5,1.61,3.98,4.43,61.0,2023-08-15


In [69]:
# todays_SPs.loc[todays_SPs['Name'] == 'Beau Brieske', 'GS'] = 15 
# todays_SPs.loc[todays_SPs['Name'] == 'Beau Brieske', 'Start-IP'] = 81.2
# todays_SPs.loc[todays_SPs['Name'] == 'Ty Blach', 'GS'] = 6
# todays_SPs.loc[todays_SPs['Name'] == 'Ty Blach', 'Start-IP'] = 23.2

               

Changing columns to numeric data type

In [70]:
todays_SPs_numcols = ['GS', 'BABIP', 'FIP', 'xFIP', 'WAR', 'WHIP', 'SIERA', 'RS/9', 'Start-IP']

In [71]:
todays_SPs[todays_SPs_numcols] = todays_SPs[todays_SPs_numcols].apply(pd.to_numeric, errors = 'coerce')

In [72]:
todays_SPs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 29
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Name      28 non-null     object        
 1   Team      28 non-null     object        
 2   Park      28 non-null     object        
 3   Opponent  28 non-null     object        
 4   GS        28 non-null     int64         
 5   BABIP     28 non-null     float64       
 6   FIP       28 non-null     float64       
 7   xFIP      28 non-null     float64       
 8   WAR       28 non-null     float64       
 9   WHIP      28 non-null     float64       
 10  SIERA     28 non-null     float64       
 11  RS/9      28 non-null     float64       
 12  Start-IP  28 non-null     float64       
 13  Date      28 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 3.3+ KB


In [73]:
#todays_SPs = todays_SPs.fillna(method = 'ffill')

# Avg_Outs
Here we can derive the Avg_Outs variable, the average outs a pitcher records per start. It's an absolute godsend that FanGraphs has a 'Start-IP' column. That saves us from writing a lot of code.

In [74]:
todays_SPs['Part-IP'] = (todays_SPs['Start-IP'] % 1)

In [75]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date,Part-IP
0,Logan Allen,CLE,CIN,CIN,17,0.307,4.09,4.28,1.4,1.37,4.45,5.12,91.1,2023-08-15,0.1
1,Graham Ashcraft,CIN,CIN,CLE,22,0.292,5.13,4.71,1.1,1.41,4.92,5.25,120.0,2023-08-15,0.0
2,Cristian Javier,HOU,MIA,MIA,22,0.265,4.59,5.17,1.4,1.22,4.77,6.04,117.2,2023-08-15,0.2
3,Johnny Cueto,MIA,MIA,HOU,5,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,24.0,2023-08-15,0.0
4,Nick Pivetta,BOS,WSN,WSN,10,0.26,4.37,3.92,0.7,1.17,3.61,5.18,52.1,2023-08-15,0.1
5,Josiah Gray,WSN,WSN,BOS,23,0.296,4.74,5.01,1.7,1.44,5.01,5.12,126.2,2023-08-15,0.2
6,Zack Wheeler,PHI,TOR,TOR,23,0.307,3.06,3.45,4.3,1.11,3.44,5.52,137.0,2023-08-15,0.0
7,Yusei Kikuchi,TOR,TOR,PHI,23,0.291,4.52,3.96,1.4,1.23,3.99,5.3,122.1,2023-08-15,0.1
8,Bailey Falter,PIT,NYM,NYM,9,0.342,5.01,4.72,0.5,1.55,4.85,2.98,44.0,2023-08-15,0.0
9,David Peterson,NYM,NYM,PIT,13,0.374,4.45,3.6,0.5,1.61,3.98,4.43,61.0,2023-08-15,0.0


In [76]:
todays_SPs['Start-IP'] = todays_SPs['Start-IP'] - todays_SPs['Part-IP']

In [77]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date,Part-IP
0,Logan Allen,CLE,CIN,CIN,17,0.307,4.09,4.28,1.4,1.37,4.45,5.12,91.0,2023-08-15,0.1
1,Graham Ashcraft,CIN,CIN,CLE,22,0.292,5.13,4.71,1.1,1.41,4.92,5.25,120.0,2023-08-15,0.0
2,Cristian Javier,HOU,MIA,MIA,22,0.265,4.59,5.17,1.4,1.22,4.77,6.04,117.0,2023-08-15,0.2
3,Johnny Cueto,MIA,MIA,HOU,5,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,24.0,2023-08-15,0.0
4,Nick Pivetta,BOS,WSN,WSN,10,0.26,4.37,3.92,0.7,1.17,3.61,5.18,52.0,2023-08-15,0.1
5,Josiah Gray,WSN,WSN,BOS,23,0.296,4.74,5.01,1.7,1.44,5.01,5.12,126.0,2023-08-15,0.2
6,Zack Wheeler,PHI,TOR,TOR,23,0.307,3.06,3.45,4.3,1.11,3.44,5.52,137.0,2023-08-15,0.0
7,Yusei Kikuchi,TOR,TOR,PHI,23,0.291,4.52,3.96,1.4,1.23,3.99,5.3,122.0,2023-08-15,0.1
8,Bailey Falter,PIT,NYM,NYM,9,0.342,5.01,4.72,0.5,1.55,4.85,2.98,44.0,2023-08-15,0.0
9,David Peterson,NYM,NYM,PIT,13,0.374,4.45,3.6,0.5,1.61,3.98,4.43,61.0,2023-08-15,0.0


# Multiply by 10

The tricky thing here is that FanGraphs indicates IP as decimals. For example, 5.1. Since there are three outs in an inning and not 10, we need to address that.

In [78]:
todays_SPs['Part-IP'] = todays_SPs['Part-IP'] * 10

In [79]:
todays_SPs['Outs'] = (todays_SPs['Start-IP'] * 3) + todays_SPs['Part-IP']

In [80]:
todays_SPs['Avg_Outs'] = np.round(todays_SPs['Outs']/todays_SPs['GS'], 2)

In [81]:
todays_SPs = todays_SPs.drop(['Start-IP', 'Part-IP', 'Outs', 'GS'], axis = 1)

# Rename BABIP
We have to change BABIP to BABIP_SP here to indicate it's the BABIP against the starting pitcher. This is because we'll be getting team hitting BABIP and team bullpen BABIP later in the notebook.

In [82]:
todays_SPs = todays_SPs.rename(columns = {'BABIP': 'BABIP_SP'})

In [83]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Date,Avg_Outs
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,5.12,2023-08-15,16.12
1,Graham Ashcraft,CIN,CIN,CLE,0.292,5.13,4.71,1.1,1.41,4.92,5.25,2023-08-15,16.36
2,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,6.04,2023-08-15,16.05
3,Johnny Cueto,MIA,MIA,HOU,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,2023-08-15,14.4
4,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,5.18,2023-08-15,15.7
5,Josiah Gray,WSN,WSN,BOS,0.296,4.74,5.01,1.7,1.44,5.01,5.12,2023-08-15,16.52
6,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,5.52,2023-08-15,17.87
7,Yusei Kikuchi,TOR,TOR,PHI,0.291,4.52,3.96,1.4,1.23,3.99,5.3,2023-08-15,15.96
8,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,2.98,2023-08-15,14.67
9,David Peterson,NYM,NYM,PIT,0.374,4.45,3.6,0.5,1.61,3.98,4.43,2023-08-15,14.08


In [84]:
#condition = (todays_SPs['Team'] == 'PIT') | (todays_SPs['Team'] == 'CLE')

In [85]:
#todays_SPs = todays_SPs.drop(todays_SPs[condition].index)

# Bullpen data
This should go a lot smoother than the previous merge. Bullpen data and hitting data both are team-wise statistics, so the dataframes we create will only be 30 rows.

In [86]:
#Actually for bullpens we should go last 30 days
BP_url_1 = 'https://www.fangraphs.com/leaders.aspx?pos=all&stats=rel&lg=all&qual=0&type=c,43,45,62,59,42,110,113,122,123&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate='
BP_url_2 = '&enddate='

In [87]:
today

datetime.date(2023, 8, 15)

In [88]:
yesterday

datetime.date(2023, 8, 14)

In [89]:
start_date = yesterday - timedelta(days = 30)

In [90]:
start_date

datetime.date(2023, 7, 15)

In [91]:
start_date_str = str(start_date)

In [92]:
start_date_str

'2023-07-15'

In [93]:
end_date_str = yesterday_str
start_date_str = start_date_str
url_string_page1 = BP_url_1 + start_date_str + BP_url_2 + end_date_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
#print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
#num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
#     if num_pages > 2:
#         for j in range(2, num_pages):
#             temp_url_string = pitch_url + current_date_str + '&page=' + str(j) + '_30'
#             temp_r = requests.get(temp_url_string)
#             temp_html_doc = temp_r.text
#             temp_soup_obj = BeautifulSoup(temp_html_doc)
#             temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
#             for entry in temp_data:
#                 all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
BP_live_df = pd.DataFrame(data_lists, columns = col_names)
#date_plus_1 = str(date + timedelta(days = 1))
date_plus_1 = pd.to_datetime(today_str, format = '%Y-%m-%d')
BP_live_df['Date'] = date_plus_1
#     list_of_dfs.append(SP_live_df)
#     print(f"{start_date_str} - {end_date_str}")
#    time.sleep(25)


In [94]:
BP_live_df

Unnamed: 0,#,Team,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SwStr%,SIERA,RS/9,Date
0,1,SEA,0.284,3.36,3.4,1.6,1.18,71.2%,13.5%,3.32,5.34,2023-08-15
1,2,LAD,0.245,4.09,4.65,1.0,1.15,75.9%,11.6%,4.29,5.85,2023-08-15
2,3,MIL,0.254,3.87,3.63,1.1,1.06,73.3%,12.7%,3.4,4.42,2023-08-15
3,4,TOR,0.278,3.85,4.51,1.1,1.23,75.2%,11.4%,4.21,3.99,2023-08-15
4,5,BAL,0.287,3.4,4.05,1.3,1.23,72.7%,13.4%,3.87,3.13,2023-08-15
5,6,PHI,0.27,4.31,4.65,0.6,1.3,75.9%,11.8%,4.32,5.11,2023-08-15
6,7,SFG,0.303,3.4,4.18,2.1,1.21,76.1%,11.7%,3.83,3.6,2023-08-15
7,8,NYY,0.26,4.18,3.98,0.1,1.21,71.2%,13.4%,3.78,3.51,2023-08-15
8,9,ATL,0.292,4.05,4.01,0.7,1.31,74.6%,12.6%,3.93,6.22,2023-08-15
9,10,CHC,0.244,4.64,4.41,0.2,1.21,74.8%,11.3%,4.13,6.94,2023-08-15


In [95]:
BP_live_df = BP_live_df.drop(['#', 'WAR', 'Contact%', 'SwStr%', 'RS/9', 'Date'], axis = 1)

In [96]:
#Adding '_BP' to each column name so we can distinguish from SP data
BP_live_df = BP_live_df.rename(columns = {'BABIP': 'BABIP_BP', 'FIP': 'FIP_BP', 'xFIP': 'xFIP_BP', 'WHIP': 'WHIP_BP', 'SIERA': 'SIERA_BP'})

In [97]:
BP_live_df

Unnamed: 0,Team,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP
0,SEA,0.284,3.36,3.4,1.18,3.32
1,LAD,0.245,4.09,4.65,1.15,4.29
2,MIL,0.254,3.87,3.63,1.06,3.4
3,TOR,0.278,3.85,4.51,1.23,4.21
4,BAL,0.287,3.4,4.05,1.23,3.87
5,PHI,0.27,4.31,4.65,1.3,4.32
6,SFG,0.303,3.4,4.18,1.21,3.83
7,NYY,0.26,4.18,3.98,1.21,3.78
8,ATL,0.292,4.05,4.01,1.31,3.93
9,CHC,0.244,4.64,4.41,1.21,4.13


In [98]:
BP_cols = list(BP_live_df.columns)[1:]

In [99]:
BP_cols

['BABIP_BP', 'FIP_BP', 'xFIP_BP', 'WHIP_BP', 'SIERA_BP']

In [100]:
BP_live_df[BP_cols] = BP_live_df[BP_cols].apply(pd.to_numeric, errors = 'coerce')

In [101]:
todays_SPs = pd.merge(todays_SPs, BP_live_df, how = 'left', on = 'Team')

In [102]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Date,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,5.12,2023-08-15,16.12,0.333,3.77,3.89,1.48,3.83
1,Graham Ashcraft,CIN,CIN,CLE,0.292,5.13,4.71,1.1,1.41,4.92,5.25,2023-08-15,16.36,0.279,5.12,4.55,1.38,4.12
2,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,6.04,2023-08-15,16.05,0.257,5.77,4.97,1.45,4.61
3,Johnny Cueto,MIA,MIA,HOU,0.194,5.83,4.82,-0.1,1.0,4.53,1.67,2023-08-15,14.4,0.312,3.81,4.1,1.36,3.72
4,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,5.18,2023-08-15,15.7,0.339,3.86,4.13,1.42,3.82
5,Josiah Gray,WSN,WSN,BOS,0.296,4.74,5.01,1.7,1.44,5.01,5.12,2023-08-15,16.52,0.317,4.39,4.44,1.39,4.06
6,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,5.52,2023-08-15,17.87,0.27,4.31,4.65,1.3,4.32
7,Yusei Kikuchi,TOR,TOR,PHI,0.291,4.52,3.96,1.4,1.23,3.99,5.3,2023-08-15,15.96,0.278,3.85,4.51,1.23,4.21
8,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,2.98,2023-08-15,14.67,0.3,3.72,4.09,1.26,3.6
9,David Peterson,NYM,NYM,PIT,0.374,4.45,3.6,0.5,1.61,3.98,4.43,2023-08-15,14.08,0.307,5.06,4.91,1.52,4.42


In [103]:
todays_SPs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 18 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Name      28 non-null     object        
 1   Team      28 non-null     object        
 2   Park      28 non-null     object        
 3   Opponent  28 non-null     object        
 4   BABIP_SP  28 non-null     float64       
 5   FIP       28 non-null     float64       
 6   xFIP      28 non-null     float64       
 7   WAR       28 non-null     float64       
 8   WHIP      28 non-null     float64       
 9   SIERA     28 non-null     float64       
 10  RS/9      28 non-null     float64       
 11  Date      28 non-null     datetime64[ns]
 12  Avg_Outs  28 non-null     float64       
 13  BABIP_BP  28 non-null     float64       
 14  FIP_BP    28 non-null     float64       
 15  xFIP_BP   28 non-null     float64       
 16  WHIP_BP   28 non-null     float64       
 17  SIERA_BP  28 non-n

# Hitting Data

In [104]:
Hit_url_1 = 'https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=c,39,41,50,58,40&season=2022&month=1000&season1=2021&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate='
Hit_url_2 = '&enddate='

In [105]:
end_date_str = yesterday_str
url_string_page1 = Hit_url_1 + start_date_str + Hit_url_2 + end_date_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
#print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
#num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
#     if num_pages > 2:
#         for j in range(2, num_pages):
#             temp_url_string = pitch_url + current_date_str + '&page=' + str(j) + '_30'
#             temp_r = requests.get(temp_url_string)
#             temp_html_doc = temp_r.text
#             temp_soup_obj = BeautifulSoup(temp_html_doc)
#             temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
#             for entry in temp_data:
#                 all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
Hit_live_df = pd.DataFrame(data_lists, columns = col_names)
#     date_plus_1 = str(date + timedelta(days = 1))
#     date_plus_1 = pd.to_datetime(date_plus_1, format = '%Y-%m-%d')
SP_live_df['Date'] = today
#list_of_dfs.append(SP_live_df)
#print(f"{end_date_str} done {date_plus_1}")
#time.sleep(25)


In [106]:
Hit_live_df = Hit_live_df.drop(['WAR', '#'], axis = 1)

In [107]:
Hit_live_df

Unnamed: 0,Team,OPS,BABIP,wOBA,ISO
0,ATL,0.902,0.321,0.381,0.248
1,CHC,0.848,0.329,0.364,0.207
2,LAD,0.831,0.315,0.358,0.185
3,STL,0.807,0.308,0.349,0.184
4,MIN,0.8,0.314,0.344,0.212
5,HOU,0.792,0.3,0.343,0.184
6,TEX,0.794,0.31,0.342,0.198
7,KCR,0.781,0.31,0.333,0.191
8,TOR,0.758,0.313,0.331,0.156
9,PHI,0.755,0.302,0.329,0.171


In [108]:
Hit_live_df[['OPS', 'BABIP', 'wOBA', 'ISO']] = Hit_live_df[['OPS', 'BABIP', 'wOBA', 'ISO']].apply(pd.to_numeric, errors = 'coerce')

In [109]:
todays_SPs = pd.merge(todays_SPs, Hit_live_df, how = 'left', on = 'Team')

In [110]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,16.12,0.333,3.77,3.89,1.48,3.83,0.71,0.287,0.308,0.149
1,Graham Ashcraft,CIN,CIN,CLE,0.292,5.13,4.71,1.1,1.41,4.92,...,16.36,0.279,5.12,4.55,1.38,4.12,0.729,0.29,0.313,0.191
2,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,16.05,0.257,5.77,4.97,1.45,4.61,0.792,0.3,0.343,0.184
3,Johnny Cueto,MIA,MIA,HOU,0.194,5.83,4.82,-0.1,1.0,4.53,...,14.4,0.312,3.81,4.1,1.36,3.72,0.703,0.3,0.302,0.15
4,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,15.7,0.339,3.86,4.13,1.42,3.82,0.739,0.294,0.317,0.177
5,Josiah Gray,WSN,WSN,BOS,0.296,4.74,5.01,1.7,1.44,5.01,...,16.52,0.317,4.39,4.44,1.39,4.06,0.742,0.295,0.323,0.158
6,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,17.87,0.27,4.31,4.65,1.3,4.32,0.755,0.302,0.329,0.171
7,Yusei Kikuchi,TOR,TOR,PHI,0.291,4.52,3.96,1.4,1.23,3.99,...,15.96,0.278,3.85,4.51,1.23,4.21,0.758,0.313,0.331,0.156
8,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,14.67,0.3,3.72,4.09,1.26,3.6,0.693,0.273,0.302,0.169
9,David Peterson,NYM,NYM,PIT,0.374,4.45,3.6,0.5,1.61,3.98,...,14.08,0.307,5.06,4.91,1.52,4.42,0.696,0.26,0.306,0.158


# Breaking dataframe into home and away
Now we create a dataframe with just away team rows and one with just home team rows, based on whether or not the team name matches the park.

In [111]:
away_df = todays_SPs[todays_SPs['Team'] != todays_SPs['Park']]

In [112]:
away_df

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,16.12,0.333,3.77,3.89,1.48,3.83,0.71,0.287,0.308,0.149
2,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,16.05,0.257,5.77,4.97,1.45,4.61,0.792,0.3,0.343,0.184
4,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,15.7,0.339,3.86,4.13,1.42,3.82,0.739,0.294,0.317,0.177
6,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,17.87,0.27,4.31,4.65,1.3,4.32,0.755,0.302,0.329,0.171
8,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,14.67,0.3,3.72,4.09,1.26,3.6,0.693,0.273,0.302,0.169
10,Luis Severino,NYY,ATL,ATL,0.359,6.65,5.13,-0.7,1.88,5.01,...,14.23,0.26,4.18,3.98,1.21,3.78,0.717,0.289,0.316,0.149
12,Alex Faedo,DET,MIN,MIN,0.22,5.18,4.43,0.1,1.07,4.36,...,15.13,0.308,3.98,4.39,1.37,4.0,0.691,0.304,0.3,0.144
14,Lucas Giolito,LAA,TEX,TEX,0.28,4.79,4.43,1.4,1.26,4.18,...,17.0,0.332,5.31,4.72,1.66,4.29,0.731,0.29,0.311,0.193
16,Touki Toussaint,CHW,CHC,CHC,0.252,4.63,4.67,0.3,1.4,5.11,...,14.43,0.306,4.48,4.44,1.36,4.0,0.669,0.3,0.292,0.132
18,Emerson Hancock,SEA,KCR,KCR,0.143,3.87,5.18,0.1,1.0,5.75,...,15.0,0.284,3.36,3.4,1.18,3.32,0.754,0.312,0.326,0.184


In [113]:
home_df = todays_SPs[todays_SPs['Team'] == todays_SPs['Park']]

In [114]:
home_df

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
1,Graham Ashcraft,CIN,CIN,CLE,0.292,5.13,4.71,1.1,1.41,4.92,...,16.36,0.279,5.12,4.55,1.38,4.12,0.729,0.29,0.313,0.191
3,Johnny Cueto,MIA,MIA,HOU,0.194,5.83,4.82,-0.1,1.0,4.53,...,14.4,0.312,3.81,4.1,1.36,3.72,0.703,0.3,0.302,0.15
5,Josiah Gray,WSN,WSN,BOS,0.296,4.74,5.01,1.7,1.44,5.01,...,16.52,0.317,4.39,4.44,1.39,4.06,0.742,0.295,0.323,0.158
7,Yusei Kikuchi,TOR,TOR,PHI,0.291,4.52,3.96,1.4,1.23,3.99,...,15.96,0.278,3.85,4.51,1.23,4.21,0.758,0.313,0.331,0.156
9,David Peterson,NYM,NYM,PIT,0.374,4.45,3.6,0.5,1.61,3.98,...,14.08,0.307,5.06,4.91,1.52,4.42,0.696,0.26,0.306,0.158
11,Bryce Elder,ATL,ATL,NYY,0.276,4.3,4.32,1.5,1.24,4.58,...,17.09,0.292,4.05,4.01,1.31,3.93,0.902,0.321,0.381,0.248
13,Bailey Ober,MIN,MIN,DET,0.297,3.79,4.31,2.0,1.11,3.95,...,17.16,0.306,4.64,4.32,1.34,3.85,0.8,0.314,0.344,0.212
15,Jordan Montgomery,TEX,TEX,LAA,0.299,3.78,4.1,2.7,1.25,4.26,...,17.35,0.258,4.71,4.23,1.15,3.64,0.794,0.31,0.342,0.198
17,Kyle Hendricks,CHC,CHC,CHW,0.266,4.2,4.53,1.3,1.12,4.61,...,17.27,0.244,4.64,4.41,1.21,4.13,0.848,0.329,0.364,0.207
19,Jordan Lyles,KCR,KCR,SEA,0.265,5.25,5.29,0.5,1.27,5.12,...,17.23,0.291,5.45,5.18,1.48,4.89,0.781,0.31,0.333,0.191


Adding _A and _H to column names

In [115]:
away_df.columns

Index(['Name', 'Team', 'Park', 'Opponent', 'BABIP_SP', 'FIP', 'xFIP', 'WAR',
       'WHIP', 'SIERA', 'RS/9', 'Date', 'Avg_Outs', 'BABIP_BP', 'FIP_BP',
       'xFIP_BP', 'WHIP_BP', 'SIERA_BP', 'OPS', 'BABIP', 'wOBA', 'ISO'],
      dtype='object')

In [116]:
cols_to_change = ['Team', 'BABIP_SP', 'FIP', 'xFIP', 'WAR',
       'WHIP', 'SIERA', 'RS/9', 'Date', 'Avg_Outs', 'BABIP_BP', 'FIP_BP',
       'xFIP_BP', 'WHIP_BP', 'SIERA_BP', '#', 'OPS', 'BABIP', 'wOBA', 'ISO']

In [117]:
away_col_names = []
home_col_names = []

In [118]:
for i in range(len(cols_to_change)):
    away_col_names.append('A_' + cols_to_change[i])
    home_col_names.append('H_' + cols_to_change[i])

In [119]:
away_change_dict = dict(zip(cols_to_change, away_col_names))

In [120]:
home_change_dict = dict(zip(cols_to_change, home_col_names))

In [121]:
away_change_dict

{'Team': 'A_Team',
 'BABIP_SP': 'A_BABIP_SP',
 'FIP': 'A_FIP',
 'xFIP': 'A_xFIP',
 'WAR': 'A_WAR',
 'WHIP': 'A_WHIP',
 'SIERA': 'A_SIERA',
 'RS/9': 'A_RS/9',
 'Date': 'A_Date',
 'Avg_Outs': 'A_Avg_Outs',
 'BABIP_BP': 'A_BABIP_BP',
 'FIP_BP': 'A_FIP_BP',
 'xFIP_BP': 'A_xFIP_BP',
 'WHIP_BP': 'A_WHIP_BP',
 'SIERA_BP': 'A_SIERA_BP',
 '#': 'A_#',
 'OPS': 'A_OPS',
 'BABIP': 'A_BABIP',
 'wOBA': 'A_wOBA',
 'ISO': 'A_ISO'}

In [122]:
away_df = away_df.rename(columns = away_change_dict)

In [123]:
home_df = home_df.rename(columns = home_change_dict)

In [124]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 26
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        14 non-null     object        
 1   A_Team      14 non-null     object        
 2   Park        14 non-null     object        
 3   Opponent    14 non-null     object        
 4   A_BABIP_SP  14 non-null     float64       
 5   A_FIP       14 non-null     float64       
 6   A_xFIP      14 non-null     float64       
 7   A_WAR       14 non-null     float64       
 8   A_WHIP      14 non-null     float64       
 9   A_SIERA     14 non-null     float64       
 10  A_RS/9      14 non-null     float64       
 11  A_Date      14 non-null     datetime64[ns]
 12  A_Avg_Outs  14 non-null     float64       
 13  A_BABIP_BP  14 non-null     float64       
 14  A_FIP_BP    14 non-null     float64       
 15  A_xFIP_BP   14 non-null     float64       
 16  A_WHIP_BP   14 non-null     

In [125]:
home_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 1 to 27
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        14 non-null     object        
 1   H_Team      14 non-null     object        
 2   Park        14 non-null     object        
 3   Opponent    14 non-null     object        
 4   H_BABIP_SP  14 non-null     float64       
 5   H_FIP       14 non-null     float64       
 6   H_xFIP      14 non-null     float64       
 7   H_WAR       14 non-null     float64       
 8   H_WHIP      14 non-null     float64       
 9   H_SIERA     14 non-null     float64       
 10  H_RS/9      14 non-null     float64       
 11  H_Date      14 non-null     datetime64[ns]
 12  H_Avg_Outs  14 non-null     float64       
 13  H_BABIP_BP  14 non-null     float64       
 14  H_FIP_BP    14 non-null     float64       
 15  H_xFIP_BP   14 non-null     float64       
 16  H_WHIP_BP   14 non-null     

# Win variables
Previously, we created the dataframe for all the season's scores here. But now it's created in our Scoresheet notebook as we keep track of our model's performance. So we read in that CSV here.

In [126]:
scores_df = pd.read_csv('2023_Game_Scores.csv')

In [127]:
scores_df = clean_unnamed(scores_df)

In [128]:
scores_df = scores_df.reset_index()

In [129]:
scores_df.head()

Unnamed: 0,index,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
0,0,BAL,10,BOS,9,2023-03-30,0,1
1,1,MIL,0,CHC,4,2023-03-30,1,0
2,2,PIT,5,CIN,4,2023-03-30,0,1
3,3,CHW,3,HOU,2,2023-03-30,0,1
4,4,MIN,2,KCR,0,2023-03-30,0,1


In [130]:
scores_df.tail()

Unnamed: 0,index,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
1781,1781,PIT,2,NYM,7,2023-08-14,1,0
1782,1782,BAL,4,SDP,1,2023-08-14,0,1
1783,1783,TBR,10,SFG,2,2023-08-14,0,1
1784,1784,OAK,5,STL,7,2023-08-14,1,0
1785,1785,LAA,0,TEX,12,2023-08-14,1,0


In [131]:
scores_df['index'].nunique()

1786

In [132]:
scores_df.rename(columns = {'index':'game_id'}, inplace = True)

In [133]:
scores_df.head()

Unnamed: 0,game_id,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
0,0,BAL,10,BOS,9,2023-03-30,0,1
1,1,MIL,0,CHC,4,2023-03-30,1,0
2,2,PIT,5,CIN,4,2023-03-30,0,1
3,3,CHW,3,HOU,2,2023-03-30,0,1
4,4,MIN,2,KCR,0,2023-03-30,0,1


In [134]:
scores_df['Away_Host'] = 0
scores_df['Home_Host'] = 1

In [135]:
# scores_df['Home_Win'] = np.where(scores_df['Home_Score'] > scores_df['Away_Score'], 1, 0)
# scores_df['Away_Win'] = np.where(scores_df['Away_Score'] > scores_df['Home_Score'], 1, 0)

In [136]:
# scores_df = scores_df.rename(columns = {'Away':'Away_Team', 'Home': 'Home_Team'})

In [137]:
scores_df_A = scores_df[['Date', 'game_id', 'Away_Team', 'Away_Score', 'Away_Win', 'Away_Host']]
scores_df_H = scores_df[['Date', 'game_id', 'Home_Team', 'Home_Score', 'Home_Win', 'Home_Host']]

In [138]:
scores_df_A.head()

Unnamed: 0,Date,game_id,Away_Team,Away_Score,Away_Win,Away_Host
0,2023-03-30,0,BAL,10,1,0
1,2023-03-30,1,MIL,0,0,0
2,2023-03-30,2,PIT,5,1,0
3,2023-03-30,3,CHW,3,1,0
4,2023-03-30,4,MIN,2,1,0


In [139]:
scores_df_H.head()

Unnamed: 0,Date,game_id,Home_Team,Home_Score,Home_Win,Home_Host
0,2023-03-30,0,BOS,9,0,1
1,2023-03-30,1,CHC,4,1,1
2,2023-03-30,2,CIN,4,0,1
3,2023-03-30,3,HOU,2,0,1
4,2023-03-30,4,KCR,0,0,1


In [140]:
scores_df_A = scores_df_A.rename(columns = {'Away_Team': 'Team', 'Away_Score': 'Score', 'Away_Win': 'Win', 'Away_Host': 'Host'})
scores_df_H = scores_df_H.rename(columns = {'Home_Team': 'Team', 'Home_Score': 'Score', 'Home_Win': 'Win', 'Home_Host': 'Host'})

# Re-assembling home and away rows
We concatenate our home and away dataframes and srt by Date, game_id and the Host variable we created. That way each individual game is represented on consecutive rows and the row pertaining to the away team is the one on the top.

In [141]:
wins_df = pd.concat([scores_df_A, scores_df_H]).sort_values(['Date', 'game_id', 'Host'])

In [142]:
wins_df.tail(50)

Unnamed: 0,Date,game_id,Team,Score,Win,Host
1761,2023-08-12,1761,OAK,2,0,0
1761,2023-08-12,1761,WSN,3,1,1
1762,2023-08-13,1762,SDP,4,0,0
1762,2023-08-13,1762,ARI,5,1,1
1763,2023-08-13,1763,DET,3,0,0
1763,2023-08-13,1763,BOS,6,1,1
1764,2023-08-13,1764,MIL,7,1,0
1764,2023-08-13,1764,CHW,3,0,1
1765,2023-08-13,1765,LAA,2,1,0
1765,2023-08-13,1765,HOU,1,0,1


In [143]:
wins_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3572 entries, 0 to 1785
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     3572 non-null   object
 1   game_id  3572 non-null   int64 
 2   Team     3572 non-null   object
 3   Score    3572 non-null   int64 
 4   Win      3572 non-null   int64 
 5   Host     3572 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 195.3+ KB


# Adding a Team_Wins column
We group by Team and Win cumsum.

In [144]:
wins_df['Team_Wins'] = wins_df.groupby('Team')['Win'].cumsum()

In [145]:
wins_df.tail(31)

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins
1770,2023-08-13,1770,PIT,4,1,1,53
1771,2023-08-13,1771,CIN,6,1,0,62
1771,2023-08-13,1771,PIT,5,0,1,53
1772,2023-08-13,1772,BAL,5,1,0,73
1772,2023-08-13,1772,SEA,3,0,1,63
1773,2023-08-13,1773,TEX,2,0,0,70
1773,2023-08-13,1773,SFG,3,1,1,63
1774,2023-08-13,1774,CLE,9,1,0,57
1774,2023-08-13,1774,TBR,2,0,1,71
1775,2023-08-13,1775,CHC,4,0,0,61


In [146]:
# pd.set_option('display.max_rows', None)

# Last 10
Now we need a wins-in-last-10-games variable. This for loop creates a temporary dataframe for each team and subtracts Team_Wins from 10 rows before from Team_Wins in the current row of the loop. We need to reset the index of each temp_df so we can use the index when we use .loc.<br>

Then we concat the list of temp_dfs and sort by Date and game_id.<br>

There are two for loops within the main for loop because for the first 10 rows, wins in last 10 games is just total wins.

In [147]:
list_of_dfs = []
for team in wins_df['Team'].unique():
    temp_df = wins_df[wins_df['Team'] == team]
    temp_df = temp_df.reset_index(drop=True)
    temp_df['Wins_L10'] = 0
    for i in range(10):
        temp_df.loc[i, 'Wins_L10'] = temp_df.loc[i, 'Team_Wins']
    for j in range(10, (len(temp_df))):
        temp_df.loc[j, 'Wins_L10'] = temp_df.loc[j, 'Team_Wins'] - temp_df.loc[j-10, 'Team_Wins']
    list_of_dfs.append(temp_df)

In [148]:
wins_df = pd.concat(list_of_dfs).sort_values(['Date', 'game_id', 'Host'])

In [149]:
wins_df.tail()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10
118,2023-08-14,1783,SFG,2,0,1,63,3
118,2023-08-14,1784,OAK,5,0,0,33,3
118,2023-08-14,1784,STL,7,1,1,53,5
119,2023-08-14,1785,LAA,0,0,0,59,3
118,2023-08-14,1785,TEX,12,1,1,71,8


# Last 30

In [150]:
list_of_dfs = []
for team in wins_df['Team'].unique():
    temp_df = wins_df[wins_df['Team'] == team]
    temp_df = temp_df.reset_index(drop=True)
    temp_df['Wins_L30'] = 0
    for i in range(30):
        temp_df.loc[i, 'Wins_L30'] = temp_df.loc[i, 'Team_Wins']
    for j in range(30, (len(temp_df))):
        temp_df.loc[j, 'Wins_L30'] = temp_df.loc[j, 'Team_Wins'] - temp_df.loc[j-30, 'Team_Wins']
    list_of_dfs.append(temp_df)



In [151]:
wins_df = pd.concat(list_of_dfs).sort_values(['Date', 'game_id'])



In [152]:
wins_df.head(50)



Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30
0,2023-03-30,0,BAL,10,1,0,1,1,1
0,2023-03-30,0,BOS,9,0,1,0,0,0
0,2023-03-30,1,MIL,0,0,0,0,0,0
0,2023-03-30,1,CHC,4,1,1,1,1,1
0,2023-03-30,2,PIT,5,1,0,1,1,1
0,2023-03-30,2,CIN,4,0,1,0,0,0
0,2023-03-30,3,CHW,3,1,0,1,1,1
0,2023-03-30,3,HOU,2,0,1,0,0,0
0,2023-03-30,4,MIN,2,1,0,1,1,1
0,2023-03-30,4,KCR,0,0,1,0,0,0


# Pythagorean wins
Now let's add a Pythagorean Wins variable. According to <a href='https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage'>MLB.com</a>, this is a formula created by noted baseball statistician Bill James to measure how many games a team "should have" won, based on runs scored and runs allowed.<br>

**Formula:** Runs Scored to the 1.83 power/((Runs Scored to the 1.83 power) + (Runs Allowed to the 1.83 power))<br>

Because that formula outputs the winning percentage a team "should" have, we multiply by games played to convert to wins.<br>

We'll start by creating a variable that counts each team's games played, then cumulative Runs_Scored and Runs_Allowed variables.<br>



In [153]:
wins_df = wins_df.reset_index(drop=True)

In [154]:
wins_df['Games_Played'] = wins_df.groupby('Team').cumcount() + 1

In [155]:
wins_df.tail(10)

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played
3562,2023-08-14,1781,PIT,2,0,0,53,4,13,119
3563,2023-08-14,1781,NYM,7,1,1,54,4,12,119
3564,2023-08-14,1782,BAL,4,1,0,74,7,20,119
3565,2023-08-14,1782,SDP,1,0,1,56,2,14,119
3566,2023-08-14,1783,SFG,2,0,1,63,3,15,119
3567,2023-08-14,1783,TBR,10,1,0,72,6,15,121
3568,2023-08-14,1784,OAK,5,0,0,33,3,8,119
3569,2023-08-14,1784,STL,7,1,1,53,5,16,119
3570,2023-08-14,1785,LAA,0,0,0,59,3,14,120
3571,2023-08-14,1785,TEX,12,1,1,71,8,19,119


Getting runs scored and runs allowed.

In [156]:
wins_df['Opp_Score'] = 0

In [157]:
for i in range(len(wins_df)):
    if i % 2 == 0:
        wins_df.loc[i, 'Opp_Score'] = wins_df.loc[i+1, 'Score']
    else:
        wins_df.loc[i, 'Opp_Score'] = wins_df.loc[i-1, 'Score']

In [158]:
wins_df.head()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played,Opp_Score
0,2023-03-30,0,BAL,10,1,0,1,1,1,1,9
1,2023-03-30,0,BOS,9,0,1,0,0,0,1,10
2,2023-03-30,1,MIL,0,0,0,0,0,0,1,4
3,2023-03-30,1,CHC,4,1,1,1,1,1,1,0
4,2023-03-30,2,PIT,5,1,0,1,1,1,1,4


In [159]:
wins_df['Runs_Scored'] = wins_df.groupby('Team')['Score'].cumsum()

In [160]:
wins_df['Runs_Allowed'] = wins_df.groupby('Team')['Opp_Score'].cumsum()

In [161]:
wins_df['Py_Wins'] = np.round(wins_df['Runs_Scored']**1.83/((wins_df['Runs_Scored']**1.83)+(wins_df['Runs_Allowed']**1.83))\
*wins_df['Games_Played'], 0)

In [162]:
wins_df.tail()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played,Opp_Score,Runs_Scored,Runs_Allowed,Py_Wins
3567,2023-08-14,1783,TBR,10,1,0,72,6,15,121,2,630,483,75.0
3568,2023-08-14,1784,OAK,5,0,0,33,3,8,119,7,425,711,33.0
3569,2023-08-14,1784,STL,7,1,1,53,5,16,119,5,558,584,57.0
3570,2023-08-14,1785,LAA,0,0,0,59,3,14,120,12,575,591,58.0
3571,2023-08-14,1785,TEX,12,1,1,71,8,19,119,0,684,491,77.0


Now we derive Run Differential, keep only the columns we need and then keep only the latest row for each team before we save to CSV.

In [163]:
wins_df['Run_Diff'] = wins_df['Runs_Scored'] - wins_df['Runs_Allowed']

In [164]:
wins_df = wins_df.drop(columns = ['Games_Played', 'Opp_Score', 'Runs_Scored', 'Runs_Allowed'])

In [165]:
wins_df = wins_df.drop(columns = ['game_id', 'Score', 'Win', 'Host'])

In [166]:
wins_df['Date'] = today

In [167]:
wins_df['Date'] = pd.to_datetime(wins_df['Date'])

In [168]:
wins_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3572 entries, 0 to 3571
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       3572 non-null   datetime64[ns]
 1   Team       3572 non-null   object        
 2   Team_Wins  3572 non-null   int64         
 3   Wins_L10   3572 non-null   int64         
 4   Wins_L30   3572 non-null   int64         
 5   Py_Wins    3572 non-null   float64       
 6   Run_Diff   3572 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 195.5+ KB


In [169]:
wins_df = wins_df.sort_values(by = ['Team', 'Date'], ascending = ['True', 'False'])

In [170]:
wins_df = wins_df.drop_duplicates(subset = 'Team', keep = 'last')

In [171]:
filepath = r'C:\Users\Owner\Sports Betting\MLB_Game_Outcome\2023_Win_Features_' + today_str + '.csv'
wins_df.to_csv(filepath)

In [172]:
wins_df

Unnamed: 0,Date,Team,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
3556,2023-08-15,ARI,59,2,8,57.0,-24
3555,2023-08-15,ATL,76,6,16,77.0,201
3564,2023-08-15,BAL,74,7,20,66.0,68
3526,2023-08-15,BOS,62,5,17,62.0,29
3550,2023-08-15,CHC,61,6,20,65.0,64
3529,2023-08-15,CHW,47,4,10,50.0,-95
3543,2023-08-15,CIN,62,3,12,58.0,-22
3548,2023-08-15,CLE,57,4,12,60.0,0
3557,2023-08-15,COL,46,3,12,43.0,-178
3527,2023-08-15,DET,53,5,14,49.0,-100


# Back to home and away dataframes
New we still have our separate home and away dataframes even though we concatenated to make the wins_df. We'll merge those now with our wins_df.

In [173]:
away_df = away_df.rename(columns = {'A_Team': 'Team'})
home_df = home_df.rename(columns = {'H_Team': 'Team'})

In [174]:
wins_df = wins_df.drop('Date', axis = 1)

In [175]:
away_df = pd.merge(away_df, wins_df, on = ['Team'], how = 'left')

In [176]:
home_df = pd.merge(home_df, wins_df, on = ['Team'], how = 'left')

In [177]:
away_df

Unnamed: 0,Name,Team,Park,Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,A_SIERA_BP,A_OPS,A_BABIP,A_wOBA,A_ISO,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,3.83,0.71,0.287,0.308,0.149,57,4,12,60.0,0
1,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,4.61,0.792,0.3,0.343,0.184,68,6,18,68.0,82
2,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,3.82,0.739,0.294,0.317,0.177,62,5,17,62.0,29
3,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,4.32,0.755,0.302,0.329,0.171,65,6,17,63.0,33
4,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,3.6,0.693,0.273,0.302,0.169,53,4,13,51.0,-89
5,Luis Severino,NYY,ATL,ATL,0.359,6.65,5.13,-0.7,1.88,5.01,...,3.78,0.717,0.289,0.316,0.149,60,3,12,59.0,-7
6,Alex Faedo,DET,MIN,MIN,0.22,5.18,4.43,0.1,1.07,4.36,...,4.0,0.691,0.304,0.3,0.144,53,5,14,49.0,-100
7,Lucas Giolito,LAA,TEX,TEX,0.28,4.79,4.43,1.4,1.26,4.18,...,4.29,0.731,0.29,0.311,0.193,59,3,14,58.0,-16
8,Touki Toussaint,CHW,CHC,CHC,0.252,4.63,4.67,0.3,1.4,5.11,...,4.0,0.669,0.3,0.292,0.132,47,4,10,50.0,-95
9,Emerson Hancock,SEA,KCR,KCR,0.143,3.87,5.18,0.1,1.0,5.75,...,3.32,0.754,0.312,0.326,0.184,63,7,19,65.0,57


In [178]:
home_df

Unnamed: 0,Name,Team,Park,Opponent,H_BABIP_SP,H_FIP,H_xFIP,H_WAR,H_WHIP,H_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
0,Graham Ashcraft,CIN,CIN,CLE,0.292,5.13,4.71,1.1,1.41,4.92,...,4.12,0.729,0.29,0.313,0.191,62,3,12,58.0,-22
1,Johnny Cueto,MIA,MIA,HOU,0.194,5.83,4.82,-0.1,1.0,4.53,...,3.72,0.703,0.3,0.302,0.15,63,5,12,56.0,-36
2,Josiah Gray,WSN,WSN,BOS,0.296,4.74,5.01,1.7,1.44,5.01,...,4.06,0.742,0.295,0.323,0.158,53,7,18,51.0,-89
3,Yusei Kikuchi,TOR,TOR,PHI,0.291,4.52,3.96,1.4,1.23,3.99,...,4.21,0.758,0.313,0.331,0.156,66,6,17,66.0,59
4,David Peterson,NYM,NYM,PIT,0.374,4.45,3.6,0.5,1.61,3.98,...,4.42,0.696,0.26,0.306,0.158,54,4,12,54.0,-51
5,Bryce Elder,ATL,ATL,NYY,0.276,4.3,4.32,1.5,1.24,4.58,...,3.93,0.902,0.321,0.381,0.248,76,6,16,77.0,201
6,Bailey Ober,MIN,MIN,DET,0.297,3.79,4.31,2.0,1.11,3.95,...,3.85,0.8,0.314,0.344,0.212,62,6,17,64.0,39
7,Jordan Montgomery,TEX,TEX,LAA,0.299,3.78,4.1,2.7,1.25,4.26,...,3.64,0.794,0.31,0.342,0.198,71,8,19,77.0,193
8,Kyle Hendricks,CHC,CHC,CHW,0.266,4.2,4.53,1.3,1.12,4.61,...,4.13,0.848,0.329,0.364,0.207,61,6,20,65.0,64
9,Jordan Lyles,KCR,KCR,SEA,0.265,5.25,5.29,0.5,1.27,5.12,...,4.89,0.781,0.31,0.333,0.191,39,4,14,45.0,-159


In [179]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 13
Data columns (total 27 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        14 non-null     object        
 1   Team        14 non-null     object        
 2   Park        14 non-null     object        
 3   Opponent    14 non-null     object        
 4   A_BABIP_SP  14 non-null     float64       
 5   A_FIP       14 non-null     float64       
 6   A_xFIP      14 non-null     float64       
 7   A_WAR       14 non-null     float64       
 8   A_WHIP      14 non-null     float64       
 9   A_SIERA     14 non-null     float64       
 10  A_RS/9      14 non-null     float64       
 11  A_Date      14 non-null     datetime64[ns]
 12  A_Avg_Outs  14 non-null     float64       
 13  A_BABIP_BP  14 non-null     float64       
 14  A_FIP_BP    14 non-null     float64       
 15  A_xFIP_BP   14 non-null     float64       
 16  A_WHIP_BP   14 non-null     

In [180]:
away_df = away_df.rename(columns = {'Name': 'A_Starter', 'Team': 'A_Team', 'Opponent': 'A_Opponent', 'Team_Wins': 'A_Team_Wins',\
                                   'Wins_L10': 'A_Wins_L10', 'Wins_L30': 'A_Wins_L30', 'Py_Wins': 'A_Py_Wins', 'Run_Diff': 'A_Run_Diff'})

In [181]:
# away_df = away_df.drop(['Name'], axis = 1)

In [182]:
home_df = home_df.rename(columns = {'Name': 'H_Starter', 'Team': 'H_Team', 'Opponent': 'H_Opponent', 'Team_Wins': 'H_Team_Wins',\
                                   'Wins_L10': 'H_Wins_L10', 'Wins_L30': 'H_Wins_L30', 'Py_Wins': 'H_Py_Wins', 'Run_Diff': 'H_Run_Diff'})

In [183]:
# home_df = home_df.drop(['Name'], axis = 1)

In [184]:
away_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,A_SIERA_BP,A_OPS,A_BABIP,A_wOBA,A_ISO,A_Team_Wins,A_Wins_L10,A_Wins_L30,A_Py_Wins,A_Run_Diff
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,3.83,0.71,0.287,0.308,0.149,57,4,12,60.0,0
1,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,4.61,0.792,0.3,0.343,0.184,68,6,18,68.0,82
2,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,3.82,0.739,0.294,0.317,0.177,62,5,17,62.0,29
3,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,4.32,0.755,0.302,0.329,0.171,65,6,17,63.0,33
4,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,3.6,0.693,0.273,0.302,0.169,53,4,13,51.0,-89
5,Luis Severino,NYY,ATL,ATL,0.359,6.65,5.13,-0.7,1.88,5.01,...,3.78,0.717,0.289,0.316,0.149,60,3,12,59.0,-7
6,Alex Faedo,DET,MIN,MIN,0.22,5.18,4.43,0.1,1.07,4.36,...,4.0,0.691,0.304,0.3,0.144,53,5,14,49.0,-100
7,Lucas Giolito,LAA,TEX,TEX,0.28,4.79,4.43,1.4,1.26,4.18,...,4.29,0.731,0.29,0.311,0.193,59,3,14,58.0,-16
8,Touki Toussaint,CHW,CHC,CHC,0.252,4.63,4.67,0.3,1.4,5.11,...,4.0,0.669,0.3,0.292,0.132,47,4,10,50.0,-95
9,Emerson Hancock,SEA,KCR,KCR,0.143,3.87,5.18,0.1,1.0,5.75,...,3.32,0.754,0.312,0.326,0.184,63,7,19,65.0,57


In [185]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 13
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   A_Starter    14 non-null     object        
 1   A_Team       14 non-null     object        
 2   Park         14 non-null     object        
 3   A_Opponent   14 non-null     object        
 4   A_BABIP_SP   14 non-null     float64       
 5   A_FIP        14 non-null     float64       
 6   A_xFIP       14 non-null     float64       
 7   A_WAR        14 non-null     float64       
 8   A_WHIP       14 non-null     float64       
 9   A_SIERA      14 non-null     float64       
 10  A_RS/9       14 non-null     float64       
 11  A_Date       14 non-null     datetime64[ns]
 12  A_Avg_Outs   14 non-null     float64       
 13  A_BABIP_BP   14 non-null     float64       
 14  A_FIP_BP     14 non-null     float64       
 15  A_xFIP_BP    14 non-null     float64       
 16  A_WHIP_BP 

In [186]:
home_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 13
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   H_Starter    14 non-null     object        
 1   H_Team       14 non-null     object        
 2   Park         14 non-null     object        
 3   H_Opponent   14 non-null     object        
 4   H_BABIP_SP   14 non-null     float64       
 5   H_FIP        14 non-null     float64       
 6   H_xFIP       14 non-null     float64       
 7   H_WAR        14 non-null     float64       
 8   H_WHIP       14 non-null     float64       
 9   H_SIERA      14 non-null     float64       
 10  H_RS/9       14 non-null     float64       
 11  H_Date       14 non-null     datetime64[ns]
 12  H_Avg_Outs   14 non-null     float64       
 13  H_BABIP_BP   14 non-null     float64       
 14  H_FIP_BP     14 non-null     float64       
 15  H_xFIP_BP    14 non-null     float64       
 16  H_WHIP_BP 

In [187]:
away_df = away_df.rename(columns = {'A_Date': 'Date'})

In [188]:
home_df = home_df.rename(columns = {'H_Date': 'Date'})

In [189]:
main_df = pd.merge(away_df, home_df, on = ['Park', 'Date'], how = 'left')

In [190]:
main_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,H_Team_Wins,H_Wins_L10,H_Wins_L30,H_Py_Wins,H_Run_Diff
0,Logan Allen,CLE,CIN,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,4.12,0.729,0.29,0.313,0.191,62,3,12,58.0,-22
1,Cristian Javier,HOU,MIA,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,3.72,0.703,0.3,0.302,0.15,63,5,12,56.0,-36
2,Nick Pivetta,BOS,WSN,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,4.06,0.742,0.295,0.323,0.158,53,7,18,51.0,-89
3,Zack Wheeler,PHI,TOR,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,4.21,0.758,0.313,0.331,0.156,66,6,17,66.0,59
4,Bailey Falter,PIT,NYM,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,4.42,0.696,0.26,0.306,0.158,54,4,12,54.0,-51
5,Luis Severino,NYY,ATL,ATL,0.359,6.65,5.13,-0.7,1.88,5.01,...,3.93,0.902,0.321,0.381,0.248,76,6,16,77.0,201
6,Alex Faedo,DET,MIN,MIN,0.22,5.18,4.43,0.1,1.07,4.36,...,3.85,0.8,0.314,0.344,0.212,62,6,17,64.0,39
7,Lucas Giolito,LAA,TEX,TEX,0.28,4.79,4.43,1.4,1.26,4.18,...,3.64,0.794,0.31,0.342,0.198,71,8,19,77.0,193
8,Touki Toussaint,CHW,CHC,CHC,0.252,4.63,4.67,0.3,1.4,5.11,...,4.13,0.848,0.329,0.364,0.207,61,6,20,65.0,64
9,Emerson Hancock,SEA,KCR,KCR,0.143,3.87,5.18,0.1,1.0,5.75,...,4.89,0.781,0.31,0.333,0.191,39,4,14,45.0,-159


In [191]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 13
Data columns (total 52 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   A_Starter    14 non-null     object        
 1   A_Team       14 non-null     object        
 2   Park         14 non-null     object        
 3   A_Opponent   14 non-null     object        
 4   A_BABIP_SP   14 non-null     float64       
 5   A_FIP        14 non-null     float64       
 6   A_xFIP       14 non-null     float64       
 7   A_WAR        14 non-null     float64       
 8   A_WHIP       14 non-null     float64       
 9   A_SIERA      14 non-null     float64       
 10  A_RS/9       14 non-null     float64       
 11  Date         14 non-null     datetime64[ns]
 12  A_Avg_Outs   14 non-null     float64       
 13  A_BABIP_BP   14 non-null     float64       
 14  A_FIP_BP     14 non-null     float64       
 15  A_xFIP_BP    14 non-null     float64       
 16  A_WHIP_BP 

# 2023 Park Factors
Reading the CSV we pasted from MLB.com.<br>

It might be a little clunky but we just copy and paste the spreadsheet from MLB.com, and we do this periodically because the factors change slightly over the course of the season.

In [192]:
ballparks_df = pd.read_csv('ParkFactors2023.csv')

In [193]:
ballparks_df = clean_unnamed(ballparks_df)

In [194]:
ballparks_df = ballparks_df.iloc[1:, [0, 3]]

In [195]:
col_names = ['Team', 'Park_Factor']

In [196]:
ballparks_df = ballparks_df.rename(columns = {'Unnamed: 1': 'Team', 'Unnamed: 4': 'Park_Factor'})

In [197]:
ballparks_df = ballparks_df.dropna()

In [198]:
park_teams = ballparks_df['Team'].tolist()

In [199]:
park_teams = sorted(park_teams)

In [200]:
park_teams

['Angels',
 'Astros',
 'Athletics',
 'Blue Jays',
 'Braves',
 'Brewers',
 'Cardinals',
 'Cubs',
 'D-backs',
 'Dodgers',
 'Giants',
 'Guardians',
 'Mariners',
 'Marlins',
 'Mets',
 'Nationals',
 'Orioles',
 'Padres',
 'Phillies',
 'Pirates',
 'Rangers',
 'Rays',
 'Red Sox',
 'Reds',
 'Rockies',
 'Royals',
 'Tigers',
 'Twins',
 'White Sox',
 'Yankees']

In [201]:
team_codes = ['LAA', 'HOU', 'OAK', 'TOR', 'ATL', 'MIL', 'STL', 'CHC', 'ARI', 'LAD', 'SFG', 'CLE', 'SEA', 'MIA', 'NYM', 'WSN',\
             'BAL', 'SDP', 'PHI', 'PIT', 'TEX', 'TBR', 'BOS', 'CIN', 'COL', 'KCR', 'DET', 'MIN', 'CHW', 'NYY']

In [202]:
park_dict = dict(zip(park_teams, team_codes))

In [203]:
park_dict

{'Angels': 'LAA',
 'Astros': 'HOU',
 'Athletics': 'OAK',
 'Blue Jays': 'TOR',
 'Braves': 'ATL',
 'Brewers': 'MIL',
 'Cardinals': 'STL',
 'Cubs': 'CHC',
 'D-backs': 'ARI',
 'Dodgers': 'LAD',
 'Giants': 'SFG',
 'Guardians': 'CLE',
 'Mariners': 'SEA',
 'Marlins': 'MIA',
 'Mets': 'NYM',
 'Nationals': 'WSN',
 'Orioles': 'BAL',
 'Padres': 'SDP',
 'Phillies': 'PHI',
 'Pirates': 'PIT',
 'Rangers': 'TEX',
 'Rays': 'TBR',
 'Red Sox': 'BOS',
 'Reds': 'CIN',
 'Rockies': 'COL',
 'Royals': 'KCR',
 'Tigers': 'DET',
 'Twins': 'MIN',
 'White Sox': 'CHW',
 'Yankees': 'NYY'}

In [204]:
ballparks_df['Team'] = ballparks_df['Team'].replace(park_dict)

In [205]:
ballparks_df

Unnamed: 0,Team,Park_Factor
1,COL,111
2,BOS,109
3,CIN,107
5,BAL,103
6,KCR,103
7,CHC,102
8,WSN,102
9,LAA,101
10,ARI,101
11,PIT,101


In [206]:
ballparks_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 31
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Team         30 non-null     object
 1   Park_Factor  30 non-null     object
dtypes: object(2)
memory usage: 720.0+ bytes


In [207]:
ballparks_df['Park_Factor'] = ballparks_df['Park_Factor'].astype('int')

In [208]:
ballparks_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 31
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Team         30 non-null     object
 1   Park_Factor  30 non-null     int32 
dtypes: int32(1), object(1)
memory usage: 600.0+ bytes


In [209]:
park_factor_dict = ballparks_df.set_index('Team')['Park_Factor'].to_dict()

In [210]:
park_factor_dict

{'COL': 111,
 'BOS': 109,
 'CIN': 107,
 'BAL': 103,
 'KCR': 103,
 'CHC': 102,
 'WSN': 102,
 'LAA': 101,
 'ARI': 101,
 'PIT': 101,
 'ATL': 101,
 'PHI': 101,
 'TEX': 101,
 'MIN': 101,
 'CHW': 100,
 'TOR': 100,
 'LAD': 100,
 'HOU': 99,
 'STL': 98,
 'SFG': 98,
 'NYY': 98,
 'CLE': 97,
 'MIL': 97,
 'MIA': 97,
 'DET': 97,
 'OAK': 96,
 'NYM': 96,
 'TBR': 95,
 'SDP': 95,
 'SEA': 93}

In [211]:
main_df['Park'] = main_df['Park'].replace(park_factor_dict)

In [212]:
main_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,H_Team_Wins,H_Wins_L10,H_Wins_L30,H_Py_Wins,H_Run_Diff
0,Logan Allen,CLE,107,CIN,0.307,4.09,4.28,1.4,1.37,4.45,...,4.12,0.729,0.29,0.313,0.191,62,3,12,58.0,-22
1,Cristian Javier,HOU,97,MIA,0.265,4.59,5.17,1.4,1.22,4.77,...,3.72,0.703,0.3,0.302,0.15,63,5,12,56.0,-36
2,Nick Pivetta,BOS,102,WSN,0.26,4.37,3.92,0.7,1.17,3.61,...,4.06,0.742,0.295,0.323,0.158,53,7,18,51.0,-89
3,Zack Wheeler,PHI,100,TOR,0.307,3.06,3.45,4.3,1.11,3.44,...,4.21,0.758,0.313,0.331,0.156,66,6,17,66.0,59
4,Bailey Falter,PIT,96,NYM,0.342,5.01,4.72,0.5,1.55,4.85,...,4.42,0.696,0.26,0.306,0.158,54,4,12,54.0,-51
5,Luis Severino,NYY,101,ATL,0.359,6.65,5.13,-0.7,1.88,5.01,...,3.93,0.902,0.321,0.381,0.248,76,6,16,77.0,201
6,Alex Faedo,DET,101,MIN,0.22,5.18,4.43,0.1,1.07,4.36,...,3.85,0.8,0.314,0.344,0.212,62,6,17,64.0,39
7,Lucas Giolito,LAA,101,TEX,0.28,4.79,4.43,1.4,1.26,4.18,...,3.64,0.794,0.31,0.342,0.198,71,8,19,77.0,193
8,Touki Toussaint,CHW,102,CHC,0.252,4.63,4.67,0.3,1.4,5.11,...,4.13,0.848,0.329,0.364,0.207,61,6,20,65.0,64
9,Emerson Hancock,SEA,103,KCR,0.143,3.87,5.18,0.1,1.0,5.75,...,4.89,0.781,0.31,0.333,0.191,39,4,14,45.0,-159


In [213]:
main_df = main_df.rename(columns = {'A_Team': 'Away','H_Team': 'Home'})

# 5-column index
We need to include home and away teams, home and away starters and game date in the index so that we can identify the game. For our model to run, the columns must be numerical and not strings.

In [214]:
main_df = main_df.set_index(['Away', 'Home', 'A_Starter', 'H_Starter', 'Date'])

In [215]:
main_df[['H_Opponent']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,H_Opponent
Away,Home,A_Starter,H_Starter,Date,Unnamed: 5_level_1
CLE,CIN,Logan Allen,Graham Ashcraft,2023-08-15,CLE
HOU,MIA,Cristian Javier,Johnny Cueto,2023-08-15,HOU
BOS,WSN,Nick Pivetta,Josiah Gray,2023-08-15,BOS
PHI,TOR,Zack Wheeler,Yusei Kikuchi,2023-08-15,PHI
PIT,NYM,Bailey Falter,David Peterson,2023-08-15,PIT
NYY,ATL,Luis Severino,Bryce Elder,2023-08-15,NYY
DET,MIN,Alex Faedo,Bailey Ober,2023-08-15,DET
LAA,TEX,Lucas Giolito,Jordan Montgomery,2023-08-15,LAA
CHW,CHC,Touki Toussaint,Kyle Hendricks,2023-08-15,CHW
SEA,KCR,Emerson Hancock,Jordan Lyles,2023-08-15,SEA


In [216]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 14 entries, ('CLE', 'CIN', 'Logan Allen', 'Graham Ashcraft', Timestamp('2023-08-15 00:00:00')) to ('MIL', 'LAD', 'Adrian Houser', 'Bobby Miller', Timestamp('2023-08-15 00:00:00'))
Data columns (total 47 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Park         14 non-null     int64  
 1   A_Opponent   14 non-null     object 
 2   A_BABIP_SP   14 non-null     float64
 3   A_FIP        14 non-null     float64
 4   A_xFIP       14 non-null     float64
 5   A_WAR        14 non-null     float64
 6   A_WHIP       14 non-null     float64
 7   A_SIERA      14 non-null     float64
 8   A_RS/9       14 non-null     float64
 9   A_Avg_Outs   14 non-null     float64
 10  A_BABIP_BP   14 non-null     float64
 11  A_FIP_BP     14 non-null     float64
 12  A_xFIP_BP    14 non-null     float64
 13  A_WHIP_BP    14 non-null     float64
 14  A_SIERA_BP   14 non-null     float64
 15  A_OPS        14 no

In [217]:
main_df = main_df.drop(['A_Opponent', 'H_Opponent'], axis = 1)

In [218]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 14 entries, ('CLE', 'CIN', 'Logan Allen', 'Graham Ashcraft', Timestamp('2023-08-15 00:00:00')) to ('MIL', 'LAD', 'Adrian Houser', 'Bobby Miller', Timestamp('2023-08-15 00:00:00'))
Data columns (total 45 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Park         14 non-null     int64  
 1   A_BABIP_SP   14 non-null     float64
 2   A_FIP        14 non-null     float64
 3   A_xFIP       14 non-null     float64
 4   A_WAR        14 non-null     float64
 5   A_WHIP       14 non-null     float64
 6   A_SIERA      14 non-null     float64
 7   A_RS/9       14 non-null     float64
 8   A_Avg_Outs   14 non-null     float64
 9   A_BABIP_BP   14 non-null     float64
 10  A_FIP_BP     14 non-null     float64
 11  A_xFIP_BP    14 non-null     float64
 12  A_WHIP_BP    14 non-null     float64
 13  A_SIERA_BP   14 non-null     float64
 14  A_OPS        14 non-null     float64
 15  A_BABIP      14 no

# Feature Engineering
The Major League Baseball season is 162 games, starting in late March and ending in the first week of October. The best teams win close to 100 games, and sometimes even more than 100 games over the course of a season. So a team with 50 wins in June is a good team, but a team with 50 wins in September is a bad team. That's why raw win totals are not an effective feature. So we need to derive differentials, as in how many more wins does a team have over its opponent? We're trying to predict winners of a single game, so we need to find how much better a team is than its opponent.<br>

We're going to engineer these differentials for all features from the perspective of the home team so our model can predict the answer to the question: Will the home team win? (1 = home win, 0 = road win). For example, Py_Diff is how many more Pythagorean Wins the home team has, and if the home team has fewer Pythagorean Wins the number will be negative. ISO_Diff indicates how many more points of isolated power the home team has over the away team. 

In [219]:
main_df['A_Win_Diff'] = main_df['A_Team_Wins'] - main_df['H_Team_Wins']

In [220]:
main_df['H_Win_Diff'] = main_df['H_Team_Wins'] - main_df['A_Team_Wins']

In [221]:
main_df['A_Py_Diff'] = main_df['A_Py_Wins'] - main_df['H_Py_Wins']

In [222]:
main_df['H_Py_Diff'] = main_df['H_Py_Wins'] - main_df['A_Py_Wins']

In [223]:
main_df = main_df.drop(columns = ['A_Team_Wins', 'H_Team_Wins', 'A_Py_Wins', 'H_Py_Wins'])

In [224]:
main_df['A_Py_Diff'].sum(), main_df['H_Py_Diff'].sum()

(-10.0, 10.0)

In [225]:
main_df['A_Win_Diff'].sum(), main_df['H_Win_Diff'].sum()

(14, -14)

In [226]:
main_df = main_df.drop(columns = ['A_Win_Diff', 'A_Py_Diff'])

In [227]:
main_df = main_df.rename({'H_Win_Diff': 'Win_Diff', 'H_Py_Diff': 'Py_Diff'})

In [228]:
main_df['Run_Diff'] = main_df['H_Run_Diff'] - main_df['A_Run_Diff'] 

In [229]:
main_df = main_df.drop(columns = ['H_Run_Diff', 'A_Run_Diff'])

In [230]:
main_df['W_L10_Diff'] = main_df['H_Wins_L10'] - main_df['A_Wins_L10']
main_df['W_L30_Diff'] = main_df['H_Wins_L30'] - main_df['A_Wins_L30']

In [231]:
main_df = main_df.drop(columns = ['H_Wins_L10', 'A_Wins_L10', 'H_Wins_L30', 'A_Wins_L30'])

# BABIP
BABIP is the only feature we have for starting pitchers, team hitting and team bullpens. Most of our features indicate one starting pitcher's WAR, for example, against the other starting pitcher's WAR. But pitchers don't oppose each other directly.<br>

These BABIP features provide us with our only opportunity to derive "matchup" data that compares hitters to pitchers.
So these features indicate:<br>

Home team's hitting BABIP compared to the BABIP against the away team's starting pitchers<br>
Home team's hitting BABIP compared to the BABIP against the away team's bullpen<br>
Home team's starting pitcher's BABIP against the away team's hitters' BABIP<br>
Home team's bullpen BABIP against the away team's hitting BABIP<br>

In [232]:
main_df['BABIP_vs_SP'] = main_df['H_BABIP'] - main_df['A_BABIP_SP']
main_df['BABIP_vs_BP'] = main_df['H_BABIP'] - main_df['A_BABIP_BP']
main_df['BABIP_SP_vs_Hit'] = main_df['H_BABIP_SP'] - main_df['A_BABIP']
main_df['BABIP_BP_vs_Hit'] = main_df['H_BABIP_BP'] - main_df['A_BABIP']

In [233]:
main_df = main_df.drop(columns = ['A_BABIP', 'A_BABIP_SP', 'A_BABIP_BP', 'H_BABIP', 'H_BABIP_SP', 'H_BABIP_BP'])

In [234]:
main_df['Avg_Outs_Diff'] = main_df['H_Avg_Outs'] - main_df['A_Avg_Outs']
main_df['FIP_Diff'] = main_df['H_FIP'] - main_df['A_FIP']
main_df['FIP_BP_Diff'] = main_df['H_FIP_BP'] - main_df['A_FIP_BP']
main_df['ISO_Diff'] = main_df['H_ISO'] - main_df['A_ISO']
main_df['OPS_Diff'] = main_df['H_OPS'] - main_df['A_OPS']
main_df['RS/9_Diff'] = main_df['H_RS/9'] - main_df['A_RS/9']
main_df['SIERA_Diff'] = main_df['H_SIERA'] - main_df['A_SIERA']
main_df['SIERA_BP_Diff'] = main_df['H_SIERA_BP'] - main_df['A_SIERA_BP']
main_df['WAR_Diff'] = main_df['H_WAR'] - main_df['A_WAR']
main_df['WHIP_Diff'] = main_df['H_WHIP'] - main_df['A_WHIP']
main_df['WHIP_BP_Diff'] = main_df['H_WHIP_BP'] - main_df['A_WHIP_BP']
main_df['wOBA_Diff'] = main_df['H_wOBA'] - main_df['A_wOBA']
main_df['xFIP_Diff'] = main_df['H_xFIP'] - main_df['A_xFIP']
main_df['xFIP_BP_Diff'] = main_df['H_xFIP_BP'] - main_df['A_xFIP_BP']

In [235]:
main_df = main_df.drop(columns = ['A_Avg_Outs', 'H_Avg_Outs', 'A_FIP', 'H_FIP', 'A_FIP_BP', 'H_FIP_BP', 'A_ISO', 'H_ISO',\
                                 'A_OPS', 'H_OPS', 'A_RS/9', 'H_RS/9', 'A_SIERA', 'H_SIERA', 'A_SIERA_BP', 'H_SIERA_BP',\
                                 'A_WAR', 'H_WAR', 'A_WHIP', 'H_WHIP', 'A_WHIP_BP', 'H_WHIP_BP', 'A_wOBA', 'H_wOBA',\
                                 'A_xFIP', 'H_xFIP', 'A_xFIP_BP', 'H_xFIP_BP'])

In [236]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 14 entries, ('CLE', 'CIN', 'Logan Allen', 'Graham Ashcraft', Timestamp('2023-08-15 00:00:00')) to ('MIL', 'LAD', 'Adrian Houser', 'Bobby Miller', Timestamp('2023-08-15 00:00:00'))
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Park             14 non-null     int64  
 1   H_Win_Diff       14 non-null     int64  
 2   H_Py_Diff        14 non-null     float64
 3   Run_Diff         14 non-null     int64  
 4   W_L10_Diff       14 non-null     int64  
 5   W_L30_Diff       14 non-null     int64  
 6   BABIP_vs_SP      14 non-null     float64
 7   BABIP_vs_BP      14 non-null     float64
 8   BABIP_SP_vs_Hit  14 non-null     float64
 9   BABIP_BP_vs_Hit  14 non-null     float64
 10  Avg_Outs_Diff    14 non-null     float64
 11  FIP_Diff         14 non-null     float64
 12  FIP_BP_Diff      14 non-null     float64
 13  ISO_Diff         14 non-null     float64
 

In [237]:
main_df_cols = list(main_df.columns)

In [238]:
main_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Park,H_Win_Diff,H_Py_Diff,Run_Diff,W_L10_Diff,W_L30_Diff,BABIP_vs_SP,BABIP_vs_BP,BABIP_SP_vs_Hit,BABIP_BP_vs_Hit,...,OPS_Diff,RS/9_Diff,SIERA_Diff,SIERA_BP_Diff,WAR_Diff,WHIP_Diff,WHIP_BP_Diff,wOBA_Diff,xFIP_Diff,xFIP_BP_Diff
Away,Home,A_Starter,H_Starter,Date,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
CLE,CIN,Logan Allen,Graham Ashcraft,2023-08-15,107,5,-2.0,-22,-1,0,-0.017,-0.043,0.005,-0.008,...,0.019,0.13,0.47,0.29,-0.3,0.04,-0.1,0.005,0.43,0.66
HOU,MIA,Cristian Javier,Johnny Cueto,2023-08-15,97,-5,-12.0,-118,-1,-6,0.035,0.043,-0.106,0.012,...,-0.089,-4.37,-0.24,-0.89,-1.5,-0.22,-0.09,-0.041,-0.35,-0.87
BOS,WSN,Nick Pivetta,Josiah Gray,2023-08-15,102,-9,-11.0,-118,2,1,0.035,-0.044,0.002,0.023,...,0.003,-0.06,1.4,0.24,1.0,0.27,-0.03,0.006,1.09,0.31
PHI,TOR,Zack Wheeler,Yusei Kikuchi,2023-08-15,100,1,3.0,26,0,0,0.006,0.043,-0.011,-0.024,...,0.003,-0.22,0.55,-0.11,-2.9,0.12,-0.07,0.002,0.51,-0.14
PIT,NYM,Bailey Falter,David Peterson,2023-08-15,96,1,3.0,38,0,-1,-0.082,-0.04,0.101,0.034,...,0.003,1.45,-0.87,0.82,0.0,0.06,0.26,0.004,-1.12,0.82
NYY,ATL,Luis Severino,Bryce Elder,2023-08-15,101,16,18.0,208,3,4,-0.038,0.061,-0.013,0.003,...,0.185,1.32,-0.43,0.15,2.2,-0.64,0.1,0.065,-0.81,0.03
DET,MIN,Alex Faedo,Bailey Ober,2023-08-15,101,9,15.0,139,1,3,0.094,0.006,-0.007,0.002,...,0.109,-1.56,-0.41,-0.15,1.9,0.04,-0.03,0.044,-0.12,-0.07
LAA,TEX,Lucas Giolito,Jordan Montgomery,2023-08-15,101,12,19.0,209,5,5,0.03,-0.022,0.009,-0.032,...,0.063,-0.23,0.08,-0.65,1.3,-0.01,-0.51,0.031,-0.33,-0.49
CHW,CHC,Touki Toussaint,Kyle Hendricks,2023-08-15,102,14,15.0,159,2,10,0.077,0.023,-0.034,-0.056,...,0.179,0.63,-0.5,0.13,1.0,-0.28,-0.15,0.072,-0.14,-0.03
SEA,KCR,Emerson Hancock,Jordan Lyles,2023-08-15,103,-24,-20.0,-216,-3,-5,0.167,0.026,-0.047,-0.021,...,0.027,2.69,-0.63,1.57,0.4,0.27,0.3,0.007,0.11,1.78


In [239]:
filepath = r'C:\Users\Owner\Sports Betting\MLB_Game_Outcome\Live_Game_Features_' + today_str + '.csv'
main_df.to_csv(filepath)

In [240]:
import os

csv_file_paths = ['Live_Game_Features_' + yesterday_str + '.csv', '2023_Win_Features_' + yesterday_str + '.csv']

for file in csv_file_paths:
    try:
        os.remove(file)
        print(f"CSV file '{file}' deleted successfully.")
    except FileNotFoundError:
        print(f"CSV file '{file}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

CSV file 'Live_Game_Features_2023-08-14.csv' deleted successfully.
CSV file '2023_Win_Features_2023-08-14.csv' deleted successfully.
