# MLB Game Outcome Model - Live Data Wrangling
**This is the second notebook that we run when we use our MLB Game Outcome model.**<br>

The goal of this notebook is to save a dataframe with all the features needed to make predictions for the winner of that day's Major League Baseball games.<br>

This notebook will:<br>

+ Scrape RotoGrinders to identify the starting pitchers for today's MLB games.<br>
+ Scrape FanGraphs to collect data for our model's features relating to each starting pitcher.<br>
+ Merge those two dataframes so we have the name, team, opponent, ballpark, date and all of the features for each starting pitcher.<br>
+ Derive our Avg_Outs feature, the mean number of outs each pitcher records per start.<br>
+ Scrape FanGraphs for team bullpen and hitting features over the last 30 days and merge with our main dataframe.<br>
+ Temporarily create separate home and away dataframes so we can add 'A_' and 'H_' suffixes to each feature, which will indicate home or away team.<br>
+ Read in the '2023_Game_Scores' CSV, which includes every score this season, and use it to derive total wins, wins in last 10 games, last 30 games, Pythagorean Wins and Run Differential.<br>
+ Add the Park Factors feature, which measures how favorable the ballpark is for hitters.<br>
+ Engineer all the features from the perspective of the home team so that the model will predict the answer to the question: Will the home team win the game? (1 is a home win, 0 is an away win). So for example Win_Diff is how many more wins the home team has than the visiting team. Run_Diff is the home team's run differential in its games compared to the away team's run differential in its games. A positive number means the home team's run differential is better than the visiting team's.

<div style="background-color: yellow; padding: 10px;">
    

# Temporary workaround
Pending a Selenium wedbriver update, we currently have to just go to RotoGrinders ourselves, copy and paste the MLB First Look url, and set the first_look_url variable to that value. Selenium scraping code is commented out for the time being.
    
</div>

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import datetime as dt
from datetime import timedelta
import time
import papermill as pm

In [2]:
def clean_unnamed(df):
    if 'Unnamed: 0' in df.columns:
        df = df.drop(columns = ['Unnamed: 0'])
        return df
    print("Dataframe does not have 'Unnamed: 0' column.")

In [3]:
today = dt.date.today()

In [4]:
today_str = str(today)

In [5]:
yesterday = dt.date.today() - timedelta(days = 1)

In [6]:
yesterday_str = str(yesterday)

# Scraping RotoGrinders data using Selenium
We take today's starting pitcher data from the RotoGrinders MLB First Look page. Since the page we need isn't on the home page of that site, we use Selenium to get to that page.

In [7]:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

In [8]:
# driver = webdriver.Chrome()
# driver.get("https://rotogrinders.com/daily-fantasy-baseball")

In [9]:
# link_element = WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "MLB DFS First Look")))


# # Click on the link
# driver.execute_script("arguments[0].click();", link_element)

In [10]:
# first_look_url = driver.current_url

In [11]:
first_look_url = 'https://rotogrinders.com/articles/mlb-dfs-first-look-saturday-august-19th-sortable-odds-salaries-stats-3924697'

# Now Beautiful Soup
Now that we have the url we need, we can use Beautiful Soup for the rest of the scraping. We can extract the tables from the soup object. We only need the second of those tables. We run three functions to create the dataframe.

In [12]:
#Scraping RotoGrinders MLB First Look tables
rg_string = first_look_url
rg_page = requests.get(rg_string)
rg_html_doc = rg_page.text
rg_soup_obj = BeautifulSoup(rg_html_doc)

In [13]:
tables = rg_soup_obj.find_all('table')

In [14]:
#There are 4 tables on the page we scraped
len(tables)

4

In [15]:
#List of strings that we will use to name our CSVs
var_name_str = ['odds', 'pitcher', 'offense', 'hitter']

In [16]:
# today = dt.date.today()

In [17]:
#Turning today's date into string and replacing hyphen with underscore
today_str_u = str(today).replace('-', '_')

In [18]:
#Extracting all table header tags
def get_headers(table):
    """
    Extracting all table header tags
    """
    headers = table.find_all('th')
    cols = []
    for header in headers:
        cols.append(header.get_text().strip('\t'))
    return cols

In [19]:
#Extracting all table data tags
def get_content(table):
    """
    Extracting all table data tags
    """
    content = table.find_all('td')
    table_content = []
    for item in content:
        table_content.append(item.get_text().strip(' '))
    table_content = [content.strip('\t') for content in table_content]
    return table_content

In [20]:
#Turning table data into rows, using number of table headers to divide and get the right number of rows
def get_rows(cols, content):
    """
    Turning table data into rows, using number of table headers to divide and get the right number of rows
    """
    num_rows = len(content)/len(cols)
    content = iter(content)
    rows = []
    for i in range(int(num_rows)):
        new_row = []
        for j in range(len(cols)):
            new_row.append(next(content))
        rows.append(new_row)
    return rows

In [21]:
headers = get_headers(tables[1])
content = get_content(tables[1])
rows = get_rows(headers, content)
todays_SPs = pd.DataFrame(rows, columns = headers)
# todays_SPs
# filepath = r'C:\Users\Owner\FantasySports\MLB_DFS_2023\{}_{}.csv'.format(var_name_str[i], today_str)
# temp_df.to_csv(filepath)

# Park column
We use the Opponent column to derive a column that indicates the ballpark in which the game is being played.

In [22]:
todays_SPs['Park'] = np.where(todays_SPs.Opponent.str[:2] == 'at',\
                                       todays_SPs.Opponent.str[-3:], todays_SPs['Team'])

In [23]:
todays_SPs = todays_SPs.rename(columns = {'Pitcher': 'Name'})

In [24]:
# driver.quit()

In [25]:
pitch_url = 'https://www.fangraphs.com/leaders-legacy.aspx?pos=all&stats=pit&lg=all&qual=0&type=c,8,43,45,62,59,42,110,122,123,113,55&season=2023&month=1000&season1=2023&ind=0&team=0&rost=0&age=0&filter=&players=0&startdate=2023-03-30&enddate='
#pitch_url = 'https://www.fangraphs.com/leaders/major-league?pos=all&stats=pit&lg=all&qual=0&type=c%2c8%2c43%2c45%2c62%2c59%2c42%2c110%2c122%2c123%2c113%2c55&season=2023&month=1000&season1=2023&ind=0&team=0&rost=0&age=0&filter=&players=0&startdate=2023-03-30&enddate='

# Starting pitching features
The following code will scrape the following features from FanGraphs for each day's starting pitcher:

**FIP:** Fielding Independent Pitching, similar to ERA except only home runs, unintentional walks, strikeouts and hit batters are taken into account. It removes from the equation all instances of the ball being put in play.<br>
**xFIP:** Home runs allowed is replaced in the equation with fly balls/league average of home runs allowed per fly ball. It penalizes pitchers a little more for allowing fly balls.<br>
**BABIP:** Batting average on balls in play. If a pitcher has a low BABIP against him, it suggests that when batters do make contact, it's not strong contact.<br>
**WAR:** Wins Above Replacement. Measures a player's value in wins over a replacement-level player at his position.<br>
**WHIP:** Walks and hits per innings pitched<br>
**Contact%:** Total pitches where contact was made / Total swings<br>
**SIERA:** Skills-interactive ERA. Scale similar to ERA. Unlike FIP and xFIP, batted balls are taken into a account<br>
**RS/9:** Run support per nine innings. How many runs does a pitcher's team score for him while he's in the game?<br>
**SwStr%:** Swings and misses divided by total pitchers thrown by a pitcher.<br>



In [26]:
url_string_page1 = pitch_url + yesterday_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
#We fill in the column names with the th tags of the rgHeader class
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
if num_pages > 2:
    for j in range(2, num_pages):
        temp_url_string = pitch_url + yesterday_str + '&page=' + str(j) + '_30'
        temp_r = requests.get(temp_url_string)
        temp_html_doc = temp_r.text
        temp_soup_obj = BeautifulSoup(temp_html_doc)
        temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
        for entry in temp_data:
            all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
SP_live_df = pd.DataFrame(data_lists, columns = col_names)
#Adding one day to the date so that the data accounts for every day through the previous day.
#For example, if we're predicting a game on June 30, the data goes through June 29
date_plus_1 = today_str
date_plus_1 = pd.to_datetime(date_plus_1, format = '%Y-%m-%d')
SP_live_df['Date'] = date_plus_1
#list_of_dfs.append(SP_live_df)
#print(f"{current_date_str} done {date_plus_1}")
#Sleeping loop for 30 seconds so we don't get caught for scraping too frequently.
#time.sleep(30)


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head id="Head1">
  <title>
   Major League Leaderboards » 2023 » Pitchers » Custom Statistics | FanGraphs Baseball
  </title>
  <meta content="Major League leaderboards for 2023 pitchers with custom statistics" description="major league leaderboards, custom statistics, pitcher leaderboards, 2023 leaderboards" id="description" name="description"/>
  <meta content="keys" id="keywords" name="keywords"/>
  <script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js" type="text/javascript">
  </script>
  <link href="https://www.fangraphs.com/dist/css/head_section-style.css?v=25ec8c7bb5fd7b1e7c95" rel="stylesheet"/>
  <script src="https://www.fangraphs.com/dist/common-bundle.js?v=25ec8c7bb5fd7b1e7c95">
  </script>
  <script src="https://www.fangraphs.com/dist/head_section-bundle.js?v=25ec8c7bb5fd7b1e7c95">
  </scr

In [27]:
# soup_obj_page1.find_all('span')

In [28]:
SP_live_df.head()

Unnamed: 0,#,Name,Team,GS,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SIERA,RS/9,SwStr%,Start-IP,Date
0,1,Josh Donaldson,NYY,0,0.0,3.26,6.54,0.0,0.0,100.0%,7.55,27.0,0.0%,,2023-08-19
1,2,Brandon Crawford,SFG,0,0.25,6.26,11.18,0.0,2.0,87.5%,10.32,0.0,5.0%,,2023-08-19
2,3,Eduardo Escobar,LAA,0,0.2,3.26,4.9,0.0,1.0,100.0%,4.53,0.0,0.0%,,2023-08-19
3,4,Daniel Hudson,LAD,0,0.333,2.93,4.57,0.0,1.67,67.7%,4.57,0.0,15.6%,,2023-08-19
4,5,Chris Owings,PIT,0,0.333,3.26,8.18,0.0,2.0,100.0%,7.55,9.0,0.0%,,2023-08-19


In [29]:
len(SP_live_df)

803

In [30]:
# SP_live_df[SP_live_df['Name'] == 'Touki Toussaint']

# Starting pitchers
We'll only need the pitcher's name, the pitcher's team, the park in which he's pitching and his team's opponent. We'll need to be on the lookout for games that aren't in regular home parks, like the games in Europe.<br>

We're also going to replace every occurrence of 'WAS' with 'WSN' (Washington Nationals) in the RotoGrinders data so that it matches FanGraphs.<br>

In [31]:
# todays_SPs = pd.read_csv('C:\\Users\Owner\Tableau_Projects\Live MLB DFS\Tableau_Pitching_Salaries.csv')

In [32]:
todays_SPs = todays_SPs[['Name', 'Team', 'Park', 'Opponent']]

In [33]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent
0,Kutter Crawford,BOS,NYY,at NYY
1,Gerrit Cole,NYY,NYY,vs BOS
2,Brady Singer,KCR,CHC,at CHC
3,Justin Steele,CHC,CHC,vs KCR
4,Cristopher Sanchez,PHI,WAS,at WAS
5,Jake Irvin,WAS,WAS,vs PHI
6,Freddy Peralta,MIL,TEX,at TEX
7,Dane Dunning,TEX,TEX,vs MIL
8,Chris Bassitt,TOR,CIN,at CIN
9,Brandon Williamson,CIN,CIN,vs TOR


In [34]:
# todays_SPs.loc[3045, 'Name'] = 'Jake Bird'

In [35]:
len(todays_SPs)

30

In [36]:
#Slicing the 'at ' or 'vs ' off the Opponent string
todays_SPs['Opponent'] = todays_SPs['Opponent'].str[-3:]

In [37]:
todays_SPs['Team'] = todays_SPs['Team'].replace('WAS', 'WSN')
todays_SPs['Park'] = todays_SPs['Park'].replace('WAS', 'WSN')
todays_SPs['Opponent'] = todays_SPs['Opponent'].replace('WAS', 'WSN')

# Boiling FanGraphs data down to today's SPs
The data we scrape from FanGraphs includes every player who has pitched this season, starters, reliever and position players. That's more than 750 rows (and growing as the season goes along). Here, we match it with the names of today's starting pitchers that we scraped for RG. It gives us a chance to detect potential missing values.

In [38]:
todays_starters = SP_live_df[SP_live_df['Name'].isin(list(todays_SPs['Name']))]

In [39]:
todays_starters

Unnamed: 0,#,Name,Team,GS,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SIERA,RS/9,SwStr%,Start-IP,Date
136,137,Gerrit Cole,NYY,25,0.272,3.33,3.68,3.5,1.05,76.3%,3.74,4.78,11.7%,156.1,2023-08-19
138,139,Justin Steele,CHC,22,0.306,3.2,3.68,3.2,1.17,78.0%,3.86,6.64,11.5%,126.0,2023-08-19
148,149,Tanner Bibee,CLE,19,0.292,3.61,4.35,2.3,1.21,76.1%,4.25,5.3,11.1%,108.2,2023-08-19
173,174,Sonny Gray,MIN,24,0.306,2.79,3.61,4.0,1.22,75.1%,3.98,3.83,11.5%,136.1,2023-08-19
176,177,Merrill Kelly,ARI,21,0.272,3.91,3.76,2.2,1.16,74.6%,4.01,4.65,12.4%,124.0,2023-08-19
183,184,Dane Dunning,TEX,18,0.272,3.93,4.3,1.9,1.14,80.7%,4.44,5.22,9.1%,107.1,2023-08-19
197,198,Jesse Scholtens,CHW,5,0.279,4.36,4.9,0.5,1.24,80.4%,4.75,3.67,9.2%,27.0,2023-08-19
210,211,Logan Webb,SFG,25,0.3,3.24,2.95,3.5,1.07,80.8%,3.15,3.15,9.0%,163.0,2023-08-19
217,218,Kodai Senga,NYM,22,0.292,3.54,3.76,2.5,1.29,71.4%,4.05,4.7,12.6%,122.2,2023-08-19
220,221,Framber Valdez,HOU,23,0.28,3.43,3.29,3.4,1.08,75.5%,3.57,4.21,11.3%,149.2,2023-08-19


In [40]:
len(todays_starters)

30

# Fixing aggregate teams

We'll write the fix_agg_team function to address situations in which a pitcher has pitched for more than one team. In those cases, his team will be indicated as '---' and that won't match our RotoGrinders data. If there are aggregate teams, we'll be prompted to input each pitcher's name and his current team.

In [41]:
def fix_agg_team(convert_dict):
    """
    takes a dictionary with the pitcher's name as a key and his current team as a value and adjusts the dataframe.
    """
    for k, v in convert_dict.items():
        SP_live_df.loc[SP_live_df['Name'] == k, 'Team'] = v

In [42]:
if '- - -' in list(todays_starters['Team']):
    fix_agg_dict = {}
    num_composite = todays_starters['Team'].value_counts().loc['- - -']
    print(f"There are {num_composite} pitchers with aggregate teams")
    temp_df = todays_starters[todays_starters['Team'] == '- - -']
    for i in range(len(temp_df)):
        comp_pitch = temp_df.iloc[i, :]['Name']#input("Enter pitcher's name: ")
        comp_team = input(f"Enter {comp_pitch}'s current team: ")
        fix_agg_dict.update({comp_pitch: comp_team})
    fix_agg_team(fix_agg_dict)
        

There are 1 pitchers with aggregate teams
Enter Yonny Chirinos's current team: ATL


In [43]:
# SP_live_df.loc[SP_live_df['Name'] == 'Lance Lynn', 'Team'] = 'LAD'
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'Team'] = 'TBR'
# SP_live_df.loc[SP_live_df['Name'] == 'Rich Hill', 'Team'] = 'SDP'
# SP_live_df.loc[SP_live_df['Name'] == 'Bailey Falter', 'Team'] = 'PIT'
# SP_live_df.loc[SP_live_df['Name'] == 'Ryan Yarbrough', 'Team'] = 'LAD'

In [44]:
# SP_live_df.loc[SP_live_df['Name'] == 'Touki Toussiaint', 'Team'] = 'CHW'
# SP_live_df.loc[SP_live_df['Name'] == 'Lucas Giolito', 'Team'] = 'LAA'
# SP_live_df.loc[SP_live_df['Name'] == 'Yonny Chirinos', 'Team'] = 'ATL'

In [45]:
# SP_live_df.loc[SP_live_df['Name'] == 'Zack Thompson', 'GS'] = 1
# SP_live_df.loc[SP_live_df['Name'] == 'Zack Thompson', 'Start-IP'] = 5
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'GS'] = 2
# SP_live_df.loc[SP_live_df['Name'] == 'Erasmo Ramirez', 'Start-IP'] = 6

In [46]:
todays_SPs = pd.merge(todays_SPs, SP_live_df, on = ['Name', 'Team'], how = 'left')

In [47]:
todays_SPs = todays_SPs.drop(['#', 'Contact%', 'SwStr%'], axis = 1)

In [48]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date
0,Kutter Crawford,BOS,NYY,NYY,15,0.276,4.17,4.35,1.2,1.13,3.95,4.8,68.1,2023-08-19
1,Gerrit Cole,NYY,NYY,BOS,25,0.272,3.33,3.68,3.5,1.05,3.74,4.78,156.1,2023-08-19
2,Brady Singer,KCR,CHC,CHC,24,0.309,3.88,4.19,2.1,1.32,4.36,5.64,135.2,2023-08-19
3,Justin Steele,CHC,CHC,KCR,22,0.306,3.2,3.68,3.2,1.17,3.86,6.64,126.0,2023-08-19
4,Cristopher Sanchez,PHI,WSN,WSN,11,0.237,4.67,3.54,0.6,1.01,3.63,4.63,58.1,2023-08-19
5,Jake Irvin,WSN,WSN,PHI,18,0.283,5.46,5.21,0.5,1.41,4.97,4.37,90.2,2023-08-19
6,Freddy Peralta,MIL,TEX,TEX,23,0.28,4.01,3.75,2.1,1.19,3.71,5.06,128.0,2023-08-19
7,Dane Dunning,TEX,TEX,MIL,18,0.272,3.93,4.3,1.9,1.14,4.44,5.22,107.1,2023-08-19
8,Chris Bassitt,TOR,CIN,CIN,25,0.277,4.56,4.4,1.5,1.24,4.35,5.07,145.2,2023-08-19
9,Brandon Williamson,CIN,CIN,TOR,16,0.252,4.81,4.78,1.0,1.22,4.77,5.22,81.0,2023-08-19


In [49]:
any_missing = todays_SPs.isna().any().any()

In [50]:
any_missing

False

In [51]:
missing_rows = todays_SPs[todays_SPs.isna().any(axis=1)]

In [52]:
missing_rows

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date


In [53]:
missing_rows_opp = list(missing_rows['Opponent'])

# Last call for missing data
The two other most likely reasons that we would still have missing data is if a pitcher is making his first start this season but he has pitched in relief. In that case we could look at Baseball Reference and fill in the GS and Start-IP columns based off the previous season, his entire career or as a last resort we could just use his total appearances and total IP if he's never started before.<br>

There also might be cases where we don't have FanGraphs data on a pitcher because he's a rookie and hasn't pitched before. In those cases, we don't necessarily know if this pitcher is a hot prospect or some scrub being called up from the minors to make a spot start for an overworked pitching staff. We just won't try to predict games in these situations because we don't know enough about the starting pitcher.<br>

The create_start_data function will be prompted when we don't have GP and Start-IP values, and we have a conditional statement that will drop rows with missing data as well as the rows with the opponent's data so that the entire game is dropped from our dataframe.

In [54]:
def create_start_data(name, games, IP):
    """
    Fills in data for pitchers who haven't started this year but have all the other FanGraphs data
    """
    todays_SPs.loc[todays_SPs['Name'] == name, 'GS'] = games
    todays_SPs.loc[todays_SPs['Name'] == name, 'Start-IP'] = IP

In [55]:
# def drop_game(team1, team2, df):
#     """
#     Drops games from the dataframe
#     """
#     df = df.drop(df[(df['Team'] == team1) | (df['Team'] == team2)].index)
#     return df

In [56]:
if '0' in list(todays_SPs['GS']):
    num_zero = todays_SPs['GS'].value_counts().loc['0']
    print(f"There are {num_zero} pitchers with no starts.")
    temp_df = todays_SPs[todays_SPs['GS'] == '0']
    for i in range(len(temp_df)):
        comp_pitch = temp_df.iloc[i, :]['Name']#input("Enter pitcher's name: ")
        gs_val = int(input(f"Enter a GS value for {comp_pitch}: "))
        ip_val = input(f"Enter a Start-IP value for {comp_pitch}: ") 
        create_start_data(comp_pitch, gs_val, ip_val)

In [57]:
if any_missing:
    print("Dropping games in which pitchers have missing FanGraphs data.")
    todays_SPs = todays_SPs.dropna()
    for opp in missing_rows_opp:
        if opp in list(todays_SPs['Team']):
            todays_SPs = todays_SPs[todays_SPs['Team'] != opp]

In case we need to fill in starting data or dop a game, we can uncomment the function calls below.

In [58]:
#create_start_data('Joe Mantiply', 18, 20)

In [59]:
# todays_SPs = drop_game('OAK', 'STL', todays_SPs)

In [60]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date
0,Kutter Crawford,BOS,NYY,NYY,15,0.276,4.17,4.35,1.2,1.13,3.95,4.8,68.1,2023-08-19
1,Gerrit Cole,NYY,NYY,BOS,25,0.272,3.33,3.68,3.5,1.05,3.74,4.78,156.1,2023-08-19
2,Brady Singer,KCR,CHC,CHC,24,0.309,3.88,4.19,2.1,1.32,4.36,5.64,135.2,2023-08-19
3,Justin Steele,CHC,CHC,KCR,22,0.306,3.2,3.68,3.2,1.17,3.86,6.64,126.0,2023-08-19
4,Cristopher Sanchez,PHI,WSN,WSN,11,0.237,4.67,3.54,0.6,1.01,3.63,4.63,58.1,2023-08-19
5,Jake Irvin,WSN,WSN,PHI,18,0.283,5.46,5.21,0.5,1.41,4.97,4.37,90.2,2023-08-19
6,Freddy Peralta,MIL,TEX,TEX,23,0.28,4.01,3.75,2.1,1.19,3.71,5.06,128.0,2023-08-19
7,Dane Dunning,TEX,TEX,MIL,18,0.272,3.93,4.3,1.9,1.14,4.44,5.22,107.1,2023-08-19
8,Chris Bassitt,TOR,CIN,CIN,25,0.277,4.56,4.4,1.5,1.24,4.35,5.07,145.2,2023-08-19
9,Brandon Williamson,CIN,CIN,TOR,16,0.252,4.81,4.78,1.0,1.22,4.77,5.22,81.0,2023-08-19


In [61]:
# todays_SPs.loc[todays_SPs['Name'] == 'Beau Brieske', 'GS'] = 15 
# todays_SPs.loc[todays_SPs['Name'] == 'Beau Brieske', 'Start-IP'] = 81.2
# todays_SPs.loc[todays_SPs['Name'] == 'Ty Blach', 'GS'] = 6
# todays_SPs.loc[todays_SPs['Name'] == 'Ty Blach', 'Start-IP'] = 23.2

               

Changing columns to numeric data type

In [62]:
todays_SPs_numcols = ['GS', 'BABIP', 'FIP', 'xFIP', 'WAR', 'WHIP', 'SIERA', 'RS/9', 'Start-IP']

In [63]:
todays_SPs[todays_SPs_numcols] = todays_SPs[todays_SPs_numcols].apply(pd.to_numeric, errors = 'coerce')

In [64]:
todays_SPs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Name      30 non-null     object        
 1   Team      30 non-null     object        
 2   Park      30 non-null     object        
 3   Opponent  30 non-null     object        
 4   GS        30 non-null     int64         
 5   BABIP     30 non-null     float64       
 6   FIP       30 non-null     float64       
 7   xFIP      30 non-null     float64       
 8   WAR       30 non-null     float64       
 9   WHIP      30 non-null     float64       
 10  SIERA     30 non-null     float64       
 11  RS/9      30 non-null     float64       
 12  Start-IP  30 non-null     float64       
 13  Date      30 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 3.5+ KB


In [65]:
#todays_SPs = todays_SPs.fillna(method = 'ffill')

# Avg_Outs
Here we can derive the Avg_Outs variable, the average outs a pitcher records per start. It's an absolute godsend that FanGraphs has a 'Start-IP' column. That saves us from writing a lot of code.

In [66]:
todays_SPs['Part-IP'] = (todays_SPs['Start-IP'] % 1)

In [67]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date,Part-IP
0,Kutter Crawford,BOS,NYY,NYY,15,0.276,4.17,4.35,1.2,1.13,3.95,4.8,68.1,2023-08-19,0.1
1,Gerrit Cole,NYY,NYY,BOS,25,0.272,3.33,3.68,3.5,1.05,3.74,4.78,156.1,2023-08-19,0.1
2,Brady Singer,KCR,CHC,CHC,24,0.309,3.88,4.19,2.1,1.32,4.36,5.64,135.2,2023-08-19,0.2
3,Justin Steele,CHC,CHC,KCR,22,0.306,3.2,3.68,3.2,1.17,3.86,6.64,126.0,2023-08-19,0.0
4,Cristopher Sanchez,PHI,WSN,WSN,11,0.237,4.67,3.54,0.6,1.01,3.63,4.63,58.1,2023-08-19,0.1
5,Jake Irvin,WSN,WSN,PHI,18,0.283,5.46,5.21,0.5,1.41,4.97,4.37,90.2,2023-08-19,0.2
6,Freddy Peralta,MIL,TEX,TEX,23,0.28,4.01,3.75,2.1,1.19,3.71,5.06,128.0,2023-08-19,0.0
7,Dane Dunning,TEX,TEX,MIL,18,0.272,3.93,4.3,1.9,1.14,4.44,5.22,107.1,2023-08-19,0.1
8,Chris Bassitt,TOR,CIN,CIN,25,0.277,4.56,4.4,1.5,1.24,4.35,5.07,145.2,2023-08-19,0.2
9,Brandon Williamson,CIN,CIN,TOR,16,0.252,4.81,4.78,1.0,1.22,4.77,5.22,81.0,2023-08-19,0.0


In [68]:
todays_SPs['Start-IP'] = todays_SPs['Start-IP'] - todays_SPs['Part-IP']

In [69]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,GS,BABIP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Start-IP,Date,Part-IP
0,Kutter Crawford,BOS,NYY,NYY,15,0.276,4.17,4.35,1.2,1.13,3.95,4.8,68.0,2023-08-19,0.1
1,Gerrit Cole,NYY,NYY,BOS,25,0.272,3.33,3.68,3.5,1.05,3.74,4.78,156.0,2023-08-19,0.1
2,Brady Singer,KCR,CHC,CHC,24,0.309,3.88,4.19,2.1,1.32,4.36,5.64,135.0,2023-08-19,0.2
3,Justin Steele,CHC,CHC,KCR,22,0.306,3.2,3.68,3.2,1.17,3.86,6.64,126.0,2023-08-19,0.0
4,Cristopher Sanchez,PHI,WSN,WSN,11,0.237,4.67,3.54,0.6,1.01,3.63,4.63,58.0,2023-08-19,0.1
5,Jake Irvin,WSN,WSN,PHI,18,0.283,5.46,5.21,0.5,1.41,4.97,4.37,90.0,2023-08-19,0.2
6,Freddy Peralta,MIL,TEX,TEX,23,0.28,4.01,3.75,2.1,1.19,3.71,5.06,128.0,2023-08-19,0.0
7,Dane Dunning,TEX,TEX,MIL,18,0.272,3.93,4.3,1.9,1.14,4.44,5.22,107.0,2023-08-19,0.1
8,Chris Bassitt,TOR,CIN,CIN,25,0.277,4.56,4.4,1.5,1.24,4.35,5.07,145.0,2023-08-19,0.2
9,Brandon Williamson,CIN,CIN,TOR,16,0.252,4.81,4.78,1.0,1.22,4.77,5.22,81.0,2023-08-19,0.0


# Multiply by 10

The tricky thing here is that FanGraphs indicates IP as decimals. For example, 5.1. Since there are three outs in an inning and not 10, we need to address that.

In [70]:
todays_SPs['Part-IP'] = todays_SPs['Part-IP'] * 10

In [71]:
todays_SPs['Outs'] = (todays_SPs['Start-IP'] * 3) + todays_SPs['Part-IP']

In [72]:
todays_SPs['Avg_Outs'] = np.round(todays_SPs['Outs']/todays_SPs['GS'], 2)

In [73]:
todays_SPs = todays_SPs.drop(['Start-IP', 'Part-IP', 'Outs', 'GS'], axis = 1)

# Rename BABIP
We have to change BABIP to BABIP_SP here to indicate it's the BABIP against the starting pitcher. This is because we'll be getting team hitting BABIP and team bullpen BABIP later in the notebook.

In [74]:
todays_SPs = todays_SPs.rename(columns = {'BABIP': 'BABIP_SP'})

In [75]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Date,Avg_Outs
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,4.8,2023-08-19,13.67
1,Gerrit Cole,NYY,NYY,BOS,0.272,3.33,3.68,3.5,1.05,3.74,4.78,2023-08-19,18.76
2,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,5.64,2023-08-19,16.96
3,Justin Steele,CHC,CHC,KCR,0.306,3.2,3.68,3.2,1.17,3.86,6.64,2023-08-19,17.18
4,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,4.63,2023-08-19,15.91
5,Jake Irvin,WSN,WSN,PHI,0.283,5.46,5.21,0.5,1.41,4.97,4.37,2023-08-19,15.11
6,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,5.06,2023-08-19,16.7
7,Dane Dunning,TEX,TEX,MIL,0.272,3.93,4.3,1.9,1.14,4.44,5.22,2023-08-19,17.89
8,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,5.07,2023-08-19,17.48
9,Brandon Williamson,CIN,CIN,TOR,0.252,4.81,4.78,1.0,1.22,4.77,5.22,2023-08-19,15.19


In [76]:
#condition = (todays_SPs['Team'] == 'PIT') | (todays_SPs['Team'] == 'CLE')

In [77]:
#todays_SPs = todays_SPs.drop(todays_SPs[condition].index)

# Bullpen data
This should go a lot smoother than the previous merge. Bullpen data and hitting data both are team-wise statistics, so the dataframes we create will only be 30 rows.

In [78]:
#Actually for bullpens we should go last 30 days
BP_url_1 = 'https://www.fangraphs.com/leaders-legacy.aspx?pos=all&stats=rel&lg=all&qual=0&type=c,43,45,62,59,42,110,113,122,123&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate='
BP_url_2 = '&enddate='

In [79]:
today

datetime.date(2023, 8, 19)

In [80]:
yesterday

datetime.date(2023, 8, 18)

In [81]:
start_date = yesterday - timedelta(days = 30)

In [82]:
start_date

datetime.date(2023, 7, 19)

In [83]:
start_date_str = str(start_date)

In [84]:
start_date_str

'2023-07-19'

In [85]:
end_date_str = yesterday_str
start_date_str = start_date_str
url_string_page1 = BP_url_1 + start_date_str + BP_url_2 + end_date_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
#print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
#num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
#     if num_pages > 2:
#         for j in range(2, num_pages):
#             temp_url_string = pitch_url + current_date_str + '&page=' + str(j) + '_30'
#             temp_r = requests.get(temp_url_string)
#             temp_html_doc = temp_r.text
#             temp_soup_obj = BeautifulSoup(temp_html_doc)
#             temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
#             for entry in temp_data:
#                 all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
BP_live_df = pd.DataFrame(data_lists, columns = col_names)
#date_plus_1 = str(date + timedelta(days = 1))
date_plus_1 = pd.to_datetime(today_str, format = '%Y-%m-%d')
BP_live_df['Date'] = date_plus_1
#     list_of_dfs.append(SP_live_df)
#     print(f"{start_date_str} - {end_date_str}")
#    time.sleep(25)


In [86]:
BP_live_df

Unnamed: 0,#,Team,BABIP,FIP,xFIP,WAR,WHIP,Contact%,SwStr%,SIERA,RS/9,Date
0,1,SEA,0.283,3.36,3.83,1.3,1.21,73.4%,12.3%,3.79,6.15,2023-08-19
1,2,LAD,0.223,4.23,4.61,0.6,1.09,74.7%,12.1%,4.22,5.6,2023-08-19
2,3,ATL,0.276,3.8,3.93,0.9,1.24,74.2%,12.5%,3.88,5.89,2023-08-19
3,4,TOR,0.278,3.67,4.33,1.3,1.17,74.2%,12.2%,3.96,3.59,2023-08-19
4,5,NYY,0.248,3.56,3.75,0.9,1.09,71.1%,13.6%,3.58,3.36,2023-08-19
5,6,HOU,0.228,5.06,4.71,-0.2,1.26,71.1%,13.4%,4.39,5.29,2023-08-19
6,7,BAL,0.299,3.22,3.81,1.6,1.2,71.5%,13.7%,3.61,3.22,2023-08-19
7,8,CHC,0.239,4.4,4.28,0.5,1.17,74.7%,11.4%,4.04,5.33,2023-08-19
8,9,PHI,0.284,4.05,4.44,0.8,1.3,75.3%,12.2%,4.16,4.48,2023-08-19
9,10,SFG,0.309,3.4,4.06,1.9,1.22,76.8%,11.5%,3.73,3.17,2023-08-19


In [87]:
BP_live_df = BP_live_df.drop(['#', 'WAR', 'Contact%', 'SwStr%', 'RS/9', 'Date'], axis = 1)

In [88]:
#Adding '_BP' to each column name so we can distinguish from SP data
BP_live_df = BP_live_df.rename(columns = {'BABIP': 'BABIP_BP', 'FIP': 'FIP_BP', 'xFIP': 'xFIP_BP', 'WHIP': 'WHIP_BP', 'SIERA': 'SIERA_BP'})

In [89]:
BP_live_df

Unnamed: 0,Team,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP
0,SEA,0.283,3.36,3.83,1.21,3.79
1,LAD,0.223,4.23,4.61,1.09,4.22
2,ATL,0.276,3.8,3.93,1.24,3.88
3,TOR,0.278,3.67,4.33,1.17,3.96
4,NYY,0.248,3.56,3.75,1.09,3.58
5,HOU,0.228,5.06,4.71,1.26,4.39
6,BAL,0.299,3.22,3.81,1.2,3.61
7,CHC,0.239,4.4,4.28,1.17,4.04
8,PHI,0.284,4.05,4.44,1.3,4.16
9,SFG,0.309,3.4,4.06,1.22,3.73


In [90]:
BP_cols = list(BP_live_df.columns)[1:]

In [91]:
BP_cols

['BABIP_BP', 'FIP_BP', 'xFIP_BP', 'WHIP_BP', 'SIERA_BP']

In [92]:
BP_live_df[BP_cols] = BP_live_df[BP_cols].apply(pd.to_numeric, errors = 'coerce')

In [93]:
todays_SPs = pd.merge(todays_SPs, BP_live_df, how = 'left', on = 'Team')

In [94]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,RS/9,Date,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,4.8,2023-08-19,13.67,0.342,4.3,4.22,1.46,3.91
1,Gerrit Cole,NYY,NYY,BOS,0.272,3.33,3.68,3.5,1.05,3.74,4.78,2023-08-19,18.76,0.248,3.56,3.75,1.09,3.58
2,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,5.64,2023-08-19,16.96,0.295,5.52,5.28,1.49,4.84
3,Justin Steele,CHC,CHC,KCR,0.306,3.2,3.68,3.2,1.17,3.86,6.64,2023-08-19,17.18,0.239,4.4,4.28,1.17,4.04
4,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,4.63,2023-08-19,15.91,0.284,4.05,4.44,1.3,4.16
5,Jake Irvin,WSN,WSN,PHI,0.283,5.46,5.21,0.5,1.41,4.97,4.37,2023-08-19,15.11,0.27,4.19,4.04,1.16,3.69
6,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,5.06,2023-08-19,16.7,0.264,4.28,3.9,1.16,3.63
7,Dane Dunning,TEX,TEX,MIL,0.272,3.93,4.3,1.9,1.14,4.44,5.22,2023-08-19,17.89,0.262,4.93,4.32,1.22,3.83
8,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,5.07,2023-08-19,17.48,0.278,3.67,4.33,1.17,3.96
9,Brandon Williamson,CIN,CIN,TOR,0.252,4.81,4.78,1.0,1.22,4.77,5.22,2023-08-19,15.19,0.25,4.94,4.34,1.19,3.84


In [95]:
todays_SPs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 18 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Name      30 non-null     object        
 1   Team      30 non-null     object        
 2   Park      30 non-null     object        
 3   Opponent  30 non-null     object        
 4   BABIP_SP  30 non-null     float64       
 5   FIP       30 non-null     float64       
 6   xFIP      30 non-null     float64       
 7   WAR       30 non-null     float64       
 8   WHIP      30 non-null     float64       
 9   SIERA     30 non-null     float64       
 10  RS/9      30 non-null     float64       
 11  Date      30 non-null     datetime64[ns]
 12  Avg_Outs  30 non-null     float64       
 13  BABIP_BP  30 non-null     float64       
 14  FIP_BP    30 non-null     float64       
 15  xFIP_BP   30 non-null     float64       
 16  WHIP_BP   30 non-null     float64       
 17  SIERA_BP  30 non-n

# Hitting Data

In [96]:
Hit_url_1 = 'https://www.fangraphs.com/leaders-legacy.aspx?pos=all&stats=bat&lg=all&qual=0&type=c,39,41,50,58,40&season=2022&month=1000&season1=2021&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate='
Hit_url_2 = '&enddate='

In [97]:
end_date_str = yesterday_str
url_string_page1 = Hit_url_1 + start_date_str + Hit_url_2 + end_date_str
r_page1 = requests.get(url_string_page1)
html_doc_page1 = r_page1.text
soup_obj_page1 = BeautifulSoup(html_doc_page1)
#print(soup_obj_page1.prettify())
#Finding number of pages to scrape since there are only 30 rows per page
#num_pages = int(soup_obj_page1.find_all('strong')[1].get_text()) + 1
col_names = []
headers = soup_obj_page1.find_all('th', class_ = 'rgHeader')
for header in headers:
    col_names.append(header.get_text())
all_data = []
data = soup_obj_page1.find_all('td', class_ = 'grid_line_regular')
for item in data:
    all_data.append(item.get_text())
#     if num_pages > 2:
#         for j in range(2, num_pages):
#             temp_url_string = pitch_url + current_date_str + '&page=' + str(j) + '_30'
#             temp_r = requests.get(temp_url_string)
#             temp_html_doc = temp_r.text
#             temp_soup_obj = BeautifulSoup(temp_html_doc)
#             temp_data = temp_soup_obj.find_all('td', class_ = 'grid_line_regular')
#             for entry in temp_data:
#                 all_data.append(entry.get_text())
#Turning the list of data into an iterator before dividing it into rows. Determining number of rows by dividing
#number of column names by length of data list
data_iter = iter(all_data)
num_rows = int(len(all_data)/len(col_names))
data_lists = []
for k in range(num_rows):
    temp_list = []
    for l in range(len(headers)):
        temp_list.append(next(data_iter))
    data_lists.append(temp_list)
Hit_live_df = pd.DataFrame(data_lists, columns = col_names)
#     date_plus_1 = str(date + timedelta(days = 1))
#     date_plus_1 = pd.to_datetime(date_plus_1, format = '%Y-%m-%d')
SP_live_df['Date'] = today
#list_of_dfs.append(SP_live_df)
#print(f"{end_date_str} done {date_plus_1}")
#time.sleep(25)


In [98]:
Hit_live_df = Hit_live_df.drop(['WAR', '#'], axis = 1)

In [99]:
Hit_live_df

Unnamed: 0,Team,OPS,BABIP,wOBA,ISO
0,ATL,0.89,0.318,0.377,0.24
1,LAD,0.833,0.311,0.357,0.193
2,CHC,0.826,0.312,0.355,0.209
3,TEX,0.807,0.305,0.346,0.208
4,SEA,0.801,0.329,0.345,0.201
5,MIN,0.783,0.316,0.338,0.198
6,STL,0.778,0.296,0.337,0.178
7,HOU,0.772,0.283,0.336,0.184
8,KCR,0.785,0.311,0.335,0.186
9,SDP,0.748,0.294,0.326,0.163


In [100]:
Hit_live_df[['OPS', 'BABIP', 'wOBA', 'ISO']] = Hit_live_df[['OPS', 'BABIP', 'wOBA', 'ISO']].apply(pd.to_numeric, errors = 'coerce')

In [101]:
todays_SPs = pd.merge(todays_SPs, Hit_live_df, how = 'left', on = 'Team')

In [102]:
todays_SPs

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,13.67,0.342,4.3,4.22,1.46,3.91,0.735,0.296,0.316,0.173
1,Gerrit Cole,NYY,NYY,BOS,0.272,3.33,3.68,3.5,1.05,3.74,...,18.76,0.248,3.56,3.75,1.09,3.58,0.711,0.289,0.313,0.15
2,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,16.96,0.295,5.52,5.28,1.49,4.84,0.785,0.311,0.335,0.186
3,Justin Steele,CHC,CHC,KCR,0.306,3.2,3.68,3.2,1.17,3.86,...,17.18,0.239,4.4,4.28,1.17,4.04,0.826,0.312,0.355,0.209
4,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,15.91,0.284,4.05,4.44,1.3,4.16,0.74,0.291,0.322,0.175
5,Jake Irvin,WSN,WSN,PHI,0.283,5.46,5.21,0.5,1.41,4.97,...,15.11,0.27,4.19,4.04,1.16,3.69,0.744,0.297,0.324,0.153
6,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,16.7,0.264,4.28,3.9,1.16,3.63,0.698,0.283,0.305,0.14
7,Dane Dunning,TEX,TEX,MIL,0.272,3.93,4.3,1.9,1.14,4.44,...,17.89,0.262,4.93,4.32,1.22,3.83,0.807,0.305,0.346,0.208
8,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,17.48,0.278,3.67,4.33,1.17,3.96,0.728,0.297,0.32,0.149
9,Brandon Williamson,CIN,CIN,TOR,0.252,4.81,4.78,1.0,1.22,4.77,...,15.19,0.25,4.94,4.34,1.19,3.84,0.73,0.295,0.315,0.184


# Breaking dataframe into home and away
Now we create a dataframe with just away team rows and one with just home team rows, based on whether or not the team name matches the park.

In [103]:
away_df = todays_SPs[todays_SPs['Team'] != todays_SPs['Park']]

In [104]:
away_df

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,13.67,0.342,4.3,4.22,1.46,3.91,0.735,0.296,0.316,0.173
2,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,16.96,0.295,5.52,5.28,1.49,4.84,0.785,0.311,0.335,0.186
4,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,15.91,0.284,4.05,4.44,1.3,4.16,0.74,0.291,0.322,0.175
6,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,16.7,0.264,4.28,3.9,1.16,3.63,0.698,0.283,0.305,0.14
8,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,17.48,0.278,3.67,4.33,1.17,3.96,0.728,0.297,0.32,0.149
10,Matt Manning,DET,CLE,CLE,0.236,5.24,5.62,0.1,1.16,5.19,...,16.0,0.315,4.45,4.59,1.46,4.18,0.711,0.305,0.307,0.159
12,Logan Gilbert,SEA,HOU,HOU,0.276,3.58,3.65,2.8,1.06,3.68,...,17.75,0.283,3.36,3.83,1.21,3.79,0.801,0.329,0.345,0.201
14,Mitch Keller,PIT,MIN,MIN,0.311,3.86,3.82,2.4,1.28,3.9,...,17.96,0.3,4.01,4.16,1.29,3.62,0.727,0.286,0.317,0.179
16,Kodai Senga,NYM,STL,STL,0.292,3.54,3.76,2.5,1.29,4.05,...,16.73,0.288,4.96,4.86,1.43,4.38,0.718,0.264,0.314,0.167
18,Logan Webb,SFG,ATL,ATL,0.3,3.24,2.95,3.5,1.07,3.15,...,19.56,0.309,3.4,4.06,1.22,3.73,0.593,0.254,0.262,0.117


In [105]:
home_df = todays_SPs[todays_SPs['Team'] == todays_SPs['Park']]

In [106]:
home_df

Unnamed: 0,Name,Team,Park,Opponent,BABIP_SP,FIP,xFIP,WAR,WHIP,SIERA,...,Avg_Outs,BABIP_BP,FIP_BP,xFIP_BP,WHIP_BP,SIERA_BP,OPS,BABIP,wOBA,ISO
1,Gerrit Cole,NYY,NYY,BOS,0.272,3.33,3.68,3.5,1.05,3.74,...,18.76,0.248,3.56,3.75,1.09,3.58,0.711,0.289,0.313,0.15
3,Justin Steele,CHC,CHC,KCR,0.306,3.2,3.68,3.2,1.17,3.86,...,17.18,0.239,4.4,4.28,1.17,4.04,0.826,0.312,0.355,0.209
5,Jake Irvin,WSN,WSN,PHI,0.283,5.46,5.21,0.5,1.41,4.97,...,15.11,0.27,4.19,4.04,1.16,3.69,0.744,0.297,0.324,0.153
7,Dane Dunning,TEX,TEX,MIL,0.272,3.93,4.3,1.9,1.14,4.44,...,17.89,0.262,4.93,4.32,1.22,3.83,0.807,0.305,0.346,0.208
9,Brandon Williamson,CIN,CIN,TOR,0.252,4.81,4.78,1.0,1.22,4.77,...,15.19,0.25,4.94,4.34,1.19,3.84,0.73,0.295,0.315,0.184
11,Tanner Bibee,CLE,CLE,DET,0.292,3.61,4.35,2.3,1.21,4.25,...,17.16,0.325,3.79,4.02,1.47,3.88,0.666,0.285,0.291,0.124
13,Framber Valdez,HOU,HOU,SEA,0.28,3.43,3.29,3.4,1.08,3.57,...,19.52,0.228,5.06,4.71,1.26,4.39,0.772,0.283,0.336,0.184
15,Sonny Gray,MIN,MIN,PIT,0.306,2.79,3.61,4.0,1.22,3.98,...,17.04,0.313,4.9,4.3,1.38,3.81,0.783,0.316,0.338,0.198
17,Miles Mikolas,STL,STL,NYM,0.309,3.85,4.64,2.8,1.28,4.72,...,17.04,0.327,4.07,4.21,1.4,3.9,0.778,0.296,0.337,0.178
19,Yonny Chirinos,ATL,ATL,SFG,0.276,5.45,5.02,-0.3,1.36,4.99,...,14.75,0.276,3.8,3.93,1.24,3.88,0.89,0.318,0.377,0.24


Adding _A and _H to column names

In [107]:
away_df.columns

Index(['Name', 'Team', 'Park', 'Opponent', 'BABIP_SP', 'FIP', 'xFIP', 'WAR',
       'WHIP', 'SIERA', 'RS/9', 'Date', 'Avg_Outs', 'BABIP_BP', 'FIP_BP',
       'xFIP_BP', 'WHIP_BP', 'SIERA_BP', 'OPS', 'BABIP', 'wOBA', 'ISO'],
      dtype='object')

In [108]:
cols_to_change = ['Team', 'BABIP_SP', 'FIP', 'xFIP', 'WAR',
       'WHIP', 'SIERA', 'RS/9', 'Date', 'Avg_Outs', 'BABIP_BP', 'FIP_BP',
       'xFIP_BP', 'WHIP_BP', 'SIERA_BP', '#', 'OPS', 'BABIP', 'wOBA', 'ISO']

In [109]:
away_col_names = []
home_col_names = []

In [110]:
for i in range(len(cols_to_change)):
    away_col_names.append('A_' + cols_to_change[i])
    home_col_names.append('H_' + cols_to_change[i])

In [111]:
away_change_dict = dict(zip(cols_to_change, away_col_names))

In [112]:
home_change_dict = dict(zip(cols_to_change, home_col_names))

In [113]:
away_change_dict

{'Team': 'A_Team',
 'BABIP_SP': 'A_BABIP_SP',
 'FIP': 'A_FIP',
 'xFIP': 'A_xFIP',
 'WAR': 'A_WAR',
 'WHIP': 'A_WHIP',
 'SIERA': 'A_SIERA',
 'RS/9': 'A_RS/9',
 'Date': 'A_Date',
 'Avg_Outs': 'A_Avg_Outs',
 'BABIP_BP': 'A_BABIP_BP',
 'FIP_BP': 'A_FIP_BP',
 'xFIP_BP': 'A_xFIP_BP',
 'WHIP_BP': 'A_WHIP_BP',
 'SIERA_BP': 'A_SIERA_BP',
 '#': 'A_#',
 'OPS': 'A_OPS',
 'BABIP': 'A_BABIP',
 'wOBA': 'A_wOBA',
 'ISO': 'A_ISO'}

In [114]:
away_df = away_df.rename(columns = away_change_dict)

In [115]:
home_df = home_df.rename(columns = home_change_dict)

In [116]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 28
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        15 non-null     object        
 1   A_Team      15 non-null     object        
 2   Park        15 non-null     object        
 3   Opponent    15 non-null     object        
 4   A_BABIP_SP  15 non-null     float64       
 5   A_FIP       15 non-null     float64       
 6   A_xFIP      15 non-null     float64       
 7   A_WAR       15 non-null     float64       
 8   A_WHIP      15 non-null     float64       
 9   A_SIERA     15 non-null     float64       
 10  A_RS/9      15 non-null     float64       
 11  A_Date      15 non-null     datetime64[ns]
 12  A_Avg_Outs  15 non-null     float64       
 13  A_BABIP_BP  15 non-null     float64       
 14  A_FIP_BP    15 non-null     float64       
 15  A_xFIP_BP   15 non-null     float64       
 16  A_WHIP_BP   15 non-null     

In [117]:
home_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 1 to 29
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        15 non-null     object        
 1   H_Team      15 non-null     object        
 2   Park        15 non-null     object        
 3   Opponent    15 non-null     object        
 4   H_BABIP_SP  15 non-null     float64       
 5   H_FIP       15 non-null     float64       
 6   H_xFIP      15 non-null     float64       
 7   H_WAR       15 non-null     float64       
 8   H_WHIP      15 non-null     float64       
 9   H_SIERA     15 non-null     float64       
 10  H_RS/9      15 non-null     float64       
 11  H_Date      15 non-null     datetime64[ns]
 12  H_Avg_Outs  15 non-null     float64       
 13  H_BABIP_BP  15 non-null     float64       
 14  H_FIP_BP    15 non-null     float64       
 15  H_xFIP_BP   15 non-null     float64       
 16  H_WHIP_BP   15 non-null     

# Win variables
Previously, we created the dataframe for all the season's scores here. But now it's created in our Scoresheet notebook as we keep track of our model's performance. So we read in that CSV here.

In [118]:
scores_df = pd.read_csv('2023_Game_Scores.csv')

In [119]:
scores_df = clean_unnamed(scores_df)

In [120]:
scores_df = scores_df.reset_index()

In [121]:
scores_df.head()

Unnamed: 0,index,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
0,0,BAL,10,BOS,9,2023-03-30,0,1
1,1,MIL,0,CHC,4,2023-03-30,1,0
2,2,PIT,5,CIN,4,2023-03-30,0,1
3,3,CHW,3,HOU,2,2023-03-30,0,1
4,4,MIN,2,KCR,0,2023-03-30,0,1


In [122]:
scores_df.tail()

Unnamed: 0,index,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
1832,1832,BAL,9,OAK,4,2023-08-18,0,1
1833,1833,ARI,0,SDP,4,2023-08-18,1,0
1834,1834,NYM,7,STL,1,2023-08-18,0,1
1835,1835,MIL,9,TEX,8,2023-08-18,0,1
1836,1836,PHI,7,WSN,8,2023-08-18,1,0


In [123]:
scores_df['index'].nunique()

1837

In [124]:
scores_df.rename(columns = {'index':'game_id'}, inplace = True)

In [125]:
scores_df.head()

Unnamed: 0,game_id,Away_Team,Away_Score,Home_Team,Home_Score,Date,Home_Win,Away_Win
0,0,BAL,10,BOS,9,2023-03-30,0,1
1,1,MIL,0,CHC,4,2023-03-30,1,0
2,2,PIT,5,CIN,4,2023-03-30,0,1
3,3,CHW,3,HOU,2,2023-03-30,0,1
4,4,MIN,2,KCR,0,2023-03-30,0,1


In [126]:
scores_df['Away_Host'] = 0
scores_df['Home_Host'] = 1

In [127]:
# scores_df['Home_Win'] = np.where(scores_df['Home_Score'] > scores_df['Away_Score'], 1, 0)
# scores_df['Away_Win'] = np.where(scores_df['Away_Score'] > scores_df['Home_Score'], 1, 0)

In [128]:
# scores_df = scores_df.rename(columns = {'Away':'Away_Team', 'Home': 'Home_Team'})

In [129]:
scores_df_A = scores_df[['Date', 'game_id', 'Away_Team', 'Away_Score', 'Away_Win', 'Away_Host']]
scores_df_H = scores_df[['Date', 'game_id', 'Home_Team', 'Home_Score', 'Home_Win', 'Home_Host']]

In [130]:
scores_df_A.head()

Unnamed: 0,Date,game_id,Away_Team,Away_Score,Away_Win,Away_Host
0,2023-03-30,0,BAL,10,1,0
1,2023-03-30,1,MIL,0,0,0
2,2023-03-30,2,PIT,5,1,0
3,2023-03-30,3,CHW,3,1,0
4,2023-03-30,4,MIN,2,1,0


In [131]:
scores_df_H.head()

Unnamed: 0,Date,game_id,Home_Team,Home_Score,Home_Win,Home_Host
0,2023-03-30,0,BOS,9,0,1
1,2023-03-30,1,CHC,4,1,1
2,2023-03-30,2,CIN,4,0,1
3,2023-03-30,3,HOU,2,0,1
4,2023-03-30,4,KCR,0,0,1


In [132]:
scores_df_A = scores_df_A.rename(columns = {'Away_Team': 'Team', 'Away_Score': 'Score', 'Away_Win': 'Win', 'Away_Host': 'Host'})
scores_df_H = scores_df_H.rename(columns = {'Home_Team': 'Team', 'Home_Score': 'Score', 'Home_Win': 'Win', 'Home_Host': 'Host'})

# Re-assembling home and away rows
We concatenate our home and away dataframes and srt by Date, game_id and the Host variable we created. That way each individual game is represented on consecutive rows and the row pertaining to the away team is the one on the top.

In [133]:
wins_df = pd.concat([scores_df_A, scores_df_H]).sort_values(['Date', 'game_id', 'Host'])

In [134]:
wins_df.tail(50)

Unnamed: 0,Date,game_id,Team,Score,Win,Host
1812,2023-08-16,1812,OAK,8,1,0
1812,2023-08-16,1812,STL,0,0,1
1813,2023-08-16,1813,LAA,2,1,0
1813,2023-08-16,1813,TEX,0,0,1
1814,2023-08-16,1814,PHI,9,1,0
1814,2023-08-16,1814,TOR,4,0,1
1815,2023-08-16,1815,BOS,2,0,0
1815,2023-08-16,1815,WSN,6,1,1
1816,2023-08-17,1816,SEA,6,1,0
1816,2023-08-17,1816,KCR,4,0,1


In [135]:
wins_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3674 entries, 0 to 1836
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     3674 non-null   object
 1   game_id  3674 non-null   int64 
 2   Team     3674 non-null   object
 3   Score    3674 non-null   int64 
 4   Win      3674 non-null   int64 
 5   Host     3674 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 200.9+ KB


# Adding a Team_Wins column
We group by Team and Win cumsum.

In [136]:
wins_df['Team_Wins'] = wins_df.groupby('Team')['Win'].cumsum()

In [137]:
wins_df.tail(31)

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins
1821,2023-08-18,1821,LAA,6,0,1,60
1822,2023-08-18,1822,SFG,0,0,0,64
1822,2023-08-18,1822,ATL,4,1,1,79
1823,2023-08-18,1823,KCR,4,1,0,40
1823,2023-08-18,1823,CHC,3,0,1,62
1824,2023-08-18,1824,TOR,0,0,0,67
1824,2023-08-18,1824,CIN,1,1,1,64
1825,2023-08-18,1825,DET,4,1,0,55
1825,2023-08-18,1825,CLE,2,0,1,58
1826,2023-08-18,1826,DET,1,0,0,55


In [138]:
# pd.set_option('display.max_rows', None)

# Last 10
Now we need a wins-in-last-10-games variable. This for loop creates a temporary dataframe for each team and subtracts Team_Wins from 10 rows before from Team_Wins in the current row of the loop. We need to reset the index of each temp_df so we can use the index when we use .loc.<br>

Then we concat the list of temp_dfs and sort by Date and game_id.<br>

There are two for loops within the main for loop because for the first 10 rows, wins in last 10 games is just total wins.

In [139]:
list_of_dfs = []
for team in wins_df['Team'].unique():
    temp_df = wins_df[wins_df['Team'] == team]
    temp_df = temp_df.reset_index(drop=True)
    temp_df['Wins_L10'] = 0
    for i in range(10):
        temp_df.loc[i, 'Wins_L10'] = temp_df.loc[i, 'Team_Wins']
    for j in range(10, (len(temp_df))):
        temp_df.loc[j, 'Wins_L10'] = temp_df.loc[j, 'Team_Wins'] - temp_df.loc[j-10, 'Team_Wins']
    list_of_dfs.append(temp_df)

In [140]:
wins_df = pd.concat(list_of_dfs).sort_values(['Date', 'game_id', 'Host'])

In [141]:
wins_df.tail()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10
122,2023-08-18,1834,STL,1,0,1,54,5
122,2023-08-18,1835,MIL,9,1,0,66,6
121,2023-08-18,1835,TEX,8,0,1,72,6
121,2023-08-18,1836,PHI,7,0,0,66,5
122,2023-08-18,1836,WSN,8,1,1,56,7


# Last 30

In [142]:
list_of_dfs = []
for team in wins_df['Team'].unique():
    temp_df = wins_df[wins_df['Team'] == team]
    temp_df = temp_df.reset_index(drop=True)
    temp_df['Wins_L30'] = 0
    for i in range(30):
        temp_df.loc[i, 'Wins_L30'] = temp_df.loc[i, 'Team_Wins']
    for j in range(30, (len(temp_df))):
        temp_df.loc[j, 'Wins_L30'] = temp_df.loc[j, 'Team_Wins'] - temp_df.loc[j-30, 'Team_Wins']
    list_of_dfs.append(temp_df)



In [143]:
wins_df = pd.concat(list_of_dfs).sort_values(['Date', 'game_id'])



In [144]:
wins_df.head(50)



Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30
0,2023-03-30,0,BAL,10,1,0,1,1,1
0,2023-03-30,0,BOS,9,0,1,0,0,0
0,2023-03-30,1,MIL,0,0,0,0,0,0
0,2023-03-30,1,CHC,4,1,1,1,1,1
0,2023-03-30,2,PIT,5,1,0,1,1,1
0,2023-03-30,2,CIN,4,0,1,0,0,0
0,2023-03-30,3,CHW,3,1,0,1,1,1
0,2023-03-30,3,HOU,2,0,1,0,0,0
0,2023-03-30,4,MIN,2,1,0,1,1,1
0,2023-03-30,4,KCR,0,0,1,0,0,0


# Pythagorean wins
Now let's add a Pythagorean Wins variable. According to <a href='https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage'>MLB.com</a>, this is a formula created by noted baseball statistician Bill James to measure how many games a team "should have" won, based on runs scored and runs allowed.<br>

**Formula:** Runs Scored to the 1.83 power/((Runs Scored to the 1.83 power) + (Runs Allowed to the 1.83 power))<br>

Because that formula outputs the winning percentage a team "should" have, we multiply by games played to convert to wins.<br>

We'll start by creating a variable that counts each team's games played, then cumulative Runs_Scored and Runs_Allowed variables.<br>



In [145]:
wins_df = wins_df.reset_index(drop=True)

In [146]:
wins_df['Games_Played'] = wins_df.groupby('Team').cumcount() + 1

In [147]:
wins_df.tail(10)

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played
3664,2023-08-18,1832,BAL,9,1,0,75,5,18,122
3665,2023-08-18,1832,OAK,4,0,1,34,2,9,122
3666,2023-08-18,1833,ARI,0,0,0,62,5,10,123
3667,2023-08-18,1833,SDP,4,1,1,59,4,15,123
3668,2023-08-18,1834,NYM,7,1,0,57,6,14,123
3669,2023-08-18,1834,STL,1,0,1,54,5,14,123
3670,2023-08-18,1835,MIL,9,1,0,66,6,15,123
3671,2023-08-18,1835,TEX,8,0,1,72,6,19,122
3672,2023-08-18,1836,PHI,7,0,0,66,5,16,122
3673,2023-08-18,1836,WSN,8,1,1,56,7,19,123


Getting runs scored and runs allowed.

In [148]:
wins_df['Opp_Score'] = 0

In [149]:
for i in range(len(wins_df)):
    if i % 2 == 0:
        wins_df.loc[i, 'Opp_Score'] = wins_df.loc[i+1, 'Score']
    else:
        wins_df.loc[i, 'Opp_Score'] = wins_df.loc[i-1, 'Score']

In [150]:
wins_df.head()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played,Opp_Score
0,2023-03-30,0,BAL,10,1,0,1,1,1,1,9
1,2023-03-30,0,BOS,9,0,1,0,0,0,1,10
2,2023-03-30,1,MIL,0,0,0,0,0,0,1,4
3,2023-03-30,1,CHC,4,1,1,1,1,1,1,0
4,2023-03-30,2,PIT,5,1,0,1,1,1,1,4


In [151]:
wins_df['Runs_Scored'] = wins_df.groupby('Team')['Score'].cumsum()

In [152]:
wins_df['Runs_Allowed'] = wins_df.groupby('Team')['Opp_Score'].cumsum()

In [153]:
wins_df['Py_Wins'] = np.round(wins_df['Runs_Scored']**1.83/((wins_df['Runs_Scored']**1.83)+(wins_df['Runs_Allowed']**1.83))\
*wins_df['Games_Played'], 0)

In [154]:
wins_df.tail()

Unnamed: 0,Date,game_id,Team,Score,Win,Host,Team_Wins,Wins_L10,Wins_L30,Games_Played,Opp_Score,Runs_Scored,Runs_Allowed,Py_Wins
3669,2023-08-18,1834,STL,1,0,1,54,5,14,123,7,567,605,58.0
3670,2023-08-18,1835,MIL,9,1,0,66,6,15,123,8,525,529,61.0
3671,2023-08-18,1835,TEX,8,0,1,72,6,19,122,9,699,505,79.0
3672,2023-08-18,1836,PHI,7,0,0,66,5,16,122,8,567,531,65.0
3673,2023-08-18,1836,WSN,8,1,1,56,7,19,123,7,545,627,54.0


Now we derive Run Differential, keep only the columns we need and then keep only the latest row for each team before we save to CSV.

In [155]:
wins_df['Run_Diff'] = wins_df['Runs_Scored'] - wins_df['Runs_Allowed']

In [156]:
wins_df = wins_df.drop(columns = ['Games_Played', 'Opp_Score', 'Runs_Scored', 'Runs_Allowed'])

In [157]:
wins_df = wins_df.drop(columns = ['game_id', 'Score', 'Win', 'Host'])

In [158]:
wins_df['Date'] = today

In [159]:
wins_df['Date'] = pd.to_datetime(wins_df['Date'])

In [160]:
wins_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3674 entries, 0 to 3673
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       3674 non-null   datetime64[ns]
 1   Team       3674 non-null   object        
 2   Team_Wins  3674 non-null   int64         
 3   Wins_L10   3674 non-null   int64         
 4   Wins_L30   3674 non-null   int64         
 5   Py_Wins    3674 non-null   float64       
 6   Run_Diff   3674 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 201.0+ KB


In [161]:
wins_df = wins_df.sort_values(by = ['Team', 'Date'], ascending = ['True', 'False'])

In [162]:
wins_df = wins_df.drop_duplicates(subset = 'Team', keep = 'last')

In [163]:
filepath = r'C:\Users\Owner\Sports Betting\MLB_Game_Outcome\2023_Win_Features_' + today_str + '.csv'
wins_df.to_csv(filepath)

In [164]:
wins_df

Unnamed: 0,Date,Team,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
3666,2023-08-19,ARI,62,5,10,59.0,-21
3645,2023-08-19,ATL,79,8,18,80.0,212
3664,2023-08-19,BAL,75,5,18,67.0,63
3662,2023-08-19,BOS,64,6,15,64.0,28
3646,2023-08-19,CHC,62,5,19,66.0,62
3654,2023-08-19,CHW,48,4,10,50.0,-107
3648,2023-08-19,CIN,64,5,14,60.0,-19
3652,2023-08-19,CLE,59,5,14,61.0,-1
3655,2023-08-19,COL,47,3,12,46.0,-170
3653,2023-08-19,DET,55,6,14,50.0,-102


# Back to home and away dataframes
New we still have our separate home and away dataframes even though we concatenated to make the wins_df. We'll merge those now with our wins_df.

In [165]:
away_df = away_df.rename(columns = {'A_Team': 'Team'})
home_df = home_df.rename(columns = {'H_Team': 'Team'})

In [166]:
wins_df = wins_df.drop('Date', axis = 1)

In [167]:
away_df = pd.merge(away_df, wins_df, on = ['Team'], how = 'left')

In [168]:
home_df = pd.merge(home_df, wins_df, on = ['Team'], how = 'left')

In [169]:
away_df

Unnamed: 0,Name,Team,Park,Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,A_SIERA_BP,A_OPS,A_BABIP,A_wOBA,A_ISO,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,3.91,0.735,0.296,0.316,0.173,64,6,15,64.0,28
1,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,4.84,0.785,0.311,0.335,0.186,40,4,13,46.0,-163
2,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,4.16,0.74,0.291,0.322,0.175,66,5,16,65.0,36
3,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,3.63,0.698,0.283,0.305,0.14,66,6,15,61.0,-4
4,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,3.96,0.728,0.297,0.32,0.149,67,4,15,67.0,54
5,Matt Manning,DET,CLE,CLE,0.236,5.24,5.62,0.1,1.16,5.19,...,4.18,0.711,0.305,0.307,0.159,55,6,14,50.0,-102
6,Logan Gilbert,SEA,HOU,HOU,0.276,3.58,3.65,2.8,1.06,3.68,...,3.79,0.801,0.329,0.345,0.201,67,7,21,68.0,64
7,Mitch Keller,PIT,MIN,MIN,0.311,3.86,3.82,2.4,1.28,3.9,...,3.62,0.727,0.286,0.317,0.179,54,3,13,51.0,-95
8,Kodai Senga,NYM,STL,STL,0.292,3.54,3.76,2.5,1.29,4.05,...,4.38,0.718,0.264,0.314,0.167,57,6,14,57.0,-41
9,Logan Webb,SFG,ATL,ATL,0.3,3.24,2.95,3.5,1.07,3.15,...,3.73,0.593,0.254,0.262,0.117,64,3,13,62.0,8


In [170]:
home_df

Unnamed: 0,Name,Team,Park,Opponent,H_BABIP_SP,H_FIP,H_xFIP,H_WAR,H_WHIP,H_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,Team_Wins,Wins_L10,Wins_L30,Py_Wins,Run_Diff
0,Gerrit Cole,NYY,NYY,BOS,0.272,3.33,3.68,3.5,1.05,3.74,...,3.58,0.711,0.289,0.313,0.15,60,2,11,59.0,-19
1,Justin Steele,CHC,CHC,KCR,0.306,3.2,3.68,3.2,1.17,3.86,...,4.04,0.826,0.312,0.355,0.209,62,5,19,66.0,62
2,Jake Irvin,WSN,WSN,PHI,0.283,5.46,5.21,0.5,1.41,4.97,...,3.69,0.744,0.297,0.324,0.153,56,7,19,54.0,-82
3,Dane Dunning,TEX,TEX,MIL,0.272,3.93,4.3,1.9,1.14,4.44,...,3.83,0.807,0.305,0.346,0.208,72,6,19,79.0,194
4,Brandon Williamson,CIN,CIN,TOR,0.252,4.81,4.78,1.0,1.22,4.77,...,3.84,0.73,0.295,0.315,0.184,64,5,14,60.0,-19
5,Tanner Bibee,CLE,CLE,DET,0.292,3.61,4.35,2.3,1.21,4.25,...,3.88,0.666,0.285,0.291,0.124,59,5,14,61.0,-1
6,Framber Valdez,HOU,HOU,SEA,0.28,3.43,3.29,3.4,1.08,3.57,...,4.39,0.772,0.283,0.336,0.184,70,6,19,70.0,88
7,Sonny Gray,MIN,MIN,PIT,0.306,2.79,3.61,4.0,1.22,3.98,...,3.81,0.783,0.316,0.338,0.198,64,5,17,66.0,44
8,Miles Mikolas,STL,STL,NYM,0.309,3.85,4.64,2.8,1.28,4.72,...,3.9,0.778,0.296,0.337,0.178,54,5,14,58.0,-38
9,Yonny Chirinos,ATL,ATL,SFG,0.276,5.45,5.02,-0.3,1.36,4.99,...,3.88,0.89,0.318,0.377,0.24,79,8,18,80.0,212


In [171]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 27 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        15 non-null     object        
 1   Team        15 non-null     object        
 2   Park        15 non-null     object        
 3   Opponent    15 non-null     object        
 4   A_BABIP_SP  15 non-null     float64       
 5   A_FIP       15 non-null     float64       
 6   A_xFIP      15 non-null     float64       
 7   A_WAR       15 non-null     float64       
 8   A_WHIP      15 non-null     float64       
 9   A_SIERA     15 non-null     float64       
 10  A_RS/9      15 non-null     float64       
 11  A_Date      15 non-null     datetime64[ns]
 12  A_Avg_Outs  15 non-null     float64       
 13  A_BABIP_BP  15 non-null     float64       
 14  A_FIP_BP    15 non-null     float64       
 15  A_xFIP_BP   15 non-null     float64       
 16  A_WHIP_BP   15 non-null     

In [172]:
away_df = away_df.rename(columns = {'Name': 'A_Starter', 'Team': 'A_Team', 'Opponent': 'A_Opponent', 'Team_Wins': 'A_Team_Wins',\
                                   'Wins_L10': 'A_Wins_L10', 'Wins_L30': 'A_Wins_L30', 'Py_Wins': 'A_Py_Wins', 'Run_Diff': 'A_Run_Diff'})

In [173]:
# away_df = away_df.drop(['Name'], axis = 1)

In [174]:
home_df = home_df.rename(columns = {'Name': 'H_Starter', 'Team': 'H_Team', 'Opponent': 'H_Opponent', 'Team_Wins': 'H_Team_Wins',\
                                   'Wins_L10': 'H_Wins_L10', 'Wins_L30': 'H_Wins_L30', 'Py_Wins': 'H_Py_Wins', 'Run_Diff': 'H_Run_Diff'})

In [175]:
# home_df = home_df.drop(['Name'], axis = 1)

In [176]:
away_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,A_SIERA_BP,A_OPS,A_BABIP,A_wOBA,A_ISO,A_Team_Wins,A_Wins_L10,A_Wins_L30,A_Py_Wins,A_Run_Diff
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,3.91,0.735,0.296,0.316,0.173,64,6,15,64.0,28
1,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,4.84,0.785,0.311,0.335,0.186,40,4,13,46.0,-163
2,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,4.16,0.74,0.291,0.322,0.175,66,5,16,65.0,36
3,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,3.63,0.698,0.283,0.305,0.14,66,6,15,61.0,-4
4,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,3.96,0.728,0.297,0.32,0.149,67,4,15,67.0,54
5,Matt Manning,DET,CLE,CLE,0.236,5.24,5.62,0.1,1.16,5.19,...,4.18,0.711,0.305,0.307,0.159,55,6,14,50.0,-102
6,Logan Gilbert,SEA,HOU,HOU,0.276,3.58,3.65,2.8,1.06,3.68,...,3.79,0.801,0.329,0.345,0.201,67,7,21,68.0,64
7,Mitch Keller,PIT,MIN,MIN,0.311,3.86,3.82,2.4,1.28,3.9,...,3.62,0.727,0.286,0.317,0.179,54,3,13,51.0,-95
8,Kodai Senga,NYM,STL,STL,0.292,3.54,3.76,2.5,1.29,4.05,...,4.38,0.718,0.264,0.314,0.167,57,6,14,57.0,-41
9,Logan Webb,SFG,ATL,ATL,0.3,3.24,2.95,3.5,1.07,3.15,...,3.73,0.593,0.254,0.262,0.117,64,3,13,62.0,8


In [177]:
away_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   A_Starter    15 non-null     object        
 1   A_Team       15 non-null     object        
 2   Park         15 non-null     object        
 3   A_Opponent   15 non-null     object        
 4   A_BABIP_SP   15 non-null     float64       
 5   A_FIP        15 non-null     float64       
 6   A_xFIP       15 non-null     float64       
 7   A_WAR        15 non-null     float64       
 8   A_WHIP       15 non-null     float64       
 9   A_SIERA      15 non-null     float64       
 10  A_RS/9       15 non-null     float64       
 11  A_Date       15 non-null     datetime64[ns]
 12  A_Avg_Outs   15 non-null     float64       
 13  A_BABIP_BP   15 non-null     float64       
 14  A_FIP_BP     15 non-null     float64       
 15  A_xFIP_BP    15 non-null     float64       
 16  A_WHIP_BP 

In [178]:
home_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   H_Starter    15 non-null     object        
 1   H_Team       15 non-null     object        
 2   Park         15 non-null     object        
 3   H_Opponent   15 non-null     object        
 4   H_BABIP_SP   15 non-null     float64       
 5   H_FIP        15 non-null     float64       
 6   H_xFIP       15 non-null     float64       
 7   H_WAR        15 non-null     float64       
 8   H_WHIP       15 non-null     float64       
 9   H_SIERA      15 non-null     float64       
 10  H_RS/9       15 non-null     float64       
 11  H_Date       15 non-null     datetime64[ns]
 12  H_Avg_Outs   15 non-null     float64       
 13  H_BABIP_BP   15 non-null     float64       
 14  H_FIP_BP     15 non-null     float64       
 15  H_xFIP_BP    15 non-null     float64       
 16  H_WHIP_BP 

In [179]:
away_df = away_df.rename(columns = {'A_Date': 'Date'})

In [180]:
home_df = home_df.rename(columns = {'H_Date': 'Date'})

In [181]:
main_df = pd.merge(away_df, home_df, on = ['Park', 'Date'], how = 'left')

In [182]:
main_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,H_Team_Wins,H_Wins_L10,H_Wins_L30,H_Py_Wins,H_Run_Diff
0,Kutter Crawford,BOS,NYY,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,3.58,0.711,0.289,0.313,0.15,60,2,11,59.0,-19
1,Brady Singer,KCR,CHC,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,4.04,0.826,0.312,0.355,0.209,62,5,19,66.0,62
2,Cristopher Sanchez,PHI,WSN,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,3.69,0.744,0.297,0.324,0.153,56,7,19,54.0,-82
3,Freddy Peralta,MIL,TEX,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,3.83,0.807,0.305,0.346,0.208,72,6,19,79.0,194
4,Chris Bassitt,TOR,CIN,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,3.84,0.73,0.295,0.315,0.184,64,5,14,60.0,-19
5,Matt Manning,DET,CLE,CLE,0.236,5.24,5.62,0.1,1.16,5.19,...,3.88,0.666,0.285,0.291,0.124,59,5,14,61.0,-1
6,Logan Gilbert,SEA,HOU,HOU,0.276,3.58,3.65,2.8,1.06,3.68,...,4.39,0.772,0.283,0.336,0.184,70,6,19,70.0,88
7,Mitch Keller,PIT,MIN,MIN,0.311,3.86,3.82,2.4,1.28,3.9,...,3.81,0.783,0.316,0.338,0.198,64,5,17,66.0,44
8,Kodai Senga,NYM,STL,STL,0.292,3.54,3.76,2.5,1.29,4.05,...,3.9,0.778,0.296,0.337,0.178,54,5,14,58.0,-38
9,Logan Webb,SFG,ATL,ATL,0.3,3.24,2.95,3.5,1.07,3.15,...,3.88,0.89,0.318,0.377,0.24,79,8,18,80.0,212


In [183]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 52 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   A_Starter    15 non-null     object        
 1   A_Team       15 non-null     object        
 2   Park         15 non-null     object        
 3   A_Opponent   15 non-null     object        
 4   A_BABIP_SP   15 non-null     float64       
 5   A_FIP        15 non-null     float64       
 6   A_xFIP       15 non-null     float64       
 7   A_WAR        15 non-null     float64       
 8   A_WHIP       15 non-null     float64       
 9   A_SIERA      15 non-null     float64       
 10  A_RS/9       15 non-null     float64       
 11  Date         15 non-null     datetime64[ns]
 12  A_Avg_Outs   15 non-null     float64       
 13  A_BABIP_BP   15 non-null     float64       
 14  A_FIP_BP     15 non-null     float64       
 15  A_xFIP_BP    15 non-null     float64       
 16  A_WHIP_BP 

# 2023 Park Factors
Reading the CSV we pasted from MLB.com.<br>

It might be a little clunky but we just copy and paste the spreadsheet from MLB.com, and we do this periodically because the factors change slightly over the course of the season.

In [184]:
ballparks_df = pd.read_csv('ParkFactors2023.csv')

In [185]:
ballparks_df = clean_unnamed(ballparks_df)

In [186]:
ballparks_df = ballparks_df.iloc[1:, [0, 3]]

In [187]:
col_names = ['Team', 'Park_Factor']

In [188]:
ballparks_df = ballparks_df.rename(columns = {'Unnamed: 1': 'Team', 'Unnamed: 4': 'Park_Factor'})

In [189]:
ballparks_df = ballparks_df.dropna()

In [190]:
park_teams = ballparks_df['Team'].tolist()

In [191]:
park_teams = sorted(park_teams)

In [192]:
park_teams

['Angels',
 'Astros',
 'Athletics',
 'Blue Jays',
 'Braves',
 'Brewers',
 'Cardinals',
 'Cubs',
 'D-backs',
 'Dodgers',
 'Giants',
 'Guardians',
 'Mariners',
 'Marlins',
 'Mets',
 'Nationals',
 'Orioles',
 'Padres',
 'Phillies',
 'Pirates',
 'Rangers',
 'Rays',
 'Red Sox',
 'Reds',
 'Rockies',
 'Royals',
 'Tigers',
 'Twins',
 'White Sox',
 'Yankees']

In [193]:
team_codes = ['LAA', 'HOU', 'OAK', 'TOR', 'ATL', 'MIL', 'STL', 'CHC', 'ARI', 'LAD', 'SFG', 'CLE', 'SEA', 'MIA', 'NYM', 'WSN',\
             'BAL', 'SDP', 'PHI', 'PIT', 'TEX', 'TBR', 'BOS', 'CIN', 'COL', 'KCR', 'DET', 'MIN', 'CHW', 'NYY']

In [194]:
park_dict = dict(zip(park_teams, team_codes))

In [195]:
park_dict

{'Angels': 'LAA',
 'Astros': 'HOU',
 'Athletics': 'OAK',
 'Blue Jays': 'TOR',
 'Braves': 'ATL',
 'Brewers': 'MIL',
 'Cardinals': 'STL',
 'Cubs': 'CHC',
 'D-backs': 'ARI',
 'Dodgers': 'LAD',
 'Giants': 'SFG',
 'Guardians': 'CLE',
 'Mariners': 'SEA',
 'Marlins': 'MIA',
 'Mets': 'NYM',
 'Nationals': 'WSN',
 'Orioles': 'BAL',
 'Padres': 'SDP',
 'Phillies': 'PHI',
 'Pirates': 'PIT',
 'Rangers': 'TEX',
 'Rays': 'TBR',
 'Red Sox': 'BOS',
 'Reds': 'CIN',
 'Rockies': 'COL',
 'Royals': 'KCR',
 'Tigers': 'DET',
 'Twins': 'MIN',
 'White Sox': 'CHW',
 'Yankees': 'NYY'}

In [196]:
ballparks_df['Team'] = ballparks_df['Team'].replace(park_dict)

In [197]:
ballparks_df

Unnamed: 0,Team,Park_Factor
1,COL,111
2,BOS,109
3,CIN,107
5,BAL,103
6,KCR,103
7,CHC,102
8,WSN,102
9,LAA,101
10,ARI,101
11,PIT,101


In [198]:
ballparks_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 31
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Team         30 non-null     object
 1   Park_Factor  30 non-null     object
dtypes: object(2)
memory usage: 720.0+ bytes


In [199]:
ballparks_df['Park_Factor'] = ballparks_df['Park_Factor'].astype('int')

In [200]:
ballparks_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 31
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Team         30 non-null     object
 1   Park_Factor  30 non-null     int32 
dtypes: int32(1), object(1)
memory usage: 600.0+ bytes


In [201]:
park_factor_dict = ballparks_df.set_index('Team')['Park_Factor'].to_dict()

In [202]:
park_factor_dict

{'COL': 111,
 'BOS': 109,
 'CIN': 107,
 'BAL': 103,
 'KCR': 103,
 'CHC': 102,
 'WSN': 102,
 'LAA': 101,
 'ARI': 101,
 'PIT': 101,
 'ATL': 101,
 'PHI': 101,
 'TEX': 101,
 'MIN': 101,
 'CHW': 100,
 'TOR': 100,
 'LAD': 100,
 'HOU': 99,
 'STL': 98,
 'SFG': 98,
 'NYY': 98,
 'CLE': 97,
 'MIL': 97,
 'MIA': 97,
 'DET': 97,
 'OAK': 96,
 'NYM': 96,
 'TBR': 95,
 'SDP': 95,
 'SEA': 93}

In [203]:
main_df['Park'] = main_df['Park'].replace(park_factor_dict)

In [204]:
main_df

Unnamed: 0,A_Starter,A_Team,Park,A_Opponent,A_BABIP_SP,A_FIP,A_xFIP,A_WAR,A_WHIP,A_SIERA,...,H_SIERA_BP,H_OPS,H_BABIP,H_wOBA,H_ISO,H_Team_Wins,H_Wins_L10,H_Wins_L30,H_Py_Wins,H_Run_Diff
0,Kutter Crawford,BOS,98,NYY,0.276,4.17,4.35,1.2,1.13,3.95,...,3.58,0.711,0.289,0.313,0.15,60,2,11,59.0,-19
1,Brady Singer,KCR,102,CHC,0.309,3.88,4.19,2.1,1.32,4.36,...,4.04,0.826,0.312,0.355,0.209,62,5,19,66.0,62
2,Cristopher Sanchez,PHI,102,WSN,0.237,4.67,3.54,0.6,1.01,3.63,...,3.69,0.744,0.297,0.324,0.153,56,7,19,54.0,-82
3,Freddy Peralta,MIL,101,TEX,0.28,4.01,3.75,2.1,1.19,3.71,...,3.83,0.807,0.305,0.346,0.208,72,6,19,79.0,194
4,Chris Bassitt,TOR,107,CIN,0.277,4.56,4.4,1.5,1.24,4.35,...,3.84,0.73,0.295,0.315,0.184,64,5,14,60.0,-19
5,Matt Manning,DET,97,CLE,0.236,5.24,5.62,0.1,1.16,5.19,...,3.88,0.666,0.285,0.291,0.124,59,5,14,61.0,-1
6,Logan Gilbert,SEA,99,HOU,0.276,3.58,3.65,2.8,1.06,3.68,...,4.39,0.772,0.283,0.336,0.184,70,6,19,70.0,88
7,Mitch Keller,PIT,101,MIN,0.311,3.86,3.82,2.4,1.28,3.9,...,3.81,0.783,0.316,0.338,0.198,64,5,17,66.0,44
8,Kodai Senga,NYM,98,STL,0.292,3.54,3.76,2.5,1.29,4.05,...,3.9,0.778,0.296,0.337,0.178,54,5,14,58.0,-38
9,Logan Webb,SFG,101,ATL,0.3,3.24,2.95,3.5,1.07,3.15,...,3.88,0.89,0.318,0.377,0.24,79,8,18,80.0,212


In [205]:
main_df = main_df.rename(columns = {'A_Team': 'Away','H_Team': 'Home'})

# 5-column index
We need to include home and away teams, home and away starters and game date in the index so that we can identify the game. For our model to run, the columns must be numerical and not strings.

In [206]:
main_df = main_df.set_index(['Away', 'Home', 'A_Starter', 'H_Starter', 'Date'])

In [207]:
main_df[['H_Opponent']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,H_Opponent
Away,Home,A_Starter,H_Starter,Date,Unnamed: 5_level_1
BOS,NYY,Kutter Crawford,Gerrit Cole,2023-08-19,BOS
KCR,CHC,Brady Singer,Justin Steele,2023-08-19,KCR
PHI,WSN,Cristopher Sanchez,Jake Irvin,2023-08-19,PHI
MIL,TEX,Freddy Peralta,Dane Dunning,2023-08-19,MIL
TOR,CIN,Chris Bassitt,Brandon Williamson,2023-08-19,TOR
DET,CLE,Matt Manning,Tanner Bibee,2023-08-19,DET
SEA,HOU,Logan Gilbert,Framber Valdez,2023-08-19,SEA
PIT,MIN,Mitch Keller,Sonny Gray,2023-08-19,PIT
NYM,STL,Kodai Senga,Miles Mikolas,2023-08-19,NYM
SFG,ATL,Logan Webb,Yonny Chirinos,2023-08-19,SFG


In [208]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15 entries, ('BOS', 'NYY', 'Kutter Crawford', 'Gerrit Cole', Timestamp('2023-08-19 00:00:00')) to ('MIA', 'LAD', 'Braxton Garrett', 'Julio Urias', Timestamp('2023-08-19 00:00:00'))
Data columns (total 47 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Park         15 non-null     int64  
 1   A_Opponent   15 non-null     object 
 2   A_BABIP_SP   15 non-null     float64
 3   A_FIP        15 non-null     float64
 4   A_xFIP       15 non-null     float64
 5   A_WAR        15 non-null     float64
 6   A_WHIP       15 non-null     float64
 7   A_SIERA      15 non-null     float64
 8   A_RS/9       15 non-null     float64
 9   A_Avg_Outs   15 non-null     float64
 10  A_BABIP_BP   15 non-null     float64
 11  A_FIP_BP     15 non-null     float64
 12  A_xFIP_BP    15 non-null     float64
 13  A_WHIP_BP    15 non-null     float64
 14  A_SIERA_BP   15 non-null     float64
 15  A_OPS        15 n

In [209]:
main_df = main_df.drop(['A_Opponent', 'H_Opponent'], axis = 1)

In [210]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15 entries, ('BOS', 'NYY', 'Kutter Crawford', 'Gerrit Cole', Timestamp('2023-08-19 00:00:00')) to ('MIA', 'LAD', 'Braxton Garrett', 'Julio Urias', Timestamp('2023-08-19 00:00:00'))
Data columns (total 45 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Park         15 non-null     int64  
 1   A_BABIP_SP   15 non-null     float64
 2   A_FIP        15 non-null     float64
 3   A_xFIP       15 non-null     float64
 4   A_WAR        15 non-null     float64
 5   A_WHIP       15 non-null     float64
 6   A_SIERA      15 non-null     float64
 7   A_RS/9       15 non-null     float64
 8   A_Avg_Outs   15 non-null     float64
 9   A_BABIP_BP   15 non-null     float64
 10  A_FIP_BP     15 non-null     float64
 11  A_xFIP_BP    15 non-null     float64
 12  A_WHIP_BP    15 non-null     float64
 13  A_SIERA_BP   15 non-null     float64
 14  A_OPS        15 non-null     float64
 15  A_BABIP      15 n

# Feature Engineering
The Major League Baseball season is 162 games, starting in late March and ending in the first week of October. The best teams win close to 100 games, and sometimes even more than 100 games over the course of a season. So a team with 50 wins in June is a good team, but a team with 50 wins in September is a bad team. That's why raw win totals are not an effective feature. So we need to derive differentials, as in how many more wins does a team have over its opponent? We're trying to predict winners of a single game, so we need to find how much better a team is than its opponent.<br>

We're going to engineer these differentials for all features from the perspective of the home team so our model can predict the answer to the question: Will the home team win? (1 = home win, 0 = road win). For example, Py_Diff is how many more Pythagorean Wins the home team has, and if the home team has fewer Pythagorean Wins the number will be negative. ISO_Diff indicates how many more points of isolated power the home team has over the away team. 

In [211]:
main_df['A_Win_Diff'] = main_df['A_Team_Wins'] - main_df['H_Team_Wins']

In [212]:
main_df['H_Win_Diff'] = main_df['H_Team_Wins'] - main_df['A_Team_Wins']

In [213]:
main_df['A_Py_Diff'] = main_df['A_Py_Wins'] - main_df['H_Py_Wins']

In [214]:
main_df['H_Py_Diff'] = main_df['H_Py_Wins'] - main_df['A_Py_Wins']

In [215]:
main_df = main_df.drop(columns = ['A_Team_Wins', 'H_Team_Wins', 'A_Py_Wins', 'H_Py_Wins'])

In [216]:
main_df['A_Py_Diff'].sum(), main_df['H_Py_Diff'].sum()

(-34.0, 34.0)

In [217]:
main_df['A_Win_Diff'].sum(), main_df['H_Win_Diff'].sum()

(9, -9)

In [218]:
main_df = main_df.drop(columns = ['A_Win_Diff', 'A_Py_Diff'])

In [219]:
main_df = main_df.rename({'H_Win_Diff': 'Win_Diff', 'H_Py_Diff': 'Py_Diff'})

In [220]:
main_df['Run_Diff'] = main_df['H_Run_Diff'] - main_df['A_Run_Diff'] 

In [221]:
main_df = main_df.drop(columns = ['H_Run_Diff', 'A_Run_Diff'])

In [222]:
main_df['W_L10_Diff'] = main_df['H_Wins_L10'] - main_df['A_Wins_L10']
main_df['W_L30_Diff'] = main_df['H_Wins_L30'] - main_df['A_Wins_L30']

In [223]:
main_df = main_df.drop(columns = ['H_Wins_L10', 'A_Wins_L10', 'H_Wins_L30', 'A_Wins_L30'])

# BABIP
BABIP is the only feature we have for starting pitchers, team hitting and team bullpens. Most of our features indicate one starting pitcher's WAR, for example, against the other starting pitcher's WAR. But pitchers don't oppose each other directly.<br>

These BABIP features provide us with our only opportunity to derive "matchup" data that compares hitters to pitchers.
So these features indicate:<br>

Home team's hitting BABIP compared to the BABIP against the away team's starting pitchers<br>
Home team's hitting BABIP compared to the BABIP against the away team's bullpen<br>
Home team's starting pitcher's BABIP against the away team's hitters' BABIP<br>
Home team's bullpen BABIP against the away team's hitting BABIP<br>

In [224]:
main_df['BABIP_vs_SP'] = main_df['H_BABIP'] - main_df['A_BABIP_SP']
main_df['BABIP_vs_BP'] = main_df['H_BABIP'] - main_df['A_BABIP_BP']
main_df['BABIP_SP_vs_Hit'] = main_df['H_BABIP_SP'] - main_df['A_BABIP']
main_df['BABIP_BP_vs_Hit'] = main_df['H_BABIP_BP'] - main_df['A_BABIP']

In [225]:
main_df = main_df.drop(columns = ['A_BABIP', 'A_BABIP_SP', 'A_BABIP_BP', 'H_BABIP', 'H_BABIP_SP', 'H_BABIP_BP'])

In [226]:
main_df['Avg_Outs_Diff'] = main_df['H_Avg_Outs'] - main_df['A_Avg_Outs']
main_df['FIP_Diff'] = main_df['H_FIP'] - main_df['A_FIP']
main_df['FIP_BP_Diff'] = main_df['H_FIP_BP'] - main_df['A_FIP_BP']
main_df['ISO_Diff'] = main_df['H_ISO'] - main_df['A_ISO']
main_df['OPS_Diff'] = main_df['H_OPS'] - main_df['A_OPS']
main_df['RS/9_Diff'] = main_df['H_RS/9'] - main_df['A_RS/9']
main_df['SIERA_Diff'] = main_df['H_SIERA'] - main_df['A_SIERA']
main_df['SIERA_BP_Diff'] = main_df['H_SIERA_BP'] - main_df['A_SIERA_BP']
main_df['WAR_Diff'] = main_df['H_WAR'] - main_df['A_WAR']
main_df['WHIP_Diff'] = main_df['H_WHIP'] - main_df['A_WHIP']
main_df['WHIP_BP_Diff'] = main_df['H_WHIP_BP'] - main_df['A_WHIP_BP']
main_df['wOBA_Diff'] = main_df['H_wOBA'] - main_df['A_wOBA']
main_df['xFIP_Diff'] = main_df['H_xFIP'] - main_df['A_xFIP']
main_df['xFIP_BP_Diff'] = main_df['H_xFIP_BP'] - main_df['A_xFIP_BP']

In [227]:
main_df = main_df.drop(columns = ['A_Avg_Outs', 'H_Avg_Outs', 'A_FIP', 'H_FIP', 'A_FIP_BP', 'H_FIP_BP', 'A_ISO', 'H_ISO',\
                                 'A_OPS', 'H_OPS', 'A_RS/9', 'H_RS/9', 'A_SIERA', 'H_SIERA', 'A_SIERA_BP', 'H_SIERA_BP',\
                                 'A_WAR', 'H_WAR', 'A_WHIP', 'H_WHIP', 'A_WHIP_BP', 'H_WHIP_BP', 'A_wOBA', 'H_wOBA',\
                                 'A_xFIP', 'H_xFIP', 'A_xFIP_BP', 'H_xFIP_BP'])

In [228]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15 entries, ('BOS', 'NYY', 'Kutter Crawford', 'Gerrit Cole', Timestamp('2023-08-19 00:00:00')) to ('MIA', 'LAD', 'Braxton Garrett', 'Julio Urias', Timestamp('2023-08-19 00:00:00'))
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Park             15 non-null     int64  
 1   H_Win_Diff       15 non-null     int64  
 2   H_Py_Diff        15 non-null     float64
 3   Run_Diff         15 non-null     int64  
 4   W_L10_Diff       15 non-null     int64  
 5   W_L30_Diff       15 non-null     int64  
 6   BABIP_vs_SP      15 non-null     float64
 7   BABIP_vs_BP      15 non-null     float64
 8   BABIP_SP_vs_Hit  15 non-null     float64
 9   BABIP_BP_vs_Hit  15 non-null     float64
 10  Avg_Outs_Diff    15 non-null     float64
 11  FIP_Diff         15 non-null     float64
 12  FIP_BP_Diff      15 non-null     float64
 13  ISO_Diff         15 non-null     float64


In [229]:
main_df_cols = list(main_df.columns)

In [230]:
main_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Park,H_Win_Diff,H_Py_Diff,Run_Diff,W_L10_Diff,W_L30_Diff,BABIP_vs_SP,BABIP_vs_BP,BABIP_SP_vs_Hit,BABIP_BP_vs_Hit,...,OPS_Diff,RS/9_Diff,SIERA_Diff,SIERA_BP_Diff,WAR_Diff,WHIP_Diff,WHIP_BP_Diff,wOBA_Diff,xFIP_Diff,xFIP_BP_Diff
Away,Home,A_Starter,H_Starter,Date,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
BOS,NYY,Kutter Crawford,Gerrit Cole,2023-08-19,98,-4,-5.0,-47,-4,-4,0.013,-0.053,-0.024,-0.048,...,-0.024,-0.02,-0.21,-0.33,2.3,-0.08,-0.37,-0.003,-0.67,-0.47
KCR,CHC,Brady Singer,Justin Steele,2023-08-19,102,22,20.0,225,1,6,0.003,0.017,-0.005,-0.072,...,0.041,1.0,-0.5,-0.8,1.1,-0.15,-0.32,0.02,-0.51,-1.0
PHI,WSN,Cristopher Sanchez,Jake Irvin,2023-08-19,102,-10,-11.0,-118,2,3,0.06,0.013,-0.008,-0.021,...,0.004,-0.26,1.34,-0.47,-0.1,0.4,-0.14,0.002,1.67,-0.4
MIL,TEX,Freddy Peralta,Dane Dunning,2023-08-19,101,6,18.0,198,0,4,0.025,0.041,-0.011,-0.021,...,0.109,0.16,0.73,0.2,-0.2,-0.05,0.06,0.041,0.55,0.42
TOR,CIN,Chris Bassitt,Brandon Williamson,2023-08-19,107,-3,-7.0,-73,1,-1,0.018,0.017,-0.045,-0.047,...,0.002,0.15,0.42,-0.12,-0.5,-0.02,0.02,-0.005,0.38,0.01
DET,CLE,Matt Manning,Tanner Bibee,2023-08-19,97,4,11.0,101,-1,0,0.049,-0.03,-0.013,0.02,...,-0.045,0.39,-0.94,-0.3,2.2,0.05,0.01,-0.016,-1.27,-0.57
SEA,HOU,Logan Gilbert,Framber Valdez,2023-08-19,99,3,2.0,24,-1,-2,0.007,0.0,-0.049,-0.101,...,-0.029,-0.67,-0.11,0.6,0.6,0.02,0.05,-0.009,-0.36,0.88
PIT,MIN,Mitch Keller,Sonny Gray,2023-08-19,101,10,15.0,139,2,4,0.005,0.016,0.02,0.027,...,0.056,-0.74,0.08,0.19,1.6,-0.06,0.09,0.021,-0.21,0.14
NYM,STL,Kodai Senga,Miles Mikolas,2023-08-19,98,-3,1.0,3,-1,0,0.004,0.008,0.045,0.063,...,0.06,-1.04,0.67,-0.48,0.3,-0.01,-0.03,0.023,0.88,-0.65
SFG,ATL,Logan Webb,Yonny Chirinos,2023-08-19,101,15,18.0,204,5,5,0.018,0.009,0.022,0.022,...,0.297,3.29,1.84,0.15,-3.8,0.29,0.02,0.115,2.07,-0.13


In [231]:
filepath = r'C:\Users\Owner\Sports Betting\MLB_Game_Outcome\Live_Game_Features_' + today_str + '.csv'
main_df.to_csv(filepath)

In [232]:
import os

csv_file_paths = ['Live_Game_Features_' + yesterday_str + '.csv', '2023_Win_Features_' + yesterday_str + '.csv']

for file in csv_file_paths:
    try:
        os.remove(file)
        print(f"CSV file '{file}' deleted successfully.")
    except FileNotFoundError:
        print(f"CSV file '{file}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

CSV file 'Live_Game_Features_2023-08-18.csv' deleted successfully.
CSV file '2023_Win_Features_2023-08-18.csv' deleted successfully.
