# CMDA 3654 Project - Money Ball 1

Marcos Fassio Bazzi - marcosfassiob18

Peter Do - kyungwan

Grant Collier - gcollier24

Ajay Kanjoor - ajkanjoor03

# Question: Which MLB teams will make the 2023 World Series?

Our goal for this project is to try to predict the 2023 MLB regular season and determine who will make the World Series based off of the data we can find [Baseball Reference](https://www.baseball-reference.com/) and [Spotrac](https://www.spotrac.com/mlb/payroll/2022/), two websites dedicated to baseball statistics and payroll, respectively.

## How does the MLB Regular Season work and how do teams make the World Series?

Major League Baseball is split up into two leagues: National League and American League. Each league is split up into 3 divisions: East, Central, and West. And each division has 5 teams in it making a total of 30 teams. Each team plays the four other teams in their division 19 times, and the 10 other teams in their league but not in their division either 6 or 7 times. This totals to a 162 game season for every team. The playoffs have changed throughout the years, but this year a total of 12 teams make the playoffs, 6 from both leagues. To win your division you need to have the best record out of the five teams, and the winners of each of the 6 division make the playoffs. In addition, out of all the teams that didn't win their division, the 3 teams with the best records in both leagues make the playoffs. In the playoffs, the two teams that win their league make it to the World Series. The 1 and 2 seed from both leagues get a by and automatically make it to the quarter finals, so those teams have an advantage.

# Scraping, cleaning, grouping and visualizing the data

Exporting the data from Baseball Reference is easy enough - the site has an option that allows us to share and export as a `.csv` file. However, the same cannot be said for Spotrac. In this case, we used Python libraries `requests` and `BeautifulSoup4` to fetch and scrape the payroll data from Spotrac.

## Importing necessary libraries

In [25]:
import requests
import bs4
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

## Scraping and cleaning the 2022 payroll data

In [24]:
# fetch link and get page
url = "https://www.spotrac.com/mlb/payroll/2022/"
header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" }
link = requests.get(url, headers=header)
soup = bs4.BeautifulSoup(link.content, "html5lib")

## get table and parse headers
table_data = soup.find('table', { "class": "datatable" })
headers = [i.text.strip() for i in table_data.find_all("th")]

## parse table body
table_body = soup.find("tbody")
team_data = []
for row_data in table_body.find_all("tr"):
    team_data += [[entry.text.strip().replace("\n", "").replace("\t", "") for entry in row_data.find_all("td")]]

# clean team names
for i in range(len(team_data)):
    team_data[i][1] = team_data[i][1][:-3] # last three characters are the team's abbreviation - we don't want this
team_data[4][1] = "San Diego Padres" # for whatever reason this entry's abbreviation was two characters instead of three
team_data.pop(14)
team_data.pop(14) # get rid of league averages in table

# turn data into dataframe
dirty_payroll_data_2022 = pd.DataFrame(team_data, columns=headers)

# cleaning 2022 payroll data
payroll_data_2022 = dirty_payroll_data_2022.copy()
payroll_data_2022 = dirty_payroll_data_2022.replace('0-','0').replace('\$','',regex=True).replace(',','', regex=True)
payroll_data_2022 = payroll_data_2022.astype({'Rank':'int', 'Win%':'float', 'Roster':'int', '26-Man Payroll':'int',
                                                    'Injured Reserve':'int','Retained':'int','Buried':'int','Suspended':'int',
                                                        '2022 Total Payroll':'int'})
payroll_data_2022.head()


FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

## Scraping and cleaning the 2023 payroll data

The process is identical.

In [19]:
# fetch link and get page
url = "https://www.spotrac.com/mlb/payroll/2023/"
header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" }
link = requests.get(url, headers=header)
soup = bs4.BeautifulSoup(link.content, "html5lib")

## get table and parse headers
table_data = soup.find('table', { "class": "datatable" })
headers = [i.text.strip() for i in table_data.find_all("th")]

## parse table body
table_body = soup.find("tbody")
team_data = []
for row_data in table_body.find_all("tr"):
    team_data += [[entry.text.strip().replace("\n", "").replace("\t", "") for entry in row_data.find_all("td")]]

# clean team names
for i in range(len(team_data)):
    team_data[i][1] = team_data[i][1][:-3] # last three characters are the team's abbreviation - we don't want this
team_data[4][1] = "San Diego Padres" # for whatever reason this entry's abbreviation was two characters instead of three
team_data.pop(14)# get rid of league averages in table
team_data.pop(14)
team_data.pop(14)

# turn data into dataframe
dirty_payroll_data_2023 = pd.DataFrame(team_data, columns=headers)

# cleaning 2023 payroll data
payroll_data_2023 = dirty_payroll_data_2023.replace('$','',regex=True).replace(',','', regex=True)
payroll_data_2023 = payroll_data_2023.astype({ '2023 Total Payroll':'int' })


FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

## Fetching the 2022 MLB team data from Baseball Reference

First, we copied the data as a `.csv` file and exported it. Then, we dropped the league average data from the bottom two indexes and renamed the "Tm" column to "Team" to group the data more easily.

In [11]:
batting_data_2022_raw = pd.read_csv("data/batting_data_2022.csv")
pitching_data_2022_raw = pd.read_csv("data/pitching_data_2022.csv")

In [13]:
batting_data_2022 = batting_data_2022_raw.drop([30, 31]).rename(columns={ "Tm": "Team" })
pitching_data_2022 = pitching_data_2022_raw.drop([30, 31]).rename(columns={ "Tm": "Team" })

batting_data_2022

Unnamed: 0,Team,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,...,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,Arizona Diamondbacks,57,26.5,4.33,162,6027,5351,702,1232,262,...,0.385,0.689,95,2061,97,60,31,50,14,1039
1,Atlanta Braves,53,27.5,4.87,162,6082,5509,789,1394,298,...,0.443,0.761,111,2443,103,66,1,36,13,1030
2,Baltimore Orioles,58,27.0,4.16,162,6049,5429,674,1281,275,...,0.39,0.695,97,2119,95,83,12,43,10,1095
3,Boston Red Sox,54,28.8,4.54,162,6144,5539,735,1427,352,...,0.409,0.731,102,2268,131,63,12,50,23,1133
4,Chicago Cubs,64,27.9,4.06,162,6072,5425,657,1293,265,...,0.387,0.698,96,2097,130,84,19,36,16,1100
5,Chicago White Sox,44,29.3,4.23,162,6123,5611,686,1435,272,...,0.387,0.698,97,2172,127,73,16,35,9,1117
6,Cincinnati Reds,66,29.4,4.0,162,5978,5380,648,1264,235,...,0.372,0.676,83,2003,127,92,12,33,6,1020
7,Cleveland Guardians,50,25.9,4.31,162,6163,5558,698,1410,273,...,0.383,0.699,102,2126,119,81,22,52,36,1156
8,Colorado Rockies,43,29.1,4.31,162,6105,5540,698,1408,280,...,0.398,0.713,90,2203,139,61,10,40,10,1113
9,Detroit Tigers,53,27.9,3.44,162,5870,5378,557,1240,235,...,0.346,0.632,84,1859,108,58,10,44,8,1015


## Grouping the 2022 payroll and baseball data

After cleaning the data above, we grouped the data together using two many-to-one joins based on the team name.

In [9]:
baseball_data_2022 = pd.merge(batting_data_2022, pitching_data_2022, on=["Team"]) # merge baseball reference stats
baseball_data_2022 = pd.merge(baseball_data_2022, payroll_data_2022, on=["Team"]) # merge payroll stats

baseball_data_2022.head()

NameError: name 'payroll_data_2022' is not defined