Import libraries that will be needed for this project

In [12]:
import pandas as pd
import numpy as np

Merge of cleaned dataframes to one all encompassing dataframe

First, I will need to read in the csv files created with the previous data cleansing steps.

In [13]:
#read in data files
total_payroll = pd.read_csv(r'cleansed_data\total_payroll.csv')
mlb_win_totals = pd.read_csv(r'cleansed_data\mlb_win_totals.csv')
total_attendance = pd.read_csv(r'cleansed_data\total_attendance.csv')

I will now run a quick check on the data and make sure it looks properly set up. 

In [14]:
#check payroll data
total_payroll.head()

Unnamed: 0,Year,Team,Total Payroll
0,2021,Los Angeles Dodgers,235412876
1,2021,New York Yankees,191205631
2,2021,Boston Red Sox,180261996
3,2021,Los Angeles Angels,177353000
4,2021,Philadelphia Phillies,174009000


Payroll looks good

I see the wins DF will need some updates, I'll to replace team abbreviations with team names and update the format to match the other files

In [15]:
#check team win data
mlb_win_totals.head()

Unnamed: 0,Year,G,ARI,ATL,BAL,BOS,CHC,CHW,CIN,CLE,...,PHI,PIT,SDP,SFG,SEA,STL,TBR,TEX,TOR,WSN
0,2023,162,84,104,101,78,83,61,82,76,...,90,76,82,79,88,71,99,90,89,71
1,2022,162,74,101,83,78,74,81,62,92,...,87,62,89,81,90,93,86,68,92,55
2,2021,162,52,88,52,92,71,93,83,80,...,82,61,79,107,90,90,100,60,91,65


In [16]:
#need to check all team name values in mlb_win_totals
print(mlb_win_totals.columns.to_numpy().tolist())

['Year', 'G', 'ARI', 'ATL', 'BAL', 'BOS', 'CHC', 'CHW', 'CIN', 'CLE', 'COL', 'DET', 'HOU', 'KCR', 'LAA', 'LAD', 'MIA', 'MIL', 'MIN', 'NYM', 'NYY', 'OAK', 'PHI', 'PIT', 'SDP', 'SFG', 'SEA', 'STL', 'TBR', 'TEX', 'TOR', 'WSN']


Here I will set the mlb_wins_totals DF to update with the matching team in the other DFs and check it to verify proper changes are made

In [17]:
#renaming mlb_win_totals columns to match total_payroll columns
mlb_win_totals = mlb_win_totals.rename(columns={'ARI': 'Arizona Diamondbacks', 
                                                'ATL': 'Atlanta Braves', 
                                                'BAL': 'Baltimore Orioles', 
                                                'BOS': 'Boston Red Sox', 
                                                'CHC': 'Chicago Cubs', 
                                                'CHW': 'Chicago White Sox', 
                                                'CIN': 'Cincinnati Reds', 
                                                'CLE': 'Cleveland Guardians', 
                                                'COL': 'Colorado Rockies', 
                                                'DET': 'Detroit Tigers', 
                                                'HOU': 'Houston Astros', 
                                                'KCR': 'Kansas City Royals', 
                                                'LAA': 'Los Angeles Angels', 
                                                'LAD': 'Los Angeles Dodgers', 
                                                'MIA': 'Miami Marlins', 
                                                'MIL': 'Milwaukee Brewers', 
                                                'MIN': 'Minnesota Twins', 
                                                'NYM': 'New York Mets', 
                                                'NYY': 'New York Yankees', 
                                                'OAK': 'Oakland Athletics', 
                                                'PHI': 'Philadelphia Phillies', 
                                                'PIT': 'Pittsburgh Pirates', 
                                                'SDP': 'San Diego Padres', 
                                                'SFG': 'San Francisco Giants', 
                                                'SEA': 'Seattle Mariners', 
                                                'STL': 'St. Louis Cardinals', 
                                                'TBR': 'Tampa Bay Rays', 
                                                'TEX': 'Texas Rangers', 
                                                'TOR': 'Toronto Blue Jays', 
                                                'WSN': 'Washington Nationals'})

mlb_win_totals.head()

Unnamed: 0,Year,G,Arizona Diamondbacks,Atlanta Braves,Baltimore Orioles,Boston Red Sox,Chicago Cubs,Chicago White Sox,Cincinnati Reds,Cleveland Guardians,...,Philadelphia Phillies,Pittsburgh Pirates,San Diego Padres,San Francisco Giants,Seattle Mariners,St. Louis Cardinals,Tampa Bay Rays,Texas Rangers,Toronto Blue Jays,Washington Nationals
0,2023,162,84,104,101,78,83,61,82,76,...,90,76,82,79,88,71,99,90,89,71
1,2022,162,74,101,83,78,74,81,62,92,...,87,62,89,81,90,93,86,68,92,55
2,2021,162,52,88,52,92,71,93,83,80,...,82,61,79,107,90,90,100,60,91,65


The mlb_win_totals DF is set up differently from the other DFs, the teams are a Column header instead of in the Rows.
For this, I will use the melt function to reshape the DF into a format matching the other dataframes. 

In [18]:
#use melt to arrange mlb_win_totals in a more usable format
mlb_win_totals=mlb_win_totals.melt(id_vars=['Year', 'G'], var_name='Team', value_name='Wins')

mlb_win_totals.head()

Unnamed: 0,Year,G,Team,Wins
0,2023,162,Arizona Diamondbacks,84
1,2022,162,Arizona Diamondbacks,74
2,2021,162,Arizona Diamondbacks,52
3,2023,162,Atlanta Braves,104
4,2022,162,Atlanta Braves,101


With the mlb_win_totals and total_payroll DFs complete, I can now merge them to one DF

In [19]:
#merge wins and payroll totals into one DF, matching on both Year and Team values
payroll_wins = pd.merge(mlb_win_totals, total_payroll, on=['Year', 'Team'])

payroll_wins

Unnamed: 0,Year,G,Team,Wins,Total Payroll
0,2023,162,Arizona Diamondbacks,84,112763571
1,2022,162,Arizona Diamondbacks,74,75993333
2,2021,162,Arizona Diamondbacks,52,89077233
3,2023,162,Atlanta Braves,104,199727500
4,2022,162,Atlanta Braves,101,173935000
...,...,...,...,...,...
85,2022,162,Toronto Blue Jays,92,168070905
86,2021,162,Toronto Blue Jays,91,137133333
87,2023,162,Washington Nationals,71,79983095
88,2022,162,Washington Nationals,55,114623095


Now I will move the G column and rename to Games Played for clarity

In [20]:
#moving G column closer to front of DF
games = payroll_wins.pop("G")
payroll_wins.insert(2, "G", games)

#rename Games column for clarity
payroll_wins=payroll_wins.rename(columns={'G': 'Games Played'})

payroll_wins

Unnamed: 0,Year,Team,Games Played,Wins,Total Payroll
0,2023,Arizona Diamondbacks,162,84,112763571
1,2022,Arizona Diamondbacks,162,74,75993333
2,2021,Arizona Diamondbacks,162,52,89077233
3,2023,Atlanta Braves,162,104,199727500
4,2022,Atlanta Braves,162,101,173935000
...,...,...,...,...,...
85,2022,Toronto Blue Jays,162,92,168070905
86,2021,Toronto Blue Jays,162,91,137133333
87,2023,Washington Nationals,162,71,79983095
88,2022,Washington Nationals,162,55,114623095


The payroll_wins DF and total_attendance DF can now be merged on matching 'Year' and 'Team' rows

In [21]:
#merge in attendance doc
attend_wins_payroll = pd.merge(payroll_wins, total_attendance, on=['Year', 'Team'])

attend_wins_payroll

Unnamed: 0,Year,Team,Games Played,Wins,Total Payroll,Attendance,Avg Att
0,2023,Arizona Diamondbacks,162,84,112763571,1961182,24212
1,2022,Arizona Diamondbacks,162,74,75993333,1605199,19817
2,2021,Arizona Diamondbacks,162,52,89077233,1043010,12876
3,2023,Atlanta Braves,162,104,199727500,3191505,39401
4,2022,Atlanta Braves,162,101,173935000,3129931,38641
...,...,...,...,...,...,...,...
85,2022,Toronto Blue Jays,162,92,168070905,2653830,32763
86,2021,Toronto Blue Jays,162,91,137133333,809557,10119
87,2023,Washington Nationals,162,71,79983095,1865832,23034
88,2022,Washington Nationals,162,55,114623095,2026401,25017


With final DF complete, it can written to csv for analysis and plotting 

In [22]:
#write payroll_wins to csv
attend_wins_payroll.to_csv(r'cleansed_data\attend_wins_payroll.csv', index=None, header=True)