# Data Wrangling & EDA

##### Kimberly Liu & Isaac Tabor

### March Madness Data

**Part 1: What is our data?**

We believe the information and variables highlighted from the following datasets will help us build a simple prediction model:

- *MTeams.csv* and *WTeams.csv* contain **Team ID** and **Team Names**

- *MNCAATourneySeeds.csv* and *WNCAATourneySeeds.csv* contain **tournament seeds since 1984-85 season**. Key to note: We will not know which 68 teams will be in the tournament, or what seeds  are until Selection Sunday on March 16, 2025.

- *MRegularSeasonCompactResults.csv* and *WRegularSeasonCompactResults.csv* contain **Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season**

- *MSeasons.csv* and *WSeasons.csv* contain **Season-level details including dates and region names**

In the end, we plan to generate our predictions from a machine learning model in format outlined in *SampleSubmissionStage1.csv*



**Part 2: How will these data be useful for studying the phenomenon we're interested in?**

We have collected a large amount of data of historical NCAA basketball games and teams going back many years. We intend to use it to build a machine learning model to predict March Madness outcomes.

We have data on both men's and women's data currently, with files starting with M containing only data pertaining to men's data, and files starting with W containing only women's data (e.g. MCities, WConferences). MTeamSpellings and WTeamSpellings will help us map TeamID to the team.

All of the files are currently complete through January 28th of the current season. This data was compiled into a Kaggle dataset for a March Madness ML competition largely from Kenneth Massey and Jeff Sonas of Sonas Consulting.


**Part 3: What are the challenges we've resolved or expect to face in using them?**

The dataset we downloaded contained over 36 different csv files, each with its own distinction. To minimize the load, we merged essential variables from various files to create one large dataset, saved under <code>'merged_data.csv' </code>

In this dataset, each observation contains information from each game starting from the 2003 season. This includes both regular season and playoff NCAA season games as well. The observations includes results and team box scores, along with other supplementary data that might be contextual and insightful for building our model.

In [1]:
import pandas as pd

# df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/merged_data.csv")
df = pd.read_csv("/content/DS3001-Project/data/merged_data.csv")

df.head()

print(df.shape[0])

  df = pd.read_csv("/content/DS3001-Project/data/merged_data.csv")


657977


Our data has 657977 observations and 42 variables. The variables are as follows:
Season,


First clone the GitHub repo:

In [2]:
! git clone https://github.com/kimberlyyliuu/DS3001-Project/

fatal: destination path 'DS3001-Project' already exists and is not an empty directory.


Next, load and merge basic data. You may have to adjust to your unique file path:

In [3]:
import pandas as pd

MTeams = pd.read_csv("/content/DS3001-Project/data/MTeams.csv")
MNCAATournamentSeeds = pd.read_csv("/content/DS3001-Project/data/MNCAATourneySeeds.csv")
MRegularSeasonCompactResults = pd.read_csv("/content/DS3001-Project/data/MRegularSeasonCompactResults.csv")
MSeasons = pd.read_csv("/content/DS3001-Project/data/MSeasons.csv")

In [4]:
MTeams.head()

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2025
1,1102,Air Force,1985,2025
2,1103,Akron,1985,2025
3,1104,Alabama,1985,2025
4,1105,Alabama A&M,2000,2025


In [5]:
MNCAATournamentSeeds.head()

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [6]:
MRegularSeasonCompactResults.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [7]:
MSeasons.head()

Unnamed: 0,Season,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,10/29/1984,East,West,Midwest,Southeast
1,1986,10/28/1985,East,Midwest,Southeast,West
2,1987,10/27/1986,East,Southeast,Midwest,West
3,1988,11/02/1987,East,Midwest,Southeast,West
4,1989,10/31/1988,East,West,Midwest,Southeast


In [8]:
# Assuming you have loaded the DataFrames as MTeams, MNCAATournamentSeeds,
# MRegularSeasonCompactResults, and MSeasons

# 1. Merge MTeams for winning teams
merged_df = pd.merge(MRegularSeasonCompactResults, MTeams, left_on='WTeamID', right_on='TeamID', suffixes=('', '_winner'))

# 2. Merge MTeams for losing teams
merged_df = pd.merge(merged_df, MTeams, left_on='LTeamID', right_on='TeamID', suffixes=('', '_loser'))

# 3. Merge MNCAATournamentSeeds
merged_df = pd.merge(merged_df, MNCAATournamentSeeds, on=['Season', 'TeamID'], how='left')  # Left merge to keep all regular season games

# 4. Merge MSeasons
merged_df = pd.merge(merged_df, MSeasons, on='Season')

# Now 'merged_df' contains all the combined information
merged_df.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,TeamID,TeamName,...,TeamID_loser,TeamName_loser,FirstD1Season_loser,LastD1Season_loser,Seed,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,20,1228,81,1328,64,N,0,1228,Illinois,...,1328,Oklahoma,1985,2025,W03,10/29/1984,East,West,Midwest,Southeast
1,1985,25,1106,77,1354,70,H,0,1106,Alabama St,...,1354,S Carolina St,1985,2025,,10/29/1984,East,West,Midwest,Southeast
2,1985,25,1112,63,1223,56,H,0,1112,Arizona,...,1223,Houston Chr,1985,2025,X10,10/29/1984,East,West,Midwest,Southeast
3,1985,25,1165,70,1432,54,H,0,1165,Cornell,...,1432,Utica,1985,1987,,10/29/1984,East,West,Midwest,Southeast
4,1985,25,1192,86,1447,74,H,0,1192,F Dickinson,...,1447,Wagner,1985,2025,Z16,10/29/1984,East,West,Midwest,Southeast


Now, we have a basic merged dataset to start with. Kimberly, if you have a better one, we can use that or merge those too.

In [9]:
"""#Section 1 """
# MTeams_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MTeams.csv")
MTeams_df = pd.read_csv("/content/DS3001-Project/data/MTeams.csv")

# MSeasons_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MSeasons.csv")
MSeasons_df = pd.read_csv("/content/DS3001-Project/data/MSeasons.csv")

# MTourneySeeds_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneySeeds.csv")
MTourneySeeds_df = pd.read_csv("/content/DS3001-Project/data/MNCAATourneySeeds.csv")

# MRegularSeasonCompactResults_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MRegularSeasonCompactResults.csv")
MRegularSeasonCompactResults_df = pd.read_csv("/content/DS3001-Project/data/MRegularSeasonCompactResults.csv")

""" ## Section 2 """
# MRegularSeasonDetailedResults_df =  pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MRegularSeasonDetailedResults.csv")
MRegularSeasonDetailedResults_df =  pd.read_csv("/content/DS3001-Project/data/MRegularSeasonDetailedResults.csv")

# MNCAATourneyDetailedResults_df =   pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneyDetailedResults.csv")
MNCAATourneyDetailedResults_df =   pd.read_csv("/content/DS3001-Project/data/MNCAATourneyDetailedResults.csv")

# MTeamConferences_df =pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MTeamConferences.csv")
MTeamConferences_df =pd.read_csv("/content/DS3001-Project/data/MTeamConferences.csv")

# MGameCities_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MGameCities.csv")
MGameCities_df = pd.read_csv("/content/DS3001-Project/data/MGameCities.csv")

# MConferenceTourneyGames_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MConferenceTourneyGames.csv")
MConferenceTourneyGames_df = pd.read_csv("/content/DS3001-Project/data/MConferenceTourneyGames.csv")

# df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MMasseyOrdinals.csv")
df = pd.read_csv("/content/DS3001-Project/data/MMasseyOrdinals.csv")

# MNCAATourneySlots_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneySlots.csv")
MNCAATourneySlots_df = pd.read_csv("/content/DS3001-Project/data/MNCAATourneySlots.csv")


In [10]:
"""## Drop Teams that have not been D1 since 2003 """
MTeams_df = MTeams_df[MTeams_df['LastD1Season'] >= 2003]

""" ## Merge MRegularSeasonDetailedResults with MNCAATourneyDetailedResults"""
merged_df = pd.concat([MRegularSeasonDetailedResults_df, MNCAATourneyDetailedResults_df], ignore_index=True)
# merged_df.sort_values('Season').sort_values('DayNum')

# Showing older games to newest games
merged_df.sort_values(by=['Season', 'DayNum'], ascending=[True, True]) # side note: why is 2003 starting at daynum = 10?

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117743,2025,106,1461,69,1102,62,H,0,25,54,...,23,9,17,2,24,12,8,3,3,24
117744,2025,106,1462,76,1139,63,H,0,29,68,...,23,9,14,4,31,12,20,5,2,12
117745,2025,106,1466,80,1480,62,H,0,28,55,...,18,4,8,6,23,13,13,2,2,18
117746,2025,106,1468,94,1122,68,H,0,36,58,...,32,17,22,7,22,12,10,2,5,17


In [11]:
""" Add Team Name for Win and Lose """
merged_df['WTeamName'] = merged_df['WTeamID'].map(MTeams_df.set_index('TeamID')['TeamName'])
merged_df['LTeamName'] = merged_df['LTeamID'].map(MTeams_df.set_index('TeamID')['TeamName'])

""" add game type and city id """
merged_df = merged_df.merge(
    MGameCities_df[['Season', 'DayNum', 'WTeamID', 'LTeamID', 'CRType', 'CityID']],
    on=['Season', 'DayNum', 'WTeamID', 'LTeamID'],
    how='left'
)

merged_df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
       'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3',
       'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF',
       'WTeamName', 'LTeamName', 'CRType', 'CityID'],
      dtype='object')

In [12]:
merged_df

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LDR,LAst,LTO,LStl,LBlk,LPF,WTeamName,LTeamName,CRType,CityID
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,8,18,9,2,20,Alabama,Oklahoma,,
1,2003,10,1272,70,1393,63,N,0,26,62,...,25,7,12,8,6,16,Memphis,Syracuse,,
2,2003,11,1266,73,1437,61,N,0,24,58,...,22,9,12,2,5,23,Marquette,Villanova,,
3,2003,11,1296,56,1457,50,N,0,18,38,...,20,9,19,4,3,23,N Illinois,Winthrop,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,15,12,10,7,1,14,Texas,Georgia,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119125,2024,146,1301,76,1181,64,N,0,28,60,...,27,11,9,4,5,23,NC State,Duke,NCAA,4088.0
119126,2024,146,1345,72,1397,66,N,0,24,53,...,17,17,6,8,4,25,Purdue,Tennessee,NCAA,4098.0
119127,2024,152,1163,86,1104,72,N,0,31,62,...,21,9,7,2,5,15,Connecticut,Alabama,NCAA,4130.0
119128,2024,152,1345,63,1301,50,N,0,22,55,...,22,10,11,8,3,13,Purdue,NC State,NCAA,4130.0


In [13]:
# Add conferences of winning and losing teams.

""" Merge to bring in the conference abbreviation as ConfAbbrev """
merged_df = merged_df.merge(
    MTeamConferences_df[['Season', 'TeamID', 'ConfAbbrev']],
    left_on=['Season', 'WTeamID'],
    right_on=['Season', 'TeamID'],
    how='left'
)

""" Rename the imported column to WConf and drop the duplicate TeamID column from the merge"""
merged_df.rename(columns={'ConfAbbrev': 'WConf'}, inplace=True)
merged_df.drop(columns=['TeamID'], inplace=True)


""" Merge to bring in the conference abbreviation as ConfAbbrev"""
merged_df = merged_df.merge(
    MTeamConferences_df[['Season', 'TeamID', 'ConfAbbrev']],
    left_on=['Season', 'LTeamID'],
    right_on=['Season', 'TeamID'],
    how='left'
)

merged_df.rename(columns={'ConfAbbrev': 'LConf'}, inplace=True)
merged_df.drop(columns=['TeamID'], inplace=True)

len(merged_df)

119130

In [14]:
df.sort_values(by=['Season', 'RankingDayNum','TeamID','SystemName'], ascending=[True, True,True,True])

Unnamed: 0,Season,RankingDayNum,SystemName,TeamID,OrdinalRank
0,2003,35,SEL,1102,159
1,2003,35,SEL,1103,229
2,2003,35,SEL,1104,12
3,2003,35,SEL,1105,314
4,2003,35,SEL,1106,260
...,...,...,...,...,...
5487635,2025,107,TRK,1480,333
5487999,2025,107,TRP,1480,342
5488388,2025,107,WIL,1480,336
5488752,2025,107,WLK,1480,345


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5489117 entries, 0 to 5489116
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   Season         int64 
 1   RankingDayNum  int64 
 2   SystemName     object
 3   TeamID         int64 
 4   OrdinalRank    int64 
dtypes: int64(4), object(1)
memory usage: 209.4+ MB


In [24]:
len(df['SystemName'].unique())

192

In [21]:
df_pivot = df.pivot(index=["Season", "RankingDayNum", "TeamID"],
                     columns="SystemName",
                     values="OrdinalRank").reset_index()
df_pivot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261990 entries, 0 to 261989
Columns: 195 entries, Season to ZAM
dtypes: float64(192), int64(3)
memory usage: 389.8 MB


In [30]:
pd.set_option("display.max_columns", None)


In [34]:
df_pivot

SystemName,Season,RankingDayNum,TeamID,7OT,ACU,ADE,AP,ARG,ATP,AUS,AWS,BAR,BBT,BCM,BD,BIH,BKM,BLS,BMN,BNM,BNT,BNZ,BOB,BOW,BP5,BPI,BRZ,BUR,BWE,CBR,CJB,CMV,CNG,COL,COX,CPA,CPR,CRO,CRW,CTL,CWL,D1A,DAV,DC,DC2,DCI,DDB,DES,DII,DOK,DOL,DP,DUN,DWH,EBB,EBP,ECK,EMK,ENT,ERD,ESR,FAS,FDM,FMG,FSH,GC,GRN,GRS,HAS,HAT,HER,HKB,HKS,HOL,HRN,IMS,INC,INP,ISR,JCI,JEN,JJK,JNG,JON,JRT,KBM,KCX,KEL,KLK,KMV,KOS,KPI,KPK,KRA,LAB,LAW,LEF,LMC,LOG,LYD,LYN,MAS,MB,MCL,MGS,MGY,MIC,MKV,MMG,MOR,MPI,MSX,MUZ,MvG,NET,NOL,NOR,OCT,OMN,OMY,PAC,PEQ,PGH,PH,PIG,PIR,PKL,PMC,POM,PPR,PRR,PTS,RAG,REI,REN,REW,RIS,RM,RME,RMS,ROG,ROH,RPI,RSE,RSL,RT,RTB,RTH,RTP,RTR,RWP,SAG,SAP,SAU,SCR,SE,SEL,SFX,SGR,SIM,SMN,SMS,SP,SPR,SPW,SRS,STF,STH,STM,STR,STS,STY,TBD,TMR,TOL,TPR,TRK,TRP,TRX,TS,TSR,TW,UCS,UPS,USA,WIL,WLK,WLS,WMR,WMV,WOB,WOL,WTE,YAG,ZAM
0,2003,35,1102,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,159.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2003,35,1103,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,229.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2003,35,1104,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2003,35,1105,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,314.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2003,35,1106,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,260.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261985,2025,107,1476,275.0,,,,,,,,,,,,289.0,,,273.0,,,303.0,,,,,,,310.0,,,,,270.0,,,,,,,,,,,,324.0,,,303.0,318.0,286.0,310.0,286.0,,,318.0,,310.0,,,250.0,283.0,,,,,,,299.0,,,,,,,,319.0,,,,,275.0,230.0,,,,,,,,,,284.0,,272.0,,,,290.0,,,328.0,297.0,,312.0,,,,305.0,289.0,,,,,293.0,296.0,,,,302.0,298.0,,288.0,,,308.0,,,303.0,,,,,,,296.0,,,,292.0,,,300.0,,,306.0,,297.0,,,,,,,,,,,,,,299.0,,302.0,,307.0,,,,,,294.0,,,,,305.0,315.0,,,,,,,,288.0,295.0,,,,,282.0,,,
261986,2025,107,1477,335.0,,,,,,,,,,,,343.0,,,353.0,,,326.0,,,,,,,334.0,,,,,345.0,,,,,,,,,,,,339.0,,,326.0,331.0,347.0,327.0,358.0,,,347.0,,333.0,,,358.0,346.0,,,,,,,334.0,,,,,,,,338.0,,,,,347.0,354.0,,,,,,,,,,342.0,,336.0,,,,338.0,,,351.0,350.0,,347.0,,,,325.0,310.0,,,,,332.0,354.0,,,,329.0,344.0,,343.0,,,333.0,,,329.0,,,,,,,,,,,343.0,,,340.0,,,,,342.0,,,,,,,,,,,,,,347.0,,347.0,,323.0,,,,,,333.0,,,,,331.0,339.0,,,,,,,,341.0,339.0,,,,,346.0,,,
261987,2025,107,1478,326.0,,,,,,,,,,,,332.0,,,347.0,,,352.0,,,,,,,350.0,,,,,336.0,,,,,,,,,,,,354.0,,,345.0,349.0,336.0,351.0,333.0,,,341.0,,338.0,,,301.0,335.0,,,,,,,355.0,,,,,,,,356.0,,,,,330.0,331.0,,,,,,,,,,332.0,,340.0,,,,346.0,,,352.0,337.0,,358.0,,,,349.0,341.0,,,,,349.0,338.0,,,,351.0,347.0,,339.0,,,352.0,,,354.0,,,,,,,332.0,,,,339.0,,,336.0,,,351.0,,347.0,,,,,,,,,,,,,,351.0,,346.0,,353.0,,,,,,349.0,,,,,345.0,344.0,,,,,,,,341.0,344.0,,,,,325.0,,,
261988,2025,107,1479,334.0,,,,,,,,,,,,321.0,,,324.0,,,340.0,,,,,,,337.0,,,,,318.0,,,,,,,,,,,,326.0,,,352.0,345.0,316.0,341.0,306.0,,,349.0,,336.0,,,306.0,323.0,,,,,,,344.0,,,,,,,,354.0,,,,,319.0,310.0,,,,,,,,,,320.0,,319.0,,,,329.0,,,340.0,317.0,,352.0,,,,350.0,340.0,,,,,340.0,318.0,,,,340.0,340.0,,323.0,,,346.0,,,349.0,,,,,,,,,,,326.0,,,311.0,,,341.0,,335.0,,,,,,,,,,,,,,335.0,,333.0,,348.0,,,,,,327.0,,,,,337.0,348.0,,,,,,,,318.0,327.0,,,,,306.0,,,


In [39]:
df = df_pivot[["Season", "RankingDayNum","TeamID","NET","POM","RPI","USA","AP"]] # NET started in 2018 and replaced RPI in terms of what NCAA uses for evaluating quality of CBB teams.
df

SystemName,Season,RankingDayNum,TeamID,NET,POM,RPI,USA,AP
0,2003,35,1102,,,,,
1,2003,35,1103,,,,,
2,2003,35,1104,,,,,
3,2003,35,1105,,,,,
4,2003,35,1106,,,,,
...,...,...,...,...,...,...,...,...
261985,2025,107,1476,293.0,303.0,300.0,,
261986,2025,107,1477,332.0,329.0,340.0,,
261987,2025,107,1478,349.0,354.0,336.0,,
261988,2025,107,1479,340.0,349.0,311.0,,


In [40]:
""" Merge ranking information for the winning team"""
merged_df = merged_df.merge(
    df[['Season', 'RankingDayNum', 'TeamID', 'NET','POM', 'RPI','USA','AP']],
    left_on=['Season', 'DayNum', 'WTeamID'],
    right_on=['Season', 'RankingDayNum', 'TeamID'],
    how='left'
)

# Renaming columns in merged_df
merged_df.rename(columns={
    'NET': 'W_NET',
    'POM': 'W_POM',
    'RPI': 'W_RPI',
    'USA': 'W_USA',
    'AP': 'W_AP'
}, inplace=True)
merged_df.drop(columns=['RankingDayNum', 'TeamID'], inplace=True)
"""Note, games from pre-season do not have rankings"""
merged_df

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,WFGM3,WFGA3,WFTM,WFTA,WOR,WDR,WAst,WTO,WStl,WBlk,WPF,LFGM,LFGA,LFGM3,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF,WTeamName,LTeamName,CRType,CityID,WConf,LConf,W_NET,W_POM,W_RPI,W_USA,W_AP
0,2003,10,1104,68,1328,62,N,0,27,58,3,14,11,18,14,24,13,23,7,1,22,22,53,2,10,16,22,10,22,8,18,9,2,20,Alabama,Oklahoma,,,sec,big_twelve,,,,,
1,2003,10,1272,70,1393,63,N,0,26,62,8,20,10,19,15,28,16,13,4,4,18,24,67,6,24,9,20,20,25,7,12,8,6,16,Memphis,Syracuse,,,cusa,big_east,,,,,
2,2003,11,1266,73,1437,61,N,0,24,58,8,18,17,29,17,26,15,10,5,2,25,22,73,3,26,14,23,31,22,9,12,2,5,23,Marquette,Villanova,,,cusa,big_east,,,,,
3,2003,11,1296,56,1457,50,N,0,18,38,3,9,17,31,6,19,11,12,14,2,18,18,49,6,22,8,15,17,20,9,19,4,3,23,N Illinois,Winthrop,,,mac,big_south,,,,,
4,2003,11,1400,77,1208,71,N,0,30,61,6,14,11,13,17,22,12,14,4,4,20,24,62,6,16,17,27,21,15,12,10,7,1,14,Texas,Georgia,,,big_twelve,sec,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119125,2024,146,1301,76,1181,64,N,0,28,60,3,13,17,23,8,27,16,4,4,6,16,19,59,5,20,21,26,10,27,11,9,4,5,23,NC State,Duke,NCAA,4088.0,acc,acc,,,,,
119126,2024,146,1345,72,1397,66,N,0,24,53,3,15,21,33,8,32,16,10,5,2,12,24,62,11,26,7,11,6,17,17,6,8,4,25,Purdue,Tennessee,NCAA,4098.0,big_ten,sec,,,,,
119127,2024,152,1163,86,1104,72,N,0,31,62,10,25,14,18,10,25,20,4,4,8,17,26,58,11,23,9,11,7,21,9,7,2,5,15,Connecticut,Alabama,NCAA,4130.0,big_east,sec,,,,,
119128,2024,152,1345,63,1301,50,N,0,22,55,10,25,9,10,10,28,13,14,5,2,8,21,57,5,19,3,4,6,22,10,11,8,3,13,Purdue,NC State,NCAA,4130.0,big_ten,acc,,,,,
