# Data Wrangling & EDA

##### Kimberly Liu & Isaac Tabor

### March Madness Data

**Part 1: What is our data?**

We believe the information and variables highlighted from the following datasets will help us build a simple prediction model:

- *MTeams.csv* and *WTeams.csv* contain **Team ID** and **Team Names**

- *MNCAATourneySeeds.csv* and *WNCAATourneySeeds.csv* contain **tournament seeds since 1984-85 season**. Key to note: We will not know which 68 teams will be in the tournament, or what seeds  are until Selection Sunday on March 16, 2025.

- *MRegularSeasonCompactResults.csv* and *WRegularSeasonCompactResults.csv* contain **Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season**

- *MSeasons.csv* and *WSeasons.csv* contain **Season-level details including dates and region names**

In the end, we plan to generate our predictions from a machine learning model in format outlined in *SampleSubmissionStage1.csv*



**Part 2: How will these data be useful for studying the phenomenon we're interested in?**

We have collected a large amount of data of historical NCAA basketball games and teams going back many years. We intend to use it to build a machine learning model to predict March Madness outcomes.

We have data on both men's and women's data currently, with files starting with M containing only data pertaining to men's data, and files starting with W containing only women's data (e.g. MCities, WConferences). MTeamSpellings and WTeamSpellings will help us map TeamID to the team.

All of the files are currently complete through January 28th of the current season. This data was compiled into a Kaggle dataset for a March Madness ML competition largely from Kenneth Massey and Jeff Sonas of Sonas Consulting.


**Part 3: What are the challenges we've resolved or expect to face in using them?**

First clone the GitHub repo:

In [5]:
! git clone https://github.com/kimberlyyliuu/DS3001-Project/

fatal: destination path 'DS3001-Project' already exists and is not an empty directory.


Next, load and merge basic data. You may have to adjust to your unique file path:

In [6]:
import pandas as pd

MTeams = pd.read_csv("/content/DS3001-Project/data/MTeams.csv")
MNCAATournamentSeeds = pd.read_csv("/content/DS3001-Project/data/MNCAATourneySeeds.csv")
MRegularSeasonCompactResults = pd.read_csv("/content/DS3001-Project/data/MRegularSeasonCompactResults.csv")
MSeasons = pd.read_csv("/content/DS3001-Project/data/MSeasons.csv")

FileNotFoundError: [Errno 2] No such file or directory: '/content/DS3001-Project/data/MTeams.csv'

In [None]:
MTeams.head()

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2025
1,1102,Air Force,1985,2025
2,1103,Akron,1985,2025
3,1104,Alabama,1985,2025
4,1105,Alabama A&M,2000,2025


In [None]:
MNCAATournamentSeeds.head()

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [None]:
MRegularSeasonCompactResults.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [None]:
MSeasons.head()

Unnamed: 0,Season,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,10/29/1984,East,West,Midwest,Southeast
1,1986,10/28/1985,East,Midwest,Southeast,West
2,1987,10/27/1986,East,Southeast,Midwest,West
3,1988,11/02/1987,East,Midwest,Southeast,West
4,1989,10/31/1988,East,West,Midwest,Southeast


In [None]:
# Assuming you have loaded the DataFrames as MTeams, MNCAATournamentSeeds,
# MRegularSeasonCompactResults, and MSeasons

# 1. Merge MTeams for winning teams
merged_df = pd.merge(MRegularSeasonCompactResults, MTeams, left_on='WTeamID', right_on='TeamID', suffixes=('', '_winner'))

# 2. Merge MTeams for losing teams
merged_df = pd.merge(merged_df, MTeams, left_on='LTeamID', right_on='TeamID', suffixes=('', '_loser'))

# 3. Merge MNCAATournamentSeeds
merged_df = pd.merge(merged_df, MNCAATournamentSeeds, on=['Season', 'TeamID'], how='left')  # Left merge to keep all regular season games

# 4. Merge MSeasons
merged_df = pd.merge(merged_df, MSeasons, on='Season')

# Now 'merged_df' contains all the combined information
merged_df.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,TeamID,TeamName,...,TeamID_loser,TeamName_loser,FirstD1Season_loser,LastD1Season_loser,Seed,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,20,1228,81,1328,64,N,0,1228,Illinois,...,1328,Oklahoma,1985,2025,W03,10/29/1984,East,West,Midwest,Southeast
1,1985,25,1106,77,1354,70,H,0,1106,Alabama St,...,1354,S Carolina St,1985,2025,,10/29/1984,East,West,Midwest,Southeast
2,1985,25,1112,63,1223,56,H,0,1112,Arizona,...,1223,Houston Chr,1985,2025,X10,10/29/1984,East,West,Midwest,Southeast
3,1985,25,1165,70,1432,54,H,0,1165,Cornell,...,1432,Utica,1985,1987,,10/29/1984,East,West,Midwest,Southeast
4,1985,25,1192,86,1447,74,H,0,1192,F Dickinson,...,1447,Wagner,1985,2025,Z16,10/29/1984,East,West,Midwest,Southeast


Now, we have a basic merged dataset to start with. Kimberly, if you have a better one, we can use that or merge those too.

In [14]:
import pandas as pd 
#Section 1 
MTeams_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MTeams.csv")

MSeasons_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MSeasons.csv")

MTourneySeeds_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneySeeds.csv")

MRegularSeasonCompactResults_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MRegularSeasonCompactResults.csv")

## Section 2 
MRegularSeasonDetailedResults_df =  pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MRegularSeasonDetailedResults.csv")

MNCAATourneyDetailedResults_df =   pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneyDetailedResults.csv")

MTeamConferences_df =pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MTeamConferences.csv")

MGameCities_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MGameCities.csv")

MConferenceTourneyGames_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MConferenceTourneyGames.csv")

df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MMasseyOrdinals.csv")

MNCAATourneySlots_df = pd.read_csv("/Users/kimberlyliu/Downloads/DS 3001/DS3001-Project/data/MNCAATourneySlots.csv")
## Drop Teams before 2003 
MTeams_df = MTeams_df[MTeams_df['LastD1Season'] >= 2003]

## Merge MRegularSeasonDetailedResults with MNCAATourneyDetailedResults
merged_df = pd.concat([MRegularSeasonDetailedResults_df, MNCAATourneyDetailedResults_df], ignore_index=True)
merged_df.sort_values('Season').sort_values('DayNum')

## Add Team Name for Win and Lose 
merged_df['WTeamName'] = merged_df['WTeamID'].map(MTeams_df.set_index('TeamID')['TeamName'])
merged_df['LTeamName'] = merged_df['LTeamID'].map(MTeams_df.set_index('TeamID')['TeamName'])

## add game type and city id 
merged_df = merged_df.merge(
    MGameCities_df[['Season', 'DayNum', 'WTeamID', 'LTeamID', 'CRType', 'CityID']],
    on=['Season', 'DayNum', 'WTeamID', 'LTeamID'],
    how='left'
)


## issue, only valid post 2010 

# Merge to bring in the conference abbreviation as ConfAbbrev
merged_df = merged_df.merge(
    MTeamConferences_df[['Season', 'TeamID', 'ConfAbbrev']],
    left_on=['Season', 'WTeamID'],
    right_on=['Season', 'TeamID'],
    how='left'
)

# Rename the imported column to WConf and drop the duplicate TeamID column from the merge
merged_df.rename(columns={'ConfAbbrev': 'WConf'}, inplace=True)
merged_df.drop(columns=['TeamID'], inplace=True)


# Merge to bring in the conference abbreviation as ConfAbbrev
merged_df = merged_df.merge(
    MTeamConferences_df[['Season', 'TeamID', 'ConfAbbrev']],
    left_on=['Season', 'LTeamID'],
    right_on=['Season', 'TeamID'],
    how='left'
)

merged_df.rename(columns={'ConfAbbrev': 'LConf'}, inplace=True)
merged_df.drop(columns=['TeamID'], inplace=True)

# Merge ranking information for the winning team
merged_df = merged_df.merge(
    df[['Season', 'RankingDayNum', 'TeamID', 'SystemName', 'OrdinalRank']],
    left_on=['Season', 'DayNum', 'WTeamID'],
    right_on=['Season', 'RankingDayNum', 'TeamID'],
    how='left'
)

merged_df.rename(columns={'OrdinalRank': 'WOrdinalRank'}, inplace=True)
merged_df.drop(columns=['RankingDayNum', 'TeamID'], inplace=True)
#Note, games from pre-season do not have rankings
merged_df.head()


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LBlk,LPF,WTeamName,LTeamName,CRType,CityID,WConf,LConf,SystemName,WOrdinalRank
0,2003,10,1104,68,1328,62,N,0,27,58,...,2,20,Alabama,Oklahoma,,,sec,big_twelve,,
1,2003,10,1272,70,1393,63,N,0,26,62,...,6,16,Memphis,Syracuse,,,cusa,big_east,,
2,2003,11,1266,73,1437,61,N,0,24,58,...,5,23,Marquette,Villanova,,,cusa,big_east,,
3,2003,11,1296,56,1457,50,N,0,18,38,...,3,23,N Illinois,Winthrop,,,mac,big_south,,
4,2003,11,1400,77,1208,71,N,0,30,61,...,1,14,Texas,Georgia,,,big_twelve,sec,,
