# Data Parsing for Raw March Madness Tournament Data 1985 - 2019

Data for each March Madness tournament from 1985 to 2019 was compiled manually from www.sports-reference.com and imported into this notebook to be parsed into a more suitable format for analysis.

Source: [Historical NCAA March Madness statistics](https://www.sports-reference.com/cbb/postseason/)


## Import packages and raw data

Raw csv data is imported as a dataframe and will be manipulated with the `pandas` Python package.

Raw data is highly unorganized but consistently formatted - there are no descriptive row names, but each row of data represents the same field (e.g. the first row is a 1 seed, followed by the respective school name, followed by its first-round score, etc.). Much of this data is irrelevant to the analysis and will be removed - we will only be looking at top 4 seeds and their locations. 

In [1]:
import pandas as pd

In [2]:
# import raw copy/pasted data with first round sites - mm for 'March Madness'
mm = pd.read_csv('../data/raw/mm-85-19-raw.csv')

# check imported data
mm

Unnamed: 0,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,...,1994,1993,1992,1991,1990,1989,1988,1987,1986,1985
0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,Duke,Villanova,Villanova,UNC,Villanova,Wichita State,Indiana,UNC,Ohio State,Duke,...,UNC,UNC,Duke,UNC,UConn,Georgetown,Temple,UNC,Duke,Georgetown
2,85,87,76,83,93,64,83,77,75,73,...,71,85,82,101,76,50,87,113,85,68
3,16,16,16,16,16,16,16,16,16,16,...,16,16,16,16,16,16,16,16,16,16
4,North Dakota State,Radford,Mount St. Mary's,Florida Gulf Coast,Lafayette,Cal Poly,James Madison,Vermont,UTSA,Arkansas-Pine Bluff,...,Liberty,East Carolina,Campbell,Northeastern,Boston University,Princeton,Lehigh,Penn,Mississippi Valley State,Lehigh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215,Tennessee,Cincinnati,Arizona,Michigan State,Arizona,Wisconsin,Georgetown,Ohio State,San Diego State,Kansas State,...,Arizona,Arizona,Indiana,Arizona,Arizona,Indiana,UNC,Iowa,Louisville,VCU
216,77,68,100,81,93,75,68,78,68,82,...,81,61,94,93,79,99,83,99,93,81
217,15,15,15,15,15,15,15,15,15,15,...,15,15,15,15,15,15,15,15,15,15
218,Colgate,Georgia State,North Dakota,Middle Tennessee,Texas Southern,American,Florida Gulf Coast,Loyola (MD),Northern Colorado,North Texas,...,Loyola (MD),Santa Clara,Eastern Illinois,Saint Francis (PA),South Florida,George Mason,North Texas,Santa Clara,Drexel,Marshall


## Loop through raw data and pull out relevant fields

The fields we are interested in retaining are (1) seed, (2) school, (3) host site, and (4) year. We also want to add a unique ID to each school for each year to associate with its host site in a separate row of data. We primarily use list comprehension to iterate through the raw data frame.

In [3]:
# instantiate empty list to push new data to
seedsAll = []
schoolsAll = []
sitesAll = []
yearsAll = []
idsAll = []

# loop through each column of data
for year in mm:
    # each row in column
    yearData = list(mm[year])
    # all are in [] = [] + [list comprehension] format to append each row's data to the same list
    # find each 1, 2, 3, and 4 seed in the data
    seedsAll = seedsAll + ([yearData[i] for i in range(len(yearData)) if yearData[i] == '1']) + ([yearData[i] for i in range(len(yearData)) if yearData[i] == '2']) + ([yearData[i] for i in range(len(yearData)) if yearData[i] == '3']) + ([yearData[i] for i in range(len(yearData)) if yearData[i] == '4'])
    # find associated school, which is one row below the seed number
    schoolsAll = schoolsAll + [yearData[i+1] for i in range(len(yearData)) if yearData[i] == '1'] + [yearData[i+1] for i in range(len(yearData)) if yearData[i] == '2'] + [yearData[i+1] for i in range(len(yearData)) if yearData[i] == '3'] + [yearData[i+1] for i in range(len(yearData)) if yearData[i] == '4']
    # sites - 6 rows below 1, 3, 4 seeds, 1 before 2 seeds - remove 'at ' prefix
    sitesAll = sitesAll + [yearData[i+6].split('at ')[1] for i in range(len(yearData)) if yearData[i] == '1'] + [yearData[i-1].split('at ')[1] for i in range(len(yearData)) if yearData[i] == '2'] + [yearData[i+6].split('at ')[1] for i in range(len(yearData)) if yearData[i] == '3'] + [yearData[i+6].split('at ')[1] for i in range(len(yearData)) if yearData[i] == '4']
    # year for each row
    yearsAll = yearsAll + ([year for i in range(len(yearData)) if yearData[i] == '1']) + ([year for i in range(len(yearData)) if yearData[i] == '2']) + ([year for i in range(len(yearData)) if yearData[i] == '3']) + ([year for i in range(len(yearData)) if yearData[i] == '4'])


## Reformat data from lists to single master data frame

Each field of data was saved as a list through our list comprehension loop above. We ultimately want all of our data in a single data frame. We first add all of the lists to a dictionary with named keys, then use `pandas` to convert that object to a dataframe.

In [4]:
# plot in dictionary to apply column names
objAll = {'seed': seedsAll, 'school': schoolsAll, 'site': sitesAll, 'year': yearsAll}

# convert to data frame - dictionary keys are now columns
dfSchools = pd.DataFrame(objAll)

## Create unique IDs

Although each row of data includes both the school and its first-round site, we will ultimately want to split these fields into two rows to have different geometries for the school and site locations. We still want these rows to be associated together, so we create a unique ID for each row of data by concatenating its year and index.

In [5]:
# apply unique id to each row by pasting year and index
dfSchools['id'] = [dfSchools.year[d] + str(d) for d in list(dfSchools.index.values)]

# check data
dfSchools.tail()

Unnamed: 0,seed,school,site,year,id
555,3,NC State,"Albuquerque, NM",1985,1985555
556,4,Loyola (IL),"Hartford, CT",1985,1985556
557,4,Ohio State,"Tulsa, OK",1985,1985557
558,4,LSU,"Dayton, OH",1985,1985558
559,4,UNLV,"Salt Lake City, UT",1985,1985559


## Create new rows for each site

Again, we want to separate each school location from its site but keep the two associated by a unique id. Using the same formatting as above, we pull out each site and id and create new rows of data persisting only thosee fields.

In [6]:
# copy all sites and ids
sites = [site for site in dfSchools.site]
ids = [uid for uid in dfSchools.id]

# new dictionary with only matching sites and ids
siteDict = {'seed': None, 'school': None, 'site': sites, 'year': None, 'id': ids}

# new data frame with each site in its own row
siteDf = pd.DataFrame(siteDict)

# append all rows in single dataframe
dfAll = dfSchools.append(siteDf, ignore_index=True)

# check!
dfAll.tail()

Unnamed: 0,seed,school,site,year,id
1115,,,"Albuquerque, NM",,1985555
1116,,,"Hartford, CT",,1985556
1117,,,"Tulsa, OK",,1985557
1118,,,"Dayton, OH",,1985558
1119,,,"Salt Lake City, UT",,1985559


## Write to CSV

For now, we just want our data in CSV format for analysis. Once all data is finalized, we will eventually convert to JSON for webmapping.

In [7]:
# write to csv and json
dfAll.to_csv('../data/cleaned/mm-85-19-cleaned.csv', index=False)