# Kaggle March Madness (Stage 1) - Data Understanding

This section is used to gather information on the data provided for the competition. In subsequent notebooks we will move forward with data analysis and exploration.  

In [3]:
# !kaggle competitions download -c mens-march-mania-2022
# !ls
# cd ncaa_basketball/
# ls
# !unzip mens-march-mania-2022.zip

In [1]:
import pandas as pd

In [2]:
cd ncaa_basketball/MDataFiles_Stage1/

/home/kyle/Desktop/kaggle/ncaa_basketball/MDataFiles_Stage1


Data Description:

Each season there are thousands of NCAA basketball games played between Division I men's teams, culminating in March Madness®, the 68-team national championship that starts in the middle of March. We have provided a large amount of historical data about college basketball games and teams, going back many years. Armed with this historical data, you can explore it and develop your own distinctive ways of predicting March Madness® game outcomes. You can even evaluate and compare different approaches by seeing which of them would have done best at predicting tournament games from the past.

If you are unfamiliar with the format and intricacies of the NCAA® tournament, we encourage reading the [wikipedia](https://en.wikipedia.org/wiki/NCAA_Division_I_Men%27s_Basketball_Tournament) page before diving into the data.  The data description and schema may seem daunting at first, but is not as complicated as it appears.

As a reminder, you are encouraged to incorporate your own sources of data. We have provided extensive historical data to jump-start the modeling process, and this data is self-consistent (for instance, dates and team ID's are always treated the same way). Nevertheless, you may also be able to make productive use of external data. If you head down this path, please be forewarned that many sources have their own distinctive way of identifying the names of teams, and this can make it challenging to link up with our data. The MTeamSpellings file, which is listed in the bottom section below, may help you map external team references into our own Team ID structure, and you may also need to understand exactly how dates work in our data.

We extend our gratitude to [Kenneth Massey](https://masseyratings.com/) for providing much of the historical data.

Special Acknowledgment to Jeff Sonas of Sonas Consulting for his suppor.t in assembling the dataset for this competition.
What to predict

Stage 1 - You should submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (2016-2019 and 2021). Note that there was no tournament held in 2020.

Stage 2 - You should submit predicted probabilities for every possible matchup before the 2022 tournament begins.

Refer to the Timeline page for specific dates. In both stages, the sample submission will tell you which games to predict.
File descriptions

Below we describe the format and fields of the contest data files. All of the files are complete through February 7th of the current season. At the start of Stage 2, we will provide updates to these files to incorporate data from the remaining weeks of the current season.
Data Section 1 - The Basics

This section provides everything you need to build a simple prediction model and submit predictions.

   * Team ID's and Team Names
   * Tournament seeds since 1984-85 season
   * Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season
   * Season-level details including dates and region names
   * Example submission file for stage 1

Special note about "Season" numbers: the college basketball season lasts from early November until the national championship tournament that starts in the middle of March. For instance, this year the first men’s Division I games were played on November 9th, 2021 and the men’s national championship game will be played on April 4th, 2022. Because a basketball season spans two calendar years like this, it can be confusing to refer to the year of the season. By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in. So for instance, the current season will be identified in our data as the 2022 season, not the 2021 season or the 2021-22 season or the 2021-2022 season, though you may see any of these in everyday use outside of our data.

# Summary

Project Goals: submit predicted probabilities that the team with the lower id beats the team with the higher id.

* Seasons are denoted by their ending year 
    * I.E. season 2021-2022 will show 2022
    
Section 1: Teams

* 378 teams in dataset 
    * current 358 teams in division 1
        * does this need to be reconciled?

Teams Dataframe: 
* 4 Columns
    * TeamID,TeamName,FirstD1Season,LastD1Season
        * Team ID = number between 1000-1999
        * TeamName = abreviated team name (MAX len = 16)
        * FirstD1Season = first year of play in division 1
        * LastD1Season = last year of play in division 1
             * Note: This will help us eliminate or show a probablity of 0 if a team is no longer division 1


Seasons Dataframe:
* 6 Columns
    * Season,DayZero,RegionW, RegionX, Region Y, Region Z
        * Season = Season Year
        * DayZero = Start Date of Season
        * Region W,X,Y,Z - denotes conference for final four teams 
            * Whichever team comes first alphabetically will be given the variable W
                * Whichever team plays W will be given the variable X
            * Of the remaining two teams, whichever team comes first alphabetically will be given the variable Y
                * Whichever team plays Y will be give nthe variable Z
                
Tourney Seeds Dataframe:
* 3 Columns:
    * Season,Seed,TeamID
        * Season = year tournament was played
        * Seed = typically denoted using W,X,Y,Z followed by an integer 1-16
            * W, X, Y, Z denotes region as shown on the Season Dataframe
            * integer denotes rank within that region
                * Example: W01 would be the 1st overall team in W01 given the year (Season column)
            * alternatively, if a team is a "play-in" team:
                * the two teams that had an additional game to get into the tournament will have their seed end in either an 'a' or 'b'
        * TeamID - identifies the team based from the Teams Dataframe
        
Regular Season Compact Results Dataframe:
* 8 Columns:
    * Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
        * Season = season year (assocaited with Seasons)
        * DayNum = day number (ranging from 0-132)
            * Note: Regular season ends at day 132
            * Note: As previously mentioned day zero is the start of the season
        * WTeamID = winning teams ID
        * WScore = winning teams score
        * LTeamID = losing teams ID
        * LScore = losing teams score
        * WLoc = location of the winning team (H,A,N)
            * H = Home
            * A = Away
            * N = Neutral
        * NumOT = number of overtime periods (0-n)
        
Tournament Compact Reults
* 8 Columns:
    * Data is identical to Regular Season Compact Results Dateframe
        * Useful Info
            * DayNum=134 or 135 (Tue/Wed) - play-in games to get the tournament field down to the final 64 teams
            * DayNum=136 or 137 (Thu/Fri) - Round 1, to bring the tournament field from 64 teams to 32 teams
            * DayNum=138 or 139 (Sat/Sun) - Round 2, to bring the tournament field from 32 teams to 16 teams
            * DayNum=143 or 144 (Thu/Fri) - Round 3, otherwise known as "Sweet Sixteen", to bring the tournament field from 16 teams to 8 teams
            * DayNum=145 or 146 (Sat/Sun) - Round 4, otherwise known as "Elite Eight" or "regional finals", to bring the tournament field from 8 teams to 4 teams
            * DayNum=152 (Sat) - Round 5, otherwise known as "Final Four" or "national semifinals", to bring the tournament field from 4 teams to 2 teams
            * DayNum=154 (Mon) - Round 6, otherwise known as "national final" or "national championship", to bring the tournament field from 2 teams to 1 champion team

Regular Season Detailed Results
* 34 Columns:
    * Data is identical to Regular Season Compact Results Dateframe
        * Provides data post 2003

Tournament Detailed Results
* 34 Columns:
    * Data is identical to Regular Season Compact Results Dateframe
        * Provides data post 2003
Detailed Results include:
   * WFGM - field goals made (by the winning team)
   * WFGA - field goals attempted (by the winning team)
   * WFGM3 - three pointers made (by the winning team)
   * WFGA3 - three pointers attempted (by the winning team)
   * WFTM - free throws made (by the winning team)
   * WFTA - free throws attempted (by the winning team)
   * WDR - defensive rebounds (pulled by the winning team)
   * WAst - assists (by the winning team)
   * WTO - turnovers committed (by the winning team)
   * WStl - steals (accomplished by the winning team)
   * WBlk - blocks (accomplished by the winning team)
   * WPF - personal fouls committed (by the winning team)

Cities
* 3 Columns:
    * CityID,City,State
        * CityID = four digit ID 
        * City = city name
        * State = state city is located

Game Cities
* 6 Columns
    * Season, DayNum, WTeamID, LTeamID, CRType, CityID
        * Season, DayNum, WTeamID, LTeamID = identical to past dataframes
        * CRType = Regular, NCAA, Secondary
            * Regular = Regular Season
            * NCAA = Tournament
            * Secondary = Secondary Tournament (alternative to NCAA)
                * See below notes 
        * CityID = ID of city where game was played
        
Massey Ordinals
* Note: this is using external rankings and will add complexity to this performance. Get a foundation built for your model and possibly try to incorporate these rankings at a later date.


Coaches
* 5 Columns
    * Season,TeamID,FirstDayNum,LastDayNum,CoachName
        * Season = season year
        * TeamID = team ID 
        * FirstDayNum = coaches start date
        * LastDayNum = coaches end date
            * Note: FirstDayNum/LastDayNum: 
                * A coach who serves a full season with have 0 for FirstDayNum and 154 for LayDayNum
                * Coaches who started/stopped coaching during the season will have applicable integers assocaited to start/end day num
        * CoachName = coaches name
        
Conferences
* 2 Columns
    * ConfAbbrev, Description
        * ConfAbbrev = conference abbreviation
        * Description = full conference name
            * Note: Data goes back to the 1985 season. The dataset does not take into account conference mergers. 
            
Team Conferences
* 3 Columns
    * Season,TeamID,ConfAbbrev
        * Season = year
        * TeamID = team ID
        * ConfAbbrev = conference abbreviation
            * Note: This should help clarrify the conference mergers as it shows historical conferencing base on team. 
            
Conference Tournament Games
* 5 Columns
    * ConfAbbrev, Season, DayNum, WTeamID, LTeamID
        * ConfAbbrev = conference abbreviation
        * Season = yearTeams
        * DayNum = day number post day 0
        * WTeamID = winning team ID
        * LTeamID = Losing Team ID
            * Note: All of this data is already provided in Regular Season Compact and Detailed Results
                * Note: This will identify games that occured during the conference championship tournament

Secondary Tournament Teams & Secondary Tourney Compact Results
* Note: These dataset provide information unrelated to the NCAA tournament and include data for alternative tournaments. 
    * To Do: Get initial structure in place and possibly incorporate this data to see if model improves
    
Team Spellings
* Note: Includes alternative spellings of team names. Could be used for aligning external datasets
    * Come back to this if you decide to include external data
    
Tournament Seed Round Slots 
* 5 Columns
    * Seed,GameRound,GameSlot,EarlyDayNum,LateDayNum
       * Seed - this is the tournament seed of the team.
       * GameRound - round during the tournament that the game would occur in, where Round 0 (zero) is for the play-in games, Rounds 1/2 are for the first weekend, Rounds 3/4 are for the second weekend, and Rounds 5/6 are the national semifinals and finals.
       * GameSlot - this is the game slot that the team would be playing in, during the given GameRound. The naming convention for slots is described above, in the definition of the MNCAATourneySlots file.
       * EarlyDayNum, LateDayNum - these fields describe the earliest possible, and latest possible, DayNums that the game might be played on.

### Teams

Data Section 1 file: MTeams.csv

This file identifies the different college teams present in the dataset. Each school is uniquely identified by a 4 digit id number. You will not see games present for all teams in all seasons, because the games listing is only for matchups where both teams are Division-I teams. There are 358 teams currently in Division-I, and an overall total of 372 teams in our team listing (each year, some teams might start being Division-I programs, and others might stop being Division-I programs).

  * TeamID - a 4 digit id number, from 1000-1999, uniquely identifying each NCAA® men's team. A school's TeamID does not change from one year to the next, so for instance the Duke men's TeamID is 1181 for all seasons. To avoid possible confusion between the men's data and the women's data, all of the men's team ID's range from 1000-1999, whereas all of the women's team ID's range from 3000-3999.
  * TeamName - a compact spelling of the team's college name, 16 characters or fewer. There are no commas or double-quotes in the team names, but you will see some characters that are not letters or spaces, e.g., Texas A&M, St Mary's CA, TAM C. Christi, and Bethune-Cookman.
  *  FirstD1Season - the first season in our dataset that the school was a Division-I school. For instance, FL Gulf Coast (famously) was not a Division-I school until the 2008 season, despite their two wins just five years later in the 2013 NCAA® tourney. Of course, many schools were Division-I far earlier than 1985, but since we don't have any data included prior to 1985, all such teams are listed with a FirstD1Season of 1985.
  *  LastD1Season - the last season in our dataset that the school was a Division-I school. For any teams that are currently Division-I, they will be listed with LastD1Season=2022, and you can confirm there are 358 such teams.


In [3]:
teams = pd.read_csv('MTeams.csv')

teams.head()

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2022
1,1102,Air Force,1985,2022
2,1103,Akron,1985,2022
3,1104,Alabama,1985,2022
4,1105,Alabama A&M,2000,2022


### Seasons

Data Section 1 file: MSeasons.csv

This file identifies the different seasons included in the historical data, along with certain season-level properties.

   * Season - indicates the year in which the tournament was played. Remember that the current season counts as 2022.
   * DayZero - tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the national semifinals are always on DayNum=152, the "play-in" games are on days 135, Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you need to know the exact date a game was played on, you can combine the game's "DayNum" with the season's "DayZero". For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on DayNum=7, they were therefore played on 11/07/2011.
   * RegionW, RegionX, Region Y, Region Z - by our contests' convention, each of the four regions in the final tournament is assigned a letter of W, X, Y, or Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files, even if the region names change from year to year. For instance, during the 2012 tournament, the four regions were East, Midwest, South, and West. Being the first alphabetically, East becomes W. Since the East regional champion (Ohio State) played against the Midwest regional champion (Kansas) in the national semifinals, that makes Midwest be region X. For the other two (South and West), since South comes first alphabetically, that makes South Y and therefore West is Z. So for that season, the W/X/Y/Z are East,Midwest,South,West. And so for instance, Ohio State, the #2 seed in the East, is listed in the MNCAATourneySeeds file that year with a seed of W02, meaning they were the #2 seed in the W region (the East region). We will not know the final W/X/Y/Z designations until Selection Sunday, because the national semifinal pairings in the Final Four will depend upon the overall ranks of the four #1 seeds.

In [4]:
seasons = pd.read_csv('MSeasons.csv')

seasons.head()

Unnamed: 0,Season,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,1984-10-29 00:00:00,East,West,Midwest,Southeast
1,1986,1985-10-28 00:00:00,East,Midwest,Southeast,West
2,1987,1986-10-27 00:00:00,East,Southeast,Midwest,West
3,1988,1987-11-02 00:00:00,East,Midwest,Southeast,West
4,1989,1988-10-31 00:00:00,East,West,Midwest,Southeast


### Tournament Seeds

Data Section 1 file: MNCAATourneySeeds.csv

This file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 13, 2022 (DayNum=132).

   * Season - the year that the tournament was played in
   * Seed - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tell you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have seeds with the same first three characters. The "a" and "b" are assigned based on which Team ID is lower numerically. As an example of the format of the seed, the first record in the file is seed W01 from 1985, which means we are looking at the #1 seed in the W region (which we can see from the "MSeasons.csv" file was the East region).
   * TeamID - this identifies the id number of the team, as specified in the MTeams.csv file

In [5]:
tourney_seeds = pd.read_csv('MNCAATourneySeeds.csv')

tourney_seeds.head()

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


### Regular Season Compact Results

Data Section 1 file: MRegularSeasonCompactResults.csv

This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

   * Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs). For example, during the 2016 season, there were regular season games played between November 2015 and March 2016, and all of those games will show up with a Season of 2016.
   * DayNum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "MSeasons.csv" file. For example, the first game in the file was DayNum=20. Combined with the fact from the "MSeasons.csv" file that day zero was 10/29/1984 that year, this means the first game was played 20 days later, or 11/18/1984. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID). In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day because of a tornado, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original scheduled date.
   * WTeamID - this identifies the id number of the team that won the game, as listed in the "MTeams.csv" file. No matter whether the game was won by the home team or visiting team, or if it was a neutral-site game, the "WTeamID" always identifies the winning team.
   * WScore - this identifies the number of points scored by the winning team.
   * LTeamID - this identifies the id number of the team that lost the game.
   * LScore - this identifies the number of points scored by the losing team. Thus you can be confident that WScore will be greater than LScore for all games listed.
   * WLoc - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.
   * NumOT - this indicates the number of overtime periods in the game, an integer 0 or higher.

In [6]:
reg_season_compact_results = pd.read_csv('MRegularSeasonCompactResults.csv')

reg_season_compact_results.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


### Tournament Compact Reults

Data Section 1 file: MNCAATourneyCompactResults.csv

This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.

Because of the consistent structure of the NCAA® tournament schedule, you can actually tell what round a game was, depending on the exact DayNum. Thus:

    DayNum=134 or 135 (Tue/Wed) - play-in games to get the tournament field down to the final 64 teams
    DayNum=136 or 137 (Thu/Fri) - Round 1, to bring the tournament field from 64 teams to 32 teams
    DayNum=138 or 139 (Sat/Sun) - Round 2, to bring the tournament field from 32 teams to 16 teams
    DayNum=143 or 144 (Thu/Fri) - Round 3, otherwise known as "Sweet Sixteen", to bring the tournament field from 16 teams to 8 teams
    DayNum=145 or 146 (Sat/Sun) - Round 4, otherwise known as "Elite Eight" or "regional finals", to bring the tournament field from 8 teams to 4 teams
    DayNum=152 (Sat) - Round 5, otherwise known as "Final Four" or "national semifinals", to bring the tournament field from 4 teams to 2 teams
    DayNum=154 (Mon) - Round 6, otherwise known as "national final" or "national championship", to bring the tournament field from 2 teams to 1 champion team

Special note: Each year, there are also going to be other games that happened after Selection Sunday, which are not part of the NCAA® Tournament. This includes tournaments like the postseason NIT, the CBI, the CIT, and the Vegas 16. Such games are not listed in the Regular Season or the NCAA® Tourney files; they can be found in the "Secondary Tourney" data files within Data Section 6. Although they would not be games you would ever be predicting directly for the NCAA® tournament, and they would not be games you would have data from at the time of predicting NCAA® tournament outcomes, you may nevertheless wish to make use of these games for model optimization, depending on your methodology. The more games that you can test your predictions against, the better your optimized model might eventually become, depending on how applicable all those games are. A similar argument might be advanced in favor of optimizing your predictions against conference tournament games, which might be viewed as reasonable proxies for NCAA® tournament games.

In [7]:
tourney_compact_results = pd.read_csv('MNCAATourneyCompactResults.csv')

tourney_compact_results.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


### Regular Season Detailed Results

Data Section 2 file: MRegularSeasonDetailedResults.csv

This file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season. All games listed in the MRegularSeasonCompactResults file since the 2003 season should exactly be present in the MRegularSeasonDetailedResults file.


In [12]:
reg_season_detailed_results = pd.read_csv('MRegularSeasonDetailedResults.csv')

reg_season_compact_results.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


### Tournament Detailed Results

Data Section 2 file: MNCAATourneyDetailedResults.csv

This file provides team-level box scores for many NCAA® tournaments, starting with the 2003 season. All games listed in the MNCAATourneyCompactResults file since the 2003 season should exactly be present in the MNCAATourneyDetailedResults file.

In [11]:
tourney_detailed_results = pd.read_csv('MNCAATourneyDetailedResults.csv')

tourney_detailed_results.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,134,1421,92,1411,84,N,1,32,69,...,31,14,31,17,28,16,15,5,0,22
1,2003,136,1112,80,1436,51,N,0,31,66,...,16,7,7,8,26,12,17,10,3,15
2,2003,136,1113,84,1272,71,N,0,31,59,...,28,14,21,20,22,11,12,2,5,18
3,2003,136,1141,79,1166,73,N,0,29,53,...,17,12,17,14,17,20,21,6,6,21
4,2003,136,1143,76,1301,74,N,1,27,64,...,21,15,20,10,26,16,14,5,8,19


### Geography

Data Section 3 - Geography

This section provides city locations of all regular season, conference tournament, and NCAA® tournament games since the 2009-10 season

### Cities

Data Section 3 file: Cities.csv

This file provides a master list of cities that have been locations for games played. Please notice that the Cities and Conferences files are the only two that don't start with an M; this is because the data files are identical between men's and women's data, so you don't need to maintain separate listings of cities or conferences across the two contests. Also note that if you created any supplemental data last year on cities (latitude/longitude, altitude, etc.), the CityID's match between last year and this year, so you should be able to re-use that information.

   * CityID - a four-digit ID number uniquely identifying a city.
   * City - the text name of the city.
   * State - the state abbreviation of the state that the city is in. In a few rare cases, the game location is not inside one of the 50 U.S. states and so other abbreviations are used. For instance Cancun, Mexico has a state abbreviation of MX.

In [10]:
cities = pd.read_csv('Cities.csv')

cities.head()

Unnamed: 0,CityID,City,State
0,4001,Abilene,TX
1,4002,Akron,OH
2,4003,Albany,NY
3,4004,Albuquerque,NM
4,4005,Allentown,PA


### Game Cities

Data Section 3 file: MGameCities.csv

This file identifies all games, starting with the 2010 season, along with the city that the game was played in. Games from the regular season, the NCAA® tourney, and other post-season tournaments, are all listed together. There should be no games since the 2010 season where the CityID is not known. Games from the 2009 season and before are not listed in this file.

   * Season, DayNum, WTeamID, LTeamID - these four columns are sufficient to uniquely identify each game. Additional data, such as the score of the game and other stats, can be found in the corresponding Compact Results and/or Detailed Results file.
   * CRType - this can be either Regular or NCAA or Secondary. If it is Regular, you can find more about the game in the MRegularSeasonCompactResults.csv and MRegularSeasonDetailedResults.csv files. If it is NCAA, you can find more about the game in the MNCAATourneyCompactResults.csv and MNCAATourneyDetailedResults.csv files. If it is Secondary, you can find more about the game in the MSecondaryTourneyCompactResults file.
   * CityID - the ID of the city where the game was played, as specified by the CityID column in the Cities.csv file.

In [21]:
game_cities = pd.read_csv('MGameCities.csv')

game_cities.head()

Unnamed: 0,Season,DayNum,WTeamID,LTeamID,CRType,CityID
0,2010,7,1143,1293,Regular,4027
1,2010,7,1314,1198,Regular,4061
2,2010,7,1326,1108,Regular,4080
3,2010,7,1393,1107,Regular,4340
4,2010,9,1143,1178,Regular,4027


### Data Section 4 - Public Rankings

This section provides weekly team rankings for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc., since the 2002-2003 season

### Massey Ordinals

Data Section 4 file: MMasseyOrdinals.csv

This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies. The information was gathered by Kenneth Massey and provided on his College Basketball Ranking Composite page.

Note that a rating system is more precise than a ranking system, because a rating system can provide insight about the strength gap between two adjacently-ranked teams. A ranking system will just tell you who is #1 or who is #2, but a rating system might tell you whether the gap between #1 and #2 is large or small. Nevertheless, it can be hard to compare two different rating systems that are expressed in different scales, so it can be very useful to express all the systems in terms of their ordinal ranking (1, 2, 3, ..., N) of teams.

   * Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
   * RankingDayNum - this integer always ranges from 0 to 133, and is expressed in the same terms as a game's DayNum (where DayZero is found in the MSeasons.csv file). The RankingDayNum is intended to tell you the first day that it is appropriate to use the rankings for predicting games. For example, if RankingDayNum is 110, then the rankings ought to be based upon game outcomes up through DayNum=109, and so you can use the rankings to make predictions of games on DayNum=110 or later. The final pre-tournament rankings each year have a RankingDayNum of 133, and can thus be used to make predictions of the games from the NCAA® tournament, which start on DayNum=134 (the Tuesday after Selection Sunday).
   * SystemName - this is the (usually) 3-letter abbreviation for each distinct ranking system. These systems may evolve from year to year, but as a general rule they retain their meaning across the years. Near the top of the Massey composite page, you can find slightly longer labels describing each system, along with links to the underlying pages where the latest rankings are provided (and sometimes the calculation is described).
   * TeamID - this is the ID of the team being ranked, as described in MTeams.csv.
   * OrdinalRank - this is the overall ranking of the team in the underlying system. Most systems from recent seasons provide a complete ranking from #1 through #351, but more recently they go higher because additional teams were added to Division I in recent years.

Disclaimer: you ought to be careful about your methodology when using or evaluating these ranking systems. They are presented on a weekly basis, and given a consistent date on the Massey Composite page that typically is a Sunday; that is how the ranking systems can be compared against each other on this page. However, these systems each follow their own timeline and some systems may be released on a Sunday and others on a Saturday or Monday or even Tuesday. You should remember that if a ranking is released on a Tuesday, and was calculated based on games played through Monday, it will make the system look unusually good at predicting if you use that system to forecast the very games played on Monday that already inform the rankings. To avoid this methodological trap, we have typically used a conservative RankingDayNum of Wednesday to represent the rankings that were released at approximately the end of the weekend, a few days before, even though those rankings are represented on the composite page as being on a Sunday. For some of the older years, a more precise timestamp was known for each ranking system that allowed a more precise assignment of a RankingDayNum. By convention, the final pre-tournament rankings are always expressed as RankingDayNum=133, even though sometimes the rankings for individual systems are not released until Tuesday (DayNum=134) or even Wednesday or Thursday. If you decide to use some rankings from these Massey Ordinals to inform your predictions, be forewarned that we have no control over when they are released, and not all systems may turn out to be available in time to make pre-tournament predictions by our submission deadline. In such a situation, you may wish to use the rankings from DayNum=128 or you may need to dig into the details of the actual source of the rankings, by following the respective links on the Massey Composite Page. We may also be able to provide partial releases of the final pre-tournament Massey Ordinals on the forums, so that as systems come in on Monday or Tuesday you can use them right away.
Data Section 5 - Supplements

This section contains additional supporting information, including coaches, conference affiliations, alternative team name spellings, bracket structure, and game results for NIT and other postseason tournaments.


In [22]:
massey_ordinals = pd.read_csv('MMasseyOrdinals.csv')

massey_ordinals.head()

Unnamed: 0,Season,RankingDayNum,SystemName,TeamID,OrdinalRank
0,2003,35,SEL,1102,159
1,2003,35,SEL,1103,229
2,2003,35,SEL,1104,12
3,2003,35,SEL,1105,314
4,2003,35,SEL,1106,260


### Data Section 5 file: MTeamCoaches.csv

### Team Coaches.

This file indicates the head coach for each team in each season, including a start/finish range of DayNum's to indicate a mid-season coaching change. For scenarios where a team had the same head coach the entire season, they will be listed with a DayNum range of 0 to 154 for that season. For head coaches whose term lasted many seasons, there will be many rows listed, most of which have a DayNum range of 0 to 154 for the corresponding season.

   * Season - this is the year of the associated entry in MSeasons.csv (the calendar year in which the final tournament occurs)
   * TeamID - this is thconfusione TeamID of the team that was coached, as described in MTeams.csv.
   * FirstDayNum, LastDayNum - this defines a continuous range of days within the season, during which the indicated coach was the head coach of the team. In most cases, a data row will either have FirstDayNum=0 (meaning they started the year as head coach) and/or LastDayNum=154 (meaning they ended the year as head coach), but in some cases there were multiple new coaches during a team's season, or a head coach who went on leave and then returned (in which case there would be multiple records in that season for that coach, indicating the continuous ranges of days when they were the head coach).
   * CoachName - this is a text representation of the coach's full name, in the format first_last, with underscores substituted in for spaces.

In [23]:
team_coaches = pd.read_csv('MTeamCoaches.csv')

team_coaches.head()

Unnamed: 0,Season,TeamID,FirstDayNum,LastDayNum,CoachName
0,1985,1102,0,154,reggie_minton
1,1985,1103,0,154,bob_huggins
2,1985,1104,0,154,wimp_sanderson
3,1985,1106,0,154,james_oliver
4,1985,1108,0,154,davey_whitney


### Data Section 5 file: Conferences.csv

### Conferences

This file indicates the Division I conferences that have existed over the years since 1985. Each conference is listed with an abbreviation and a longer name. There has been no attempt to link up conferences who merged with other conferences, or whose names changed over time. Thus you will see, for instance, a "Pacific-10" conference up through the 2011 season, and then a "Pacific-12" conference starting in the 2012 season, and these look like different conferences in the data, even though it was still mostly the same teams. Please notice that the Cities and Conferences files are the only two that don't start with an M; this is because the data files are identical between men's and women's data, so you don't need to maintain separate listings of cities or conferences across the two contests. However, the Team Conferences data differs slightly between men's and women's, so those files do have the prefixes.

   * ConfAbbrev - this is a short abbreviation for each conference; the abbreviation is used in some other files to indicate the parent conference of a team or of a conference tournament.
   * Description - this is a longer text name for the conference.

In [24]:
conf = pd.read_csv('Conferences.csv')

conf.head()

Unnamed: 0,ConfAbbrev,Description
0,a_sun,Atlantic Sun Conference
1,a_ten,Atlantic 10 Conference
2,aac,American Athletic Conference
3,acc,Atlantic Coast Conference
4,aec,America East Conference


### Data Section 5 file: MTeamConferences.csv

### Team Conferences 

This file indicates the conference affiliations for each team during each season. Some conferences have changed their names from year to year, and/or changed which teams are part of the conference. This file tracks this information historically.

   * Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
   * TeamID - this identifies the TeamID (as described in MTeams.csv).
   * ConfAbbrev - this identifies the conference (as described in Conferences.csv).


In [25]:
team_conf = pd.read_csv('MTeamConferences.csv')

team_conf.head()

Unnamed: 0,Season,TeamID,ConfAbbrev
0,1985,1102,wac
1,1985,1103,ovc
2,1985,1104,sec
3,1985,1106,swac
4,1985,1108,swac


### Data Section 5 file: MConferenceTourneyGames.csv

### Conference Tournament Games 

This file indicates which games were part of each year's post-season conference tournaments (all of which finished on Selection Sunday or earlier), starting from the 2001 season. Many of these conference tournament games are held on neutral sites, and many of the games are played by tournament-caliber teams just a few days before the NCAA® tournament. Thus these games could be considered as very similar to NCAA® tournament games, and (depending on your methodology) may be of use in optimizing your predictions. However, this is NOT a new listing of games; these games are already present within the MRegularSeasonCompactResults and MRegularSeasonDetailedResults files. So this file simply helps you to identify which of the "regular season" games since the 2001 season were actually conference tournament games, in case that is useful information.

   * ConfAbbrev - this identifies the conference (as described in Conferences.csv) that the tournament was for.
   * Season, DayNum, WTeamID, LTeamID - these four columns are sufficient to uniquely identify each game. Further details about the game, such as the final score and other stats, can be found in the associated data row of the MRegularSeasonCompactResults and/or MRegularSeasonDetailedResults files.

In [26]:
conf_tourney_games = pd-.read_csv('MConferenceTourneyGames.csv')

conf_tourney_games.head()

Unnamed: 0,Season,ConfAbbrev,DayNum,WTeamID,LTeamID
0,2001,a_sun,121,1194,1144
1,2001,a_sun,121,1416,1240
2,2001,a_sun,122,1209,1194
3,2001,a_sun,122,1359,1239
4,2001,a_sun,122,1391,1273


### Data Section 5 file: MSecondaryTourneyTeams.csv

### Secondary Tournament Teams

This file identifies the teams that participated in post-season tournaments other than the NCAA® Tournament (such events would run in parallel with the NCAA® Tournament). These are teams that were not invited to the NCAA® Tournament and instead were invited to some other tournament, of which the NIT is the most prominent tournament, but there have also been the CBI, CIT, and Vegas 16 (V16) at various points in recent years. Depending on your methodology, you might find it useful to have these additional game results, above and beyond what is available from the NCAA® Tournament results. Many of these teams, especially in the NIT, were "bubble" teams of comparable strength to several NCAA® Tournament invitees, and so these games may be of use in model optimization for predicting NCAA® Tournament results. Also note that this information could be determined just from inspecting the MSecondaryTourneyCompactResults file, but is presented in this file as well, for your convenience.

   * Season - this is the year of the associated entry in MSeasons.csv (the year in which the post-season tournament was played)
   * SecondaryTourney - this is the abbreviation of the tournament, either NIT, CBI, CIT, or V16 (which stands for Vegas 16).
   * TeamID - this identifies the TeamID that participated in the tournament (as described in MTeams.csv).

In [27]:
secondary_tourney_teams = pd.read_csv('MSecondaryTourneyTeams.csv')

secondary_tourney_teams.head()

Unnamed: 0,Season,SecondaryTourney,TeamID
0,1985,NIT,1108
1,1985,NIT,1133
2,1985,NIT,1139
3,1985,NIT,1145
4,1985,NIT,1151


### Data Section 5 file: MSecondaryTourneyCompactResults.csv

### Secondary Tourney Compact Results

This file indicates the final scores for the tournament games of "secondary" post-season tournaments: the NIT, CBI, CIT, and Vegas 16. The detailed results (team box scores) have not been assembled for these games. For the most part, this file is exactly like other Compact Results listings, although it also has a column for Secondary Tourney. Also note that because these games are played after DayNum=132, they are NOT listed in the MRegularSeasonCompactReconfusionsults file.

   * SecondaryTourney - this is the abbreviation of the tournament, either NIT, CBI, CIT, or V16 (which stands for Vegas 16).

In [28]:
secondary_tourney_compact_results = pd.read_csv('MSecondaryTourneyTeams.csv')

secondary_tourney_compact_results.head()

Unnamed: 0,Season,SecondaryTourney,TeamID
0,1985,NIT,1108
1,1985,NIT,1133
2,1985,NIT,1139
3,1985,NIT,1145
4,1985,NIT,1151


### Data Section 5 file: MTeamSpellings.csv

### Team Spelling

This file indicates alternative spellings of many team names. It is intended for use in associating external spellings against our own TeamID numbers, thereby helping to relate the external data properly with our datasets. Over the years we have identified various external spellings of different team names (as an example, for Ball State we have seen "ball st", and "ball st.", and "ball state", and "ball-st", and "ball-state"). Other teams have had more significant changes to their names over the years; for example, "Texas Pan-American" and "Texas-Rio Grande Valley" are actually the same school. The current list is obviously not exhaustive, and we encourage participants to identify additional mappings and upload extended versions of this file to the forums.

   * TeamNameSpelling - this is the spelling of the team name. It is always expressed in all lowercase letters - e.g. "ball state" rather than "Ball State" - in order to emphasize that any comparisons should be case-insensitive when matching.
   * TeamID - this identifies the TeamID for the team that has the alternative spelling (as described in MTeams.csv).

In [29]:
team_spellings = pd.read_csv('MTeamSpellings.csv', encoding='cp1252')

team_spellings.head()

Unnamed: 0,TeamNameSpelling,TeamID
0,a&m-corpus chris,1394
1,a&m-corpus christi,1394
2,abilene chr,1101
3,abilene christian,1101
4,abilene-christian,1101


### Data Section 5 file: MNCAATourneySlots

### Tournament Slots 

This file identifies the mechanism by which teams are paired against each other, depending upon their seeds, as the tournament proceeds through its rounds. It can be of use in identifying, for a given historical game, what round it occurred in, and what the seeds/slots were for the two teams (the meaning of "slots" is described below). Because of the existence of play-in games for particular seed numbers, the pairings have small differences from year to year. You may need to know these specifics if you are trying to represent/simulate the exact workings of the tournament bracket.

   * Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
   * Slot - this uniquely identifies one of the tournament games. For play-in games, it is a three-character string identifying the seed fulfilled by the winning team, such as W16 or Z13. For regular tournament games, it is a four-character string, where the first two characters tell you which round the game is (R1, R2, R3, R4, R5, or R6) and the second two characters tell you the expected seed of the favored team. Thus the first row is R1W1, identifying the Round 1 game played in the W bracket, where the favored team is the 1 seed. As a further example, the R2W1 slot indicates the Round 2 game that would have the 1 seed from the W bracket, assuming that all favored teams have won up to that point. Even if that R2W1 slot were actually a game between the W09 and W16 teams, it is still considered to be the R2W1 slot. The slot names are different for the final two rounds, where R5WX identifies the national semifinal game between the winners of regions W and X, and R5YZ identifies the national semifinal game between the winners of regions Y and Z, and R6CH identifies the championship game. The "slot" value is used in other columns in order to represent the advancement and pairings of winners of previous games.
   * StrongSeed - this indicates the expected stronger-seeded team that plays in this game. For Round 1 games, a team seed is identified in this column (as listed in the "Seed" column in the MNCAATourneySeeds.csv file), whereas for subsequent games, a slot is identified in this column. In the first record of this file (slot R1W1), we see that seed W01 is the "StrongSeed", which during the 1985 tournament would have been Georgetown. Whereas for games from Round 2 or later, rather than a team seed, we will see a "slot" referenced in this column. So in the 33rd record of this file (slot R2W1), it tells us that the winners of slots R1W1 and R1W8 will face each other in Round 2. Of course, in the last few games of the tournament - the national semifinals and finals - it's not really meaningful to talk about a "strong seed" or "weak seed", since you would have #1 seeds favored to face each other, but those games are nevertheless represented in the same format for the sake of consistency.
   * WeakSeed - this indicates the expected weaker-seeded team that plays in this game, assuming all favored teams have won so far. For Round 1 games, a team seed is identified in this column (as listed in the "Seed" column in the MNCAATourneySeeds.csv file), whereas for subsequent games, a slot is identified in this column.

In [30]:
tourney_slots = pd.read_csv('MNCAATourneySlots.csv')

tourney_slots.head()


Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
0,1985,R1W1,W01,W16
1,1985,R1W2,W02,W15
2,1985,R1W3,W03,W14
3,1985,R1W4,W04,W13
4,1985,R1W5,W05,W12


### Data Section 5 file: MNCAATourneySeedRoundSlots.csv

### Tournament Seed Round Slots 

This file helps to represent the bracket structure in any given year. No matter where the play-in seeds are located, we can always know, for a given tournament seed, exactly what bracket slot they would be playing in, on each possible game round, and what the possible DayNum values would be for that round. Thus, if we know when a historical game was played, and what the team's seed was, we can identify the slot for that game. This can be useful in representing or simulating the tournament bracket structure.

   * Seed - this is the tournament seed of the team.
   * GameRound - this is the round during the tournament that the game would occur in, where Round 0 (zero) is for the play-in games, Rounds 1/2 are for the first weekend, Rounds 3/4 are for the second weekend, and Rounds 5/6 are the national semifinals and finals.
   * GameSlot - this is the game slot that the team would be playing in, during the given GameRound. The naming convention for slots is described above, in the definition of the MNCAATourneySlots file.
   * EarlyDayNum, LateDayNum - these fields describe the earliest possible, and latest possible, DayNums that the game might be played on.

In [32]:
tourney_seed_round_slots = pd.read_csv('MNCAATourneySeedRoundSlots.csv')

tourney_seed_round_slots.head()

Unnamed: 0,Seed,GameRound,GameSlot,EarlyDayNum,LateDayNum
0,W01,1,R1W1,136,137
1,W01,2,R2W1,138,139
2,W01,3,R3W1,143,144
3,W01,4,R4W1,145,146
4,W01,5,R5WX,152,152


bmitted to a com### Sample Submission Stage 1

Data Section 1 file: MSampleSubmissionStage1.csv

This file illustrates the submission file format for Stage 1. It is the simplest possible submission: a 50% winning percentage is predicted for each possible matchup.

A submission file lists every possible matchup between tournament teams for one or more years. During Stage 1, you are asked to make predictions for all possible matchups from the past five NCAA® tournaments (seasons 2016, 2017, 2018, 2019, and 2021). In Stage 2, you will be asked to make predictions for all possible matchups from the current NCAA® tournament (season 2022).

When there are 68 teams in the tournament, there are 68*67/2=2,278 predictions to make for that year, so a Stage 1 submission file will have 2,278*5=11,390 data rows.

    ID - this is a 14-character string of the format SSSS_XXXX_YYYY, where SSSS is the four digit season number, XXXX is the four-digit TeamID of the lower-ID team, and YYYY is the four-digit TeamID of the higher-ID team.
    Pred - this contains the predicted winning percentage for the first team identified in the ID field, the one represented above by XXXX.

Example #1: You want to make a prediction for Duke (TeamID=1181) against Arizona (TeamID=1112) in the 2017 tournament, with Duke given a 53% chance to win and Arizona given a 47% chance to win. In this case, Arizona has the lower numerical ID so they would be listed first, and the winning percentage would be expressed from Arizona's perspective (47%):

2017_1112_1181,0.47

Example #2: You want to make a prediction for Duke (TeamID=1181) against North Carolina (TeamID=1314) in the 2018 tournament, with Duke given a 51.6% chance to win and North Carolina given a 48.4% chance to win. In this case, Duke has the lower numerical ID so they would be listed first, and the winning percentage would be expressed from Duke's perspective (51.6%):

2018_1181_1314,0.516

Also note that a single prediction row serves as a prediction for each of the two teams' winning chances. So for instance, in Example #1, the submission row of "2017_1112_1181,0.47" specifically gives a 47% chance for Arizona to win, and doesn't explicitly mention Duke's 53% chance to win. However, our evaluation utility will automatically infer the winning percentage in the other direction, so a 47% prediction for Arizona to win also means a 53% prediction for Duke to win. And similarly, because the submission row in Example #2 gives Duke a 51.6% chance to beat North Carolina, we will automatically figure out that this also means North Carolina has a 48.4% chance to beat Duke.
Data Section 2 - Team Box Scores

This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.) for all regular season, conference tournament, and NCAA® tournament games since the 2002-03 season.

Team Box Scores are provided in "Detailed Results" files rather than "Compact Results" files. However, the two files are strongly related.

In a Detailed Results file, the first eight columns (Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT) are exactly the same as a Compact Results file. However, in a Detailed Results file, there are many additional columns. The column names should be self-explanatory to basketball fans (as above, "W" or "L" refers to the winning or losing team):

    WFGM - field goals made (by the winning team)
    WFGA - field goals attempted (by the winning team)
    WFGM3 - three pointers made (by the winning team)
    WFGA3 - three pointers attempted (by the winning team)
    WFTM - free throws made (by the winning team)
    WFTA - free throws attempted (by the winning team)
    WOR - offensive rebounds (pulled by the winning team)
    WDR - defensive rebounds (pulled by the winning team)
    WAst - assists (by the winning team)
    WTO - turnovers committed (by the winning team)
    WStl - steals (accomplished by the winning team)
    WBlk - blocks (accomplished by the winning team)
    WPF - personal fouls committed (by the winning team)

(and then the same set of stats from the perspective of the losing team: LFGM is the number of field goals made by the losing team, and so on up to LPF).

Note: by convention, "field goals made" (either WFGM or LFGM) refers to the total number of fields goals made by a team, a combination of both two-point field goals and three-point field goals. And "three point field goals made" (either WFGM3 or LFGM3) is just the three-point fields goals made, of course. So if you want to know specifically about two-point field goals, you have to subtract one from the other (e.g., WFGM - WFGM3). And the total number of points scored is most simply expressed as 2*FGM + FGM3 + FTM.
,ConfAbbrev

In [13]:
sample_sub_stage_1 = pd.read_csv('MSampleSubmissionStage1.csv')

sample_sub_stage_1

Unnamed: 0,ID,Pred
0,2016_1112_1114,0.5
1,2016_1112_1122,0.5
2,2016_1112_1124,0.5
3,2016_1112_1138,0.5
4,2016_1112_1139,0.5
...,...,...
11385,2021_1452_1457,0.5
11386,2021_1452_1458,0.5
11387,2021_1455_1457,0.5
11388,2021_1455_1458,0.5


In [None]:
=