Here is a [link](https://run.unl.pt/bitstream/10362/33864/1/TGI0135.pdf) to a research paper on March Madness. I want to bring attention to the model evaluation results shown below. This can be found on Page 35, and this represents the accuracy scores of the training data using 10-fold CV.

| Technique           | 1999 - 2016 | 2017 |
|---------------------|-------------|------|
| Decision Tree       | 67.2        | 50   |   
| Logistic Regression | 70          | 73   |  
| KNN                 | 67.1        | 64   |  
| Random Forest       | 68.2        | 50   |   
| SGD                 | 69          | 69.1 |   
| SVC                 | 71.5        | 85.1 |  


We see that SVC gave the best accuracy, and we will showcase this from a 2017 March Madness prediction via the Kaggle tournament. Link to the Github can be found [here](https://github.com/adeshpande3/March-Madness-2017) as well as the full downloaded work here. We will look into the March Madness 2017 Jupyter Notebook first.

# The DataFrames

In [3]:
# Not all imports from the file are here, nor all imports were actually used here.

from __future__ import division
import sklearn
import pandas as pd
import numpy as np
import collections
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import linear_model
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import sys
from sklearn.ensemble import GradientBoostingRegressor
import math
import csv
%matplotlib inline
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report
import urllib
from sklearn.svm import LinearSVC

There will be a lot of dataframes to list here, so we will just give quick explanations.

In [4]:
# Scores of all regular regular season games from 1985 - 2015. 
reg_season_compact_pd = pd.read_csv('Data/RegularSeasonCompactResults.csv')
reg_season_compact_pd.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [5]:
# All the relevant Basketball stats from regular season games 2003 - 2016. 
reg_season_detailed_pd = pd.read_csv('Data/RegularSeasonDetailedResults.csv')
print(reg_season_detailed_pd.columns)
reg_season_detailed_pd.tail()

Index(['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc',
       'Numot', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wor', 'Wdr',
       'Wast', 'Wto', 'Wstl', 'Wblk', 'Wpf', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3',
       'Lftm', 'Lfta', 'Lor', 'Ldr', 'Last', 'Lto', 'Lstl', 'Lblk', 'Lpf'],
      dtype='object')


Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
71236,2016,132,1114,70,1419,50,N,0,26,52,...,18,4,9,6,20,13,13,6,3,12
71237,2016,132,1163,72,1272,58,N,0,22,48,...,16,11,17,5,21,10,6,3,0,20
71238,2016,132,1246,82,1401,77,N,1,28,58,...,23,15,22,17,23,11,13,5,4,20
71239,2016,132,1277,66,1345,62,N,0,25,60,...,15,17,21,5,22,10,5,4,4,14
71240,2016,132,1386,87,1433,74,N,0,35,54,...,29,11,16,12,21,12,9,5,5,21


In [6]:
# List of all Teams that participated in March Madness since 1985 and their Team_ID
teams_pd = pd.read_csv('Data/Teams.csv')
teamList = teams_pd['Team_Name'].tolist()
teams_pd.tail()

Unnamed: 0,Team_Id,Team_Name
359,1460,Wright St
360,1461,Wyoming
361,1462,Xavier
362,1463,Yale
363,1464,Youngstown St


In [7]:
# Tourney Results from 1985 - 2015. Like the First Dataframe, only the teams and scores are listed here.
tourney_compact_pd = pd.read_csv('Data/TourneyCompactResults.csv')
tourney_compact_pd.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


In [8]:
# Tourney Results from 2003 onwards except more detailed.
tourney_detailed_pd = pd.read_csv('Data/TourneyDetailedResults.csv')
tourney_detailed_pd.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
0,2003,134,1421,92,1411,84,N,1,32,69,...,31,14,31,17,28,16,15,5,0,22
1,2003,136,1112,80,1436,51,N,0,31,66,...,16,7,7,8,26,12,17,10,3,15
2,2003,136,1113,84,1272,71,N,0,31,59,...,28,14,21,20,22,11,12,2,5,18
3,2003,136,1141,79,1166,73,N,0,29,53,...,17,12,17,14,17,20,21,6,6,21
4,2003,136,1143,76,1301,74,N,1,27,64,...,21,15,20,10,26,16,14,5,8,19


In [9]:
# List of seeds in a given year
tourney_seeds_pd = pd.read_csv('Data/TourneySeeds.csv')
tourney_seeds_pd.head()

Unnamed: 0,Season,Seed,Team
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [10]:
# List of teams given conference
conference_pd = pd.read_csv('Data/Conference.csv')
conference_pd.head()

Unnamed: 0,Year,Conference,Schls,W,L,W-L%,SRS,SOS,AP,NCAA,FF,Regular Season Champ,Tournament Champ
0,2003,Southeastern Conference,12,220,145,0.603,12.95,8.58,4,6,0,Kentucky (East) Mississippi St (West),Kentucky
1,2003,Big 12 Conference,12,238,146,0.62,12.94,8.21,4,6,2,Kansas,Oklahoma
2,2003,Atlantic Coast Conference,9,170,111,0.605,12.75,7.83,3,4,0,Wake Forest,Duke
3,2003,Big East Conference,14,262,179,0.594,11.03,7.13,4,4,1,Boston College (East) Connecticut (East) Pitts...,Pittsburgh
4,2003,Big Ten Conference,11,200,146,0.578,10.68,7.66,2,5,0,Wisconsin,Illinois


In [11]:
tourney_results_pd = pd.read_csv('Data/TourneyResults.csv')
NCAAChampionsList = tourney_results_pd['NCAA Champion'].tolist()
NCAAChampionsList

[nan,
 'Villanova',
 'Duke',
 'Connecticut',
 'Louisville',
 'Kentucky',
 'Connecticut',
 'Duke',
 'North Carolina',
 'Kansas',
 'Florida',
 'Florida',
 'North Carolina',
 'Connecticut',
 'Syracuse',
 'Maryland',
 'Duke',
 'Michigan State',
 'Connecticut',
 'Kentucky',
 'Arizona',
 'Kentucky',
 'UCLA',
 'Arkansas',
 'North Carolina',
 'Duke',
 'Duke',
 'UNLV',
 'Michigan',
 'Kansas',
 'Indiana',
 'Louisville',
 'Villanova']

The first thing to notice about the dataframes are the stats for every regular and tournament game since 1985. While we have the scores for every year, we do not have detailed results until 2003. We have to decide whether we want to factor in old data for our models with the expense of missing data or forego it completely, but we are not using a chunk of the total dataset.

The other thing is the ID's assigned to schools. Not every dataset came from the same person, so there are bound to be different naming conventions. For example, we take Michigan State University for example, and some datasets may call it MSU or Michigan State. It looks like using the ID's will be the way to go.

# Feature Selection

Here, I will copy the list of features this person used

* Regular Season Wins 
* Points per game season average 
* Points per game allowed season average
* **Whether or not in Power 6 conference (ACC, Big Ten, Big 12, SEC, Pac 12, Big East) - Binary label**
* Number of 3's per game
* Turnovers per game average
* Assists per game average
* **Conference Championship - binary label**
* **Conference Tournament Championship - binary label**
* Tournament Seed
* Strength of Schedule
* Simple Rating System
* Rebounds per game average
* Steals per game average
* Number of NCAA appearances since 1985
* Whether the team is home or away or neutral (labels -1, 0, and 1)
* **Number of NCAA championships**

The ones in bold require some helper functions as well as cleaning in order to be used. First, we deal with the naming conventions which unfortunately must be hard coded.

In [12]:
def handleDifferentCSV(df):
    df['School'] = df['School'].replace('(State)', 'St', regex=True) 
    df['School'] = df['School'].replace('Albany (NY)', 'Albany NY') 
    df['School'] = df['School'].replace('Boston University', 'Boston Univ')
    df['School'] = df['School'].replace('Central Michigan', 'C Michigan')
    df['School'] = df['School'].replace('(Eastern)', 'E', regex=True)
    df['School'] = df['School'].replace('Louisiana St', 'LSU')
    df['School'] = df['School'].replace('North Carolina St', 'NC State')
    df['School'] = df['School'].replace('Southern California', 'USC')
    df['School'] = df['School'].replace('University of California', 'California', regex=True) 
    df['School'] = df['School'].replace('American', 'American Univ')
    df['School'] = df['School'].replace('Arkansas-Little Rock', 'Ark Little Rock')
    df['School'] = df['School'].replace('Arkansas-Pine Bluff', 'Ark Pine Bluff')
    df['School'] = df['School'].replace('Bowling Green St', 'Bowling Green')
    df['School'] = df['School'].replace('Brigham Young', 'BYU')
    df['School'] = df['School'].replace('Cal Poly', 'Cal Poly SLO')
    df['School'] = df['School'].replace('Centenary (LA)', 'Centenary')
    df['School'] = df['School'].replace('Central Connecticut St', 'Central Conn')
    df['School'] = df['School'].replace('Charleston Southern', 'Charleston So')
    df['School'] = df['School'].replace('Coastal Carolina', 'Coastal Car')
    df['School'] = df['School'].replace('College of Charleston', 'Col Charleston')
    df['School'] = df['School'].replace('Cal St Fullerton', 'CS Fullerton')
    df['School'] = df['School'].replace('Cal St Sacramento', 'CS Sacramento')
    df['School'] = df['School'].replace('Cal St Bakersfield', 'CS Bakersfield')
    df['School'] = df['School'].replace('Cal St Northridge', 'CS Northridge')
    df['School'] = df['School'].replace('East Tennessee St', 'ETSU')
    df['School'] = df['School'].replace('Detroit Mercy', 'Detroit')
    df['School'] = df['School'].replace('Fairleigh Dickinson', 'F Dickinson')
    df['School'] = df['School'].replace('Florida Atlantic', 'FL Atlantic')
    df['School'] = df['School'].replace('Florida Gulf Coast', 'FL Gulf Coast')
    df['School'] = df['School'].replace('Florida International', 'Florida Intl')
    df['School'] = df['School'].replace('George Washington', 'G Washington')
    df['School'] = df['School'].replace('Georgia Southern', 'Ga Southern')
    df['School'] = df['School'].replace('Gardner-Webb', 'Gardner Webb')
    df['School'] = df['School'].replace('Illinois-Chicago', 'IL Chicago')
    df['School'] = df['School'].replace('Kent St', 'Kent')
    df['School'] = df['School'].replace('Long Island University', 'Long Island')
    df['School'] = df['School'].replace('Loyola Marymount', 'Loy Marymount')
    df['School'] = df['School'].replace('Loyola (MD)', 'Loyola MD')
    df['School'] = df['School'].replace('Loyola (IL)', 'Loyola-Chicago')
    df['School'] = df['School'].replace('Massachusetts', 'MA Lowell')
    df['School'] = df['School'].replace('Maryland-Eastern Shore', 'MD E Shore')
    df['School'] = df['School'].replace('Miami (FL)', 'Miami FL')
    df['School'] = df['School'].replace('Miami (OH)', 'Miami OH')
    df['School'] = df['School'].replace('Missouri-Kansas City', 'Missouri KC')
    df['School'] = df['School'].replace('Monmouth', 'Monmouth NJ')
    df['School'] = df['School'].replace('Mississippi Valley St', 'MS Valley St')
    df['School'] = df['School'].replace('Montana St', 'MTSU')
    df['School'] = df['School'].replace('Northern Colorado', 'N Colorado')
    df['School'] = df['School'].replace('North Dakota St', 'N Dakota St')
    df['School'] = df['School'].replace('Northern Illinois', 'N Illinois')
    df['School'] = df['School'].replace('Northern Kentucky', 'N Kentucky')
    df['School'] = df['School'].replace('North Carolina A&T', 'NC A&T')
    df['School'] = df['School'].replace('North Carolina Central', 'NC Central')
    df['School'] = df['School'].replace('Pennsylvania', 'Penn')
    df['School'] = df['School'].replace('South Carolina St', 'S Carolina St')
    df['School'] = df['School'].replace('Southern Illinois', 'S Illinois')
    df['School'] = df['School'].replace('UC-Santa Barbara', 'Santa Barbara')
    df['School'] = df['School'].replace('Southeastern Louisiana', 'SE Louisiana')
    df['School'] = df['School'].replace('Southeast Missouri St', 'SE Missouri St')
    df['School'] = df['School'].replace('Stephen F. Austin', 'SF Austin')
    df['School'] = df['School'].replace('Southern Methodist', 'SMU')
    df['School'] = df['School'].replace('Southern Mississippi', 'Southern Miss')
    df['School'] = df['School'].replace('Southern', 'Southern Univ')
    df['School'] = df['School'].replace('St. Bonaventure', 'St Bonaventure')
    df['School'] = df['School'].replace('St. Francis (NY)', 'St Francis NY')
    df['School'] = df['School'].replace('Saint Francis (PA)', 'St Francis PA')
    df['School'] = df['School'].replace('St. John\'s (NY)', 'St John\'s')
    df['School'] = df['School'].replace('Saint Joseph\'s', 'St Joseph\'s PA')
    df['School'] = df['School'].replace('Saint Louis', 'St Louis')
    df['School'] = df['School'].replace('Saint Mary\'s (CA)', 'St Mary\'s CA')
    df['School'] = df['School'].replace('Mount Saint Mary\'s', 'Mt St Mary\'s')
    df['School'] = df['School'].replace('Saint Peter\'s', 'St Peter\'s')
    df['School'] = df['School'].replace('Texas A&M-Corpus Christian', 'TAM C. Christian')
    df['School'] = df['School'].replace('Texas Christian', 'TCU')
    df['School'] = df['School'].replace('Tennessee-Martin', 'TN Martin')
    df['School'] = df['School'].replace('Texas-Rio Grande Valley', 'UTRGV')
    df['School'] = df['School'].replace('Texas Southern', 'TX Southern')
    df['School'] = df['School'].replace('Alabama-Birmingham', 'UAB')
    df['School'] = df['School'].replace('UC-Davis', 'UC Davis')
    df['School'] = df['School'].replace('UC-Irvine', 'UC Irvine')
    df['School'] = df['School'].replace('UC-Riverside', 'UC Riverside')
    df['School'] = df['School'].replace('Central Florida', 'UCF')
    df['School'] = df['School'].replace('Louisiana-Lafayette', 'ULL')
    df['School'] = df['School'].replace('Louisiana-Monroe', 'ULM')
    df['School'] = df['School'].replace('Maryland-Baltimore County', 'UMBC')
    df['School'] = df['School'].replace('North Carolina-Asheville', 'UNC Asheville')
    df['School'] = df['School'].replace('North Carolina-Greensboro', 'UNC Greensboro')
    df['School'] = df['School'].replace('North Carolina-Wilmington', 'UNC Wilmington')
    df['School'] = df['School'].replace('Nevada-Las Vegas', 'UNLV')
    df['School'] = df['School'].replace('Texas-Arlington', 'UT Arlington')
    df['School'] = df['School'].replace('Texas-San Antonio', 'UT San Antonio')
    df['School'] = df['School'].replace('Texas-El Paso', 'UTEP')
    df['School'] = df['School'].replace('Virginia Commonwealth', 'VA Commonwealth')
    df['School'] = df['School'].replace('Western Carolina', 'W Carolina')
    df['School'] = df['School'].replace('Western Illinois', 'W Illinois')
    df['School'] = df['School'].replace('Western Kentucky', 'WKU')
    df['School'] = df['School'].replace('Western Michigan', 'W Michigan')
    df['School'] = df['School'].replace('Abilene Christian', 'Abilene Chr')
    df['School'] = df['School'].replace('Montana State', 'Montana St')
    df['School'] = df['School'].replace('Central Arkansas', 'Cent Arkansas')
    df['School'] = df['School'].replace('Houston Baptist', 'Houston Bap')
    df['School'] = df['School'].replace('South Dakota St', 'S Dakota St')
    df['School'] = df['School'].replace('Maryland-Eastern Shore', 'MD E Shore')
    return df

We also deal with other edge cases too like 'St' and 'State'.

In [13]:
def handleCases(arr):
    indices = []
    listLen = len(arr)
    for i in range(listLen):
        if (arr[i] == 'St' or arr[i] == 'FL'):
            indices.append(i)
    for p in indices:
        arr[p-1] = arr[p-1] + ' ' + arr[p]
    for i in range(len(indices)): 
        arr.remove(arr[indices[i] - i])
    return arr

Now we deal with the Power 6 Conferences. Unfortunately, they also had to be hardcoded.

In [14]:
# All Power 6 Conferences
listACCteams = ['North Carolina','Virginia','Florida St','Louisville','Notre Dame','Syracuse','Duke','Virginia Tech','Georgia Tech','Miami','Wake Forest','Clemson','NC State','Boston College','Pittsburgh']
listPac12teams = ['Arizona','Oregon','UCLA','California','USC','Utah','Washington St','Stanford','Arizona St','Colorado','Washington','Oregon St']
listSECteams = ['Kentucky','South Carolina','Florida','Arkansas','Alabama','Tennessee','Mississippi St','Georgia','Ole Miss','Vanderbilt','Auburn','Texas A&M','LSU','Missouri']
listBig10teams = ['Maryland','Wisconsin','Purdue','Northwestern','Michigan St','Indiana','Iowa','Michigan','Penn St','Nebraska','Minnesota','Illinois','Ohio St','Rutgers']
listBig12teams = ['Kansas','Baylor','West Virginia','Iowa St','TCU','Kansas St','Texas Tech','Oklahoma St','Texas','Oklahoma']
listBigEastteams = ['Butler','Creighton','DePaul','Georgetown','Marquette','Providence','Seton Hall','St John\'s','Villanova','Xavier']

def checkPower6Conference(team_id):
    teamName = teams_pd.values[team_id-1101][1]
    if (teamName in listACCteams or teamName in listBig10teams or teamName in listBig12teams
       or teamName in listSECteams or teamName in listPac12teams or teamName in listBigEastteams):
        return 1
    else:
        return 0

To make it easier, let us also easily interchange from team_id to college name and back.

In [15]:
def getTeamID(name):
    return teams_pd[teams_pd['Team_Name'] == name].values[0][0]

def getTeamName(team_id):
    return teams_pd[teams_pd['Team_Id'] == team_id].values[0][1]

print("ID for West Virginia is", getTeamID("West Virginia"))
print("Name for ID 1422 is", getTeamName(1422))

# Check if these two teams are in the Power 6 
if checkPower6Conference(1452) == 1:
    print("West Virginia is a Power 6 School")
else:
     print("West Virginia is not a Power 6 School")
        
if checkPower6Conference(1422) == 1:
    print("UNC Greensboro is a Power 6 School")
else:
     print("UNC Greensboro is not a Power 6 School")

ID for West Virginia is 1452
Name for ID 1422 is UNC Greensboro
West Virginia is a Power 6 School
UNC Greensboro is not a Power 6 School


Here is a function to get number of championships and tourney appearances.

In [16]:
def getNumChampionships(team_id):
    name = getTeamName(team_id)
    return NCAAChampionsList.count(name)

def getTourneyAppearances(team_id):
    return len(tourney_seeds_pd[tourney_seeds_pd['Team'] == team_id].index)

print(getNumChampionships(1452))
getTourneyAppearances(1452)

0


14

Now we include whether or not a given team won their conference (most wins) and/or won their tourney in a given year.

In [17]:
def checkConferenceChamp(team_id, year):
    year_conf_pd = conference_pd[conference_pd['Year'] == year]
    champs = year_conf_pd['Regular Season Champ'].tolist()
    # For handling cases where there is more than one champion
    champs_separated = [words for segments in champs for words in segments.split()]
    name = getTeamName(team_id)
    champs_separated = handleCases(champs_separated)
    if (name in champs_separated):
        return 1
    else:
        return 0
    
def checkConferenceTourneyChamp(team_id, year):
    year_conf_pd = conference_pd[conference_pd['Year'] == year]
    champs = year_conf_pd['Tournament Champ'].tolist()
    name = getTeamName(team_id)
    if (name in champs):
        return 1
    else:
        return 0
    
id1 = getTeamID("Purdue")
id2 = getTeamID("Michigan St")

if checkConferenceChamp(id1, 2009) == 1:
    print("In 2009, Purdue won the Big Ten Conference Championship")
else:
    print("In 2009, Purdue did not win the Big Ten Conference Championship")
    
if checkConferenceChamp(id2, 2009) == 1:
    print("In 2009, Michigan St won the Big Ten Conference Championship")
else:
    print("In 2009, Michigan St did not win the Big Ten Conference Championship")
    
if checkConferenceTourneyChamp(id1, 2009) == 1:
    print("In 2009, Purdue won the Big Ten Conference Championship Tourney")
else:
    print("In 2009, Purdue did not win the Big Ten Conference Championship Tourney")
    
if checkConferenceTourneyChamp(id2, 2009) == 1:
    print("In 2009, Michigan St won the Big Ten Conference Championship Tourney")
else:
    print("In 2009, Michigan St did not win the Big Ten Conference Championship Tourney")

In 2009, Purdue did not win the Big Ten Conference Championship
In 2009, Michigan St won the Big Ten Conference Championship
In 2009, Purdue won the Big Ten Conference Championship Tourney
In 2009, Michigan St did not win the Big Ten Conference Championship Tourney


Finally, let us get the URLS for each team.

In [18]:
def getListForURL(team_list):
    team_list = [x.lower() for x in team_list]
    team_list = [t.replace(' ', '-') for t in team_list]
    team_list = [t.replace('st', 'state') for t in team_list]
    team_list = [t.replace('northern-dakota', 'north-dakota') for t in team_list]
    team_list = [t.replace('nc-', 'north-carolina-') for t in team_list]
    team_list = [t.replace('fl-', 'florida-') for t in team_list]
    team_list = [t.replace('ga-', 'georgia-') for t in team_list]
    team_list = [t.replace('lsu', 'louisiana-state') for t in team_list]
    team_list = [t.replace('maristate', 'marist') for t in team_list]
    team_list = [t.replace('stateate', 'state') for t in team_list]
    team_list = [t.replace('northernorthern', 'northern') for t in team_list]
    team_list = [t.replace('usc', 'southern-california') for t in team_list]
    base = 'http://www.sports-reference.com/cbb/schools/'
    for team in team_list:
        url = base + team + '/'
getListForURL(teamList);

A lot of these features are very straightforward like regular season wins and points per game. We seem to have a label on assists per game and not rebounds per game, but I think that is an independence issue. Well, onto building the vectors.

# Season Vectors

In [19]:
def getSeasonData(team_id, year):
    # The data frame below holds stats for every single game in the given year
    year_data_pd = reg_season_compact_pd[reg_season_compact_pd['Season'] == year]
    # Finding number of points per game
    gamesWon = year_data_pd[year_data_pd.Wteam == team_id] 
    totalPointsScored = gamesWon['Wscore'].sum()
    gamesLost = year_data_pd[year_data_pd.Lteam == team_id] 
    totalGames = gamesWon.append(gamesLost)
    numGames = len(totalGames.index)
    totalPointsScored += gamesLost['Lscore'].sum()
    
    # Finding number of points per game allowed
    totalPointsAllowed = gamesWon['Lscore'].sum()
    totalPointsAllowed += gamesLost['Wscore'].sum()
    
    stats_SOS_pd = pd.read_csv('Data/MMStats/MMStats_'+str(year)+'.csv')
    stats_SOS_pd = handleDifferentCSV(stats_SOS_pd)
    ratings_pd = pd.read_csv('Data/RatingStats/RatingStats_'+str(year)+'.csv')
    ratings_pd = handleDifferentCSV(ratings_pd)
    
    name = getTeamName(team_id)
    team = stats_SOS_pd[stats_SOS_pd['School'] == name]
    team_rating = ratings_pd[ratings_pd['School'] == name]
    if (len(team.index) == 0 or len(team_rating.index) == 0): #Can't find the team
        total3sMade = 0
        totalTurnovers = 0
        totalAssists = 0
        sos = 0
        totalRebounds = 0
        srs = 0
        totalSteals = 0
    else:
        total3sMade = team['X3P'].values[0]
        totalTurnovers = team['TOV'].values[0]
        if (math.isnan(totalTurnovers)):
            totalTurnovers = 0
        totalAssists = team['AST'].values[0]
        if (math.isnan(totalAssists)):
            totalAssists = 0
        sos = team['SOS'].values[0]
        srs = team['SRS'].values[0]
        totalRebounds = team['TRB'].values[0]
        if (math.isnan(totalRebounds)):
            totalRebounds = 0
        totalSteals = team['STL'].values[0]
        if (math.isnan(totalSteals)):
            totalSteals = 0
    
    #Finding tournament seed for that year
    tourneyYear = tourney_seeds_pd[tourney_seeds_pd['Season'] == year]
    seed = tourneyYear[tourneyYear['Team'] == team_id]
    if (len(seed.index) != 0):
        seed = seed.values[0][1]
        tournamentSeed = int(seed[1:3])
    else:
        tournamentSeed = 25 #Not sure how to represent if a team didn't make the tourney
    
    # Finding number of wins and losses
    numWins = len(gamesWon.index)
    # There are some teams who may have dropped to Division 2, so they won't have games 
    # a certain year. In this case, we don't want to divide by 0, so we'll just set the
    # averages to 0 instead
    if numGames == 0:
        avgPointsScored = 0
        avgPointsAllowed = 0
        avg3sMade = 0
        avgTurnovers = 0
        avgAssists = 0
        avgRebounds = 0
        avgSteals = 0
    else:
        avgPointsScored = totalPointsScored/numGames
        avgPointsAllowed = totalPointsAllowed/numGames
        avg3sMade = total3sMade/numGames
        avgTurnovers = totalTurnovers/numGames
        avgAssists = totalAssists/numGames
        avgRebounds = totalRebounds/numGames
        avgSteals = totalSteals/numGames
    #return [numWins, sos, srs]
    #return [numWins, avgPointsScored, avgPointsAllowed, checkPower6Conference(team_id), avg3sMade, avg3sAllowed, avgTurnovers,
    #        tournamentSeed, getStrengthOfSchedule(team_id, year), getTourneyAppearances(team_id)]
    return [numWins, avgPointsScored, avgPointsAllowed, checkPower6Conference(team_id), avg3sMade, avgAssists, avgTurnovers,
           checkConferenceChamp(team_id, year), checkConferenceTourneyChamp(team_id, year), tournamentSeed,
            sos, srs, avgRebounds, avgSteals, getTourneyAppearances(team_id), getNumChampionships(team_id)]

getSeasonData(getTeamID('Kentucky'), 2016)

[26,
 79.67647058823529,
 68.26470588235294,
 1,
 7.176470588235294,
 15.058823529411764,
 11.823529411764707,
 1,
 1,
 4,
 8.84,
 20.23,
 41.029411764705884,
 6.0588235294117645,
 27,
 3]

Here is how the vector is constructed:

* Regular Season Wins 
* Points per game season average 
* Points per game allowed season average
* Whether or not in Power 6 conference (ACC, Big Ten, Big 12, SEC, Pac 12, Big East) - Binary label
* Number of 3's per game
* Turnovers per game average
* Assists per game average
* Conference Championship - binary label
* Conference Tournament Championship - binary label
* Tournament Seed
* Strength of Schedule
* Simple Rating System
* Rebounds per game average
* Steals per game average
* Number of NCAA appearances since 1985
* Number of NCAA championships

# Testing the SVC and Other Models

| Technique           | 1999 - 2016 | 2017 |
|---------------------|-------------|------|
| Decision Tree       | 67.2        | 50   |   
| Logistic Regression | 70          | 73   |  
| KNN                 | 67.1        | 64   |  
| Random Forest       | 68.2        | 50   |   
| SGD                 | 69          | 69.1 |   
| SVC                 | 71.5        | 85.1 |  

Now according to the paper, SVC had the best accuracy overall. Let us see that and others in action. As for the data, it takes a really long time to make, so below is the premade ones.

In [20]:
xTrain = np.load('PrecomputedMatrices/xTrain.npy')
yTrain = np.load('PrecomputedMatrices/yTrain.npy')
X_train, X_test, Y_train, Y_test = train_test_split(xTrain, yTrain)

categories=['Wins','PPG','PPGA','PowerConf','3PG', 'APG','TOP','Conference Champ','Tourney Conference Champ',
           'Seed','SOS','SRS', 'RPG', 'SPG', 'Tourney Appearances','National Championships','Location']

To print out the accuracy scores of some models, I added some of my code to mainly showcase how long this actually took to run. (Looking at you SVC!)

In [38]:
import time

X_train, X_test, Y_train, Y_test = train_test_split(xTrain, yTrain)

'''
My code: Make Dictionaries containing all the models in the form model: [a, b, c, d] where
- a is the model
- b is the accuracy score
- c is how long it took to run
- d is whether it was in seconds, ms, or ns
'''
d = {'Decision Tree': [tree.DecisionTreeClassifier(), 0, 0, ''],
     'Logistic Regression': [linear_model.LogisticRegression(), 0, 0, ''],
     'KNN 39': [KNeighborsClassifier(n_neighbors=39), 0, 0, ''],
     'Random Forest 64': [RandomForestClassifier(n_estimators=64), 0, 0, ''],
     'SVC': [svm.SVC(), 0, 0, '']}

for i in d.keys():
    start = time.time_ns()
    
    # I want to measure how long this takes
    model = d[i][0]
    results = model.fit(X_train, Y_train)
    preds = model.predict(X_test)
    
    end = time.time_ns()
    diff = end - start
    diff = '%.3f'%(diff) # To look better
    diff = float(diff)
    
    if diff > 1e9: # Took Longer than a second, print in seconds
        d[i][2] = diff / 1e9
        d[i][3] = 'seconds'
    elif diff > 1e6: # Took longer than a milisecond, print in ms
        d[i][2] = diff / 1e6
        d[i][3] = 'ms'
    else: # Took less than a milisecond, print in ns
        d[i][2] = diff 
        d[i][3] = 'ns'
    
    # Continue on
    preds[preds < .5] = 0
    preds[preds >= .5] = 1
    
    accuracy = np.mean(preds == Y_test) * 100 
    accuracy = '%.3f'%(accuracy) # Also truncate
    accuracy = float(accuracy) 
    d[i][1] = accuracy

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [39]:
# Printing out the Results
for i in d:
    print('Model:', i, 'Accuracy:', d[i][1], 'Time:', d[i][2], d[i][3])

Model: Decision Tree Accuracy: 65.023 Time: 1.469822 seconds
Model: Logistic Regression Accuracy: 76.226 Time: 1.118918 seconds
Model: KNN 39 Accuracy: 75.218 Time: 19.076413 seconds
Model: Random Forest 64 Accuracy: 72.233 Time: 15.223764 seconds
Model: SVC Accuracy: 76.205 Time: 317.002601 seconds


So SVC did not have the best accuracy but Logisitic Regression did. However, LR was a close second according to the table, and it only acheived a better accuracy by less than a tenth of a percentage point. Decision Tree Accuracy is bad, so we will not use that (or we will to shame it). KNN did pretty w