**Predict NCAA Basketball 2018
**

Here we will use Logistic Regression to predict the outcomes of every possible matchup in the 2018 March Madness basketball tournament.  Our classifier will make its decision based off of the values for 17 features.  One important feature is a ranking metric called ELO ([Link #1](https://en.wikipedia.org/wiki/Elo_rating_system), [Link #2](https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/)) while the remaining 16 features are traditional basketball metrics (described below).  Note that many functions are adapted from [this solution]( https://github.com/harvitronix/kaggle-march-madness-machine-learning) from 2016.

*Step 1: Import Libraries*

In [1]:
import pandas as pd
import numpy
import math
import csv
import random
from sklearn import cross_validation, linear_model, model_selection
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV



*Step 2: Load Data*

In [2]:
folder = '../input'
season_data = pd.read_csv(folder + '/mens-machine-learning-competition-2018/RegularSeasonDetailedResults.csv')
tourney_data = pd.read_csv(folder + '/mens-machine-learning-competition-2018/NCAATourneyDetailedResults.csv')
seeds = pd.read_csv(folder + '/mens-machine-learning-competition-2018/NCAATourneySeeds.csv')
frames = [season_data, tourney_data]
all_data = pd.concat(frames)
stat_fields = ['score', 'fga', 'fgp', 'fga3', '3pp', 'ftp', 'or', 'dr',
                   'ast', 'to', 'stl', 'blk', 'pf']
prediction_year = 2018
base_elo = 1600
team_elos = {}
team_stats = {}
X = []
y = []
submission_data = []
def initialize_data():
    for i in range(1985, prediction_year+1):
        team_elos[i] = {}
        team_stats[i] = {}
initialize_data()

*Step 3: Explore Data*

In [3]:
all_data.head(10)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14
5,2003,11,1458,81,1186,55,H,0,26,57,...,11,12,17,6,22,8,19,4,3,25
6,2003,12,1161,80,1236,62,H,0,23,55,...,15,20,28,9,21,11,30,10,4,28
7,2003,12,1186,75,1457,61,N,0,28,62,...,17,17,23,8,25,10,15,14,8,18
8,2003,12,1194,71,1156,66,N,0,28,58,...,18,12,27,13,26,13,25,8,2,18
9,2003,12,1458,84,1296,56,H,0,32,67,...,14,7,12,9,23,10,18,1,3,18


*Step 4: Define Helper Functions*

In [4]:
def get_elo(season, team):
    try:
        return team_elos[season][team]
    except:
        try:
            # Get the previous season's ending value.
            team_elos[season][team] = team_elos[season-1][team]
            return team_elos[season][team]
        except:
            # Get the starter elo.
            team_elos[season][team] = base_elo
            return team_elos[season][team]

def calc_elo(win_team, lose_team, season):
    winner_rank = get_elo(season, win_team)
    loser_rank = get_elo(season, lose_team)
    rank_diff = winner_rank - loser_rank
    exp = (rank_diff * -1) / 400
    odds = 1 / (1 + math.pow(10, exp))
    if winner_rank < 2100:
        k = 32
    elif winner_rank >= 2100 and winner_rank < 2400:
        k = 24
    else:
        k = 16
    new_winner_rank = round(winner_rank + (k * (1 - odds)))
    new_rank_diff = new_winner_rank - winner_rank
    new_loser_rank = loser_rank - new_rank_diff
    return new_winner_rank, new_loser_rank

def get_stat(season, team, field):
    try:
        l = team_stats[season][team][field]
        return sum(l) / float(len(l))
    except:
        return 0
    
def update_stats(season, team, fields):
    if team not in team_stats[season]:
        team_stats[season][team] = {}
    for key, value in fields.items():
        # Make sure we have the field.
        if key not in team_stats[season][team]:
            team_stats[season][team][key] = []
        if len(team_stats[season][team][key]) >= 9:
            team_stats[season][team][key].pop()
        team_stats[season][team][key].append(value)
        
def predict_winner(team_1, team_2, model, season, stat_fields):
    features = []
    # Team 1
    features.append(get_elo(season, team_1))
    for stat in stat_fields:
        features.append(get_stat(season, team_1, stat))
    # Team 2
    features.append(get_elo(season, team_2))
    for stat in stat_fields:
        features.append(get_stat(season, team_2, stat))
    return model.predict_proba([features])

*Step 5: Feature Selection and Feature Engineering*

Our classifier will make its decision based off of the values for 17 features.  One important feature is a ranking metric called ELO ([Link #1](https://en.wikipedia.org/wiki/Elo_rating_system), [Link #2](https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/)) while the remaining 16 features are traditional basketball metrics as described below:

Features:

            wfgm :  field goals made
            wfga :  field goals attempted
            wfgm3 :  three pointers made
            wfga3 :  three pointers attempted
            wftm :  free throws made
            wfta :  free throws attempted
            wor :  offensive rebounds
            wdr :  defensive rebounds
            wast :  assists
            wto :  turnovers
            wstl :  steals
            wblk :  blocks
            wpf :  personal fouls

Engineered Features:

            fgp :  field goal percentage
            3pp :  three point percentage
            ftp:  free throw percentage

In [5]:
def build_season_data(all_data):
    # Calculate the elo for every game for every team, each season.
    # Store the elo per season so we can retrieve their end elo
    # later in order to predict the tournaments without having to
    # inject the prediction into this loop.
    for index, row in all_data.iterrows():
        # Used to skip matchups where we don't have usable stats yet.
        skip = 0
        # Get starter or previous elos.
        team_1_elo = get_elo(row['Season'], row['WTeamID'])
        team_2_elo = get_elo(row['Season'], row['LTeamID'])
        # Add 100 to the home team (# taken from Nate Silver analysis.)
        if row['WLoc'] == 'H':
            team_1_elo += 100
        elif row['WLoc'] == 'A':
            team_2_elo += 100         
        # We'll create some arrays to use later.
        team_1_features = [team_1_elo]
        team_2_features = [team_2_elo]
        # Build arrays out of the stats we're tracking..
        for field in stat_fields:
            team_1_stat = get_stat(row['Season'], row['WTeamID'], field)
            team_2_stat = get_stat(row['Season'], row['LTeamID'], field)
            if team_1_stat is not 0 and team_2_stat is not 0:
                team_1_features.append(team_1_stat)
                team_2_features.append(team_2_stat)
            else:
                skip = 1
        if skip == 0:  # Make sure we have stats.
            # Randomly select left and right and 0 or 1 so we can train
            # for multiple classes.
            if random.random() > 0.5:
                X.append(team_1_features + team_2_features)
                y.append(0)
            else:
                X.append(team_2_features + team_1_features)
                y.append(1)
        # AFTER we add the current stuff to the prediction, update for
        # next time. Order here is key so we don't fit on data from the
        # same game we're trying to predict.
        if row['WFTA'] != 0 and row['LFTA'] != 0:
            stat_1_fields = {
                'score': row['WScore'],
                'fgp': row['WFGM'] / row['WFGA'] * 100,
                'fga': row['WFGA'],
                'fga3': row['WFGA3'],
                '3pp': row['WFGM3'] / row['WFGA3'] * 100,
                'ftp': row['WFTM'] / row['WFTA'] * 100,
                'or': row['WOR'],
                'dr': row['WDR'],
                'ast': row['WAst'],
                'to': row['WTO'],
                'stl': row['WStl'],
                'blk': row['WBlk'],
                'pf': row['WPF']
            }            
            stat_2_fields = {
                'score': row['LScore'],
                'fgp': row['LFGM'] / row['LFGA'] * 100,
                'fga': row['LFGA'],
                'fga3': row['LFGA3'],
                '3pp': row['LFGM3'] / row['LFGA3'] * 100,
                'ftp': row['LFTM'] / row['LFTA'] * 100,
                'or': row['LOR'],
                'dr': row['LDR'],
                'ast': row['LAst'],
                'to': row['LTO'],
                'stl': row['LStl'],
                'blk': row['LBlk'],
                'pf': row['LPF']
            }
            update_stats(row['Season'], row['WTeamID'], stat_1_fields)
            update_stats(row['Season'], row['LTeamID'], stat_2_fields)
        # Now that we've added them, calc the new elo.
        new_winner_rank, new_loser_rank = calc_elo(
            row['WTeamID'], row['LTeamID'], row['Season'])
        team_elos[row['Season']][row['WTeamID']] = new_winner_rank
        team_elos[row['Season']][row['LTeamID']] = new_loser_rank
    return X, y
X, y = build_season_data(all_data)

*Step 6: Use Logistic Regression To Predict Game Outcomes*

In [6]:
model = linear_model.LogisticRegression()
print("Let's hope to be correct 75% of the time")
print(cross_validation.cross_val_score(model, numpy.array(X), numpy.array(y), cv=10, scoring='accuracy', n_jobs=-1).mean())
model.fit(X, y)
tourney_teams = []
for index, row in seeds.iterrows():
    if row['Season'] == prediction_year:
        tourney_teams.append(row['TeamID'])
tourney_teams.sort()
for team_1 in tourney_teams:
    for team_2 in tourney_teams:
        if team_1 < team_2:
            prediction = predict_winner(
                team_1, team_2, model, prediction_year, stat_fields)
            label = str(prediction_year) + '_' + str(team_1) + '_' + \
                str(team_2)
            submission_data.append([label, prediction[0][0]])

Let's hope to be correct 75% of the time
0.7260128073395251


*Step 7: Submit Results*

In [7]:
print("Writing %d results." % len(submission_data))
submission_data2=pd.DataFrame(submission_data)
submission_data2.to_csv("submission1.csv", index=False)
def build_team_dict():
    team_ids = pd.read_csv(folder + '/mens-machine-learning-competition-2018/Teams.csv')
    team_id_map = {}
    for index, row in team_ids.iterrows():
        team_id_map[row['TeamID']] = row['TeamName']
    return team_id_map
team_id_map = build_team_dict()
readable = []
less_readable = []  # A version that's easy to look up.
for pred in submission_data:
    parts = pred[0].split('_')
    less_readable.append(
        [team_id_map[int(parts[1])], team_id_map[int(parts[2])], pred[1]])
    # Order them properly.
    if pred[1] > 0.5:
        winning = int(parts[1])
        losing = int(parts[2])
        proba = pred[1]
    else:
        winning = int(parts[2])
        losing = int(parts[1])
        proba = 1 - pred[1]
    readable.append(
        [
            '%s beats %s: %f' %
            (team_id_map[winning], team_id_map[losing], proba)
        ]
    )
readable

Writing 2278 results.


[['Arizona beats Alabama: 0.782418'],
 ['Arizona St beats Alabama: 0.518115'],
 ['Arkansas beats Alabama: 0.563638'],
 ['Auburn beats Alabama: 0.590357'],
 ['Alabama beats Bucknell: 0.713181'],
 ['Alabama beats Buffalo: 0.651474'],
 ['Butler beats Alabama: 0.576445'],
 ['Cincinnati beats Alabama: 0.786580'],
 ['Clemson beats Alabama: 0.612985'],
 ['Alabama beats Col Charleston: 0.710605'],
 ['Creighton beats Alabama: 0.606804'],
 ['Alabama beats CS Fullerton: 0.865036'],
 ['Alabama beats Davidson: 0.526432'],
 ['Duke beats Alabama: 0.814096'],
 ['Florida beats Alabama: 0.615163'],
 ['Florida St beats Alabama: 0.685759'],
 ['Alabama beats Georgia St: 0.802545'],
 ['Gonzaga beats Alabama: 0.784289'],
 ['Houston beats Alabama: 0.607369'],
 ['Alabama beats Iona: 0.791874'],
 ['Kansas beats Alabama: 0.822081'],
 ['Kansas St beats Alabama: 0.686178'],
 ['Kentucky beats Alabama: 0.709947'],
 ['Alabama beats Lipscomb: 0.862446'],
 ['Alabama beats Long Island: 0.933204'],
 ['Alabama beats Loyol

In [8]:
Finalpredictions=pd.DataFrame(readable)
Finalpredictions.to_csv("Finalpredictions.csv", index=False)