March Madness Bracket Prediction Data Crunch
===================

## Overview
Use machine learning and statistical methods to predict NCAA Men's Basketball game outcome and championship based on seasonal performance, game seed, and other stats.

## What's special about this method

Instead of analyzing the stats of each team, transform the features into **Difference** and **Quotient** between **two teams in each of 63 games** and predict the probability, using logistic regression and other machine learning methods.
<br><br>
\begin{equation*}
\ Difference  = Team\: 1\: Feature\: N - Team\: 2\: Feature\: N
\end{equation*}
<br>
\begin{equation*}
\ Quotient  = \frac{Team\: 1\: Feature\: N}{Team\: 2\: Feature\: N}
\end{equation*}
<br>

## Pipeline
- Feature Engineering
- Data Preprocessing
- Feature Selection
- Model Comparision
- Prediction

## Result
The accuracy of the model is **74.89%** with a log loss of **0.53**.
<br>
***

## Data Preprocessing

In [1]:
%load_ext blackcellmagic

In [210]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use("ggplot")

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    auc,
    classification_report,
    roc_auc_score,
    accuracy_score,
    f1_score,
    log_loss,
    roc_curve,
    confusion_matrix,
    precision_score,
    recall_score,
)
from sklearn.preprocessing import StandardScaler
from math import sin, cos, sqrt, atan2, radians
import random
import statsmodels.api as sm

Load the data from previous seasons 2002 to 2018.

In [211]:
ncaa_tour = pd.read_csv('testncaa.csv')

In [212]:
ncaa_tour.columns

Index(['team1_score', 'team2_score', 'team1_seed', 'team2_seed', 'season',
       'host_lat', 'host_long', 'team1_lat', 'team1_long', 'team2_lat',
       'team2_long', 'team1_pt_school_ncaa', 'team1_pt_overall_ncaa',
       'team1_pt_school_s16', 'team1_pt_overall_s16', 'team1_pt_school_ff',
       'team1_pt_overall_ff', 'team1_pt_career_school_wins',
       'team1_pt_career_school_losses', 'team1_pt_career_overall_wins',
       'team1_pt_career_overall_losses', 'team1_pt_team_season_wins',
       'team1_pt_team_season_losses', 'team1_pt_coach_season_wins',
       'team1_pt_coach_season_losses', 'team2_pt_school_ncaa',
       'team2_pt_overall_ncaa', 'team2_pt_school_s16', 'team2_pt_overall_s16',
       'team2_pt_school_ff', 'team2_pt_overall_ff',
       'team2_pt_career_school_wins', 'team2_pt_career_school_losses',
       'team2_pt_career_overall_wins', 'team2_pt_career_overall_losses',
       'team2_pt_team_season_wins', 'team2_pt_team_season_losses',
       'team2_pt_coach_season_w

In [213]:
pd.set_option('display.max_columns', None) #Show all the columns and their values
ncaa_tour.head()

Unnamed: 0,team1_score,team2_score,team1_seed,team2_seed,season,host_lat,host_long,team1_lat,team1_long,team2_lat,team2_long,team1_pt_school_ncaa,team1_pt_overall_ncaa,team1_pt_school_s16,team1_pt_overall_s16,team1_pt_school_ff,team1_pt_overall_ff,team1_pt_career_school_wins,team1_pt_career_school_losses,team1_pt_career_overall_wins,team1_pt_career_overall_losses,team1_pt_team_season_wins,team1_pt_team_season_losses,team1_pt_coach_season_wins,team1_pt_coach_season_losses,team2_pt_school_ncaa,team2_pt_overall_ncaa,team2_pt_school_s16,team2_pt_overall_s16,team2_pt_school_ff,team2_pt_overall_ff,team2_pt_career_school_wins,team2_pt_career_school_losses,team2_pt_career_overall_wins,team2_pt_career_overall_losses,team2_pt_team_season_wins,team2_pt_team_season_losses,team2_pt_coach_season_wins,team2_pt_coach_season_losses,team1_ap_final,team1_ap_preseason,team1_coaches_before_final,team1_coaches_preseason,team2_ap_final,team2_ap_preseason,team2_coaches_before_final,team2_coaches_preseason,team1_fg2pct,team1_fg3pct,team1_ftpct,team1_blockpct,team1_oppfg2pct,team1_oppfg3pct,team1_oppftpct,team1_oppblockpct,team1_f3grate,team1_oppf3grate,team1_arate,team1_opparate,team1_stlrate,team1_oppstlrate,team2_fg2pct,team2_fg3pct,team2_ftpct,team2_blockpct,team2_oppfg2pct,team2_oppfg3pct,team2_oppftpct,team2_oppblockpct,team2_f3grate,team2_oppf3grate,team2_arate,team2_opparate,team2_stlrate,team2_oppstlrate,team1_tempo,team1_adjtempo,team1_oe,team1_adjoe,team1_de,team1_adjde,team2_tempo,team2_adjtempo,team2_oe,team2_adjoe,team2_de,team2_adjde,game_id
0,81,77,16,16,2002,39.7594,-84.1917,42.718586,-73.75153,31.877216,-91.142854,0,0,0,0,0,0,16,18,16,18,16,18,16,18,6,6,0,0,0,0,317,210,317,210,20,9,20,9,,,,,,,,,45.6127,34.626,74.4032,11.3308,42.6616,33.6399,69.6471,10.5144,35.3229,36.657,54.5135,57.8947,0.0778,0.102,48.9138,37.5556,64.2105,9.434,47.6981,34.1071,70.6505,6.4471,23.9744,29.7082,54.3253,56.8651,0.108,0.1049,68.8425,67.7359,98.693,98.6102,96.3688,99.6885,76.0664,73.8504,103.4213,99.8665,99.637,106.45,2002-1373-1108
1,86,78,2,15,2002,35.6017,-77.3725,33.2144,-87.545766,26.372536,-80.102293,0,2,0,0,0,0,81,49,149,73,26,7,26,7,0,0,0,0,0,0,28,63,28,63,19,11,19,11,8.0,24.0,8.0,,,,,,50.6579,30.8677,73.5782,8.0468,45.2085,31.7841,68.2422,7.1637,33.945,32.7925,50.5495,46.988,0.1053,0.0881,45.0355,33.9114,66.7785,8.1871,48.538,34.3925,68.8172,9.9291,26.9051,30.8891,56.5968,56.3399,0.1282,0.1076,69.8636,69.9001,108.4361,111.4954,95.2313,93.877,71.2357,71.2446,100.2897,96.8669,98.4183,99.9263,2002-1104-1194
2,86,81,3,14,2002,35.1107,-106.61,32.232071,-110.950769,34.415462,-119.848071,17,22,8,10,4,5,467,144,656,239,22,9,22,9,0,0,0,0,0,0,62,52,62,52,20,10,20,10,7.0,,8.0,,,,,,50.0,37.6093,73.0769,7.2512,47.8576,34.7826,70.1258,8.4934,33.6275,31.9731,56.4706,54.8255,0.0835,0.0839,47.0414,40.8247,75.8117,8.7185,43.1723,32.0968,69.3679,6.7061,32.3549,39.4402,65.7778,58.5246,0.0868,0.0778,74.1462,72.8207,111.0077,117.3877,104.0411,96.9262,63.2345,64.7948,105.2163,105.4534,96.0965,97.6704,2002-1112-1364
3,84,37,1,16,2002,35.6017,-77.3725,36.00159,-78.94226,34.93851,-81.028663,17,17,12,12,9,9,562,167,635,226,29,3,29,3,3,3,0,0,0,0,79,41,79,41,19,11,19,11,1.0,1.0,1.0,,,,,,57.4329,36.2651,68.9845,7.2251,46.2916,30.2703,68.8693,7.4692,37.5736,26.1916,57.1821,49.5516,0.1297,0.0919,47.642,30.5603,65.5914,11.6301,45.6625,30.4104,68.8372,9.8171,36.1794,33.817,49.037,51.8692,0.1108,0.0987,77.0734,75.1685,116.3726,118.5999,90.2544,87.7504,68.2128,67.4639,100.2384,95.3632,94.6587,99.9754,2002-1181-1457
4,75,56,5,12,2002,38.5556,-121.4689,39.166383,-86.526904,40.762484,-111.846044,1,1,0,0,0,0,41,24,41,24,20,11,20,11,8,9,4,4,1,1,283,81,382,133,21,8,21,8,,22.0,,,,,,,48.3696,40.0,69.4561,12.5086,42.8866,35.5839,66.4013,10.0155,33.0213,27.359,58.4949,48.2295,0.1045,0.1075,50.306,40.4719,67.9443,7.3961,49.1388,31.8302,70.6406,7.3439,40.2778,27.6393,63.7224,51.5702,0.0922,0.0927,65.7442,66.5739,105.3974,109.6136,94.4987,89.9224,62.6181,63.3547,107.7123,108.6231,97.4318,95.6459,2002-1231-1428


In [214]:
ncaa_tour.describe()

Unnamed: 0,team1_score,team2_score,team1_seed,team2_seed,season,host_lat,host_long,team1_lat,team1_long,team2_lat,team2_long,team1_pt_school_ncaa,team1_pt_overall_ncaa,team1_pt_school_s16,team1_pt_overall_s16,team1_pt_school_ff,team1_pt_overall_ff,team1_pt_career_school_wins,team1_pt_career_school_losses,team1_pt_career_overall_wins,team1_pt_career_overall_losses,team1_pt_team_season_wins,team1_pt_team_season_losses,team1_pt_coach_season_wins,team1_pt_coach_season_losses,team2_pt_school_ncaa,team2_pt_overall_ncaa,team2_pt_school_s16,team2_pt_overall_s16,team2_pt_school_ff,team2_pt_overall_ff,team2_pt_career_school_wins,team2_pt_career_school_losses,team2_pt_career_overall_wins,team2_pt_career_overall_losses,team2_pt_team_season_wins,team2_pt_team_season_losses,team2_pt_coach_season_wins,team2_pt_coach_season_losses,team1_ap_final,team1_ap_preseason,team1_coaches_before_final,team1_coaches_preseason,team2_ap_final,team2_ap_preseason,team2_coaches_before_final,team2_coaches_preseason,team1_fg2pct,team1_fg3pct,team1_ftpct,team1_blockpct,team1_oppfg2pct,team1_oppfg3pct,team1_oppftpct,team1_oppblockpct,team1_f3grate,team1_oppf3grate,team1_arate,team1_opparate,team1_stlrate,team1_oppstlrate,team2_fg2pct,team2_fg3pct,team2_ftpct,team2_blockpct,team2_oppfg2pct,team2_oppfg3pct,team2_oppftpct,team2_oppblockpct,team2_f3grate,team2_oppf3grate,team2_arate,team2_opparate,team2_stlrate,team2_oppstlrate,team1_tempo,team1_adjtempo,team1_oe,team1_adjoe,team1_de,team1_adjde,team2_tempo,team2_adjtempo,team2_oe,team2_adjoe,team2_de,team2_adjde
count,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,759.0,651.0,764.0,613.0,403.0,342.0,383.0,324.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0,1112.0
mean,75.408273,64.0,5.151079,8.813849,2010.097122,38.069318,-92.201793,38.141361,-88.920553,37.94273,-90.0941,5.545863,8.681655,2.882194,4.178058,1.063849,1.466727,209.513489,84.36241,346.713129,153.535072,27.510791,7.635791,25.118705,7.123201,3.26259,5.383094,1.401079,2.116007,0.477518,0.67446,153.866906,78.747302,252.680755,135.42536,25.779676,8.891187,23.428058,8.593525,10.011858,9.840246,9.740838,9.652529,13.173697,12.581871,13.336815,12.490741,51.295981,36.426244,70.907142,11.124787,45.148329,32.799256,69.070052,8.365315,32.993649,33.060414,55.915376,51.716273,0.103637,0.088147,50.550596,35.988968,70.634399,10.239078,46.01354,33.229407,69.211347,8.530778,33.442078,33.282912,55.730459,52.134852,0.10137,0.090135,67.641909,67.319748,110.98746,112.914521,95.36306,93.965913,67.435028,67.154363,108.633711,109.629406,96.992843,96.413722
std,10.715982,10.417133,3.992863,4.626117,4.903526,4.719764,14.532305,4.158697,12.983528,4.353578,13.886183,6.696834,7.634619,4.304726,4.938209,2.162704,2.494056,196.170639,65.609371,237.159495,92.383396,34.184782,7.1937,3.854735,3.086821,5.015487,6.295799,3.098026,3.712197,1.508881,1.746838,148.078865,61.366468,199.142495,89.680159,38.31901,5.555694,3.536138,3.09645,6.747118,7.096195,6.871092,6.938689,7.150799,7.248664,7.087625,7.173863,2.873669,2.565706,3.576111,3.108575,2.658248,2.072762,2.258669,1.488797,5.456393,3.811987,4.843289,4.881479,0.017204,0.011957,2.807329,2.604955,3.568777,2.90014,2.774358,2.113396,2.36706,1.569232,5.339403,3.920087,5.231826,4.886377,0.018371,0.012677,3.586542,3.320256,4.845619,5.139765,4.272568,4.14059,3.412821,3.201833,4.987599,5.725011,4.453553,4.787527
min,47.0,29.0,1.0,1.0,2002.0,25.7877,-122.6819,21.292648,-157.816607,21.292648,-157.816607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,12.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,43.6834,27.0096,60.4527,3.775,38.41983,27.3684,63.4615,4.3143,18.6339,20.8375,39.2182,38.5629,0.05969,0.053693,39.8305,25.7928,59.3443,1.7257,37.7622,26.979,61.9938,4.3143,18.6339,20.8375,39.2182,38.0495,0.0542,0.053693,57.0722,57.6154,92.0001,90.1584,83.446,81.1277,53.996,53.5435,90.7727,90.1584,83.446,81.1277
25%,68.0,57.0,2.0,5.0,2006.0,35.1174,-98.5,35.912165,-95.24587,35.115864,-96.581077,1.0,2.0,0.0,0.0,0.0,0.0,76.0,36.0,149.0,81.0,23.0,5.0,23.0,5.0,0.0,1.0,0.0,0.0,0.0,0.0,60.0,36.0,101.75,62.75,21.0,7.0,21.0,7.0,4.0,4.0,4.0,4.0,7.0,6.0,7.0,6.0,49.1748,34.7009,68.4555,8.952703,43.3597,31.32365,67.524375,7.192475,29.2943,30.56205,52.57755,48.582913,0.0914,0.08065,48.733275,34.1185,68.176364,8.197016,44.088425,31.708725,67.59985,7.383925,30.12275,30.6986,52.034525,48.9308,0.088476,0.0816,65.2376,65.091525,107.8079,109.654625,92.26415,90.8202,65.13925,65.005225,105.40265,106.14945,93.87875,93.218825
50%,75.0,64.0,4.0,9.0,2010.0,38.8951,-87.6847,38.957351,-84.510698,38.89865,-86.289848,3.0,7.0,1.0,2.0,0.0,0.0,145.0,67.5,304.0,143.5,26.0,7.0,25.5,7.0,1.0,3.0,0.0,0.0,0.0,0.0,110.0,64.0,199.5,121.0,24.0,9.0,24.0,8.0,9.0,8.0,8.0,8.0,13.0,12.0,13.0,12.0,51.114105,36.3636,70.7819,10.66895,45.10185,32.782369,68.9369,8.2966,33.0213,32.90775,55.7305,51.6171,0.1027,0.0872,50.54375,36.0377,70.6092,9.95745,45.8702,33.21585,69.2966,8.4408,33.227178,33.179264,55.69135,52.1172,0.10035,0.0899,67.519,67.18025,111.2415,113.0406,95.1165,93.8292,67.45745,67.1535,108.6206,109.8982,97.00815,96.04135
75%,82.0,71.0,8.0,13.0,2014.0,41.25,-81.6614,40.762484,-79.050969,40.927107,-79.9545,8.0,13.0,4.0,7.0,1.0,2.0,273.5,115.0,498.0,224.0,27.0,9.0,27.0,9.0,4.0,8.0,1.0,3.0,0.0,0.0,193.25,105.0,350.25,193.0,26.0,11.0,26.0,11.0,15.0,15.0,15.0,14.0,19.0,19.0,19.0,18.25,52.9597,38.193344,73.5549,13.041278,46.849824,34.098875,70.67505,9.34355,36.3679,35.8276,59.037328,54.7208,0.1133,0.0959,52.2515,37.807175,73.104525,12.07585,47.947925,34.7531,70.76575,9.62695,36.921472,35.86675,59.1417,55.2419,0.1128,0.0987,70.0699,69.402875,114.18315,116.5324,98.0531,96.5026,69.819175,69.30825,112.008425,113.4002,99.918475,99.47355
max,121.0,105.0,16.0,16.0,2018.0,47.6589,-71.0589,47.668144,-71.118177,47.92188,-71.088782,33.0,33.0,23.0,23.0,12.0,12.0,1024.0,370.0,1097.0,417.0,713.0,117.0,34.0,19.0,33.0,33.0,23.0,23.0,12.0,12.0,1024.0,390.0,1097.0,435.0,915.0,117.0,34.0,20.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,59.713518,43.2815,81.8011,20.4309,53.7572,38.8193,76.082863,13.0435,51.4065,43.75,70.2069,73.961219,0.1697,0.13,59.754,43.3566,81.8011,20.4309,56.7104,40.6383,76.082863,13.4305,53.2072,49.3409,72.1728,73.961219,0.1697,0.1335,79.0371,79.1217,127.4,125.6888,112.9,114.058,79.0371,79.1217,123.2,125.6888,112.9,114.058


In [215]:
pd.set_option('display.max_rows', None)  # Show all rows in dataframe
ncaa_tour.isnull().sum()  # Check for missing values

team1_score                         0
team2_score                         0
team1_seed                          0
team2_seed                          0
season                              0
host_lat                            0
host_long                           0
team1_lat                           0
team1_long                          0
team2_lat                           0
team2_long                          0
team1_pt_school_ncaa                0
team1_pt_overall_ncaa               0
team1_pt_school_s16                 0
team1_pt_overall_s16                0
team1_pt_school_ff                  0
team1_pt_overall_ff                 0
team1_pt_career_school_wins         0
team1_pt_career_school_losses       0
team1_pt_career_overall_wins        0
team1_pt_career_overall_losses      0
team1_pt_team_season_wins           0
team1_pt_team_season_losses         0
team1_pt_coach_season_wins          0
team1_pt_coach_season_losses        0
team2_pt_school_ncaa                0
team2_pt_ove

We can see that the following features have missing values:
- team1_ap_final          
- team1_ap_preseason                
- team1_coaches_before_final        
- team1_coaches_preseason      
- team2_ap_final                    
- team2_ap_preseason               
- team2_coaches_before_final       
- team2_coaches_preseason        

**This is because these preseason and final AP Poll rankings, preseason and before final Coaches Poll rankings are only available for top 25 teams and coaches.** <br>Therefore, we will replace the NaN rankings with an ambiguous outlier number 45, distinct from all other values.

In [216]:
ncaa_tour.fillna(45.0, inplace=True)

Each team has 10 variables that include 0 value. Replace 0 with 0.01 to avoid those being considered as null in further statistical analysis.

In [218]:
for i in list(ncaa_tour.columns[:-2]):
    ncaa_tour[i] = ncaa_tour[i].astype(float).replace(0, 0.01)

## Split the dataset

The setup of the dataset is:
- by year order
- the winning team is always assigned as team 1.

Therefore, we're going to:
- shuffle the data by random year
- splitting half of the rows as team 1 won and the other half as team 2 won.

In [219]:
def shuffle(df):
    # random generate and random order
    df = df.reindex(np.random.permutation(df.index)).copy()
    return df.reset_index(drop=True)


ncaa_tour = shuffle(ncaa_tour)

In [220]:
my_randoms = random.sample(range(len(ncaa_tour)), round(len(ncaa_tour)/2))
ncaa_tour_1 = ncaa_tour[ncaa_tour.index.isin(
    my_randoms)].reset_index(drop=True)
ncaa_tour_2 = ncaa_tour[~ncaa_tour.index.isin(
    my_randoms)].reset_index(drop=True)

In [221]:
ncaa_tour_1.shape

(556, 88)

In [222]:
ncaa_tour_2.shape

(556, 88)

In [223]:
#Renaming team 1 as team 2 and vice versa
ncaa_tour_2.columns = ['team2_score', 'team1_score', 'team2_seed', 'team1_seed', 'season',
       'host_lat', 'host_long', 'team2_lat', 'team2_long', 'team1_lat',
       'team1_long', 'team2_pt_school_ncaa', 'team2_pt_overall_ncaa',
       'team2_pt_school_s16', 'team2_pt_overall_s16', 'team2_pt_school_ff',
       'team2_pt_overall_ff', 'team2_pt_career_school_wins',
       'team2_pt_career_school_losses', 'team2_pt_career_overall_wins',
       'team2_pt_career_overall_losses', 'team2_pt_team_season_wins',
       'team2_pt_team_season_losses', 'team2_pt_coach_season_wins',
       'team2_pt_coach_season_losses', 'team1_pt_school_ncaa',
       'team1_pt_overall_ncaa', 'team1_pt_school_s16', 'team1_pt_overall_s16',
       'team1_pt_school_ff', 'team1_pt_overall_ff',
       'team1_pt_career_school_wins', 'team1_pt_career_school_losses',
       'team1_pt_career_overall_wins', 'team1_pt_career_overall_losses',
       'team1_pt_team_season_wins', 'team1_pt_team_season_losses',
       'team1_pt_coach_season_wins', 'team1_pt_coach_season_losses',
       'team2_ap_final', 'team2_ap_preseason', 'team2_coaches_before_final',
       'team2_coaches_preseason', 'team1_ap_final', 'team1_ap_preseason',
       'team1_coaches_before_final', 'team1_coaches_preseason', 'team2_fg2pct',
       'team2_fg3pct', 'team2_ftpct', 'team2_blockpct', 'team2_oppfg2pct',
       'team2_oppfg3pct', 'team2_oppftpct', 'team2_oppblockpct',
       'team2_f3grate', 'team2_oppf3grate', 'team2_arate', 'team2_opparate',
       'team2_stlrate', 'team2_oppstlrate', 'team1_fg2pct', 'team1_fg3pct',
       'team1_ftpct', 'team1_blockpct', 'team1_oppfg2pct', 'team1_oppfg3pct',
       'team1_oppftpct', 'team1_oppblockpct', 'team1_f3grate',
       'team1_oppf3grate', 'team1_arate', 'team1_opparate', 'team1_stlrate',
       'team1_oppstlrate', 'team2_tempo', 'team2_adjtempo', 'team2_oe',
       'team2_adjoe', 'team2_de', 'team2_adjde', 'team1_tempo',
       'team1_adjtempo', 'team1_oe', 'team1_adjoe', 'team1_de', 'team1_adjde',
       'game_id']

In [224]:
#Switching the order of renamed columns so that it matches with the original dataframe
ncaa_tour_2 = ncaa_tour_2[['team1_score', 'team2_score', 'team1_seed', 'team2_seed', 'season',
       'host_lat', 'host_long', 'team1_lat', 'team1_long', 'team2_lat',
       'team2_long', 'team1_pt_school_ncaa', 'team1_pt_overall_ncaa',
       'team1_pt_school_s16', 'team1_pt_overall_s16', 'team1_pt_school_ff',
       'team1_pt_overall_ff', 'team1_pt_career_school_wins',
       'team1_pt_career_school_losses', 'team1_pt_career_overall_wins',
       'team1_pt_career_overall_losses', 'team1_pt_team_season_wins',
       'team1_pt_team_season_losses', 'team1_pt_coach_season_wins',
       'team1_pt_coach_season_losses', 'team2_pt_school_ncaa',
       'team2_pt_overall_ncaa', 'team2_pt_school_s16', 'team2_pt_overall_s16',
       'team2_pt_school_ff', 'team2_pt_overall_ff',
       'team2_pt_career_school_wins', 'team2_pt_career_school_losses',
       'team2_pt_career_overall_wins', 'team2_pt_career_overall_losses',
       'team2_pt_team_season_wins', 'team2_pt_team_season_losses',
       'team2_pt_coach_season_wins', 'team2_pt_coach_season_losses',
       'team1_ap_final', 'team1_ap_preseason', 'team1_coaches_before_final',
       'team1_coaches_preseason', 'team2_ap_final', 'team2_ap_preseason',
       'team2_coaches_before_final', 'team2_coaches_preseason', 'team1_fg2pct',
       'team1_fg3pct', 'team1_ftpct', 'team1_blockpct', 'team1_oppfg2pct',
       'team1_oppfg3pct', 'team1_oppftpct', 'team1_oppblockpct',
       'team1_f3grate', 'team1_oppf3grate', 'team1_arate', 'team1_opparate',
       'team1_stlrate', 'team1_oppstlrate', 'team2_fg2pct', 'team2_fg3pct',
       'team2_ftpct', 'team2_blockpct', 'team2_oppfg2pct', 'team2_oppfg3pct',
       'team2_oppftpct', 'team2_oppblockpct', 'team2_f3grate',
       'team2_oppf3grate', 'team2_arate', 'team2_opparate', 'team2_stlrate',
       'team2_oppstlrate', 'team1_tempo', 'team1_adjtempo', 'team1_oe',
       'team1_adjoe', 'team1_de', 'team1_adjde', 'team2_tempo',
       'team2_adjtempo', 'team2_oe', 'team2_adjoe', 'team2_de', 'team2_adjde',
       'game_id']]

In [225]:
#Adding outcome feature
ncaa_tour_1['outcome'] = 1
ncaa_tour_2['outcome'] = 0

In [227]:
# Merging the two dataframes back together
ncaa_shuffle = pd.concat([ncaa_tour_1, ncaa_tour_2], ignore_index=True)

Engineer new feature that calculates the distance between the team and the host's location.

In [230]:
# Calculate Distance between Host and Team's location
def distance(lat1, lon1, lat2, lon2):

    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

In [231]:
ncaa_shuffle['dist1'] = ncaa_shuffle.apply(lambda row: distance(
    row['host_lat'], row['host_long'], row['team1_lat'], row['team1_long']), axis=1)
ncaa_shuffle['dist2'] = ncaa_shuffle.apply(lambda row: distance(
    row['host_lat'], row['host_long'], row['team2_lat'], row['team2_long']), axis=1)

In [232]:
ncaa_new = pd.DataFrame()

In [233]:
ncaa_new['d_team_seed'] = ncaa_shuffle['team1_seed'] - \
    ncaa_shuffle['team2_seed']
ncaa_new['q_team_seed'] = ncaa_shuffle['team1_seed'] / \
    ncaa_shuffle['team2_seed']

ncaa_new['d_dist'] = ncaa_shuffle['dist1'] - ncaa_shuffle['dist2']
ncaa_new['q_dist'] = ncaa_shuffle['dist1'] / ncaa_shuffle['dist2']
ncaa_new['d_pt_school_ncaa'] = ncaa_shuffle['team1_pt_school_ncaa'] - \
    ncaa_shuffle['team2_pt_school_ncaa']
ncaa_new['q_pt_school_ncaa'] = ncaa_shuffle['team1_pt_school_ncaa'] / \
    ncaa_shuffle['team2_pt_school_ncaa']
ncaa_new['d_pt_overall_ncaa'] = ncaa_shuffle['team1_pt_overall_ncaa'] - \
    ncaa_shuffle['team2_pt_overall_ncaa']
ncaa_new['q_pt_overall_ncaa'] = ncaa_shuffle['team1_pt_overall_ncaa'] / \
    ncaa_shuffle['team2_pt_overall_ncaa']
ncaa_new['d_pt_school_s16'] = ncaa_shuffle['team1_pt_school_s16'] - \
    ncaa_shuffle['team2_pt_school_s16']
ncaa_new['q_pt_school_s16'] = ncaa_shuffle['team1_pt_school_s16'] / \
    ncaa_shuffle['team2_pt_school_s16']
ncaa_new['d_pt_overall_s16'] = ncaa_shuffle['team1_pt_overall_s16'] - \
    ncaa_shuffle['team2_pt_overall_s16']
ncaa_new['q_pt_overall_s16'] = ncaa_shuffle['team1_pt_overall_s16'] / \
    ncaa_shuffle['team2_pt_overall_s16']
ncaa_new['d_pt_school_ff'] = ncaa_shuffle['team1_pt_school_ff'] - \
    ncaa_shuffle['team2_pt_school_ff']
ncaa_new['q_pt_school_ff'] = ncaa_shuffle['team1_pt_school_ff'] / \
    ncaa_shuffle['team2_pt_school_ff']
ncaa_new['d_pt_overall_ff'] = ncaa_shuffle['team1_pt_overall_ff'] - \
    ncaa_shuffle['team2_pt_overall_ff']
ncaa_new['q_pt_overall_ff'] = ncaa_shuffle['team1_pt_overall_ff'] / \
    ncaa_shuffle['team2_pt_overall_ff']
ncaa_new['d_pt_career_school_wins'] = ncaa_shuffle['team1_pt_career_school_wins'] - \
    ncaa_shuffle['team2_pt_career_school_wins']
ncaa_new['q_pt_career_school_wins'] = ncaa_shuffle['team1_pt_career_school_wins'] / \
    ncaa_shuffle['team2_pt_career_school_wins']
ncaa_new['d_pt_career_school_losses'] = ncaa_shuffle['team1_pt_career_school_losses'] - \
    ncaa_shuffle['team2_pt_career_school_losses']
ncaa_new['q_pt_career_school_losses'] = ncaa_shuffle['team1_pt_career_school_losses'] / \
    ncaa_shuffle['team2_pt_career_school_losses']
ncaa_new['d_pt_career_overall_wins'] = ncaa_shuffle['team1_pt_career_overall_wins'] - \
    ncaa_shuffle['team2_pt_career_overall_wins']
ncaa_new['q_pt_career_overall_wins'] = ncaa_shuffle['team1_pt_career_overall_wins'] / \
    ncaa_shuffle['team2_pt_career_overall_wins']
ncaa_new['d_pt_career_overall_losses'] = ncaa_shuffle['team1_pt_career_overall_losses'] - \
    ncaa_shuffle['team2_pt_career_overall_losses']
ncaa_new['q_pt_career_overall_losses'] = ncaa_shuffle['team1_pt_career_overall_losses'] / \
    ncaa_shuffle['team2_pt_career_overall_losses']
ncaa_new['d_pt_team_season_wins'] = ncaa_shuffle['team1_pt_team_season_wins'] - \
    ncaa_shuffle['team2_pt_team_season_wins']
ncaa_new['q_pt_team_season_wins'] = ncaa_shuffle['team1_pt_team_season_wins'] / \
    ncaa_shuffle['team2_pt_team_season_wins']
ncaa_new['d_pt_team_season_losses'] = ncaa_shuffle['team1_pt_team_season_losses'] - \
    ncaa_shuffle['team2_pt_team_season_losses']
ncaa_new['q_pt_team_season_losses'] = ncaa_shuffle['team1_pt_team_season_losses'] / \
    ncaa_shuffle['team2_pt_team_season_losses']
ncaa_new['d_pt_coach_season_wins'] = ncaa_shuffle['team1_pt_coach_season_wins'] - \
    ncaa_shuffle['team2_pt_coach_season_wins']
ncaa_new['q_pt_coach_season_wins'] = ncaa_shuffle['team1_pt_coach_season_wins'] / \
    ncaa_shuffle['team2_pt_coach_season_wins']
ncaa_new['d_pt_coach_season_losses'] = ncaa_shuffle['team1_pt_coach_season_losses'] - \
    ncaa_shuffle['team2_pt_coach_season_losses']
ncaa_new['q_pt_coach_season_losses'] = ncaa_shuffle['team1_pt_coach_season_losses'] / \
    ncaa_shuffle['team2_pt_coach_season_losses']

In [234]:
ncaa_new['d_ap_final'] = ncaa_shuffle['team1_ap_final'] - \
    ncaa_shuffle['team2_ap_final']
ncaa_new['q_ap_final'] = ncaa_shuffle['team1_ap_final'] / \
    ncaa_shuffle['team2_ap_final']
ncaa_new['d_ap_preseason'] = ncaa_shuffle['team1_ap_preseason'] - \
    ncaa_shuffle['team2_ap_preseason']
ncaa_new['q_ap_preseason'] = ncaa_shuffle['team1_ap_preseason'] / \
    ncaa_shuffle['team2_ap_preseason']
ncaa_new['d_coaches_before_final'] = ncaa_shuffle['team1_coaches_before_final'] - \
    ncaa_shuffle['team2_coaches_before_final']
ncaa_new['q_coaches_before_final'] = ncaa_shuffle['team1_coaches_before_final'] / \
    ncaa_shuffle['team2_coaches_before_final']
ncaa_new['d_coaches_preseason'] = ncaa_shuffle['team1_coaches_preseason'] - \
    ncaa_shuffle['team2_coaches_preseason']
ncaa_new['q_coaches_preseason'] = ncaa_shuffle['team1_coaches_preseason'] / \
    ncaa_shuffle['team2_coaches_preseason']
ncaa_new['d_fg2pct'] = ncaa_shuffle['team1_fg2pct'] - \
    ncaa_shuffle['team2_fg2pct']
ncaa_new['q_fg2pct'] = ncaa_shuffle['team1_fg2pct'] / \
    ncaa_shuffle['team2_fg2pct']
ncaa_new['d_fg3pct'] = ncaa_shuffle['team1_fg3pct'] - \
    ncaa_shuffle['team2_fg3pct']
ncaa_new['q_fg3pct'] = ncaa_shuffle['team1_fg3pct'] / \
    ncaa_shuffle['team2_fg3pct']
ncaa_new['d_ftpct'] = ncaa_shuffle['team1_ftpct'] - ncaa_shuffle['team2_ftpct']
ncaa_new['q_ftpct'] = ncaa_shuffle['team1_ftpct'] / ncaa_shuffle['team2_ftpct']
ncaa_new['d_blockpct'] = ncaa_shuffle['team1_blockpct'] - \
    ncaa_shuffle['team2_blockpct']
ncaa_new['q_blockpct'] = ncaa_shuffle['team1_blockpct'] / \
    ncaa_shuffle['team2_blockpct']
ncaa_new['d_oppfg2pct'] = ncaa_shuffle['team1_oppfg2pct'] - \
    ncaa_shuffle['team2_oppfg2pct']
ncaa_new['q_oppfg2pct'] = ncaa_shuffle['team1_oppfg2pct'] / \
    ncaa_shuffle['team2_oppfg2pct']
ncaa_new['d_oppfg3pct'] = ncaa_shuffle['team1_oppfg3pct'] - \
    ncaa_shuffle['team2_oppfg3pct']
ncaa_new['q_oppfg3pct'] = ncaa_shuffle['team1_oppfg3pct'] / \
    ncaa_shuffle['team2_oppfg3pct']
ncaa_new['d_oppftpct'] = ncaa_shuffle['team1_oppftpct'] - \
    ncaa_shuffle['team2_oppftpct']
ncaa_new['q_oppftpct'] = ncaa_shuffle['team1_oppftpct'] / \
    ncaa_shuffle['team2_oppftpct']
ncaa_new['d_oppblockpct'] = ncaa_shuffle['team1_oppblockpct'] - \
    ncaa_shuffle['team2_oppblockpct']
ncaa_new['q_oppblockpct'] = ncaa_shuffle['team1_oppblockpct'] / \
    ncaa_shuffle['team2_oppblockpct']
ncaa_new['d_f3grate'] = ncaa_shuffle['team1_f3grate'] - \
    ncaa_shuffle['team2_f3grate']
ncaa_new['q_f3grate'] = ncaa_shuffle['team1_f3grate'] / \
    ncaa_shuffle['team2_f3grate']
ncaa_new['d_oppf3grate'] = ncaa_shuffle['team1_oppf3grate'] - \
    ncaa_shuffle['team2_oppf3grate']
ncaa_new['q_oppf3grate'] = ncaa_shuffle['team1_oppf3grate'] / \
    ncaa_shuffle['team2_oppf3grate']
ncaa_new['d_arate'] = ncaa_shuffle['team1_arate'] - ncaa_shuffle['team2_arate']
ncaa_new['q_arate'] = ncaa_shuffle['team1_arate'] / ncaa_shuffle['team2_arate']
ncaa_new['d_opparate'] = ncaa_shuffle['team1_opparate'] - \
    ncaa_shuffle['team2_opparate']
ncaa_new['q_opparate'] = ncaa_shuffle['team1_opparate'] / \
    ncaa_shuffle['team2_opparate']
ncaa_new['d_stlrate'] = ncaa_shuffle['team1_stlrate'] - \
    ncaa_shuffle['team2_stlrate']
ncaa_new['q_stlrate'] = ncaa_shuffle['team1_stlrate'] / \
    ncaa_shuffle['team2_stlrate']
ncaa_new['d_oppstlrate'] = ncaa_shuffle['team1_oppstlrate'] - \
    ncaa_shuffle['team2_oppstlrate']
ncaa_new['q_oppstlrate'] = ncaa_shuffle['team1_oppstlrate'] / \
    ncaa_shuffle['team2_oppstlrate']
ncaa_new['d_tempo'] = ncaa_shuffle['team1_tempo'] - ncaa_shuffle['team2_tempo']
ncaa_new['q_tempo'] = ncaa_shuffle['team1_tempo'] / ncaa_shuffle['team2_tempo']
ncaa_new['d_adjtempo'] = ncaa_shuffle['team1_adjtempo'] - \
    ncaa_shuffle['team2_adjtempo']
ncaa_new['q_adjtempo'] = ncaa_shuffle['team1_adjtempo'] / \
    ncaa_shuffle['team2_adjtempo']
ncaa_new['d_oe'] = ncaa_shuffle['team1_oe'] - ncaa_shuffle['team2_oe']
ncaa_new['q_oe'] = ncaa_shuffle['team1_oe'] / ncaa_shuffle['team2_oe']
ncaa_new['d_adjoe'] = ncaa_shuffle['team1_adjoe'] - ncaa_shuffle['team2_adjoe']
ncaa_new['q_adjoe'] = ncaa_shuffle['team1_adjoe'] / ncaa_shuffle['team2_adjoe']
ncaa_new['d_de'] = ncaa_shuffle['team1_de'] - ncaa_shuffle['team2_de']
ncaa_new['q_de'] = ncaa_shuffle['team1_de'] / ncaa_shuffle['team2_de']
ncaa_new['d_adjde'] = ncaa_shuffle['team1_adjde'] - ncaa_shuffle['team2_adjde']
ncaa_new['q_adjde'] = ncaa_shuffle['team1_adjde'] / ncaa_shuffle['team2_adjde']

In [235]:
ncaa_new['game_id'] = ncaa_shuffle['game_id']
ncaa_new['season'] = ncaa_shuffle['season']
ncaa_new['outcome'] = ncaa_shuffle['outcome']

In [236]:
corr = ncaa_new.corr()
# corr.style.background_gradient(cmap='coolwarm')

In [380]:
corr.to_csv('corr2.csv')

In [238]:
# Delete all variables with correlation higher than 0.9
del ncaa_new['d_ap_final']
del ncaa_new['d_pt_school_s16']
del ncaa_new['d_pt_career_school_wins']
del ncaa_new['d_pt_overall_s16']
del ncaa_new['d_pt_career_overall_wins']
del ncaa_new['d_pt_overall_ff']
del ncaa_new['q_pt_coach_season_wins']
del ncaa_new['d_coaches_preseason']
del ncaa_new['q_coaches_preseason']
del ncaa_new['q_fg2pct']
del ncaa_new['q_fg3pct']
del ncaa_new['q_ftpct']
del ncaa_new['q_blockpct']
del ncaa_new['q_oppfg3pct']
del ncaa_new['q_oppfg2pct']
del ncaa_new['q_oppftpct']
del ncaa_new['q_oppblockpct']
del ncaa_new['q_f3grate']
del ncaa_new['q_oppf3grate']
del ncaa_new['q_arate']
del ncaa_new['q_opparate']
del ncaa_new['q_stlrate']
del ncaa_new['q_oppstlrate']
del ncaa_new['d_tempo']
del ncaa_new['q_tempo']
del ncaa_new['q_adjtempo']
del ncaa_new['q_oe']
del ncaa_new['q_adjoe']
del ncaa_new['q_de']
del ncaa_new['q_adjde']

In [239]:
ncaa_new.to_csv('correlated_removed.csv')

In [240]:
ncaa_new.columns

Index(['d_team_seed', 'q_team_seed', 'season', 'd_dist', 'q_dist',
       'd_pt_school_ncaa', 'q_pt_school_ncaa', 'd_pt_overall_ncaa',
       'q_pt_overall_ncaa', 'q_pt_school_s16', 'q_pt_overall_s16',
       'd_pt_school_ff', 'q_pt_school_ff', 'q_pt_overall_ff',
       'q_pt_career_school_wins', 'd_pt_career_school_losses',
       'q_pt_career_school_losses', 'q_pt_career_overall_wins',
       'd_pt_career_overall_losses', 'q_pt_career_overall_losses',
       'd_pt_team_season_wins', 'q_pt_team_season_wins',
       'd_pt_team_season_losses', 'q_pt_team_season_losses',
       'd_pt_coach_season_wins', 'd_pt_coach_season_losses',
       'q_pt_coach_season_losses', 'q_ap_final', 'd_ap_preseason',
       'q_ap_preseason', 'd_coaches_before_final', 'q_coaches_before_final',
       'd_fg2pct', 'd_fg3pct', 'd_ftpct', 'd_blockpct', 'd_oppfg2pct',
       'd_oppfg3pct', 'd_oppftpct', 'd_oppblockpct', 'd_f3grate',
       'd_oppf3grate', 'd_arate', 'd_opparate', 'd_stlrate', 'd_oppstlrate',
     

In [241]:
rf_train1 = ncaa_new[['d_team_seed', 'q_team_seed', 'd_dist', 'q_dist','d_pt_school_ncaa',
       'q_pt_school_ncaa', 'd_pt_overall_ncaa', 'q_pt_overall_ncaa',
       'q_pt_school_s16', 'q_pt_overall_s16', 'd_pt_school_ff',
       'q_pt_school_ff', 'q_pt_overall_ff', 'q_pt_career_school_wins',
       'd_pt_career_school_losses', 'q_pt_career_school_losses',
       'q_pt_career_overall_wins', 'd_pt_career_overall_losses',
       'q_pt_career_overall_losses', 'd_pt_team_season_wins',
       'q_pt_team_season_wins', 'd_pt_team_season_losses',
       'q_pt_team_season_losses', 'd_pt_coach_season_wins',
       'd_pt_coach_season_losses', 'q_pt_coach_season_losses', 'q_ap_final',
       'd_ap_preseason', 'q_ap_preseason', 'd_coaches_before_final',
       'q_coaches_before_final', 'd_fg2pct', 'd_fg3pct', 'd_ftpct',
       'd_blockpct', 'd_oppfg2pct', 'd_oppfg3pct', 'd_oppftpct',
       'd_oppblockpct', 'd_f3grate', 'd_oppf3grate', 'd_arate', 'd_opparate',
       'd_stlrate', 'd_oppstlrate', 'd_adjtempo', 'd_oe', 'd_adjoe', 'd_de',
       'd_adjde']]

In [243]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [244]:
sel = SelectFromModel(RandomForestClassifier(n_estimators=1000))

In [245]:
sel.fit(rf_train1, ncaa_new['outcome'])

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
        norm_order=1, prefit=False, threshold=None)

In [246]:
selected_feat = rf_train1.columns[(sel.get_support())]
len(selected_feat)

11

In [387]:
ncaa_new.columns

Index(['d_team_seed', 'q_team_seed', 'season', 'd_dist', 'q_dist',
       'd_pt_school_ncaa', 'q_pt_school_ncaa', 'd_pt_overall_ncaa',
       'q_pt_overall_ncaa', 'q_pt_school_s16', 'q_pt_overall_s16',
       'd_pt_school_ff', 'q_pt_school_ff', 'q_pt_overall_ff',
       'q_pt_career_school_wins', 'd_pt_career_school_losses',
       'q_pt_career_school_losses', 'q_pt_career_overall_wins',
       'd_pt_career_overall_losses', 'q_pt_career_overall_losses',
       'd_pt_team_season_wins', 'q_pt_team_season_wins',
       'd_pt_team_season_losses', 'q_pt_team_season_losses',
       'd_pt_coach_season_wins', 'd_pt_coach_season_losses',
       'q_pt_coach_season_losses', 'q_ap_final', 'd_ap_preseason',
       'q_ap_preseason', 'd_coaches_before_final', 'q_coaches_before_final',
       'd_fg2pct', 'd_fg3pct', 'd_ftpct', 'd_blockpct', 'd_oppfg2pct',
       'd_oppfg3pct', 'd_oppftpct', 'd_oppblockpct', 'd_f3grate',
       'd_oppf3grate', 'd_arate', 'd_opparate', 'd_stlrate', 'd_oppstlrate',
     

In [247]:
print(selected_feat)

Index(['d_team_seed', 'q_team_seed', 'q_ap_final', 'd_ap_preseason',
       'q_ap_preseason', 'd_coaches_before_final', 'q_coaches_before_final',
       'd_oppfg3pct', 'd_oe', 'd_adjoe', 'd_adjde'],
      dtype='object')


In [390]:
ncaa_selected = ncaa_new[selected_feat]
ncaa_selected['game_id'] = ncaa_new.loc[:, 'game_id']
ncaa_selected['season'] = ncaa_new.loc[:, 'season']
ncaa_selected['outcome'] = ncaa_new.loc[:, 'outcome']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [250]:
ncaa_selected_train = ncaa_selected[ncaa_selected['season'] != 2018].reset_index(
    drop=True)
ncaa_selected_testing = ncaa_selected[ncaa_selected['season'] == 2018].reset_index(
    drop=True)
input_train = ncaa_selected_train[selected_feat]
input_test = ncaa_selected_testing[selected_feat]
input_all = ncaa_selected[selected_feat]

In [251]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

In [252]:
log_model = LogisticRegression(
    random_state=0, solver='liblinear', multi_class='ovr')

In [253]:
log_model.fit(input_train, ncaa_selected_train['outcome'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [254]:
ncaa_selected_testing['log_predict'] = log_model.predict(input_test)

In [255]:
accuracy_score(ncaa_selected_testing['outcome'], log_model.predict(input_test))

0.746268656716418

In [256]:
log_loss(ncaa_selected_testing['outcome'], log_model.predict_proba(input_test))

0.5707599419013116

In [257]:
from sklearn.svm import SVC

In [258]:
svm_model = SVC(probability=True)
svm_model.fit(input_train, ncaa_selected_train['outcome'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [259]:
ncaa_selected_testing['predict'] = svm_model.predict(input_test)

In [260]:
accuracy_score(ncaa_selected_testing['outcome'], svm_model.predict(input_test))

0.5074626865671642

In [261]:
log_loss(ncaa_selected_testing['outcome'],svm_model.predict_proba(input_test))

0.7137041660428382

In [262]:
##Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [263]:
rf_model = RandomForestClassifier(
    n_estimators=100, max_depth=2, random_state=0)

In [264]:
rf_model.fit(input_train, ncaa_selected_train['outcome'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [265]:
ncaa_selected_testing['predict'] = rf_model.predict(input_test)

In [266]:
accuracy_score(ncaa_selected_testing['outcome'], rf_model.predict(input_test))

0.7910447761194029

In [267]:
log_loss(ncaa_selected_testing['outcome'], rf_model.predict_proba(input_test))

0.529188293292902

In [268]:
test_2019 = pd.read_csv('NCAA_Tourney_2019.csv')

In [269]:
test_2019_selected = pd.DataFrame()

In [270]:
test_2019.columns

Index(['game_id', 'team1_id', 'team2_id', 'season', 'team1_seed', 'team2_seed',
       'strongseed', 'weakseed', 'host_lat', 'host_long',
       'team1_pt_school_ncaa', 'team1_pt_overall_ncaa', 'team1_pt_school_s16',
       'team1_pt_overall_s16', 'team1_pt_school_ff', 'team1_pt_overall_ff',
       'team1_pt_career_school_wins', 'team1_pt_career_school_losses',
       'team1_pt_career_overall_losses', 'team1_pt_team_season_wins',
       'team1_pt_team_season_losses', 'team1_pt_coach_season_wins',
       'team1_pt_coach_season_losses', 'team1_pt_career_overall_wins',
       'team2_pt_school_ncaa', 'team2_pt_overall_ncaa', 'team2_pt_school_s16',
       'team2_pt_overall_s16', 'team2_pt_school_ff', 'team2_pt_overall_ff',
       'team2_pt_career_school_wins', 'team2_pt_career_school_losses',
       'team2_pt_career_overall_losses', 'team2_pt_team_season_wins',
       'team2_pt_team_season_losses', 'team2_pt_coach_season_wins',
       'team2_pt_coach_season_losses', 'team2_pt_career_overall

In [271]:
for i in list(test_2019.columns[1:]):
    test_2019[i].fillna(45.0, inplace=True)
    test_2019[i] = test_2019[i].astype(float).replace(0, 0.01)

'd_team_seed', 'q_team_seed', 'q_pt_overall_s16', 'q_ap_final',
       'd_ap_preseason', 'q_ap_preseason', 'd_coaches_before_final',
       'q_coaches_before_final', 'd_oppfg3pct', 'd_oe', 'd_adjoe', 'd_adjde'

In [272]:
test_2019_selected['d_team_seed'] = test_2019['team1_seed'] - \
    test_2019['team2_seed']
test_2019_selected['q_team_seed'] = test_2019['team1_seed'] / \
    test_2019['team2_seed']
test_2019_selected['q_ap_final'] = test_2019['team1_ap_final'] / \
    test_2019['team2_ap_final']
test_2019_selected['d_ap_preseason'] = test_2019['team1_ap_preseason'] - \
    test_2019['team2_ap_preseason']
test_2019_selected['q_ap_preseason'] = test_2019['team1_ap_preseason'] / \
    test_2019['team2_ap_preseason']
test_2019_selected['d_coaches_before_final'] = test_2019['team1_coaches_before_final'] - \
    test_2019['team2_coaches_before_final']
test_2019_selected['q_coaches_before_final'] = test_2019['team1_coaches_before_final'] / \
    test_2019['team2_coaches_before_final']
test_2019_selected['d_oppfg3pct'] = test_2019['team1_oppfg3pct'] - \
    test_2019['team2_oppfg3pct']
test_2019_selected['d_adjoe'] = test_2019['team1_adjoe'] - \
    test_2019['team2_adjoe']
test_2019_selected['d_oe'] = test_2019['team1_oe'] - test_2019['team2_oe']
test_2019_selected['d_adjde'] = test_2019['team1_adjde'] - \
    test_2019['team2_adjde']

In [273]:
test_2019_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2278 entries, 0 to 2277
Data columns (total 11 columns):
d_team_seed               2278 non-null float64
q_team_seed               2278 non-null float64
q_ap_final                2278 non-null float64
d_ap_preseason            2278 non-null float64
q_ap_preseason            2278 non-null float64
d_coaches_before_final    2278 non-null float64
q_coaches_before_final    2278 non-null float64
d_oppfg3pct               2278 non-null float64
d_adjoe                   2278 non-null float64
d_oe                      2278 non-null float64
d_adjde                   2278 non-null float64
dtypes: float64(11)
memory usage: 195.8 KB


In [274]:
test_2019_selected['prediction'] = log_model.predict(
    test_2019_selected[test_2019_selected.columns[:]])

In [275]:
test_2019_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2278 entries, 0 to 2277
Data columns (total 12 columns):
d_team_seed               2278 non-null float64
q_team_seed               2278 non-null float64
q_ap_final                2278 non-null float64
d_ap_preseason            2278 non-null float64
q_ap_preseason            2278 non-null float64
d_coaches_before_final    2278 non-null float64
q_coaches_before_final    2278 non-null float64
d_oppfg3pct               2278 non-null float64
d_adjoe                   2278 non-null float64
d_oe                      2278 non-null float64
d_adjde                   2278 non-null float64
prediction                2278 non-null int64
dtypes: float64(11), int64(1)
memory usage: 213.6 KB


In [276]:
test_2019_selected['probability'] = log_model.predict_proba(
    test_2019_selected[test_2019_selected.columns[:-1]])[:, 1]

In [277]:
test_2019_selected['game_id'] = test_2019['game_id']

In [278]:
test_2019_selected.to_csv('2019_prediction.csv')