# Niave Bayes Classification
1. Pull data with Google's BigQuery
    1. Save a local copy of the data
1. Clean data
    1. Eliminate unneccessary columns
    1. Seperate pre and post season data 
    1. Compute stats of regular season for each team each season
    1. Make sure win percentage is included (is this needed?)
    1. Find the difference between two teams in a season <-- This will be what we train on
    1. Extract just the match ups and outcomes of post season
1. Apply Gaussian Naive-Bayes
    1. Split post season data into train and test
    1. Find precision, recal and F1 scores for attributes

In [354]:
import pandas as pd
# !pip install google-cloud-bigquery
from google.cloud import bigquery
from google.oauth2 import service_account
# !pip install db_dtypes 
import db_dtypes 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_recall_fscore_support

## Pull data with Google's BigQuery

Initalize the dataset fetch

In [2]:
project_id = 'dataanalyticscs5850'
credentials = service_account.Credentials.from_service_account_file('serviceUserKey.json')
client = bigquery.Client(credentials= credentials, project= project_id)

Pull the dataset and store a reference

In [3]:
dataset_ref = client.dataset('ncaa_basketball', project='bigquery-public-data')
ncaa_dataset = client.get_dataset(dataset_ref)

query="""SELECT *
FROM `bigquery-public-data.ncaa_basketball.mbb_games_sr` """

# Set up the query
query_job = client.query(query)
# API request - run the query, and return a pandas DataFrame
gamesData = query_job.to_dataframe()
display(gamesData.head(5))

Unnamed: 0,game_id,season,status,coverage,neutral_site,scheduled_date,gametime,conference_game,tournament,tournament_type,...,a_fast_break_pts,a_second_chance_pts,a_team_turnovers,a_points_off_turnovers,a_team_rebounds,a_flagrant_fouls,a_player_tech_fouls,a_team_tech_fouls,a_coach_tech_fouls,created
0,b4451a02-26c5-4005-9ac8-b06c1f71e661,2015,closed,full,,2015-11-24,2015-11-24 21:30:00+00:00,,,,...,36,17,0,31.0,5,0,0,0,0,2018-02-20 15:48:58+00:00
1,b2f579ca-9eff-4b2b-a747-81169399c2e8,2015,closed,full,,2015-11-24,2015-11-24 02:00:00+00:00,,,,...,16,25,0,25.0,1,0,0,0,0,2018-02-20 15:48:53+00:00
2,571be71c-a5bf-446e-bf21-30eb6c54ac5e,2015,closed,full,,2015-11-25,2015-11-25 19:30:00+00:00,,,,...,6,6,0,12.0,0,0,0,0,0,2018-02-20 15:48:58+00:00
3,d6617923-0b23-49e4-af9b-9e4d0243e45c,2015,closed,full,,2015-12-19,2015-12-19 04:00:00+00:00,,,,...,2,13,1,15.0,3,0,0,0,0,2018-02-20 15:48:53+00:00
4,ffb463a4-dd3c-4ed9-b503-311b95ef0295,2015,closed,full,,2015-12-20,2015-12-20 04:00:00+00:00,,,,...,6,7,0,,0,0,0,0,0,2018-02-20 15:48:53+00:00


Use this to create a csv of the data

In [4]:
# compression_opts = dict(method='zip',
#                         archive_name='ncaaBasketball.csv')  
# gamesData.to_csv('ncaaBasketball.zip', index=False,
#           compression=compression_opts)  

Use this to read in csv file into a Pandas DataFrame

In [312]:
gamesData = pd.read_csv('ncaaBasketball/ncaaBasketball.csv')

  gamesData = pd.read_csv('ncaaBasketball/ncaaBasketball.csv')


## Clean data

In [315]:
dropMe=['status','coverage','tournament_type','tournament_round','tournament_game_no', 'scheduled_date',
        'attendance','created','possession_arrow','venue_id','venue_city','venue_state','venue_address',
       'venue_zip','venue_country','venue_name','venue_capacity','h_league_id','h_league_name','h_league_alias',
        'h_conf_id','h_conf_name','h_conf_alias','h_division_id','h_division_name','h_division_alias',
        'h_logo_large','h_logo_medium','h_logo_small','a_league_id','a_league_name','a_league_alias',
        'a_conf_id','a_conf_name','a_conf_alias','a_division_id','a_division_name','a_division_alias',
        'a_logo_large','a_logo_medium','a_logo_small','h_name','a_name','h_market','a_market', 'h_id','a_id',
       'game_id','gametime','conference_game','neutral_site','h_minutes','a_minutes','h_rank','a_rank',
       'lead_changes','times_tied','periods']

gamesData = gamesData.drop(columns = dropMe)

We are going to be very interested in who won the game; this is column which we want to predict.

In [316]:
gamesData['h_won'] = gamesData['h_points_game'] > gamesData.a_points_game
gamesData['a_won'] = gamesData['h_points_game'] < gamesData.a_points_game

Split data into pre and post season

In [317]:
postseason = gamesData[gamesData.tournament == 'NCAA']
postseason = postseason.drop(columns = ['tournament'])
preseason = gamesData[gamesData.tournament != 'NCAA']
preseason = preseason.drop(columns = ['tournament'])
display(preseason.head(5))

Unnamed: 0,season,h_alias,h_points_game,h_field_goals_made,h_field_goals_att,h_field_goals_pct,h_three_points_made,h_three_points_att,h_three_points_pct,h_two_points_made,...,a_second_chance_pts,a_team_turnovers,a_points_off_turnovers,a_team_rebounds,a_flagrant_fouls,a_player_tech_fouls,a_team_tech_fouls,a_coach_tech_fouls,h_won,a_won
0,2015,CHA,73,26.0,61.0,42.6,10.0,19.0,52.6,16.0,...,17.0,0.0,31.0,5.0,0.0,0.0,0.0,0.0,False,True
1,2015,CHA,72,24.0,69.0,34.8,8.0,33.0,24.2,16.0,...,25.0,0.0,25.0,1.0,0.0,0.0,0.0,0.0,False,True
2,2015,CHA,93,34.0,67.0,50.7,13.0,29.0,44.8,21.0,...,6.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,False,True
3,2015,ORST,82,26.0,54.0,48.1,8.0,20.0,40.0,18.0,...,13.0,1.0,15.0,3.0,0.0,0.0,0.0,0.0,True,False
4,2015,ORST,76,22.0,54.0,40.7,7.0,18.0,38.9,15.0,...,7.0,0.0,,0.0,0.0,0.0,0.0,0.0,True,False


Compute stats of regular season for each team each season. HERE WE ASSUME NAN IS 0

In [349]:
def getTeamStatsPerSeason(data,team,year): 
    home = data.query('season == @year and h_alias == @team').copy()
    home = home.filter(regex = '^h')
    home = home.rename(columns=lambda x: x.replace('h_', '',1))
    
    away = data.query('season == @year and a_alias == @team').copy()
    away = away.filter(regex = '^a')
    away = away.rename(columns=lambda x: x.replace('a_', '',1))
    
    temp = pd.concat([home,away])
    temp = temp.drop(columns = ['alias']).fillna(0)
    return temp.mean()

Compute difference between two given teams in a preseason

In [319]:
def getTeamDifferences(data,team1,team2,year):
    team1Mean = getTeamStatsPerSeason(data,team1,year)
    team2Mean = getTeamStatsPerSeason(data,team2,year)
    
    return team1Mean - team2Mean

Extract just the match ups and outcomes of post season. This will become our training and testing set

In [320]:
postseasonData = postseason.filter(items = ['season','h_alias','a_alias'])
postseasonTarget = postseason.filter(items = ['h_won'])

## Apply Naive Bayes
Split post season data into train and test

In [352]:
def getMatchUp(dataToSplit,dataSource):
    matchUps = []
    for game in range(len(dataToSplit)):
        [year,team1, team2] = dataToSplit.iloc[game]
        matchUps.append(getTeamDifferences(dataSource,team1, team2, year))
        
    return matchUps

In [371]:
X_train_temp, X_test_temp, y_train, y_test = train_test_split(postseasonData, postseasonTarget, test_size=0.25)

X_train = getMatchUp(X_train_temp,preseason)
X_test = getMatchUp(X_test_temp,preseason)

Unnamed: 0,season,h_alias,a_alias
17374,2015,SEA,IDHO
5255,2014,MSU,UGA
27884,2016,GMU,L-MD
11326,2013,MURR,NEOM
9907,2017,SMC,UTAH
...,...,...,...
28362,2017,WOF,CMU
9908,2017,OKST,WKU
8000,2016,CAMP,UTM
13590,2013,OHIO,VMI


Finally train!!

In [360]:
clf = GaussianNB()
clf.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


GaussianNB()

In [396]:
y_pred = clf.predict(X_test)
print(y_test['h_won'].iloc[0])
for game in range(len(y_pred)):
    [season,team1,team2] = X_test_temp.iloc[game]
    print('Game:', team1,'beats', team2, 'in', season,'Predict:',y_pred[game], 'Actual:',y_test['h_won'].iloc[game])
p,r,f,s = precision_recall_fscore_support(y_test, y_pred)
print(p, r, f)

True
Game: UTAH beats FRES in 2015 Predict: False Actual: True
Game: GONZ beats XAV in 2016 Predict: False Actual: True
Game: BSU beats WASH in 2017 Predict: False Actual: False
Game: SCAR beats GT in 2015 Predict: False Actual: False
Game: OKLA beats TXAM in 2015 Predict: False Actual: True
Game: VILL beats TTU in 2017 Predict: False Actual: True
Game: UNC beats ARK in 2016 Predict: False Actual: True
Game: TEX beats UNI in 2015 Predict: False Actual: False
Game: USD beats PRST in 2017 Predict: True Actual: True
Game: TENN beats IOWA in 2013 Predict: False Actual: True
Game: WIS beats ORE in 2014 Predict: False Actual: True
Game: TXAM beats WYO in 2013 Predict: True Actual: True
Game: COLO beats WEBB in 2014 Predict: True Actual: True
Game: VILL beats RAD in 2017 Predict: True Actual: True
Game: UVA beats MSU in 2013 Predict: False Actual: False
Game: NCST beats LSU in 2014 Predict: False Actual: True
Game: VALP beats TXSO in 2015 Predict: False Actual: True
Game: ULL beats EVAN in 20