In this notebook I will be exploring the NBADATA.csv file, and trying to identify the relevant statistics for modelling. This will be split into two sections. 


1. Which stats are the most correlated to victory? 
    By looking at all of these box scores, we can see what the game's final point differential (+/-) was. Using this as an output along with all of the box score statistics, we can infer which values are the most correlated to it, along with other useful takeaways to be described concurrently with the study. 
    

2. How many games are sufficient for lookback? 
    Obviously we cannot use the ingame data for prediction purposes, so some sort of lookback prediction proxy must be used. In order to capture what the teams expected performance will be in the game for prediction, a lookback window will be used over the previous x games to serve as that team's forecasted stats. I will try to identify how many games of lookback is the best for this purpose. 

In [1]:
#import dependencies, and dataset. 


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline



In [2]:
df = pd.read_csv('30_game_rolling_stats.csv',dtype=np.float32)

In [3]:
outcome = df['PLUS_MINUS'] #df['home_SPREAD'] + df['PLUS_MINUS'] 

In [4]:
y = []
for val in outcome:
    if val>0: 
        y.append(1) #home team wins. 
    else:
        y.append(0)

In [5]:
del df['PLUS_MINUS']

In [6]:
X = df

In [7]:
del X['Unnamed: 0']

In [8]:
sum(y)/len(y)  



0.5903896103896104

The home team roughly wins 60% of the time. 

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [10]:
scaler = MinMaxScaler()
X_train,X_test,y_train,y_test = train_test_split(X,y,shuffle=False,test_size = .25)

X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,shuffle=False,test_size = .25)

In [11]:
type(X_test.values[0][0])

numpy.float32

Now prior to normalizing, create a separate var that holds onto all of the spreads for the home teams in the testing games. 

In [12]:
test_spreads = X_test['home_SPREAD']

In [13]:
scaler.fit(X_train.values)
X_train = scaler.transform(X_train.values)
X_val = scaler.transform(X_val.values)
X_test = scaler.transform(X_test.values)

# Now apply tensorflow model here!

In [14]:
import  tensorflow as tf
tf.set_random_seed(456)
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

In [15]:
X_test.shape[0]

963

In [16]:
# Generate tensorflow graph
d = X_train.shape[1]
n_hidden = 100
learning_rate = .001
n_epochs = 100
batch_size = 200
dropout_prob = .5

In [17]:
type(X_train[0][0])

numpy.float32

In [18]:
with tf.name_scope("placeholders"):
    x = tf.placeholder(tf.float32, (None, d))
    y = tf.placeholder(tf.float32, (None,))
    keep_prob = tf.placeholder(tf.float32)
with tf.name_scope("hidden-layer-1"):
    W = tf.Variable(tf.random_normal((d, n_hidden)))
    b = tf.Variable(tf.random_normal((n_hidden,)))
    x_hidden_1 = tf.nn.sigmoid(tf.matmul(x, W) + b)
  # Apply dropout
    x_hidden_1 = tf.nn.dropout(x_hidden_1, keep_prob)

with tf.name_scope("hidden-layer-2"):
    W = tf.Variable(tf.random_normal((n_hidden, n_hidden)))
    b = tf.Variable(tf.random_normal((n_hidden,)))
    x_hidden_2 = tf.nn.sigmoid(tf.matmul(x_hidden_1, W) + b)
  # Apply dropout
    x_hidden_2 = tf.nn.dropout(x_hidden_2, keep_prob)

with tf.name_scope("output"):
    W = tf.Variable(tf.random_normal((n_hidden, 1)))
    b = tf.Variable(tf.random_normal((1,)))
    y_logit = tf.matmul(x_hidden_2, W) + b
  # the sigmoid gives the class probability of 1
    y_one_prob = tf.sigmoid(y_logit)
  # Rounding P(y=1) will give the correct prediction.
    y_pred = tf.round(y_one_prob)
with tf.name_scope("loss"):
  # Compute the cross-entropy term for each datapoint
    y_expand = tf.expand_dims(y, 1)
    entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=y_logit, labels=y_expand)
  # Sum all contributions
    l = tf.reduce_sum(entropy)

with tf.name_scope("optim"):
    train_op = tf.train.AdamOptimizer(learning_rate).minimize(l)

with tf.name_scope("summaries"):
    tf.summary.scalar("loss", l)
    merged = tf.summary.merge_all()

In [19]:
train_writer = tf.summary.FileWriter('/tmp/nba-train',tf.get_default_graph())


In [20]:
N = X_train.shape[0]
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    step = 0
    for epoch in range(n_epochs):
        pos = 0
        while pos < N:
            batch_X = X_train[pos:pos+batch_size]
            batch_y = y_train[pos:pos+batch_size]
            feed_dict = {x: batch_X, y: batch_y, keep_prob: dropout_prob}
            _, summary, loss = sess.run([train_op, merged, l], feed_dict=feed_dict)
            train_writer.add_summary(summary, step)

            step += 1
            pos += batch_size
  # Make Predictions (set keep_prob to 1.0 for predictions)
        train_y_pred = sess.run(y_pred, feed_dict={x: X_train, keep_prob: 1.0})
        val_y_pred = sess.run(y_pred, feed_dict={x: X_val, keep_prob: 1.0})
        print("epoch %d, step %d, loss: %f" % (epoch, step, loss))
        print("training acc. ", accuracy_score(y_train,train_y_pred), "%")
      #  print("validation acc. ", accuracy_score(y_val,val_y_pred), "%")
    test_y_pred = sess.run(y_pred, feed_dict={x: X_test, keep_prob: 1.0})

train_weighted_score = accuracy_score(y_train, train_y_pred)
print("Train Weighted Classification Accuracy: %f" % train_weighted_score)
#valid_weighted_score = accuracy_score(valid_y, valid_y_pred, sample_weight=valid_w)
#print("Valid Weighted Classification Accuracy: %f" % valid_weighted_score)
test_weighted_score = accuracy_score(y_test, test_y_pred)
print("Test Weighted Classification Accuracy: %f" % test_weighted_score)

epoch 0, step 11, loss: 463.394836
training acc.  0.641108545035 %
validation acc.  0.59972299169 %
epoch 1, step 22, loss: 410.276398
training acc.  0.65311778291 %
validation acc.  0.612188365651 %
epoch 2, step 33, loss: 498.833069
training acc.  0.653579676674 %
validation acc.  0.6108033241 %
epoch 3, step 44, loss: 474.791046
training acc.  0.657736720554 %
validation acc.  0.628808864266 %
epoch 4, step 55, loss: 351.564423
training acc.  0.656812933025 %
validation acc.  0.630193905817 %
epoch 5, step 66, loss: 352.825531
training acc.  0.658660508083 %
validation acc.  0.624653739612 %
epoch 6, step 77, loss: 362.442932
training acc.  0.664203233256 %
validation acc.  0.62188365651 %
epoch 7, step 88, loss: 274.934845
training acc.  0.666974595843 %
validation acc.  0.624653739612 %
epoch 8, step 99, loss: 348.102417
training acc.  0.669284064665 %
validation acc.  0.63296398892 %
epoch 9, step 110, loss: 317.629211
training acc.  0.668822170901 %
validation acc.  0.6440443213

epoch 82, step 913, loss: 132.122452
training acc.  0.677598152425 %
validation acc.  0.627423822715 %
epoch 83, step 924, loss: 128.200439
training acc.  0.677136258661 %
validation acc.  0.626038781163 %
epoch 84, step 935, loss: 128.853683
training acc.  0.676674364896 %
validation acc.  0.628808864266 %
epoch 85, step 946, loss: 116.685287
training acc.  0.676674364896 %
validation acc.  0.628808864266 %
epoch 86, step 957, loss: 116.290558
training acc.  0.675288683603 %
validation acc.  0.631578947368 %
epoch 87, step 968, loss: 121.276199
training acc.  0.675750577367 %
validation acc.  0.634349030471 %
epoch 88, step 979, loss: 110.818008
training acc.  0.675750577367 %
validation acc.  0.631578947368 %
epoch 89, step 990, loss: 115.948158
training acc.  0.676212471132 %
validation acc.  0.634349030471 %
epoch 90, step 1001, loss: 123.473068
training acc.  0.675750577367 %
validation acc.  0.638504155125 %
epoch 91, step 1012, loss: 117.554947
training acc.  0.672979214781 %
va

In [None]:
!tensorboard --logdir=/tmp/nba-train 

In [None]:
from sklearn.neural_network import MLPClassifier

from sklearn.svm import SVC
model = MLPClassifier()

model.fit(X_train,y_train)



In [None]:
model.score(X_train,y_train)

In [None]:
model.score(X_test,y_test)

Now to compare: 

In [None]:
predictions = model.predict(X_test)


for i,prediction in enumerate(predictions):
   # print("Prediction: ", prediction)
    #print("Spread of the home team for this game: ", test_spreads.values[i])
    
    if test_spreads.values[i] > 0 and prediction ==1:
        print("Picked a home team underdog")
        
    if test_spreads.values[i] < 0 and prediction ==0:
        print("Picked a away team underdog")

This is a good time to note this seems to very rarely pick underdogs, but more often it does on the road.  -- OK now the model picks the opposite, let's come back to this later after a proper payout calculator is made at least. 

In [None]:
def spread2ML(spread):
    """Converts spread into a moneyline value using the equation I derived in another notebook. 
    """
    if spread <=1.5:
        
        ML = 1.71409498 * spread**3 + 10.90008433 * spread **2 + 22.40247106 * spread - 138.20112341
    else: 

        ML = 1.66494668 * spread**3 -20.03302374 * spread**2 + 101.20347437 * spread - 34.68833849
    
    return ML


def ML2Payout(ML,bet,win=True):
    """Convert Moneyline odds to a payout. 
    """
    if win:
        if ML < 0: # - moneyline, 
        # PAYOUT = BET AMOUNT / (-1 *MONEYLINE ODDS / 100)

            payout = bet / (-1*ML/100)

        elif ML > 0:   #now for the underdog
        #PAYOUT = BET AMOUNT * ODDS / 100
            payout = (bet * ML) / 100

            
        else:
            payout = bet
    else:
        if ML > 0: 
            payout = -bet
        elif ML < 0:
            #in the circumstances where its a favorite, the computer makes you put down more. ie -190 means 19 to win 10. 
            payout = -bet
            
        else:
            payout = -bet
    
    return payout 

In [None]:
spread2ML(1)

In [None]:
def risk2payout(ML,bet,win=True):
    """Depending on the moneyline, the risk reward formula changes. 
    """
    
    if ML < 0: # if betting on a favorite. 
        
        risk = -ML/bet
        reward = bet
        if win:
            payout = reward
        else:
            payout = -risk
        
    if ML > 0: #if betting on an underdog. 
        risk = bet
        reward = ML/bet
        
        if win:
            payout = reward #this is your risked money back, plus the reward. 
        else:
            payout = -risk  #this is how much you risked, and it's gone. 
    
    return risk, payout
        
        
        
        
        

In [None]:
risk2payout(-105,10,win=False)

In [None]:
money_made = 0
acc_count = 0
total_winings = 0
total_losings = 0
init_bet = 10

for i,prediction in enumerate(predictions):
       
     
    
    spread = test_spreads.values[i]
    ML_odds = spread2ML(spread)
    print()
    print("Odds of Game: ", ML_odds)

   # ML * 
    print("Spread of Game: ",spread)

    if y_test[i] == prediction:
        acc_count+=1
        risk , winnings = risk2payout(ML_odds,init_bet,win=True)
        print("Correct! Win $", winnings)
        print("You risked $",risk)
       # print('$',winnings)
        
        money_made += winnings
        total_winings += winnings
       # if ML_odds < -
    else:
        _ , losings = risk2payout(ML_odds,init_bet,win=False)
        print("Wrong! Lose $", -losings)

        money_made += losings
        total_losings += losings
     #   print("xxx")
    
    
  #  if test_spreads.values[i] > 0 and prediction ==1:
   #     print("Picked a home team underdog")
  #      
  #  if test_spreads.values[i] < 0 and prediction ==0:
   #     print("Picked a away team underdog")

In [None]:
money_made

In [None]:
total_winings

In [None]:
total_losings

# So considering the spread along with these rolling statistics, able to successfully pick the winner of NBA games with approximately 69% accuracy. By also considering moneyline odds converted from spread, how much money do you win? 

In [None]:
len(X_test)

# Quick Review

# 1. Identified stats most correlated to point differential in a game.

# 2. Identified ngamesplits most correlated to performance in said game. 

# 3. Trained a model on these statistics and found can pick winners with ~68% accuracy, and spread still barely over 50%. 

# What's up for next time? Refactor this code, and turn into an updateable model as more games come in using those scrapers, and also provide a fast application for guess and check. Also convert between traditional spread to create a converter for money made. 