# FanDuel Analysis

This notebook is used to predict whether a player will score points on FanDuel for a game. This is simpler version of predicting how many total points they may get on a day. The goal is to create a model that will perform better than a basic model, and then visualize the results to understand what factors drive a player's chance that they will score points on FanDuel.


* Step 1 - read in data and create any more features
* Step 2- build basic model to compare future models to
* Step 3 - create visualization section
* Step 4 - once better model is built email guy from phillies to stay in contact

### Step 1 

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [5]:
df = pd.read_csv('../data/all_batters_with_extra_feats.csv')

In [6]:
df.columns

Index(['Unnamed: 0', 'player_id', 'high_home_runs', 'low_home_runs',
       'med_home_runs', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Gcar', 'Gtm',
       'Date', 'Tm', 'Opp', 'Rslt', 'Inngs', 'PA', 'AB', 'R', 'H', '2B', '3B',
       'HR', 'RBI', 'BB', 'IBB', 'SO', 'HBP', 'SH', 'SF', 'ROE', 'GDP', 'SB',
       'CS', 'BA', 'OBP', 'SLG', 'OPS', 'BOP', 'aLI', 'WPA', 'acLI', 'cWPA',
       'RE24', 'DFS(DK)', 'DFS(FD)', 'Pos', 'is_home', 'got_hit',
       'got_hit_prev_day', 'hit_streak', 'prev_points', 'points_ma',
       'above_avg_points', 'above_avg_streak'],
      dtype='object')

### Select Features & Standardize Data 

In [28]:
def get_X_and_y(df):
    feats = ['hit_streak', 'prev_points', 'points_ma', 
             'above_avg_points', 'above_avg_streak']
    X = df[feats]
    df['positive_points'] = df['DFS(FD)'] > 0
    y = df['positive_points']
    return X, y

In [15]:
X, y = get_X_and_y(df)

In [16]:
hhr = df[df['high_home_runs'] == 1].copy()
hhr_X, hhr_y = get_X_and_y(hhr)

mhr = df[df['med_home_runs'] == 1].copy()
mhr_X, mhr_y = get_X_and_y(mhr)

lhr = df[df['low_home_runs'] == 1].copy()
lhr_X, lhr_y = get_X_and_y(lhr)

### Step 2 

In [17]:
# what is the baseline accuracy (if always guessed positive how many what would it be)
# baseline is 72.9%
df['positive_points'].sum()/df.shape[0]

0.7386626105571517

In [23]:
def run_classify_model(df, X, y):
    results = {'baseline': df['positive_points'].sum()/df.shape[0]}
    models = ['LOG', 'TREE', 'KNN', 'ADA', 'NN']
    for model in models:
        if model == 'LOG':
            clf = LogisticRegression(solver='saga', max_iter=10000)
        elif model == 'TREE':
            clf = DecisionTreeClassifier(max_depth=10)
        elif model == 'KNN':
            clf = KNeighborsClassifier()
        elif model == 'ADA':
            clf = AdaBoostClassifier()
        elif model == 'NN':
            clf = MLPClassifier(random_state=1, max_iter=300)
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
        scaler = StandardScaler().fit(X_train)
        X_train_scaled = scaler.transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        clf = clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test_scaled)
#         scores = cross_val_score(clf,X, y, cv=10)
        acc = accuracy_score(y_pred, y_test)
        results[model] = round(acc, 3) * 100
    return results

In [24]:
run_classify_model(df, X, y)

{'baseline': 0.7386626105571517,
 'LOG': 73.0,
 'TREE': 72.7,
 'KNN': 64.1,
 'ADA': 73.0,
 'NN': 72.89999999999999}

In [25]:
run_classify_model(hhr,hhr_X_scaled, hhr_y)

{'baseline': 0.8091487428052105,
 'LOG': 79.80000000000001,
 'TREE': 75.2,
 'KNN': 76.6,
 'ADA': 79.7,
 'NN': 79.80000000000001}

In [26]:
run_classify_model(mhr,mhr_X_scaled, mhr_y)

{'baseline': 0.761917883714522,
 'LOG': 77.0,
 'TREE': 76.0,
 'KNN': 73.9,
 'ADA': 76.9,
 'NN': 77.0}

In [27]:
run_classify_model(lhr,lhr_X_scaled, lhr_y)

{'baseline': 0.6828420467185762,
 'LOG': 67.9,
 'TREE': 64.5,
 'KNN': 61.3,
 'ADA': 67.5,
 'NN': 67.9}