# Senior Project 1 Presentation
Authors: Ismail Conze (Leader/ Analyzer), Kalyn Matthews (Document Facilitator), Nick Chowa (Data Interpreter)

## Introduction

Our Live onLine statistical analysis will serve as a moderately equipped NBA game predictor using reliable basketball stats as well as potent history, to heighten the fantasy league experience. The world of betting as we know it has relied heavily on the ability to make concise hypotheses of favorable teams/opponents. To which partner(s) and I sought to fulfill through the increased use of machine learning and selective features. 

Aside from the research already conducted, partners & I followed an independent approach that emphasized the importance of outlier recognition and filtration respective to team analysis. H

## Methodology

The data for our study was obtained from a publicly available sqlite database on Kaggle. This dataset is titled “Basketball Dataset” and contains stats pulled directly from the NBA API. These stats range from 1946 to the 2020 season. Looking more closely into the determination of appropriate sizing for our specific dataset we referenced the acceptable size for test sets ranging between 20-25% of the total observations.

## Preprocessing

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pyplot
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, cross_val_predict

### Functions

Takes in a dataframe and team name, returns a dataframe containing all of the games for the requested team.

In [4]:
def pull_team(team, games):
    teams_games = games.loc[(games['TEAM_ABBREVIATION_HOME'] == team) |
                            (games['TEAM_ABBREVIATION_AWAY'] == team)]
    print('Number of games')
    print(len(teams_games))
    return teams_games


Returns the indices and values of all elements considered outliers by the tukey method.

In [5]:
def find_outliers(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    iqr = q3 - q1
    floor = q1 - 1.5 * iqr
    ceiling = q3 + 1.5 * iqr
    outlier_indices = list(x.index[(x < floor) | (x > ceiling)])
    outlier_values = list(x[outlier_indices])
    return outlier_indices, outlier_values

drops the indices of games containing outliers in a statistical category

In [6]:
def remove_outliers(x):
    indices = []
    for c in x.columns:
        if not x[c].map(type).eq(str).any():
            if not c == "GAME_ID" or c == "GAME_DATE":
                indices += find_outliers(x[c])[0]
    x = x.drop(indices)
    return x

Separates teams into numerical and string columns, replaces missing numerical values with the median value of the statistic, and puts the two types of dataframes back together

In [7]:
def clean_team(x):
    # separate numerical features and categorical features
    categorical_columns = []
    numeric_columns = []
    for c in x.columns:
        if x[c].map(type).eq(str).any():
            categorical_columns.append(c)
        else:
            numeric_columns.append(c)

    # create two dataframes to hold the two types
    data_numeric = x[numeric_columns]
    data_categorical = pd.DataFrame(x[categorical_columns])

    # replace missing values in numerical columns with median and then add the two types back together
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    data_numeric = pd.DataFrame(imp.fit_transform(data_numeric), columns=data_numeric.columns, index=data_numeric.index)
    x = pd.concat([data_numeric, data_categorical], axis=1)
    return x

### Process

Read in the csv file containing game data and convert the win/loss column to 

In [8]:
 df = pd.read_csv('games.csv')
 df['WL_HOME'] = [0 if x == 'L' else 1 for x in df['WL_HOME']]

Here we create a subeset of the dataset containing all of the games for the selected team from the dataset. We then check for any missing values in the dataframe and replace them with that team's average performance for that category. After doing so we check again for missing values to ensure that there are none.

In [9]:
x = pull_team("MIN", df)
print('Missing Values', x.isnull().sum())
x = clean_team(x)
print('Missing Values', x.isnull().sum())
x = remove_outliers(x)

Number of games
411
Missing Values GAME_ID                     0
GAME_DATE                   0
TEAM_ABBREVIATION_HOME      0
TEAM_ABBREVIATION_HOME.1    0
FGM_HOME                    0
FGA_HOME                    0
FG_PCT_HOME                 0
FG3M_HOME                   0
FG3A_HOME                   0
FG3_PCT_HOME                0
FTM_HOME                    0
FTA_HOME                    0
FT_PCT_HOME                 0
OREB_HOME                   0
DREB_HOME                   0
REB_HOME                    0
AST_HOME                    0
STL_HOME                    0
BLK_HOME                    0
TOV_HOME                    0
PF_HOME                     0
PTS_HOME                    0
PTS_2ND_CHANCE_HOME         1
PTS_PAINT_HOME              1
TEAM_ABBREVIATION_AWAY      0
FGM_AWAY                    0
FGA_AWAY                    0
FG_PCT_AWAY                 0
FG3M_AWAY                   0
FG3A_AWAY                   0
FG3_PCT_AWAY                0
FTM_AWAY                    0
FTA_A

Drops columns of identifying information that does not affect outcome and separates the outcome into a separate dataframe.

In [10]:
teamIF = x.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                     'TEAM_ABBREVIATION_AWAY'], axis=1)
teamOF = x.WL_HOME

In [11]:
print(teamIF)
print(teamOF)

      FGM_HOME  FGA_HOME  FG_PCT_HOME  FG3M_HOME  FG3A_HOME  FG3_PCT_HOME  \
14        43.0      82.0        0.524        4.0       15.0         0.267   
19        37.0      76.0        0.487        4.0       10.0         0.400   
33        38.0      85.0        0.447        4.0       12.0         0.333   
59        36.0      82.0        0.439        8.0       21.0         0.381   
85        41.0      78.0        0.526        5.0       13.0         0.385   
...        ...       ...          ...        ...        ...           ...   
6096      43.0      94.0        0.457        7.0       39.0         0.179   
6113      43.0      86.0        0.500        8.0       24.0         0.333   
6130      46.0      85.0        0.541       10.0       23.0         0.435   
6140      38.0      91.0        0.418       13.0       42.0         0.310   
6152      39.0      87.0        0.448       10.0       33.0         0.303   

      FTM_HOME  FTA_HOME  FT_PCT_HOME  OREB_HOME  ...  AST_AWAY  STL_AWAY  

**Trained on full league, tested on single team (without outlier removal)**

In [12]:
league = clean_team(df)
leagueIF = league.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                            'TEAM_ABBREVIATION_AWAY'], axis=1)
leagueOF = league.WL_HOME

In [13]:
leagueIF_train, leagueIF_test, leagueOF_train, leagueOF_test = train_test_split(leagueIF, leagueOF, test_size=.20)
scale = MinMaxScaler()
scale.fit(leagueIF_train)
leagueIF_train_scale = scale.transform(leagueIF_train)
leagueIF_test_scale = scale.transform(leagueIF_test)
league_svm = SVC()
league_svm.fit(leagueIF_train_scale, leagueOF_train)

In [14]:
x = pull_team("MIN", df)
x = clean_team(x)
teamIF = x.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                     'TEAM_ABBREVIATION_AWAY'], axis=1)
teamOF = x.WL_HOME

Number of games
411


In [15]:
teamIF_train, teamIF_test, teamOF_train, teamOF_test = train_test_split(teamIF, teamOF, test_size=.25)
scale = MinMaxScaler()
scale.fit(teamIF_train)
teamIF_train_scale = scale.transform(teamIF_train)
teamIF_test_scale = scale.transform(teamIF_test)

In [16]:
pred = league_svm.predict(teamIF_test_scale)

In [17]:
scores = cross_val_score(league_svm, teamIF, teamOF, cv=5)
print(classification_report(teamOF_test, pred))
print(league_svm.score(teamIF_test_scale, teamOF_test))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.98        42
         1.0       1.00      0.97      0.98        61

    accuracy                           0.98       103
   macro avg       0.98      0.98      0.98       103
weighted avg       0.98      0.98      0.98       103

0.9805825242718447


so we train our SVM on the entire league and try to use that data to then predict the outcome for a single team. this is our baseline standard with the accuracy being at 85%. this is post pruning of outliars. 

**Trained on single team test on single team (without outlier removal)**

In [18]:
team_svm = SVC()
team_svm.fit(teamIF_train_scale, teamOF_train)
pred = team_svm.predict(teamIF_test_scale)

In [19]:
scores = cross_val_score(team_svm, teamIF, teamOF, cv=5)
print("score has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

score has 0.57 accuracy with a standard deviation of 0.04


In [20]:
print(classification_report(teamOF_test, pred))
print(team_svm.score(teamIF_test, teamOF_test))

              precision    recall  f1-score   support

         0.0       1.00      0.81      0.89        42
         1.0       0.88      1.00      0.94        61

    accuracy                           0.92       103
   macro avg       0.94      0.90      0.92       103
weighted avg       0.93      0.92      0.92       103

0.5922330097087378




Here we now train the SVM on a single team and then test on that team. the F1 is great here at 93%. precision is also high at ~93%. this is cause we have tested on the team we traied and can assume that they will continue to preform the same. 

**Trained on single team tested on another team (without outlier removal)**

In [21]:
g = pull_team('GSW',df)
g = clean_team(g)
gswIF = g.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                    'TEAM_ABBREVIATION_AWAY'], axis=1)
gswOF = g.WL_HOME

Number of games
410


In [22]:
gswIF_train, gswIF_test, gswOF_train, gswOF_test = train_test_split(gswIF, gswOF, test_size=.25)
scale = MinMaxScaler()
scale.fit(gswIF_train)
gswIF_train_scale = scale.transform(gswIF_train)
gswIF_test_scale = scale.transform(gswIF_test)

In [23]:
pred = team_svm.predict(gswIF_test_scale)
scores = cross_val_score(team_svm, teamIF, teamOF, cv=5)
print(classification_report(gswOF_test, pred))
print(team_svm.score(gswIF_test, gswOF_test))
print("score has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

              precision    recall  f1-score   support

         0.0       1.00      0.85      0.92        47
         1.0       0.89      1.00      0.94        56

    accuracy                           0.93       103
   macro avg       0.94      0.93      0.93       103
weighted avg       0.94      0.93      0.93       103

0.5436893203883495
score has 0.57 accuracy with a standard deviation of 0.04




this time we test on a different team from the one we trained on. the f1 score is lower but to a lesser degree but the recall has taken a bit of a hit. the precision however has increased. This is to be expected since we are using one teams training for another team.

In [24]:
pred = team_svm.predict(gswIF_test_scale)
scores = cross_val_score(team_svm, teamIF, teamOF, cv=5)
print(classification_report(gswOF_test, pred))
print(team_svm.score(gswIF_test, gswOF_test))
print("score has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

              precision    recall  f1-score   support

         0.0       1.00      0.85      0.92        47
         1.0       0.89      1.00      0.94        56

    accuracy                           0.93       103
   macro avg       0.94      0.93      0.93       103
weighted avg       0.94      0.93      0.93       103

0.5436893203883495
score has 0.57 accuracy with a standard deviation of 0.04




Trained on full league, tested on single team (with outlier removal)

In [25]:
league = remove_outliers(league)
league_outlier_svm = SVC()

**Trained on single team tested on single team (wtih outlier removal)**

In [26]:
x = remove_outliers(x)
teamIF = x.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                     'TEAM_ABBREVIATION_AWAY'], axis=1)
teamOF = x.WL_HOME
teamIF_train, teamIF_test, teamOF_train, teamOF_test = train_test_split(teamIF, teamOF, test_size=.25)
scale = MinMaxScaler()
scale.fit(teamIF_train)
teamIF_train_scale = scale.transform(teamIF_train)
teamIF_test_scale = scale.transform(teamIF_test)
temp = SVC()
temp.fit(teamIF_train_scale, teamOF_train)
#pred = team_svm.predict(teamIF_test_scale)
pred = temp.predict(teamIF_test_scale)
scores = cross_val_score(team_svm, teamIF, teamOF, cv=5)
print(classification_report(teamOF_test, pred))
print(team_svm.score(teamIF_test, teamOF_test))
print("score has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

              precision    recall  f1-score   support

         0.0       0.86      1.00      0.92        30
         1.0       1.00      0.89      0.94        44

    accuracy                           0.93        74
   macro avg       0.93      0.94      0.93        74
weighted avg       0.94      0.93      0.93        74

0.5945945945945946
score has 0.53 accuracy with a standard deviation of 0.07




now with outliars removed we got the interesting case that across the board the stats were better (precision recall f1) removing the outlairs gave the SVM something more clean and reliable to work with.

**Trained on single team tested on another team (with outlier removal)**

In [27]:
g = remove_outliers(g)
gswIF = g.drop(['WL_HOME', 'GAME_ID', 'GAME_DATE', 'TEAM_ABBREVIATION_HOME.1', 'TEAM_ABBREVIATION_HOME',
                    'TEAM_ABBREVIATION_AWAY'], axis=1)
gswOF = g.WL_HOME
pred = team_svm.predict(gswIF_test_scale)
scores = cross_val_score(team_svm, gswIF, gswOF, cv=5)
print(classification_report(gswOF_test, pred))
print(team_svm.score(gswIF_test, gswOF_test))
print("score has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

              precision    recall  f1-score   support

         0.0       1.00      0.85      0.92        47
         1.0       0.89      1.00      0.94        56

    accuracy                           0.93       103
   macro avg       0.94      0.93      0.93       103
weighted avg       0.94      0.93      0.93       103

0.5436893203883495
score has 0.63 accuracy with a standard deviation of 0.01




yet with outliar removal testing on a different team using training of the previous team seems to put us back at a f1 of 93%. but this can once again be explained by using one teams data to predict another team. here however the precision and recall are amazing.

## Results

if we just compair our stats against the league you can see a remarkable increase in the accuracy of the prediction although that can be marked up to having less data to work with. Another interesting case in the data (though not surprising) is the increase in all stats due to outliars being removed. having cleaner data to work with, and more importantly MORE data to work with should help us get more accurate results. we will continue to tweak what we can to make the SVM as good as possible.

## Plans for senior project 2

Algorithm for estimating stats of nth future game
<br>
RSM ensemble
<br>
Front and back end of website for users
<br>
database
<br>
pipeline to pull stats from new games