## Modeling NBA Data
<!-- </br> -->

#### Overview:
- In `src/eda.ipynb`, it appears as though there are several variables which have strong a relationship to `fpts`, the target variable I would like to predict.
- However, most of these variables occur with `fpts`, and are not actually known prior to the game.
- While I could input medians/means of these valuables and use them as inputs, that then causes a slew of other problems mainly stemming from various forms of bias.
- Instead, I will be creating a Classification Model to determine likely outcomes without having to predict exact values.

#### Goals:
- The first part of this notebook will contain cleaning the data so as to isolate the features I want and discard any features I do not think are necessary.
- This will be an iterative process, so I may return to include more features later on, therefore I will write functions to accomodate this.
</br>

- Next, I will classify the outcomes for player performances.
- I will classify players into 3 bins to start:
    - ***Bust*: Performance in which a player achieved worse than a 40% outcome.**
    - ***Neutral*: Performance in which a player achieved a 40%-75% outcome.**
    - ***Boom*: Performance in which a player achieved better than a 75% outcome.**
- *Notes:*
    - *The reason for doing it this way is because in Fantasy Sports, you would really like to predict when players will have "Boom" outcomes in order to perform better than other contestants.*
    - *These three categories are not evenly distributed, I don't think this will be an issue but may need to revisit,*
  

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Custom settings for libraries used
import settings.custom

# Custom functions for dealing with local files
import utilities

#### Helper Functions

##### Functions that are not main part of code, but are used to achieve desired result. 

In [18]:
from typing import Any, Sequence

def flatten(seq: Sequence[Sequence[Any]]) -> list[Any,...]:
    """
    Converts a nested or 2d sequence of any kind into a 1d list
    Example: flatten([('foo', 'bar'), ('baz', 'qux')]) -> ['foo', 'bar', 'baz', 'qux']
    """
    return [element for inner_seq in seq for element in inner_seq]

def percentile(n: int) -> float:
    """
    Calculates n% outcome for players, designed for use in .agg
    Example: df.groupby('name')['fpts'].agg([percentile(0), percentile(50), percentile(100)])
         -> Returns 3 columns, indexed by name corresponding to following outcomes: 0% (minimum), 50% (median), 100% (maxium)
         -> Common usage will be 25% which roughly corresponds to floor and 75% which roughly corresponds to ceiling
    """
    def percentile_(arr):
        return np.percentile(arr, n)

    # Can create custom labels if wanted, as of now decided against
    # label = {25: 'floor', 50: 'median', 75: 'ceiling'}.get(n, f'{n}%')
    # percentile_.__name__ = label
    percentile_.__name__ = f'{n}%'
    return percentile_

In [19]:
def prepare_dataset_for_model(season: str, **kwargs) -> None:
    """
    This function will do some basic cleaning and perform the classification as indicated above.
    Defaults to DraftKings Fantasy Points, but simply need to pass kwarg specifying site='fanduel' to change.
    In order to accomplish this, will use the custom function created in the above cell to calculate % outcomes for each player.
    As of right now, will not worry too much about exceptions for odd lineups or missing players, etc. But may revisit to adjust later.
    Instead of returning pandas DataFrame, I will save this as a new .csv file labeled clean.
    """

    site = kwargs.get('site', 'draftkings')
    df: pd.DataFrame = utilities.load_raw_dataset(season, site=site)

    # Will not specify target at this point, just all features want included
    # To start will only include information one would know before a game. 
    # Will also include fantasy points (fpts) and minutes played (mp) and usage (usg) as they are also important and could be used with fantasy points
    features: list[str,...] = sum([
        ['name', 'starter', 'team', 'opp', 'home', 'mp', 'usg', 'fpts'],
        kwargs.get('features', [])
    ], [])

    # Truncate DataFrame to only include above features
    # Will also create a new column: fantasy points per minute (fppm) as another possible metric to use.
    # Maps well to player's fantasy point production / efficiency in cases where minutes played are irregular.
    df = (df
          [features]
          .assign(fppm=lambda df_: df_.fpts / df_.mp)
         )


    # In order to classify outcomes, first need to make data structure containing cutoffs for each player's outcomes
    # As of right now, will also get rid of players who have not played at least 5 games of 8 or more minutes. (Tweak later)

    

    bust = kwargs.get('bust', 40)
    boom = kwargs.get('boom', 75)
    
    outcomes = (df
                .loc[df['mp'] >= 8.0]
                .groupby('name')
                ['fpts']
                .agg(['count', percentile(bust), percentile(boom)])
               )

    # Get rid of players who don't meet sample size requirements
    drop_names = flatten([
        # Have not played enough games of more than 8 minutes
        list(outcomes.loc[outcomes['count'] < 5].index),
        # Have not played any games of more than 8 minutes
        [name_ for name_ in df['name'].drop_duplicates() if name_ not in outcomes.index]
    ])

    df = df.loc[df['name'].isin(drop_names) == False]

    for name in outcomes.index:
        # Upper boundary for bust outcome, lower boundary for boom outcome
        # Only need these two values
        bust_upper, boom_lower = [outcomes.loc[name, f'{result}%'] for result in (bust, boom)]
        df.loc[(df['name'] == name) & (df['fpts'] < bust_upper), 'outcome'] = 'bust'
        df.loc[(df['name'] == name) & (df['fpts'] >= bust_upper) & (df['fpts'] <= boom_lower), 'outcome'] = 'neutral'
        df.loc[(df['name'] == name) & (df['fpts'] > boom_lower), 'outcome'] = 'boom'

    utilities.save_clean_dataset(df, season, site)

    return

In [20]:
SEASON = '2023-2024'
SITE = 'draftkings'

In [21]:
prepare_dataset_for_model(SEASON)

#### Quick test to view results looks good, can now move on to building model.

In [27]:
test = utilities.load_clean_dataset(SEASON, SITE)

In [28]:
test.sample(10)

Unnamed: 0,name,starter,team,opp,home,mp,fpts,fppm,outcome
7478,Christian Braun,0,DEN,OKC,0,21.167,36.25,1.713,boom
3205,Isaiah Stewart,1,DET,HOU,1,28.65,26.25,0.916,neutral
1237,Nikola Jokic,1,DEN,PHI,1,36.667,60.5,1.65,neutral
7043,Kevin Durant,1,PHO,CHA,1,38.683,53.0,1.37,neutral
6788,Trey Murphy,0,NO,MEM,1,35.933,24.75,0.689,neutral
15702,Scottie Barnes,1,TOR,ATL,0,37.6,42.5,1.13,neutral
2374,Dennis Smith,0,BKN,ORL,1,20.7,39.75,1.92,boom
2133,Stanley Umude,0,DET,LAL,1,20.183,18.75,0.929,neutral
11745,Toumani Camara,1,POR,SA,1,21.067,9.25,0.439,bust
6378,Franz Wagner,1,ORL,BOS,0,34.317,34.25,0.998,bust


##### Model Imports

In [55]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold #, GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    mean_squared_error
)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    MinMaxScaler,
    Normalizer,
    OneHotEncoder,
    PolynomialFeatures,
    StandardScaler
)

import xgboost as xgb

In [48]:
def parse_confusion_matrix(cm: np.ndarray) -> dict[[str], int]:
    """
    Confusion matrix has form:
    
    bust perceived as bust         bust perceived as neutral         bust perceived as boom
    neutral perceived as bust      neutral perceived as neutral      neutral perceived as boom
    boom perceived as bust         boom perceived as neutral         boom perceived as boom
    
    """

    flatten = lambda nested_list: tuple([value for nested in nested_list for value in nested])    
    true_positive, false_negative, false_positive, true_negative = flatten(cm.tolist())

    return {
        'Bust perceived as Bust':
        'Bust perceived as Neutral':
        'Bust perceived as Boom'
    }

In [66]:
def run_model(season: str, site: str, **kwargs):
    """
    Runs an XGBClassifier Model on dataset to predict players outcomes
    Will build a class with more flushed out functionality once basic model is working.
    """

    df = utilities.load_clean_dataset(season, site)
    
    target = 'outcome'
    features = kwargs.get(
        'features',
        [column for column in df if column not in ('name', 'mp', 'fpts', 'fppm', target)]
    )

    cat_features = list(df[features].select_dtypes(exclude='number').columns)
    num_features = list(df[features].select_dtypes(include='number').columns)
    bin_features = [feature for feature in num_features if df[feature].value_counts().shape[0] == 2]

    for feature in bin_features:
        df[feature] = df[feature].astype('uint8')
        # if feature in num_features:
        #     num_features.remove(feature)

    target_mapping = {'bust': 0, 'neutral': 1, 'boom': 2}

    df[target] = df[target].map(lambda oc: target_mapping[oc])

    X = df[features]
    y = df[target]

    # Column Transformers
    ct = ColumnTransformer(
        transformers=[
            ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features),
            ('ss', StandardScaler(), num_features),
            ('polynomial', PolynomialFeatures(include_bias=False), num_features) # Better training data score but worse overall,
        ],
        remainder='passthrough'
    )

    X_transformed = ct.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, train_size=0.8, random_state=42)

    clf = xgb.XGBClassifier()

    clf.fit(X_train, y_train)

    # Output model results
    print(f'\nModel score for training data for {target}: {clf.score(X_train, y_train):.4f}')
    print(f'Model score for testing data for {target}: {clf.score(X_test, y_test):.4f}\n') # --> returns identical result to sklearn.metrics.accuracy_score(y_test, clf.predict(X_test))

    cv_scores = cross_val_score(clf, X_train, y_train, cv=5)
    print(f'Mean cross-validation score: {cv_scores.mean():.4f}')

    # Does almost the same as above but takes average of K-separate cross_validations --> ideally want pretty similar result
    kfold = KFold(n_splits=10, shuffle=True)
    kf_cv_scores = cross_val_score(clf, X_train, y_train, cv=kfold)
    print(f'K-fold CV average score: {kf_cv_scores.mean():.4f}')

    y_pred = clf.predict(X_test)

    print('\nClassification Report:')
    print(classification_report(y_test, y_pred, target_names=target_mapping.keys()))
    
    return None

In [67]:
run_model(SEASON, SITE)


Model score for training data for outcome: 0.5498
Model score for testing data for outcome: 0.4442

Mean cross-validation score: 0.4427
K-fold CV average score: 0.4438

Classification Report:
              precision    recall  f1-score   support

        bust       0.49      0.75      0.59      1519
     neutral       0.36      0.23      0.28      1127
        boom       0.32      0.15      0.20       740

    accuracy                           0.44      3386
   macro avg       0.39      0.37      0.36      3386
weighted avg       0.41      0.44      0.40      3386



In [63]:
run_model(SEASON, SITE)


Model score for training data for outcome: 0.5422
Model score for testing data for outcome: 0.4421

Mean cross-validation score: 0.4424
K-fold CV average score: 0.4397

Classification Report:
              precision    recall  f1-score   support

        bust       0.48      0.74      0.59      1519
     neutral       0.36      0.24      0.29      1127
        boom       0.32      0.14      0.19       740

    accuracy                           0.44      3386
   macro avg       0.39      0.37      0.36      3386
weighted avg       0.41      0.44      0.40      3386

