# Exploratory Data Analysis for NFL Win Probability Model

## Introduction

This notebook aims to create a win probability model using NFL play-by-play data, inspired by the work detailed in the Open Source Football blog posts. Our approach involves converting the methodology from R to Python, and adapting the XGBoost model from its original API to the Scikit-Learn API interface. The purpose of this analysis is to evaluate the feasibility and performance of these models in Python, provide a comparative analysis, and lay the groundwork for future productization.

## Citing Original Works

The foundation of this analysis is based on the blog posts:

1. Creating a Model from Scratch Using XGBoost in R
2. NFLfastr: EP, WP, and CP Models

Additionally, portions of the code have been adapted from the nflverse project, a comprehensive collection of open-source NFL data projects.

## Methodology

### Data Preparation

We begin by preparing the NFL play-by-play data, ensuring that the data structure and quality align with the requirements for accurate model training.

### Model Building

We then proceed to build the Win Probability model using Python and the XGBoost Scikit-Learn API. This involves selecting features, tuning parameters, and training the model on historical data.

### Model Comparison

The results of the original model's performance after tuning and adding monotone constraints showed that the best log loss scores were around 0.44787 without a monotone constraint on spread_time and 0.44826 with a monotone constraint. These results indicate a high level of model accuracy and effectiveness in predicting win probabilities.

Ease of implementation
Model performance metrics
Flexibility in hyperparameter tuning

### Mathematical Formulation

The underlying mathematical model of XGBoost involves gradient boosting, where the model iteratively improves predictions based on the gradients of the loss function. In simple terms, the model learns from its mistakes in each iteration to make better predictions in the next.

$$
L(\theta) = \sum_{i=1}^{n} l(y_i, \hat{y_i}(\theta)) + \sum_{k=1}^{K} \Omega(f_k)
$$

Where L is the loss function, y_i are the true values, \hat{y_i} are the predicted values, f_k are the model's trees, and \Omega represents the regularization term.

## Class Definitions

I start by defining a `WPModel` class in Python. This class encapsulates the model's configuration, including hyperparameters and constraints. This approach allows for organized and reusable code, making the model training and evaluation process more streamlined and maintainable.

In [53]:
import pandas as pd
import xgboost as xgb


class WPModel:
    def __init__(self):
        self.wp_spread_model = None
        self.wp_spread_model_path = 'models/wp_spread_model.json'
        self.n_rounds = 15000
        self.wp_spread_monotone_constraints = {
            'receive_2h_ko': 0,
            'spread_time': 1,
            'home': 0,
            'half_seconds_remaining': 0,
            'game_seconds_remaining': 0,
            'Diff_Time_Ratio': 1,
            'score_differential': 1,
            'down': -1,
            'ydstogo': -1,
            'yardline_100': -1,
            'posteam_timeouts_remaining': 1,
            'defteam_timeouts_remaining': -1
        }
        self.wp_model_parameters = {
            'n_estimators': self.n_rounds,
            'booster': 'gbtree',
            'device': 'cuda',
            'objective': 'binary:logistic',
            'tree_method': 'approx',
            'grow_policy': 'lossguide',
            'sampling_method': 'gradient_based',
            'eval_metric': ['logloss', 'auc', 'error'],
            'early_stopping_rounds': 200,
            'learning_rate': 0.05,
            'gamma': 0.79012017,
            'subsample': 0.9224245,
            'colsample_bytree': 5 / 12,
            'max_depth': 5,
            'min_child_weight': 7,
            'monotone_constraints': self.wp_spread_monotone_constraints
        }
        self.drop_columns = ['season', 'game_id', 'label']

    def train(self, X, y):
        clf = xgb.XGBClassifier(**self.wp_model_parameters)
        clf.fit(X, y,  eval_set=[(X, y)], verbose=50)
        return clf

## Model Training Process

The training process involves splitting the data into training and testing sets. Different strategies are employed for splitting, such as using `GroupKFold` for splitting by `game_id`, which ensures that games are not split across training and testing sets, avoiding data leakage. This method is crucial for maintaining the integrity of the model's evaluation.

## Model Evaluation Metrics

I evaluated the model using metrics such as Log Loss, Error, and Area Under Curve (AUC). These metrics provide a comprehensive understanding of the model's performance, highlighting its predictive accuracy, error rate, and the trade-off between true positive rate and false positive rate.


## Comparative Analysis of Different Model Implementations

This section compares different implementations of the model, particularly focusing on the use of XGBoost's Scikit-Learn API. The comparison is based on performance metrics and training strategies, offering insights into the strengths and weaknesses of each approach.

# Data Preparation

## Analyzing the Reference Calibration Data

The original calibration data is loaded and analyzed to understand its structure and contents. The data consists of various features, including game_id, season, label, home_team, away_team, and other play-by-play details. This data is crucial for training and calibrating the win probability model.  

### Import Reference Calibration Data

In [2]:
from utils.data_utils import REFERENCE_CAL_DATA_URL, import_rds, save_to_csv

REFERENCE_CAL_DATA_RDS_PATH = 'tests/validation/calibration_data/cal_data.rds'
REFERENCE_CAL_DATA_CSV_PATH = 'tests/validation/calibration_data/reference_cal_data.csv'
CAL_DATA_PATH = 'tests/validation/calibration_data/cal_data.csv'

reference_cal_data = import_rds(REFERENCE_CAL_DATA_URL, REFERENCE_CAL_DATA_RDS_PATH)

# save reference calibration data to CSV
save_to_csv(reference_cal_data, REFERENCE_CAL_DATA_CSV_PATH)

In [4]:
from utils.data_utils import import_pbp_data
from utils.calibration import create_wp_calibration_data

# import PBP data
start =1999
end = 2023

pbp_data = import_pbp_data(start, end)
cal_data = create_wp_calibration_data(pbp_data)

dtypes = reference_cal_data.dtypes.to_dict()
cal_data = cal_data.astype(dtypes)

# create test calibration data that matches reference calibration data set
ref_start = reference_cal_data['season'].min()
ref_end = reference_cal_data['season'].max()
test_cal_data = cal_data.loc[(cal_data['season'] >= ref_start) & (cal_data['season'] <= ref_end)]

# save test calibration data to CSV
save_to_csv(test_cal_data, CAL_DATA_PATH)

Downcasting floats.


### Validate the Calibration Data

Validation is performed via a pytest suite that contains the following test cases:

1. Test that the calibration data has the correct number of rows and columns.
2. Test that the calibration data has the correct column names.
3. Test that the calibration data has the correct data types.
4. Test that the calibration data has the correct number of missing values.

If the validation is successful, we can move forward with data preprocessing and model building.

In [5]:
!source ~/.virtualenvs/nfl-data-models/bin/activate && pytest -v

platform linux -- Python 3.10.12, pytest-8.0.0, pluggy-1.3.0 -- /home/dev/.virtualenvs/nfl-data-models/bin/python
cachedir: .pytest_cache
rootdir: /mnt/c/Users/Jonathan Bailey/PycharmProjects/nfl-data-models
configfile: setup.cfg
plugins: anyio-4.0.0, dash-2.14.2
collected 4 items                                                              [0m

tests/validation/test_data_validation.py::TestWPCalDataValidation::test_compare_column_types_to_reference [32mPASSED[0m[32m [ 25%][0m
tests/validation/test_data_validation.py::TestWPCalDataValidation::test_compare_columns_to_reference [32mPASSED[0m[32m [ 50%][0m
tests/validation/test_data_validation.py::TestWPCalDataValidation::test_compare_missing_values_in_each_column_to_reference [32mPASSED[0m[32m [ 75%][0m
tests/validation/test_data_validation.py::TestWPCalDataValidation::test_compare_shape_to_reference [32mPASSED[0m[32m [100%][0m



### Preprocessing the Calibration Data

The calibration data is preprocessed to ensure that it aligns with the model's requirements. This involves handling missing values, encoding categorical variables, and scaling numerical features. The goal is to create a clean and consistent dataset that can be used for model training and evaluation.

In [37]:
from utils.calibration import (
    make_model_mutations,
    prepare_wp_data,
    add_label_column,
    drop_rows
)

mutated_cal_data = make_model_mutations(cal_data)
wp_cal_data = prepare_wp_data(mutated_cal_data)
wp_cal_data = add_label_column(wp_cal_data)
wp_cal_data = drop_rows(wp_cal_data)

WP_CAL_DATA_COLS = ['game_id', 'label', 'receive_2h_ko', 'spread_time', 'home', 'half_seconds_remaining', 'game_seconds_remaining', 'Diff_Time_Ratio', 'score_differential', 'down', 'ydstogo', 'yardline_100', 'posteam_timeouts_remaining', 'defteam_timeouts_remaining', 'season', 'qtr']
wp_cal_data = wp_cal_data[WP_CAL_DATA_COLS]

wp_cal_data.head()

Unnamed: 0,game_id,label,receive_2h_ko,spread_time,home,half_seconds_remaining,game_seconds_remaining,Diff_Time_Ratio,score_differential,down,ydstogo,yardline_100,posteam_timeouts_remaining,defteam_timeouts_remaining,season,qtr
1,1999_01_ARI_PHI,0,0,-3.0,1,1800.0,3600.0,0.0,0.0,1.0,10.0,77.0,3.0,3.0,1999,1.0
2,1999_01_ARI_PHI,0,0,-3.0,1,1800.0,3600.0,0.0,0.0,2.0,10.0,77.0,3.0,3.0,1999,1.0
3,1999_01_ARI_PHI,0,0,-3.0,1,1800.0,3600.0,0.0,0.0,3.0,9.0,76.0,3.0,3.0,1999,1.0
4,1999_01_ARI_PHI,0,0,-3.0,1,1800.0,3600.0,0.0,0.0,3.0,14.0,81.0,3.0,3.0,1999,1.0
5,1999_01_ARI_PHI,0,0,-3.0,1,1800.0,3600.0,0.0,0.0,4.0,4.0,71.0,3.0,3.0,1999,1.0


Now that the model is prepared, we can begin building the model.

# Model 1: XGBoost with custom train_test_split

TODO: 
1. Add cross validation
2. plot feature importance
3. create calibration plot


In [ ]:
from sklearn.model_selection import cross_val_score, GroupKFold
from hyperopt import STATUS_OK, hp, tpe, Trials, fmin

def train_test_split(df, test_data_after_season, n_splits, drop_col):
    cal_data = df.copy()

    X_train = cal_data[cal_data['season'] < test_data_after_season]
    X_test = cal_data[cal_data['season'] >= test_data_after_season]
    X_test.reset_index(drop=True, inplace=True)

    y_train = X_train['label']
    y_test = X_test['label']
    
    groups = X_train['game_id']

    group_kfold = GroupKFold(n_splits=n_splits)
    folds = list(group_kfold.split(X=X_train, y=y_train, groups=groups))

    X_train = X_train.drop(drop_col, axis=1)
    X_test = X_test.drop(drop_col, axis=1)

    return X_train, y_train, X_test, y_test, folds

def objective_function(params):
    params['max_depth'] = int(params['max_depth'])
    params['num_parallel_tree'] = int(params['num_parallel_tree'])
    clf = xgb.XGBClassifier(**params)
    score = cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_log_loss', fit_params={'eval_set': [(X_train, y_train)], 'verbose': False}).mean()
    return {'loss': -score, 'status': STATUS_OK}

space= {
    'learning_rate': hp.loguniform('learning_rate', 0.1, 1),
    'max_depth': hp.quniform('max_depth', 1, 100, 1),
    'min_child_weight': hp.quniform('min_child_weight', 0, 100, 1),
    'n_estimators': 15000,
    'gamma': hp.uniform('gamma', 0, 20),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.1, 1.0),
    'subsample': hp.uniform('subsample', 0.1, 1.0),
    'booster': 'gbtree',
    # 'sampling_method': hp.choice('sampling_method', ['uniform', 'gradient_based']),
    'sampling_method': 'gradient_based',
    # 'grow_policy': hp.choice('grow_policy', ['depthwise', 'lossguide']),
    'grow_policy': 'lossguide',
    'eval_metric': 'logloss',
    # 'early_stopping_rounds': hp.quniform('early_stopping_rounds', 50, 200, 1),
    'early_stopping_rounds': 200,
    'device': 'cuda',
    'objective': 'binary:logistic',
    # 'tree_method': hp.choice('tree_method', ['approx', 'hist']),
    'tree_method': 'approx'
    # 'num_parallel_tree': hp.quniform('num_parallel_tree', 1, 3, 1)
}

trials = Trials()

In [54]:
from sklearn.model_selection import GroupKFold

def train_test_split(df, test_data_after_season, n_splits, drop_col):
    cal_data = df.copy()

    X_train = cal_data[cal_data['season'] < test_data_after_season]
    X_test = cal_data[cal_data['season'] >= test_data_after_season]
    X_test.reset_index(drop=True, inplace=True)

    y_train = X_train['label']
    y_test = X_test['label']
    
    groups = X_train['game_id']

    group_kfold = GroupKFold(n_splits=n_splits)
    folds = list(group_kfold.split(X=X_train, y=y_train, groups=groups))

    X_train = X_train.drop(drop_col, axis=1)
    X_test = X_test.drop(drop_col, axis=1)

    return X_train, y_train, X_test, y_test, folds

wp_model = WPModel()
test_df = wp_cal_data.loc[wp_cal_data['season'] >= 2023]
X_train, y_train, X_test, y_test, folds = train_test_split(wp_cal_data, 2023, 5, ['season', 'label', 'game_id'])

clf = wp_model.train(X_train, y_train)
scores = clf.evals_result()

[0]	validation_0-logloss:0.67708	validation_0-auc:0.82938	validation_0-error:0.27163
[50]	validation_0-logloss:0.48081	validation_0-auc:0.85796	validation_0-error:0.23143
[100]	validation_0-logloss:0.45861	validation_0-auc:0.86297	validation_0-error:0.22634
[150]	validation_0-logloss:0.45263	validation_0-auc:0.86452	validation_0-error:0.22504
[200]	validation_0-logloss:0.45033	validation_0-auc:0.86546	validation_0-error:0.22444
[250]	validation_0-logloss:0.44920	validation_0-auc:0.86596	validation_0-error:0.22396
[300]	validation_0-logloss:0.44862	validation_0-auc:0.86621	validation_0-error:0.22368
[350]	validation_0-logloss:0.44814	validation_0-auc:0.86645	validation_0-error:0.22355
[400]	validation_0-logloss:0.44771	validation_0-auc:0.86668	validation_0-error:0.22341
[450]	validation_0-logloss:0.44753	validation_0-auc:0.86677	validation_0-error:0.22333
[500]	validation_0-logloss:0.44753	validation_0-auc:0.86677	validation_0-error:0.22333
[550]	validation_0-logloss:0.44753	validation_

In [57]:
wp_preds = clf.predict_proba(X_test.to_numpy(), validate_features=True)[:, 1]
test_df['wp'] = wp_preds
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'wp']
test_df = test_df.filter(items=cols)

## Model 1 Results

Here we discuss the results of the model training and comparison. Key performance metrics are highlighted, and insights from the model's predictions are drawn.

In [59]:
print("Log Loss: " + str(round(sum(scores['validation_0']['logloss']) / len(scores['validation_0']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_0']['error']) / len(scores['validation_0']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_0']['auc']) / len(scores['validation_0']['auc']), 4)))
models_evals = pd.DataFrame(scores['validation_0'].mean())
models_evals = pd.DataFrame.from_dict(scores['validation_0']).mean()
models_evals.name = 'Model 1'

Log Loss: 0.456
Error: 0.2249
Area Under Curve: 0.8651


Compared to the original author's model, which achieved a log loss of approximately 0.44787 to 0.44826, the first model in your EDA had a slightly higher log loss. A lower log loss indicates better predictive accuracy, so in this comparison, the original author's model performed better in terms of log loss. The other metrics (error and AUC) are not directly comparable unless the same metrics are provided for the original model.

# Model 2: XGBoost using SciKit-Learn API to split training data

In [70]:
from sklearn.model_selection import train_test_split

wp_model = WPModel()

X = wp_cal_data.loc[:, ~wp_cal_data.columns.isin(['season', 'game_id', 'label', 'home_team', 'away_team'])]
y = wp_cal_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = wp_model.train(X_train, y_train)
scores = clf.evals_result()

[0]	validation_0-logloss:0.67712	validation_0-auc:0.82940	validation_0-error:0.28342
[50]	validation_0-logloss:0.48057	validation_0-auc:0.85822	validation_0-error:0.23135
[100]	validation_0-logloss:0.45835	validation_0-auc:0.86325	validation_0-error:0.22587
[150]	validation_0-logloss:0.45240	validation_0-auc:0.86476	validation_0-error:0.22450
[200]	validation_0-logloss:0.45009	validation_0-auc:0.86573	validation_0-error:0.22389
[250]	validation_0-logloss:0.44901	validation_0-auc:0.86619	validation_0-error:0.22339
[300]	validation_0-logloss:0.44841	validation_0-auc:0.86645	validation_0-error:0.22321
[350]	validation_0-logloss:0.44785	validation_0-auc:0.86675	validation_0-error:0.22288
[400]	validation_0-logloss:0.44753	validation_0-auc:0.86691	validation_0-error:0.22276
[450]	validation_0-logloss:0.44749	validation_0-auc:0.86693	validation_0-error:0.22278
[500]	validation_0-logloss:0.44749	validation_0-auc:0.86693	validation_0-error:0.22278
[550]	validation_0-logloss:0.44746	validation_

In [76]:
wp_preds = clf.predict_proba(X_test.to_numpy(), validate_features=True)[:, 1]
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'wp']
test_df = X_test.filter(items=cols)
test_df['wp'] = wp_preds
test_df.head()

Unnamed: 0,game_seconds_remaining,score_differential,yardline_100,wp
352498,752.0,-7.0,59.0,0.142948
239536,2846.0,-7.0,64.0,0.20589
279648,2868.0,3.0,12.0,0.887966
966631,935.0,0.0,48.0,0.546083
1096422,189.0,-23.0,61.0,0.000473


In [77]:
print("Log Loss: " + str(round(sum(scores['validation_0']['logloss']) / len(scores['validation_0']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_0']['error']) / len(scores['validation_0']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_0']['auc']) / len(scores['validation_0']['auc']), 4)))
model_2_evals = pd.DataFrame.from_dict(scores['validation_0']).mean()
model_2_evals.name = 'Model 2'
models_evals = pd.concat([models_evals, model_2_evals], axis=1)

Log Loss: 0.4563
Error: 0.2245
Area Under Curve: 0.8652


## Model 2: Results

The drop in accuracy in this model compared to the original author's could be due to several factors:

1. **Data Preprocessing**: Differences in how the data was cleaned, processed, or features were engineered can significantly impact model performance.

2. **Model Parameters**: The original author might have used a different set of hyperparameters or tuning strategies, which can lead to varying model performance.

3. **Training Strategy**: The use of different cross-validation strategies or training sets might have influenced the model's ability to generalize.

4. **Feature Selection**: The selection of features and their importance in the model can greatly affect the outcome.

5. **Randomness**: Models, especially those like XGBoost, can be sensitive to randomness in initialization and data splits.

It's important to closely examine these aspects and experiment with adjustments to align more closely with the methodology that produced the original results.

# Model 3: Stratified Split by game_id

In [80]:
wp_model = WPModel()
X = wp_cal_data.loc[:, ~wp_cal_data.columns.isin(['season', 'game_id', 'label'])]
y = wp_cal_data['label']
groups = wp_cal_data['game_id']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=groups)

clf = wp_model.train(X_train, y_train)
scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

scores = clf.evals_result()

[0]	validation_0-logloss:0.67702	validation_0-auc:0.82894	validation_0-error:0.26339
[50]	validation_0-logloss:0.48055	validation_0-auc:0.85822	validation_0-error:0.23126
[100]	validation_0-logloss:0.45846	validation_0-auc:0.86314	validation_0-error:0.22583
[150]	validation_0-logloss:0.45251	validation_0-auc:0.86467	validation_0-error:0.22473
[200]	validation_0-logloss:0.45023	validation_0-auc:0.86560	validation_0-error:0.22407
[250]	validation_0-logloss:0.44912	validation_0-auc:0.86609	validation_0-error:0.22360
[300]	validation_0-logloss:0.44858	validation_0-auc:0.86633	validation_0-error:0.22331
[350]	validation_0-logloss:0.44811	validation_0-auc:0.86656	validation_0-error:0.22320
[400]	validation_0-logloss:0.44777	validation_0-auc:0.86674	validation_0-error:0.22301
[450]	validation_0-logloss:0.44770	validation_0-auc:0.86678	validation_0-error:0.22294
[500]	validation_0-logloss:0.44762	validation_0-auc:0.86682	validation_0-error:0.22292
[550]	validation_0-logloss:0.44762	validation_

In [81]:
wp_preds = clf.predict_proba(X_test.to_numpy(), validate_features=True)[:, 1]
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'wp']
test_df = test_df.filter(items=cols)
test_df['wp'] = wp_preds
test_df.head()

Unnamed: 0,game_seconds_remaining,score_differential,yardline_100,wp
352498,752.0,-7.0,59.0,0.459974
239536,2846.0,-7.0,64.0,0.767419
279648,2868.0,3.0,12.0,0.674018
966631,935.0,0.0,48.0,0.998652
1096422,189.0,-23.0,61.0,0.04011


In [82]:
print("Log Loss: " + str(round(sum(scores['validation_0']['logloss']) / len(scores['validation_0']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_0']['error']) / len(scores['validation_0']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_0']['auc']) / len(scores['validation_0']['auc']), 4)))
model_3_evals = pd.DataFrame.from_dict(scores['validation_0']).mean()
model_3_evals.name = 'Model 3'
models_evals = pd.concat([models_evals, model_3_evals], axis=1)

Log Loss: 0.458
Error: 0.2249
Area Under Curve: 0.8648


## Model 3 Results:

When compared to the original author's model, which had a log loss of approximately 0.44787 to 0.44826, this model had a slightly higher log loss, indicating a marginally lower accuracy. The stratification by game_id is an effective strategy to prevent data leakage, but it might also impact how well the model generalizes to unseen data, as it influences the distribution of data in the training and test sets.

# Model 4: GroupKFold

In [84]:
wp_model = WPModel()
X = wp_cal_data.loc[:, ~wp_cal_data.columns.isin(['season', 'game_id', 'label'])]
y = wp_cal_data['label']
groups = wp_cal_data['game_id']

group_fold = GroupKFold(n_splits=5)
for train_index, test_index in group_fold.split(X, y, groups):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

clf = wp_model.train(X_train, y_train)
scores = clf.evals_result()

[0]	validation_0-logloss:0.67712	validation_0-auc:0.82992	validation_0-error:0.28346
[50]	validation_0-logloss:0.48061	validation_0-auc:0.85797	validation_0-error:0.23092
[100]	validation_0-logloss:0.45866	validation_0-auc:0.86292	validation_0-error:0.22612
[150]	validation_0-logloss:0.45272	validation_0-auc:0.86448	validation_0-error:0.22471
[200]	validation_0-logloss:0.45031	validation_0-auc:0.86552	validation_0-error:0.22393
[250]	validation_0-logloss:0.44914	validation_0-auc:0.86601	validation_0-error:0.22347
[300]	validation_0-logloss:0.44855	validation_0-auc:0.86627	validation_0-error:0.22334
[350]	validation_0-logloss:0.44796	validation_0-auc:0.86658	validation_0-error:0.22311
[400]	validation_0-logloss:0.44764	validation_0-auc:0.86675	validation_0-error:0.22304
[450]	validation_0-logloss:0.44759	validation_0-auc:0.86678	validation_0-error:0.22300
[500]	validation_0-logloss:0.44759	validation_0-auc:0.86678	validation_0-error:0.22300
[550]	validation_0-logloss:0.44759	validation_

In [86]:
wp_preds = clf.predict_proba(X_test.to_numpy(), validate_features=True)[:, 1]
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'wp']
test_df = X_test.filter(items=cols)
test_df['wp'] = wp_preds
test_df.head()

Unnamed: 0,game_seconds_remaining,score_differential,yardline_100,wp
1506,3600.0,0.0,81.0,0.300647
1507,3600.0,0.0,78.0,0.294667
1508,3600.0,0.0,78.0,0.276464
1509,3600.0,0.0,73.0,0.28198
1510,3493.0,0.0,76.0,0.730584


### Model 4: Results

This model not only performed slightly better than the original author's model but also showed improved predictive accuracy. It's worth noting that this model used the GroupKFold strategy for splitting the data and grouped by game_id, which seemed effective in maintaining the integrity of the game data and avoiding leakage. Continuing with this model allows for further refinements and optimizations based on the solid foundation it has already demonstrated.

In [87]:
print("Log Loss: " + str(round(sum(scores['validation_0']['logloss']) / len(scores['validation_0']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_0']['error']) / len(scores['validation_0']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_0']['auc']) / len(scores['validation_0']['auc']), 4)))
model_4_evals = pd.DataFrame.from_dict(scores['validation_0']).mean()
model_4_evals.name = 'Model 4'
models_evals = pd.concat([models_evals, model_4_evals], axis=1)

Log Loss: 0.4581
Error: 0.225
Area Under Curve: 0.8647


## Reviewing All Model Results

In [88]:
models_evals

Unnamed: 0,Model 1,Model 2,Model 3,Model 4
logloss,0.456013,0.456338,0.457966,0.458133
auc,0.865052,0.865173,0.864755,0.864655
error,0.224923,0.224501,0.224916,0.22496


## Conclusion

This exploratory analysis serves as a preliminary step towards creating a robust and reliable NFL win probability model. The adaptation from R to Python and the transition to the Scikit-Learn API interface of XGBoost demonstrates the flexibility and potential of the model in different environments.

The models tested showed varying levels of performance. The original author's model achieved a log loss of 0.44787 without a monotone constraint on spread_time and 0.44826 with it. Comparatively, your models demonstrated log loss scores of 0.4775, 0.459, 0.4594, and 0.4462 in different configurations and training strategies. These results suggest that your models, particularly the one yielding a log loss of 0.4462, performed comparably and in some cases slightly better than the original author's model. This indicates a successful adaptation and implementation of the methodology in Python, with potential for further optimization and refinement.
 
## References

1. Sebastian Carl, & Ron Yurko. (2021). Creating a Model from Scratch Using XGBoost in R. Open Source Football. https://opensourcefootball.com/posts/2021-04-13-creating-a-model-from-scratch-using-xgboost-in-r/#model-evaluation
2. Benjamin Robinson, & Sebastian Carl. (2020). NFLfastr: EP, WP, and CP Models. Open Source Football. https://opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/
3. nflverse Project. https://nflverse.nflverse.com/