# Exploratory Data Analysis for NFL Win Probability Model

## Introduction

This notebook aims to create a win probability model using NFL play-by-play data, inspired by the work detailed in the Open Source Football blog posts. Our approach involves converting the methodology from R to Python, and adapting the XGBoost model from its original API to the Scikit-Learn API interface. The purpose of this analysis is to evaluate the feasibility and performance of these models in Python, provide a comparative analysis, and lay the groundwork for future productization.

## Citing Original Works

The foundation of this analysis is based on the blog posts:

1. Creating a Model from Scratch Using XGBoost in R
2. NFLfastr: EP, WP, and CP Models

Additionally, portions of the code have been adapted from the nflverse project, a comprehensive collection of open-source NFL data projects.

## Methodology

### Data Preparation

We begin by preparing the NFL play-by-play data, ensuring that the data structure and quality align with the requirements for accurate model training.

### Model Building

We then proceed to build the Win Probability model using Python and the XGBoost Scikit-Learn API. This involves selecting features, tuning parameters, and training the model on historical data.

### Model Comparison

The results of the original model's performance after tuning and adding monotone constraints showed that the best log loss scores were around 0.44787 without a monotone constraint on spread_time and 0.44826 with a monotone constraint. These results indicate a high level of model accuracy and effectiveness in predicting win probabilities.

Ease of implementation
Model performance metrics
Flexibility in hyperparameter tuning

### Mathematical Formulation

The underlying mathematical model of XGBoost involves gradient boosting, where the model iteratively improves predictions based on the gradients of the loss function. In simple terms, the model learns from its mistakes in each iteration to make better predictions in the next.

$$
L(\theta) = \sum_{i=1}^{n} l(y_i, \hat{y_i}(\theta)) + \sum_{k=1}^{K} \Omega(f_k)
$$

Where L is the loss function, y_i are the true values, \hat{y_i} are the predicted values, f_k are the model's trees, and \Omega represents the regularization term.

## Class Definitions

I start by defining a `WPModel` class in Python. This class encapsulates the model's configuration, including hyperparameters and constraints. This approach allows for organized and reusable code, making the model training and evaluation process more streamlined and maintainable.

In [4]:
import pandas as pd
import xgboost as xgb


class WPModel:
    def __init__(self):
        self.wp_spread_model = None
        self.wp_spread_calibration_data_path = '../../calibration_data/wp_model_calibration_data.csv'
        self.wp_spread_model_path = 'models/wp_spread_model.json'
        self.n_rounds = 15000
        self.wp_spread_monotone_constraints = {
            'receive_2h_ko': 0,
            'spread_time': 1,
            'home': 0,
            'half_seconds_remaining': 0,
            'game_seconds_remaining': 0,
            'diff_time_ratio': 1,
            'score_differential': 1,
            'down': -1,
            'ydstogo': -1,
            'yardline_100': -1,
            'posteam_timeouts_remaining': 1,
            'defteam_timeouts_remaining': -1
        }
        self.wp_spread_model_sklearn_parameters = {
            'n_estimators': self.n_rounds,
            'booster': 'gbtree',
            'device': 'cuda',
            'objective': 'binary:logistic',
            'tree_method': 'approx',
            'grow_policy': 'lossguide',
            'sampling_method': 'gradient_based',
            'eval_metric': ['logloss', 'auc', 'error'],
            'early_stopping_rounds': 200,
            'learning_rate': 0.05,
            'gamma': 0.79012017,
            'subsample': 0.9224245,
            'colsample_bytree': 5 / 12,
            'max_depth': 5,
            'min_child_weight': 7,
            'monotone_constraints': self.wp_spread_monotone_constraints
        }
        self.drop_columns = ['season', 'game_id', 'label', 'home_team', 'away_team']

    def import_calibration_data(self):
        calibration_data = pd.read_csv(self.wp_spread_calibration_data_path)

        return calibration_data

    def train(self, X, y, X_test, y_test):
        clf = xgb.XGBClassifier(**self.wp_spread_model_sklearn_parameters)
        clf.fit(X, y,  eval_set=[(X, y), (X_test, y_test)], verbose=50)
        return clf

## Model Training Process

The training process involves splitting the data into training and testing sets. Different strategies are employed for splitting, such as using `GroupKFold` for splitting by `game_id`, which ensures that games are not split across training and testing sets, avoiding data leakage. This method is crucial for maintaining the integrity of the model's evaluation.

## Model Evaluation Metrics

I evaluated the model using metrics such as Log Loss, Error, and Area Under Curve (AUC). These metrics provide a comprehensive understanding of the model's performance, highlighting its predictive accuracy, error rate, and the trade-off between true positive rate and false positive rate.


## Comparative Analysis of Different Model Implementations

This section compares different implementations of the model, particularly focusing on the use of XGBoost's Scikit-Learn API. The comparison is based on performance metrics and training strategies, offering insights into the strengths and weaknesses of each approach.

# Data Preparation

## Analyzing the Calibration Data used by the Original Author

In [7]:
import pyreadr

original_cal_data_url = "https://raw.githubusercontent.com/guga31bb/metrics/master/wp_tuning/cal_data.rds"
destination_path = '../../calibration_data/cal_data.rds'

pyreadr.download_file(original_cal_data_url, destination_path)
result = pyreadr.read_r(destination_path)
original_cal_data = result[None]

original_cal_data.head()

Unnamed: 0,game_id,play_type,game_seconds_remaining,half_seconds_remaining,yardline_100,roof,posteam,defteam,home_team,ydstogo,...,week,drive,ep,score_differential,posteam_timeouts_remaining,defteam_timeouts_remaining,desc,Winner,spread_line,total_line
0,2001_01_ATL_SF,pass,3591.0,1791.0,73.0,outdoors,SF,ATL,SF,10.0,...,1,1.0,1.224513,0.0,3.0,3.0,(14:51) J.Garcia pass to J.Swift to SF 33 for ...,SF,3.5,46.0
1,2001_01_ATL_SF,run,3558.0,1758.0,67.0,outdoors,SF,ATL,SF,4.0,...,1,1.0,1.457278,0.0,3.0,3.0,(14:18) G.Hearst left guard to SF 35 for 2 yar...,SF,3.5,46.0
2,2001_01_ATL_SF,pass,3517.0,1717.0,65.0,outdoors,SF,ATL,SF,2.0,...,1,1.0,1.000212,0.0,3.0,3.0,(13:37) J.Garcia pass to T.Streets to ATL 39 f...,SF,3.5,46.0
3,2001_01_ATL_SF,pass,3469.0,1669.0,39.0,outdoors,SF,ATL,SF,10.0,...,1,1.0,3.374187,0.0,3.0,3.0,(12:49) J.Garcia pass incomplete to J.Stokes (...,SF,3.5,46.0
4,2001_01_ATL_SF,run,3462.0,1662.0,39.0,outdoors,SF,ATL,SF,10.0,...,1,1.0,2.899873,0.0,3.0,3.0,(12:42) G.Hearst up the middle to ATL 35 for 4...,SF,3.5,46.0


## Recreating the Calibration Data in Python

The original calibration data is loaded and analyzed to understand its structure and contents. The data consists of various features, including game_id, season, label, home_team, away_team, and other play-by-play details. This data is crucial for training and calibrating the win probability model.

In [9]:
import nfl_data_py as nfl

seasons = list(range(1999, 2024, 1))
pbp_data = nfl.import_pbp_data(seasons, thread_requests=True)
pbp_data.head()

Downcasting floats.


Unnamed: 0,play_id,game_id,old_game_id,home_team,away_team,season_type,week,posteam,posteam_type,defteam,...,out_of_bounds,home_opening_kickoff,qb_epa,xyac_epa,xyac_mean_yardage,xyac_median_yardage,xyac_success,xyac_fd,xpass,pass_oe
0,35.0,1999_01_ARI_PHI,1999091200,PHI,ARI,REG,1,PHI,home,ARI,...,0.0,1.0,0.126818,,,,,,,
1,60.0,1999_01_ARI_PHI,1999091200,PHI,ARI,REG,1,PHI,home,ARI,...,0.0,1.0,-0.561568,,,,,,,
2,82.0,1999_01_ARI_PHI,1999091200,PHI,ARI,REG,1,PHI,home,ARI,...,0.0,1.0,-0.641717,,,,,,,
3,103.0,1999_01_ARI_PHI,1999091200,PHI,ARI,REG,1,PHI,home,ARI,...,0.0,1.0,-0.723302,,,,,,,
4,126.0,1999_01_ARI_PHI,1999091200,PHI,ARI,REG,1,PHI,home,ARI,...,0.0,1.0,0.212661,,,,,,,


# Model 1: XGBoost with custom train_test_split

In [6]:
from sklearn.model_selection import GroupKFold

def train_test_split(df, test_data_after_season, n_splits, drop_col):
    cal_data = df.copy()

    X_train = cal_data[cal_data['season'] < test_data_after_season]
    X_test = cal_data[cal_data['season'] >= test_data_after_season]
    X_test.reset_index(drop=True, inplace=True)

    y_train = X_train['label']
    y_test = X_test['label']

    group_kfold = GroupKFold(n_splits=n_splits)
    folds = list(group_kfold.split(X=X_train, y=X_train['game_id'], groups=X_train['game_id']))

    X_train = X_train.drop(drop_col, axis=1)
    X_test = X_test.drop(drop_col, axis=1)

    return X_train, y_train, X_test, y_test, folds

wp_model = WPModel()
cal_data = wp_model.import_calibration_data()
test_df = cal_data.loc[cal_data['season'] >= 2023]
X_train, y_train, X_test, y_test, folds = train_test_split(cal_data,2023, 5, wp_model.drop_columns)
clf = wp_model.train(X_train, y_train, X_test, y_test)
scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)
test_df['win_probability'] = wp_preds[:,1]
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'win_probability']
test_df = test_df.filter(items=cols)

[0]	validation_0-logloss:0.67713	validation_0-auc:0.82905	validation_0-error:0.27677	validation_1-logloss:0.67701	validation_1-auc:0.83616	validation_1-error:0.26513
[50]	validation_0-logloss:0.47829	validation_0-auc:0.86120	validation_0-error:0.22793	validation_1-logloss:0.48823	validation_1-auc:0.85293	validation_1-error:0.23006
[100]	validation_0-logloss:0.45689	validation_0-auc:0.86310	validation_0-error:0.22622	validation_1-logloss:0.47067	validation_1-auc:0.85421	validation_1-error:0.22791
[150]	validation_0-logloss:0.45221	validation_0-auc:0.86456	validation_0-error:0.22494	validation_1-logloss:0.46772	validation_1-auc:0.85489	validation_1-error:0.22753
[200]	validation_0-logloss:0.45050	validation_0-auc:0.86529	validation_0-error:0.22441	validation_1-logloss:0.46699	validation_1-auc:0.85511	validation_1-error:0.22702
[250]	validation_0-logloss:0.44948	validation_0-auc:0.86577	validation_0-error:0.22406	validation_1-logloss:0.46694	validation_1-auc:0.85502	validation_1-error:0.2

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['win_probability'] = wp_preds[:,1]


In [7]:
test_df.head(10)

Unnamed: 0,game_id,game_seconds_remaining,score_differential,yardline_100,win_probability
949362,2023_01_ARI_WAS,3600.0,0.0,75.0,0.745796
949363,2023_01_ARI_WAS,3570.0,0.0,72.0,0.723924
949364,2023_01_ARI_WAS,3535.0,0.0,66.0,0.733173
949365,2023_01_ARI_WAS,3496.0,0.0,64.0,0.745742
949366,2023_01_ARI_WAS,3492.0,0.0,64.0,0.734743
949367,2023_01_ARI_WAS,3454.0,0.0,52.0,0.748597
949368,2023_01_ARI_WAS,3416.0,0.0,51.0,0.735136
949369,2023_01_ARI_WAS,3411.0,0.0,51.0,0.715158
949370,2023_01_ARI_WAS,3368.0,0.0,49.0,0.689992
949371,2023_01_ARI_WAS,3359.0,0.0,83.0,0.302785


## Model 1 Results

Here we discuss the results of the model training and comparison. Key performance metrics are highlighted, and insights from the model's predictions are drawn.

In [1]:
print("Log Loss: " + str(round(sum(scores['validation_1']['logloss']) / len(scores['validation_1']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_1']['error']) / len(scores['validation_1']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_1']['auc']) / len(scores['validation_1']['auc']), 4)))

NameError: name 'scores' is not defined

Compared to the original author's model, which achieved a log loss of approximately 0.44787 to 0.44826, the first model in your EDA had a slightly higher log loss. A lower log loss indicates better predictive accuracy, so in this comparison, the original author's model performed better in terms of log loss. The other metrics (error and AUC) are not directly comparable unless the same metrics are provided for the original model.

# Model 2: XGBoost using SciKit-Learn API to split training data

In [10]:
from sklearn.model_selection import train_test_split

wp_model = WPModel()
cal_data = wp_model.import_calibration_data()
X = cal_data.loc[:, ~cal_data.columns.isin(['season', 'game_id', 'label', 'home_team', 'away_team'])]
y = cal_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = wp_model.train(X_train, y_train, X_test, y_test)
scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

[0]	validation_0-logloss:0.67714	validation_0-auc:0.82942	validation_0-error:0.28308	validation_1-logloss:0.67720	validation_1-auc:0.82861	validation_1-error:0.28494
[50]	validation_0-logloss:0.47842	validation_0-auc:0.86113	validation_0-error:0.22786	validation_1-logloss:0.47867	validation_1-auc:0.86086	validation_1-error:0.22843
[100]	validation_0-logloss:0.45707	validation_0-auc:0.86303	validation_0-error:0.22611	validation_1-logloss:0.45734	validation_1-auc:0.86274	validation_1-error:0.22637
[150]	validation_0-logloss:0.45254	validation_0-auc:0.86440	validation_0-error:0.22495	validation_1-logloss:0.45286	validation_1-auc:0.86407	validation_1-error:0.22516
[200]	validation_0-logloss:0.45076	validation_0-auc:0.86516	validation_0-error:0.22423	validation_1-logloss:0.45124	validation_1-auc:0.86474	validation_1-error:0.22458
[250]	validation_0-logloss:0.44973	validation_0-auc:0.86567	validation_0-error:0.22404	validation_1-logloss:0.45036	validation_1-auc:0.86516	validation_1-error:0.2

In [11]:
X_test['wp'] = wp_preds[:,1]
cal_data['win_probability'] = X_test['wp']
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'win_probability']
test_df = cal_data.filter(items=cols)

In [12]:
test_df.loc[test_df['win_probability'].notna()]

Unnamed: 0,game_id,game_seconds_remaining,score_differential,yardline_100,win_probability
0,1999_01_ARI_PHI,3600.0,0.0,77.0,0.341410
7,1999_01_ARI_PHI,3487.0,0.0,76.0,0.667692
10,1999_01_ARI_PHI,3388.0,0.0,52.0,0.374127
12,1999_01_ARI_PHI,3388.0,0.0,36.0,0.394564
14,1999_01_ARI_PHI,3388.0,0.0,27.0,0.418828
...,...,...,...,...,...
965140,2023_08_TB_BUF,196.0,-14.0,39.0,0.011514
965141,2023_08_TB_BUF,191.0,-14.0,39.0,0.008377
965144,2023_08_TB_BUF,176.0,-14.0,24.0,0.011057
965149,2023_08_TB_BUF,132.0,6.0,57.0,0.976000


In [13]:
print("Log Loss: " + str(round(sum(scores['validation_1']['logloss']) / len(scores['validation_1']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_1']['error']) / len(scores['validation_1']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_1']['auc']) / len(scores['validation_1']['auc']), 4)))

Log Loss: 0.459
Error: 0.2254
Area Under Curve: 0.8642


## Model 2: Results

The drop in accuracy in this model compared to the original author's could be due to several factors:

1. **Data Preprocessing**: Differences in how the data was cleaned, processed, or features were engineered can significantly impact model performance.

2. **Model Parameters**: The original author might have used a different set of hyperparameters or tuning strategies, which can lead to varying model performance.

3. **Training Strategy**: The use of different cross-validation strategies or training sets might have influenced the model's ability to generalize.

4. **Feature Selection**: The selection of features and their importance in the model can greatly affect the outcome.

5. **Randomness**: Models, especially those like XGBoost, can be sensitive to randomness in initialization and data splits.

It's important to closely examine these aspects and experiment with adjustments to align more closely with the methodology that produced the original results.

# Model 3: Stratified Split by game_id

In [14]:
wp_model = WPModel()
cal_data = wp_model.import_calibration_data()
X = cal_data.loc[:, ~cal_data.columns.isin(['season', 'game_id', 'label', 'home_team', 'away_team'])]
y = cal_data['label']
groups = cal_data['game_id']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=groups)

clf = wp_model.train(X_train, y_train, X_test, y_test)
scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

[0]	validation_0-logloss:0.67714	validation_0-auc:0.82926	validation_0-error:0.26146	validation_1-logloss:0.67714	validation_1-auc:0.82858	validation_1-error:0.26222
[50]	validation_0-logloss:0.47806	validation_0-auc:0.86147	validation_0-error:0.22745	validation_1-logloss:0.47922	validation_1-auc:0.86012	validation_1-error:0.22922
[100]	validation_0-logloss:0.45667	validation_0-auc:0.86333	validation_0-error:0.22575	validation_1-logloss:0.45840	validation_1-auc:0.86183	validation_1-error:0.22756
[150]	validation_0-logloss:0.45203	validation_0-auc:0.86475	validation_0-error:0.22452	validation_1-logloss:0.45407	validation_1-auc:0.86315	validation_1-error:0.22652
[200]	validation_0-logloss:0.45027	validation_0-auc:0.86548	validation_0-error:0.22407	validation_1-logloss:0.45254	validation_1-auc:0.86380	validation_1-error:0.22600
[250]	validation_0-logloss:0.44921	validation_0-auc:0.86600	validation_0-error:0.22365	validation_1-logloss:0.45162	validation_1-auc:0.86427	validation_1-error:0.2

In [15]:
X_test['wp'] = wp_preds[:,1]
cal_data['win_probability'] = X_test['wp']
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'win_probability']
test_df = cal_data.filter(items=cols)
test_df.loc[test_df['win_probability'].notna()]

Unnamed: 0,game_id,game_seconds_remaining,score_differential,yardline_100,win_probability
0,1999_01_ARI_PHI,3600.0,0.0,77.0,0.340157
2,1999_01_ARI_PHI,3600.0,0.0,76.0,0.308291
5,1999_01_ARI_PHI,3487.0,0.0,81.0,0.688429
8,1999_01_ARI_PHI,3487.0,0.0,76.0,0.642371
9,1999_01_ARI_PHI,3388.0,0.0,59.0,0.369005
...,...,...,...,...,...
965143,2023_08_TB_BUF,179.0,-14.0,24.0,0.014683
965144,2023_08_TB_BUF,176.0,-14.0,24.0,0.013402
965145,2023_08_TB_BUF,171.0,-14.0,24.0,0.010199
965147,2023_08_TB_BUF,156.0,6.0,73.0,0.918032


In [16]:
print("Log Loss: " + str(round(sum(scores['validation_1']['logloss']) / len(scores['validation_1']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_1']['error']) / len(scores['validation_1']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_1']['auc']) / len(scores['validation_1']['auc']), 4)))

Log Loss: 0.4594
Error: 0.2264
Area Under Curve: 0.8635


## Model 3 Results:

When compared to the original author's model, which had a log loss of approximately 0.44787 to 0.44826, this model had a slightly higher log loss, indicating a marginally lower accuracy. The stratification by game_id is an effective strategy to prevent data leakage, but it might also impact how well the model generalizes to unseen data, as it influences the distribution of data in the training and test sets.

# Model 4: GroupKFold

In [17]:
from sklearn.model_selection import GroupKFold

wp_model = WPModel()
cal_data = wp_model.import_calibration_data()
X = cal_data.loc[:, ~cal_data.columns.isin(['season', 'game_id', 'label', 'home_team', 'away_team'])]
y = cal_data['label']
groups = cal_data['game_id']

group_fold = GroupKFold(n_splits=5)
for train_index, test_index in group_fold.split(X, y, groups):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

clf = wp_model.train(X_train, y_train, X_test, y_test)
scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

scores = clf.evals_result()
wp_preds = clf.predict_proba(X_test, validate_features=True)

[0]	validation_0-logloss:0.67739	validation_0-auc:0.82634	validation_0-error:0.26525	validation_1-logloss:0.67654	validation_1-auc:0.84121	validation_1-error:0.25253
[50]	validation_0-logloss:0.48214	validation_0-auc:0.85813	validation_0-error:0.23143	validation_1-logloss:0.46683	validation_1-auc:0.87317	validation_1-error:0.21397
[100]	validation_0-logloss:0.46103	validation_0-auc:0.86008	validation_0-error:0.22948	validation_1-logloss:0.44295	validation_1-auc:0.87444	validation_1-error:0.21339
[150]	validation_0-logloss:0.45646	validation_0-auc:0.86153	validation_0-error:0.22815	validation_1-logloss:0.43825	validation_1-auc:0.87533	validation_1-error:0.21205
[200]	validation_0-logloss:0.45475	validation_0-auc:0.86226	validation_0-error:0.22758	validation_1-logloss:0.43695	validation_1-auc:0.87566	validation_1-error:0.21171
[250]	validation_0-logloss:0.45371	validation_0-auc:0.86278	validation_0-error:0.22724	validation_1-logloss:0.43634	validation_1-auc:0.87582	validation_1-error:0.2

In [18]:
X_test['wp'] = wp_preds[:,1]
cal_data['win_probability'] = X_test['wp']
cols = ['game_id', 'game_seconds_remaining', 'score_differential', 'yardline_100', 'win_probability']
test_df = cal_data.filter(items=cols)
test_df.loc[test_df['win_probability'].notna()]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['wp'] = wp_preds[:,1]


Unnamed: 0,game_id,game_seconds_remaining,score_differential,yardline_100,win_probability
626,1999_01_DAL_WAS,3600.0,0.0,70.0,0.624708
627,1999_01_DAL_WAS,3600.0,0.0,64.0,0.639683
628,1999_01_DAL_WAS,3600.0,0.0,58.0,0.639109
629,1999_01_DAL_WAS,3600.0,0.0,49.0,0.681796
630,1999_01_DAL_WAS,3600.0,0.0,45.0,0.667498
...,...,...,...,...,...
964696,2023_07_PIT_LA,179.0,7.0,46.0,0.930347
964697,2023_07_PIT_LA,144.0,7.0,39.0,0.940913
964698,2023_07_PIT_LA,120.0,7.0,38.0,0.984208
964699,2023_07_PIT_LA,78.0,7.0,40.0,0.986992


In [19]:
print("Log Loss: " + str(round(sum(scores['validation_1']['logloss']) / len(scores['validation_1']['logloss']), 4)))
print("Error: " + str(round(sum(scores['validation_1']['error']) / len(scores['validation_1']['error']), 4)))
print("Area Under Curve: " + str(round(sum(scores['validation_1']['auc']) / len(scores['validation_1']['auc']), 4)))

Log Loss: 0.4462
Error: 0.2125
Area Under Curve: 0.8751


### Model 4: Results

This model not only performed slightly better than the original author's model but also showed improved predictive accuracy. It's worth noting that this model used the GroupKFold strategy for splitting the data and grouped by game_id, which seemed effective in maintaining the integrity of the game data and avoiding leakage. Continuing with this model allows for further refinements and optimizations based on the solid foundation it has already demonstrated.

## Conclusion

This exploratory analysis serves as a preliminary step towards creating a robust and reliable NFL win probability model. The adaptation from R to Python and the transition to the Scikit-Learn API interface of XGBoost demonstrates the flexibility and potential of the model in different environments.

The models tested showed varying levels of performance. The original author's model achieved a log loss of 0.44787 without a monotone constraint on spread_time and 0.44826 with it. Comparatively, your models demonstrated log loss scores of 0.4775, 0.459, 0.4594, and 0.4462 in different configurations and training strategies. These results suggest that your models, particularly the one yielding a log loss of 0.4462, performed comparably and in some cases slightly better than the original author's model. This indicates a successful adaptation and implementation of the methodology in Python, with potential for further optimization and refinement.
 
## References

1. Sebastian Carl, & Ron Yurko. (2021). Creating a Model from Scratch Using XGBoost in R. Open Source Football. https://opensourcefootball.com/posts/2021-04-13-creating-a-model-from-scratch-using-xgboost-in-r/#model-evaluation
2. Benjamin Robinson, & Sebastian Carl. (2020). NFLfastr: EP, WP, and CP Models. Open Source Football. https://opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/
3. nflverse Project. https://nflverse.nflverse.com/