<a href="https://colab.research.google.com/github/kkettip/game_outcome_prediction/blob/main/SBU_Football_Game_Outcome_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install mljar-supervised

Collecting mljar-supervised
  Downloading mljar-supervised-1.1.9.tar.gz (127 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/127.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.1/127.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting scipy<=1.11.4,>=1.6.1 (from mljar-supervised)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting catboost>=0.24.4 (from mljar-supervised)
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting dtreeviz>=2.2.2 (from mljar-supervised)
  Downloading dtreeviz-2.2.2-py3-none-any.whl.metadata (2.4 kB)
Collecting shap>=0.42.1 (from mljar-supervised)
  Downloading shap-0.46.0-cp310-cp310-manyli

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from supervised.automl import AutoML

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [3]:
import supervised
print(supervised.__version__)

1.1.9


In [4]:
#load data for SBU football game outcome for the year 2023
d = pd.read_csv(
    "/content/AI_ML Project (For temp ML).csv",
)


# **EDA**

In [5]:
#checking for column names
d.columns

Index(['Metrics', 'Metric Value', 'Metric Date', 'SBUID', 'Game Date',
       'Opponent', 'Game Outcome'],
      dtype='object')

In [6]:
d.head(5)

Unnamed: 0,Metrics,Metric Value,Metric Date,SBUID,Game Date,Opponent,Game Outcome
0,accel_load_accum,1009.296762,2022-08-28 20:00:00,62780237.0,09/01/2022,RHODE ISLAND,L
1,accel_load_accum,578.703132,2022-08-30 17:30:45,13562003.0,09/01/2022,RHODE ISLAND,L
2,accel_load_accum,1552.911852,2022-08-25 20:30:00,85248416.0,09/01/2022,RHODE ISLAND,L
3,accel_load_accum,344.091141,2022-08-26 19:00:29,57639645.0,09/01/2022,RHODE ISLAND,L
4,accel_load_accum,954.471083,2022-08-27 16:30:32,86064087.0,09/01/2022,RHODE ISLAND,L


In [7]:
#checking the data
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 970 entries, 0 to 969
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Metrics       212 non-null    object 
 1   Metric Value  212 non-null    float64
 2   Metric Date   212 non-null    object 
 3   SBUID         212 non-null    float64
 4   Game Date     212 non-null    object 
 5   Opponent      212 non-null    object 
 6   Game Outcome  212 non-null    object 
dtypes: float64(2), object(5)
memory usage: 53.2+ KB


In [8]:
#Checking dataset for the percentage of L (loss) and W (win)
print (f'{round(d["Game Outcome"].value_counts(normalize=True)*100,2)}')

Game Outcome
L    74.53
W    25.47
Name: proportion, dtype: float64


In [9]:
#checking for null values
d.isna().sum()

Unnamed: 0,0
Metrics,758
Metric Value,758
Metric Date,758
SBUID,758
Game Date,758
Opponent,758
Game Outcome,758


In [10]:
# checking for duplicates
d.duplicated().sum()

757

In [11]:
#drop rows with NaN
df = d.dropna()

In [12]:
#drop duplicates
df = d.drop_duplicates()

In [13]:
#Checking dataset for the percentage of L (loss) and W (win) after dropping rows with NaN
print (f'{round(df["Game Outcome"].value_counts(normalize=True)*100,2)}')

Game Outcome
L    74.53
W    25.47
Name: proportion, dtype: float64


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 213 entries, 0 to 212
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Metrics       212 non-null    object 
 1   Metric Value  212 non-null    float64
 2   Metric Date   212 non-null    object 
 3   SBUID         212 non-null    float64
 4   Game Date     212 non-null    object 
 5   Opponent      212 non-null    object 
 6   Game Outcome  212 non-null    object 
dtypes: float64(2), object(5)
memory usage: 13.3+ KB


In [15]:
# Convert dates to a common format (e.g., 'YYYY-MM-DD')
df['Metric Date'] = pd.to_datetime(df['Metric Date'])
df['Game Date'] = pd.to_datetime(df['Game Date'])


df['Metric Date'] = df['Metric Date'].dt.strftime('%Y-%m-%d')
df['Game Date'] = df['Game Date'].dt.strftime('%Y-%m-%d')
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

Unnamed: 0,Metrics,Metric Value,Metric Date,SBUID,Game Date,Opponent,Game Outcome
0,accel_load_accum,1009.296762,2022-08-28,62780237.0,2022-09-01,RHODE ISLAND,L
1,accel_load_accum,578.703132,2022-08-30,13562003.0,2022-09-01,RHODE ISLAND,L
2,accel_load_accum,1552.911852,2022-08-25,85248416.0,2022-09-01,RHODE ISLAND,L
3,accel_load_accum,344.091141,2022-08-26,57639645.0,2022-09-01,RHODE ISLAND,L
4,accel_load_accum,954.471083,2022-08-27,86064087.0,2022-09-01,RHODE ISLAND,L
...,...,...,...,...,...,...,...
208,speed_avg,620.024008,2022-11-01,35468721.0,2022-11-05,MORGAN ST.,W
209,speed_avg,501.249084,2022-11-01,41254397.0,2022-11-05,MORGAN ST.,W
210,speed_avg,829.257278,2022-11-04,91865664.0,2022-11-05,MORGAN ST.,W
211,speed_avg,365.350000,2022-11-02,21362276.0,2022-11-05,MORGAN ST.,W


In [16]:
df.head(5)

Unnamed: 0,Metrics,Metric Value,Metric Date,SBUID,Game Date,Opponent,Game Outcome
0,accel_load_accum,1009.296762,2022-08-28,62780237.0,2022-09-01,RHODE ISLAND,L
1,accel_load_accum,578.703132,2022-08-30,13562003.0,2022-09-01,RHODE ISLAND,L
2,accel_load_accum,1552.911852,2022-08-25,85248416.0,2022-09-01,RHODE ISLAND,L
3,accel_load_accum,344.091141,2022-08-26,57639645.0,2022-09-01,RHODE ISLAND,L
4,accel_load_accum,954.471083,2022-08-27,86064087.0,2022-09-01,RHODE ISLAND,L


# **Model 1**
Variable not included for X: game outcome

In [17]:
# defining X and Y
X = df[df.columns[:-1]] # Include all columns except the last one that is game outcome
y = df["Game Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Split the dataframe df into X and y

In [18]:
# viewing data in X
X_train.head()

Unnamed: 0,Metrics,Metric Value,Metric Date,SBUID,Game Date,Opponent
174,physio_intensity,6.551718,2022-10-18,31095653.0,2022-10-22,MAINE
85,physio_load,861.032739,2022-09-19,68390914.0,2022-09-24,at Richmond
126,accel_load_accum,654.453088,2022-09-26,33257042.0,2022-10-08,at New Hampshire
130,distance_total,6864.769512,2022-09-26,43395033.0,2022-10-08,at New Hampshire
199,physio_intensity,50.523722,2022-10-30,33440256.0,2022-11-05,MORGAN ST.


In [19]:
X_test.head()

Unnamed: 0,Metrics,Metric Value,Metric Date,SBUID,Game Date,Opponent
90,speed_avg,0.8113,2022-09-22,63207963.0,2022-09-24,at Richmond
110,physio_intensity,6.799728,2022-09-26,88087868.0,2022-10-01,WILLIAM & MARY
97,accel_load_accum,379.128362,2022-09-24,32265668.0,2022-10-01,WILLIAM & MARY
135,metabolic_power_avg,295.254003,2022-09-25,46115948.0,2022-10-08,at New Hampshire
181,physio_load,797.430693,2022-10-21,75364006.0,2022-10-22,MAINE


In [20]:
# viewing data in Y
y_train.head()

Unnamed: 0,Game Outcome
174,W
85,L
126,L
130,L
199,W


In [21]:
y_test.head()

Unnamed: 0,Game Outcome
90,L
110,L
97,L
135,L
181,W


In [22]:
#checking the shape
X_train.shape, X_test.shape

((170, 6), (43, 6))

In [23]:
#training the model
automl = AutoML()
automl.fit(X_train, y_train)



AutoML directory: AutoML_1
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline logloss 0.592287 trained in 1.19 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...




2_DecisionTree logloss 0.370753 trained in 7.07 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
3_Linear logloss 0.021416 trained in 12.42 seconds
* Step default_algorithms will try to check up to 3 models
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
4_Default_Xgboost logloss 0.096781 trained in 1.68 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
5_Default_NeuralNetwork logloss 0.121347 trained in 2.05 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
6_Default_RandomForest logloss 0.0 trained in 2.73 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.0 trained in 1.88 seconds
AutoML fit time: 53.4 seconds
AutoML b

In [24]:
# predicting game win or loss on test set
prediction = automl.predict(X_test)

In [25]:
# Convert the NumPy array to a Pandas Series for easier inspection
prediction_series = pd.Series(prediction)

# Display the first few elements of the Series
print(prediction_series.head())

0    L
1    L
2    L
3    L
4    W
dtype: object


In [26]:
# Convert the NumPy array to a Pandas Series
prediction_series = pd.Series(prediction)

# Display the last few elements of the Series
print(prediction_series.tail())

38    L
39    L
40    L
41    L
42    L
dtype: object


Checking the shape

In [27]:
X_test.shape

(43, 6)

In [28]:
prediction.shape

(43,)

In [29]:
y_test.shape

(43,)

In [30]:
# Evaluating the model

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, prediction)
print(accuracy)

1.0


In [31]:
#Evaluating the model
from sklearn.metrics import classification_report
report = classification_report(y_test, prediction, target_names=['SBU Win', 'Opponent Win'])

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

     SBU Win       1.00      1.00      1.00        35
Opponent Win       1.00      1.00      1.00         8

    accuracy                           1.00        43
   macro avg       1.00      1.00      1.00        43
weighted avg       1.00      1.00      1.00        43



# **Model 2**
Variables not included for X: game outcome, SBUID, game date, metric date and opponent.

In [32]:
import pandas as pd

df = pd.DataFrame(df)  # Convert the list to a DataFrame

X = df.drop(['Game Outcome', 'SBUID', 'Game Date', 'Metric Date', 'Opponent'], axis=1)
y = df['Game Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an AutoML instance
automl = AutoML()

# Fit the model
automl.fit(X_train, y_train)

# Predict on the test set
prediction = automl.predict(X_test)



AutoML directory: AutoML_2
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models




1_Baseline logloss 0.542655 trained in 0.86 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...




2_DecisionTree logloss 1.903914 trained in 6.16 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
3_Linear logloss 0.540943 trained in 4.59 seconds
* Step default_algorithms will try to check up to 3 models
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
4_Default_Xgboost logloss 0.506942 trained in 2.11 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
5_Default_NeuralNetwork logloss 0.54516 trained in 2.68 seconds
log_loss_eps() got an unexpected keyword argument 'response_method'
Problem during computing permutation importance. Skipping ...
6_Default_RandomForest logloss 0.54066 trained in 3.12 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.502255 trained in 2.18 seconds
AutoML fit time: 40.57 seconds


In [33]:
# Evaluate the model
accuracy = accuracy_score(y_test, prediction)

# Adjust target names to match the predicted classes
report = classification_report(y_test, prediction, target_names=['SBU Win', 'Opponent Win'])

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Accuracy: 0.67
Classification Report:
              precision    recall  f1-score   support

     SBU Win       0.66      1.00      0.79        27
Opponent Win       1.00      0.12      0.22        16

    accuracy                           0.67        43
   macro avg       0.83      0.56      0.51        43
weighted avg       0.79      0.67      0.58        43



# **Checking SBU football game outcome data**

In [34]:
import pandas as pd

In [35]:
#Year 2023

df = pd.read_html("https://www.espn.com/college-football/team/schedule/_/id/2619/season/2023")

In [36]:
df

[                 0                 1               2               3  \
 0   Regular Season    Regular Season  Regular Season  Regular Season   
 1             DATE          OPPONENT          RESULT      W-L (CONF)   
 2      Thu, Aug 31       vs Delaware          L37-13       0-1 (0-1)   
 3       Fri, Sep 8    @ Rhode Island          L35-14       0-2 (0-2)   
 4      Sat, Sep 16  @ Arkansas State           L31-7       0-3 (0-2)   
 5      Sat, Sep 23       vs Richmond          L20-19       0-4 (0-3)   
 6      Sat, Sep 30           @ Maine          L56-28       0-5 (0-4)   
 7       Sat, Oct 7    @ Morgan State        Canceled        Canceled   
 8      Sat, Oct 14        vs Fordham           L26-7       0-6 (0-4)   
 9      Sat, Oct 21  vs New Hampshire          L45-14       0-7 (0-5)   
 10     Sat, Oct 28       @ Villanova          L48-13       0-8 (0-6)   
 11      Sat, Nov 4        @ Monmouth          L56-17       0-9 (0-7)   
 12     Sat, Nov 11        vs UAlbany          L38-

In [37]:
type(df)

list

In [38]:
df[0]

Unnamed: 0,0,1,2,3,4,5,6,7
0,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season
1,DATE,OPPONENT,RESULT,W-L (CONF),HI PASS,HI RUSH,HI REC,
2,"Thu, Aug 31",vs Delaware,L37-13,0-1 (0-1),Case 163,Martin 63,Freeman 62,
3,"Fri, Sep 8",@ Rhode Island,L35-14,0-2 (0-2),Case 139,Martin 42,Johnson 54,
4,"Sat, Sep 16",@ Arkansas State,L31-7,0-3 (0-2),Case 221,Carson 45,Johnson 127,
5,"Sat, Sep 23",vs Richmond,L20-19,0-4 (0-3),Case 154,Carson 40,Johnson 49,
6,"Sat, Sep 30",@ Maine,L56-28,0-5 (0-4),Case 348,Carson 59,Cook 117,
7,"Sat, Oct 7",@ Morgan State,Canceled,Canceled,Canceled,Canceled,Canceled,
8,"Sat, Oct 14",vs Fordham,L26-7,0-6 (0-4),Case 243,Dempster 36,Johnson 87,
9,"Sat, Oct 21",vs New Hampshire,L45-14,0-7 (0-5),Case 255,Turner 32,Freeman 115,


In [39]:
#Year 2021

df = pd.read_html("https://www.espn.com/college-football/team/schedule/_/id/2619/season/2021")

In [40]:
outcome = df[0]
outcome

Unnamed: 0,0,1,2,3,4,5,6,7
0,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season,Regular Season
1,DATE,OPPONENT,RESULT,W-L (CONF),HI PASS,HI RUSH,HI REC,
2,"Thu, Sep 2",vs New Hampshire,L27-21,0-1 (0-1),Fields 255,Fields 54,Newton 76,
3,"Sat, Sep 11",@ Colgate,W24-3,1-1 (0-1),Fields 140,Lawton 134,Harris Jr. 64,
4,"Sat, Sep 18",@ 4 Oregon,L48-7,1-2 (0-1),Fields 131,Lawton 53,Harris Jr. 67,
5,"Sat, Sep 25",vs Fordham,L31-14,1-3 (0-1),Fields 253,Nekhet 49,Newton 77,
6,"Sat, Oct 2",@ Rhode Island,L27-20 OT,1-4 (0-2),Fields 101,Lawton 154,Hellams Jr. 66,
7,"Sat, Oct 9",@ Towson,L21-14,1-5 (0-3),Fields 220,Lawton 119,Harris Jr. 109,
8,"Sat, Oct 16",vs Delaware,W34-17,2-5 (1-3),Fields 133,Lawton 192,Newton 44,
9,"Sat, Oct 23",vs Richmond,W27-14,3-5 (2-3),Fields 226,Lawton 51,Harris Jr. 71,


# Summary, Findings and Future Work:

# game_outcome_prediction

## Aim:
To predict SBU football game outcome using athletics’ physical metrics and past SBU football game outcome data

## Data information:
A csv file containing SBU football athletics’ physical movement metrics and game outcome

File name: AI_ML Project (For temp ML).csv

Target variable: game outcome

Columns: 'Metrics', 'Metric Value', 'Metric Date', 'SBUID', 'Game Date', 'Opponent', 'Game Outcome'

Metrics: accel_load_accum, distance_total, metabolic_power_avg, speed_avg,  physio_load, physio_intensity, metabolic_work, accel_load_accum, distance_total, metabolic_power_avg

## Approach:
Used autoML from ML Jar to generate prediction model

Used ML Jar’s Income Classification Example as a starting point to generate the game outcome prediction model

Link to example:
https://github.com/mljar/mljar-examples/blob/master/Income_classification/Income_classification.ipynb

## Steps:

1. EDA:  Drop rows with NaN and duplicates. Also, convert Metric Dates and Game Date to the same date format.

2. Define X and Y:

```
Model 1:  All variables are used for the X value except for game outcome, while game outcome is used for the y value.

defining X and Y:
X = df[df.columns[:-1]] # Include all columns except the last one that is game outcome
y = df["Game Outcome"]

```


```
Model 2: Variables used for the x value does not include 'Game Outcome', 'SBUID', 'Game Date', 'Metric Date', 'Opponent'

defining X and Y:
X = df.drop(['Game Outcome', 'SBUID', 'Game Date', 'Metric Date', 'Opponent'], axis=1)
y = df['Game Outcome']

```

3. Split data into train and test sets.  

4. Train the model

5. Run the model on the test set to make a prediction

6. Evaluate the model’s performance


## Findings:

Different combinations of variables for X have an impact on the model’s performance.

Model 1:
After training the model 1, autoML suggested that the best model would be Default RandomForest.

Model Evaluation:
Accuracy is a ratio of correctly predicted observation to the total observations.

Accuracy of 1 means that every prediction is correct.  This could be due to a small training data set used to train the model.  


```
AutoML best model: 6_Default_RandomForest
Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

     SBU Win       1.00      1.00      1.00        32
Opponent Win       1.00      1.00      1.00        11

    accuracy                           1.00        43
   macro avg       1.00      1.00      1.00        43
weighted avg       1.00      1.00      1.00        43

```

```

Model 2:
After training the model 2, autoML suggested that the best model would be Ensemble

Accuracy of 0.67, which means that correct predictions are made 67% of the time.


AutoML best model: Ensemble
Accuracy: 0.67
Classification Report:
              precision    recall  f1-score   support

     SBU Win       0.66      1.00      0.79        27
Opponent Win       1.00      0.12      0.22        16

    accuracy                           0.67        43
   macro avg       0.83      0.56      0.51        43
weighted avg       0.79      0.67      0.58        43


```

## Future work:
1. To better predict SBU football game outcomes, a larger data set should be used to train the model. The training data should also not be skewed towards losses or wins.  

2. Determine which combination of athletics’ physical metrics would be best to include into the dataset for model training to predict game outcome. This could be accomplished by including different combinations of athletics’ physical metrics into the dataset and then evaluating the model’s performance. Once the metrics are determined, we can ensure that the most relevant data is collected for training the model.

3. Predicting the number of points that would lead to a loss for the SBU football team when compared to their opponents per game. With this knowledge the football team can develop strategies to maximize the number of points gained per game.

