# AutoGluon for Sports Analytics: Baseball Pitch Prediction

Name: John Hodge

Date: 04/19/24

## Introduction

Welcome to our tutorial on leveraging AutoGluon for sports analytics, specifically for the task of predicting baseball pitch types. AutoGluon, a powerful automated machine learning (AutoML) tool, simplifies the process of model building and deployment, making it accessible even to those with limited machine learning expertise. In this tutorial, we will guide you through the steps of installing dependencies, preparing your dataset, defining the prediction task, training the model, and making predictions. This hands-on guide aims to provide you with practical experience in using AutoGluon for real-world sports data, helping you unlock new insights and enhance your analytical capabilities.

## Install dependencies

In [5]:
!pip install autogluon



In [6]:
import pandas as pd
import torch
from autogluon.tabular import TabularDataset, TabularPredictor

Check to see whether a CUDA-based GPU is available.

In [7]:
# Check to see whether a GPU is available
if torch.cuda.is_available():
    gpu_available = True
    print("GPU is available")
else:
    gpu_available = False
    print("GPU is not available")

GPU is not available


## Load and Prepare Data

We'll start by loading data from a CSV file. For this example, assume your CSV has columns for PitchType, Balls, Strikes, and PreviousPitchType. You might need to adjust the column names based on your actual data.

Data preparation is a crucial step in any machine learning pipeline. For this tutorial, our data consists of various features related to baseball pitches, such as PitchType, Balls, Strikes, and PreviousPitchType.

Effective data preparation enhances model accuracy by ensuring that the input data is suitably formatted and cleaned. This might include handling missing values, encoding categorical variables, and normalizing or scaling numerical features.

In [8]:
data_path = 'data/baseball_pitch_data.csv'

# Load the dataset
df = pd.read_csv(data_path)

# Create a new column 'PreviousPitchType' with the 'PitchType' from the previous row
df['PreviousPitchType'] = df['PitchType'].shift(1)

# Drop the 'Outcome' column for this analysis
df = df.drop('Outcome', axis=1)

# Save the updated DataFrame back to CSV if needed
# df.to_csv('/path/to/your/updated_baseball_pitch_data.csv', index=False)

# Display the updated DataFrame to verify
# print(df.head(10))

# Truncate the DataFrame to N rows
num_truncation_rows = -1 # Number of rows to truncate to (-1 for all rows)
df_truncated = df[1:num_truncation_rows]

# Load data
data = TabularDataset(df_truncated)

# Preview data
print(data.head(10))

# Optionally, drop rows with missing values
data = data.dropna(subset=['PitchType', 'Balls', 'Strikes', 'PreviousPitchType'])

# Split data into training and testing datasets
train_data = data.sample(frac=0.8, random_state=42)  # 80% for training
test_data = data.drop(train_data.index)  # Remaining 20% for testing


    Balls  Strikes  PitchType PreviousPitchType
1       0        1   Changeup          Fastball
2       1        1     Slider          Changeup
3       0        0   Fastball            Slider
4       0        1   Changeup          Fastball
5       0        2   Fastball          Changeup
6       0        0   Fastball          Fastball
7       0        0   Fastball          Fastball
8       0        1     Slider          Fastball
9       0        2  Curveball            Slider
10      1        2  Curveball         Curveball


## Define the Prediction Task

With AutoGluon, you specify what column you're predicting. In this case, it's PitchType. AutoGluon will automatically handle feature processing. In supervised learning, specifying the target variable correctly is crucial because the model learns to predict this variable based on other input features.

In [9]:
label = 'PitchType'
print("Summary statistics of training data:")
print(train_data.describe())

Summary statistics of training data:
               Balls        Strikes
count  302558.000000  302558.000000
mean        0.811121       0.874097
std         0.942289       0.806010
min         0.000000       0.000000
25%         0.000000       0.000000
50%         1.000000       1.000000
75%         1.000000       2.000000
max         3.000000       2.000000


## Train the Model

Now, train a model using AutoGluon. This can automatically handle categorical features, missing data, and more, but it’s still pivotal to understand how the training process works and what options you have available. AutoGluon automates many decisions and tasks such as feature selection, model selection, and hyperparameter tuning. It uses ensemble techniques to ensure robust predictions.

More information on [AutoGluon presets](https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets).

In [10]:
MODELS_DIR = 'autogluon_pitchtype_models'  # Specifies folder to store trained models
MODEL_PRESET = 'good_quality'  # Preset for training models

if gpu_available:
    predictor = TabularPredictor(label=label, path=MODELS_DIR).fit(train_data, presets=MODEL_PRESET,
                                                                  num_gpus=1)
else:
    predictor = TabularPredictor(label=label, path=MODELS_DIR).fit(train_data, presets=MODEL_PRESET)

Presets specified: ['good_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Note: `save_bag_folds=False`! This will greatly reduce peak disk usage during fit (by ~8x), but runs the risk of an out-of-memory error during model refit if memory is small relative to the data size.
	You can avoid this risk by setting `save_bag_folds=True`.
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 3600 seconds.
Starting holdout-based sub-

## Evaluate the Model

After training, evaluating your model is crucial to understand its performance and to ensure it generalizes well on unseen data. Model evaluation typically involves calculating performance metrics such as accuracy, precision, recall, and F1-score on a test set that was not used during the training process. Once the model is trained, you can evaluate its performance on the test data:

In [11]:
performance = predictor.evaluate(test_data)
print(performance)

{'accuracy': 0.5046007403490217, 'balanced_accuracy': 0.2585244559827297, 'mcc': 0.046267345277066455}


## View leaderboard

In [12]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost_BAG_L1_FULL,0.506253,,accuracy,0.028682,,0.267513,0.028682,,0.267513,1,True,13
1,ExtraTreesEntr_BAG_L1,0.506253,0.506918,accuracy,0.090633,4.234753,1.447731,0.090633,4.234753,1.447731,1,True,6
2,RandomForestEntr_BAG_L1_FULL,0.506253,,accuracy,0.092513,4.192037,1.582029,0.092513,4.192037,1.582029,1,True,12
3,ExtraTreesGini_BAG_L1_FULL,0.506253,,accuracy,0.093715,4.200601,1.426428,0.093715,4.200601,1.426428,1,True,14
4,ExtraTreesGini_BAG_L1,0.506253,0.506918,accuracy,0.09434,4.200601,1.426428,0.09434,4.200601,1.426428,1,True,5
5,RandomForestGini_BAG_L1_FULL,0.506253,,accuracy,0.094638,4.204879,1.713279,0.094638,4.204879,1.713279,1,True,11
6,RandomForestEntr_BAG_L1,0.506253,0.506918,accuracy,0.096177,4.192037,1.582029,0.096177,4.192037,1.582029,1,True,3
7,ExtraTreesEntr_BAG_L1_FULL,0.506253,,accuracy,0.097938,4.234753,1.447731,0.097938,4.234753,1.447731,1,True,15
8,XGBoost_BAG_L1_FULL,0.506253,,accuracy,0.108012,,0.575301,0.108012,,0.575301,1,True,16
9,RandomForestGini_BAG_L1,0.506253,0.506918,accuracy,0.133376,4.204879,1.713279,0.133376,4.204879,1.713279,1,True,2


## Make Predictions

Once satisfied with your model's performance, the final step is using it to make predictions on new data. This step is where your model is put to the test, providing insights and decisions based on the data it analyzes. Now you can use the model to predict the pitch type:

In [13]:
predictions = predictor.predict(test_data)
print(predictions.head())

# To view the probability of each class
probabilities = predictor.predict_proba(test_data)
print(probabilities.head())

6     Fastball
14    Fastball
16    Fastball
19    Fastball
20    Fastball
Name: PitchType, dtype: object
    Changeup  Curveball  Fastball    Slider
6   0.093924   0.136875  0.494357  0.274844
14  0.102742   0.144563  0.438784  0.313910
16  0.102742   0.144563  0.438784  0.313910
19  0.091051   0.089735  0.556712  0.262501
20  0.093067   0.135629  0.493959  0.277346


## Insights and Feature Importance

AutoGluon provides functionality to understand which features are most important for the predictions:

In [14]:
feature_importance = predictor.feature_importance(data=train_data)
print(feature_importance)

Computing feature importance via permutation shuffling for 3 features using 5000 rows with 5 shuffle sets...
	0.63s	= Expected runtime (0.13s per shuffle set)
	0.46s	= Actual runtime (Completed 5 of 5 shuffle sets)


                   importance    stddev   p_value  n  p99_high   p99_low
Strikes               0.01784  0.006085  0.001400  5  0.030369  0.005311
Balls                 0.01712  0.006123  0.001669  5  0.029727  0.004513
PreviousPitchType     0.00156  0.001740  0.057753  5  0.005143 -0.002023


## Conclusion

In this tutorial, we explored the capabilities of AutoGluon in the context of sports analytics by predicting baseball pitch types. We walked through the entire workflow, from data preparation and model training to evaluation and making predictions. By now, you should have a solid understanding of how AutoGluon can be used to handle complex predictive modeling tasks with ease. We encourage you to apply the knowledge and techniques learned here to your datasets and challenges, exploring further the potential of AutoML to transform your analytical processes. Remember, the key to mastering AutoGluon is continuous experimentation and adaptation to your specific needs. Experiment with AutoGluon's hyperparameters and model configurations to improve performance.