<a href="https://www.kaggle.com/code/haleyparmley/nfl-big-data-bowl-2025-predicting-receiver-routes?scriptVersionId=215875638" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## 🏈 **Introduction: The Art and Science of Route Prediction in Football** 🏈

Football is a game of strategy, precision, and execution, where even the smallest details can define success or failure. Wide receiver routes form the backbone of offensive plays, acting as orchestrated patterns designed to exploit gaps in the defense, create separation, or divert attention to set up teammates. Each route—whether it's a "Go," "Slant," "Flat," or "Angle"—carries unique characteristics that make it effective in specific situations.

### **What Determines Which Route is Run?**  
The decision behind a wide receiver's route is influenced by a variety of contextual and positional factors. Understanding these elements can shed light on how offensive strategies are crafted and executed:  
- **Pre-Snap Alignment**: The receiver's position on the field is one of the strongest indicators of the route to be run. For instance:  
  - **Outside Alignments**: Often signal deeper routes like the "Go" or "Post" to stretch the field.  
  - **Slot Alignments**: Commonly set up shorter, quicker routes like the "Slant" or "Out."  
- **Receiver Motion**: Pre-snap motion provides valuable context for route prediction. Receivers in motion often aim to create mismatches, suggesting the likelihood of quick-breaking routes such as "Flat" or "Angle."  
- **Game Situation**: Down and distance play a critical role in determining route selection:  
  - **Short Yardage**: Routes like the "Hitch" or "Flat" are preferred to gain small, manageable yardage.  
  - **Third-and-Long**: Calls for routes like the "Go" or "Post" to maximize field coverage and stretch the defense vertically.  
- **Receiver Skillset**: Certain players are specialized in specific routes based on their physical attributes:  
  - **Tall, Physical Receivers**: More likely to run "Post" or "Go" routes, where height is an advantage.  
  - **Quick, Agile Receivers**: Often employed for routes like the "Slant" or "Screen" to capitalize on speed and acceleration.  
- **Route Combinations**: Plays are designed with complementary route combinations, such as pairing a deep "Go" route with a shorter "Flat," to create spacing and confuse defenders.

By quantifying and analyzing these elements, this project uses machine learning to predict routes with remarkable accuracy, providing valuable insights to players and coaches.

### **How Can This Model Be Used?**  
The ability to predict wide receiver routes from pre-snap data unlocks new dimensions in preparation and performance analysis for football teams.  

#### **For Coaches**:  
- **Game Planning**: By understanding route tendencies for different receivers and situations, coaches can design plays to maximize offensive efficiency or counter specific defensive setups.  
- **Real-Time Adjustments**: During a game, live predictions can inform split-second adjustments in play-calling to exploit favorable matchups.  

#### **For Players**:  
- **Receiver Training**: Predictive models can help receivers identify their own tendencies, enabling them to add variety to their route trees and become less predictable.  
- **Quarterback Insights**: Quarterbacks can use this data to anticipate how routes will develop, improving timing and reducing miscommunication.  

#### **For Analysts and Scouts**:  
- **Scouting Reports**: Detailed insights into route tendencies provide a competitive edge in evaluating opponents or potential recruits.  
- **Player Evaluation**: The model can uncover subtle tendencies in route-running, helping teams identify players who excel in specific areas or roles.

### **The Vision**  
This project goes beyond simple route prediction. It bridges the knowledge of football with the power of data science, aiming to create actionable insights for decision-making on and off the field. The ability to anticipate a receiver's route based on alignment, motion, and game context could redefine offensive and defensive strategies, leading to smarter preparation, sharper execution, and ultimately, better outcomes.

By focusing on the nuances that dictate route selection, this model offers a tool that combines the art of football with the precision of science—an indispensable resource for modern teams striving for an edge on the gridiron.


## **Objective**
The goal of this project is to predict the route a receiver will run using pre-snap data. This analysis focuses on leveraging player tracking data, game statistics, and team tendencies to develop robust predictive models.

---

## **Data and Approach**
We will:
1. Merge player, play, and tracking data to create a comprehensive dataset.
2. Preprocess the data for training a machine learning models.
3. Use classification techniques to predict the `routeRan` variable, representing the receiver's route.

---


## **Introducing the Random Forest Model for the NFL Big Data Bowl 2025**

In this competition, the goal is to predict the route a receiver will run based on pre-snap data. To tackle this problem, we use a **Random Forest Classifier**, a robust and interpretable machine learning model.

#### Why Random Forest?
- **Interpretability**: Random Forest provides insights into feature importance, helping us understand which pre-snap features influence receiver routes.
- **Versatility**: It handles a mix of numerical and categorical data effectively, making it suitable for this dataset.
- **Imbalance Handling**: The `class_weight` parameter helps address class imbalance, ensuring fair predictions for all route types.

#### Key Steps:
1. Preprocessing:
   - Encoding categorical features.
   - Scaling numerical features for consistency.
2. Training:
   - The Random Forest model is trained on historical data, learning patterns from pre-snap features.
3. Evaluation:
   - The model's accuracy, F1 score, and classification report are analyzed to assess performance.
4. Visualization:
   - Feature importance is visualized to identify the most influential pre-snap features.

This model serves as a baseline for predicting receiver routes, offering interpretability and reliable performance while paving the way for further optimization and enhancements.


## **Step 1: Importing Libraries and Initial Setup**

This step sets up the essential libraries and configurations required for the data preprocessing and splitting process:

1. **Importing Libraries**:
   - **Pandas**: Used for data manipulation and analysis, including reading and merging datasets.
   - **Scikit-learn**:
     - `train_test_split` for splitting the dataset into training and testing sets for machine learning.
   - **OS**: Facilitates file path management and access for loading and saving datasets.

2. **Configure Logging**:
   - Logging is configured to provide a clear and structured output of execution progress.
   - Logs include timestamps, logging levels (e.g., INFO, ERROR), and messages for both successful operations and debugging.

This foundational setup ensures the environment is ready for handling data and splitting it into subsets for further processing.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import os

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## **Step 2: Loading and Preparing Data**

This function processes and consolidates data from multiple CSV files, performs feature engineering, and prepares the data for modeling. The key steps are:

1. **Load Datasets**:
   - Reads data from several CSV files using `pandas.read_csv`:
     - **Player-Play Data**: Includes key features such as `routeRan`, `inMotionAtBallSnap`, and unique identifiers like `nflId`, `playId`, and `gameId`.
     - **Play Data**: Contains contextual play information such as `quarter`, `yardsToGo`, and team possession details.
     - **Player Data**: Provides player attributes such as `height`, `weight`, `collegeName`, and `position`.
     - **Game Data**: Includes game-level metadata.
     - **Tracking Data**: Combines tracking files for all weeks and filters the data for "BEFORE_SNAP" frames.

2. **Filter Valid Tracking Data**:
   - Removes rows with null values in the `event` column to ensure clean tracking data.

3. **Merge Datasets**:
   - Merges the datasets step-by-step using unique identifiers (`nflId`, `playId`, and `gameId`):
     - Player data is merged with player-play data.
     - Play data is merged with game data.
     - Tracking data is integrated into the consolidated dataset.

4. **Feature Engineering**:
   - **Distance from QB**: Calculates the Euclidean distance between each player and the quarterback using their x and y coordinates.
   - **Distance from Line of Scrimmage**: Computes the absolute difference between a player's x-coordinate and the `absoluteYardlineNumber`.
   - **Distance from Sidelines**: Measures the minimum distance between a player and the nearest sideline (field width = 53.3 yards).
   - **Seconds Left in Game**: Converts the game clock into total seconds remaining.

5. **Prepare Features and Target**:
   - Drops unnecessary identifiers (`gameId`, `playId`, `qb_x`, `qb_y`) to avoid data leakage.
   - Splits the dataset into features (`x`) and the target variable (`y`), where `y` represents the `routeRan`.

6. **Error Handling**:
   - Logs errors and raises exceptions if issues occur during data loading or processing.

The final output is a tuple containing `x` (features) and `y` (target variable), which are ready for preprocessing and modeling.


In [2]:
def load_and_prepare_data():
    """
    Loads, prepares, and merges data from various CSV files.

    Returns:
        tuple: X (features), y (target), dataframes of the training and testing sets.
    """
    
    try:
        logging.info("Loading datasets...")

        player_play_data = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2025/player_play.csv")[[
            "routeRan", "nflId", "playId", "gameId", "inMotionAtBallSnap"
        ]].dropna(subset=["routeRan"])

        play_data = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2025/plays.csv")[[
            "quarter", "down", "yardsToGo", "possessionTeam", "gameClock", "preSnapHomeScore",
            "preSnapVisitorScore", "absoluteYardlineNumber", "preSnapHomeTeamWinProbability", "preSnapVisitorTeamWinProbability",
            "expectedPoints", "offenseFormation", "receiverAlignment", "gameId", "playId"
        ]]

        player_data = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2025/players.csv")[[
            "height", "weight", "collegeName", "nflId", "position"
        ]]

        game_data = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2025/games.csv").drop(
            ["season", "homeFinalScore", "visitorFinalScore"], axis=1
        )

        tracking_files = [(f"/kaggle/input/nfl-big-data-bowl-2025/tracking_week_{i}.csv") for i in range(1, 10)]
        tracking_data_combined = pd.concat(
            [
                pd.read_csv(file)[
                    ["gameId", "playId", "nflId", "playDirection", "x", "y", "frameType", "event"]
                ].query("frameType == 'BEFORE_SNAP'")
                for file in tracking_files
            ],
            ignore_index=True
        )

        tracking_data_combined = tracking_data_combined[tracking_data_combined["event"].notnull()]

        logging.info("Merging datasets...")
        player_play_merged = pd.merge(player_play_data, player_data, on="nflId")
        play_game_merged = pd.merge(play_data, game_data, on="gameId")

        final_data = pd.merge(player_play_merged, play_game_merged, on=["playId", "gameId"])
        final_data = pd.merge(final_data, tracking_data_combined, on=["gameId", "playId", "nflId"])

        # Calculate distance from QB
        qb_positions = final_data[final_data["position"] == "QB"][["x", "y", "playId", "gameId"]].rename(
            columns={"x": "qb_x", "y": "qb_y"}
        )
        final_data = pd.merge(final_data, qb_positions, on=["playId", "gameId"], how="left")
        final_data["distance_from_qb"] = np.sqrt(
            (final_data["x"] - final_data["qb_x"])**2 + (final_data["y"] - final_data["qb_y"])**2
        )

        # Calculate distance from line of scrimmage
        final_data["distance_from_los"] = np.abs(final_data["x"] - final_data["absoluteYardlineNumber"])

        # Calculate distance from sidelines (field width = 53.3 yards)
        final_data["distance_from_sideline"] = np.minimum(final_data["y"], 53.3 - final_data["y"])

        # Calculate seconds left in the game
        def game_clock_to_seconds(clock):
            minutes, seconds = map(int, clock.split(":"))
            return minutes * 60 + seconds

        final_data["seconds_left_in_game"] = final_data["gameClock"].apply(game_clock_to_seconds)

        #Drop unnecessary columns
        final_data = final_data.drop(columns=["gameId", "playId", "qb_x", "qb_y", "gameClock", "frameType"])

        # Create features and target
        x = final_data.drop(columns=["routeRan"])
        y = final_data["routeRan"]

        return x, y
    except Exception as e:
        print(f"An error occurred: {e}")
        raise

## **Step 3: Splitting and Saving Data**

This function divides the dataset into training and testing sets and saves them as CSV files for later use. The key steps include:

1. **Split Data**:
   - Uses `train_test_split` from `sklearn` to split the features (`x`) and target (`y`) into:
     - **Training Set**: Used for training the model.
     - **Testing Set**: Used for evaluating the model's performance on unseen data.
   - Ensures the split maintains the distribution of the target classes by setting `stratify=y`.
   - The `test_size` parameter specifies the proportion of the data reserved for testing (default: 20%).

2. **Save Data**:
   - Writes the resulting training and testing datasets (`x_train`, `x_test`, `y_train`, `y_test`) to CSV files in the `/kaggle/working` directory for later access during training and evaluation.

3. **Logging and Error Handling**:
   - Logs messages at each step for transparency and troubleshooting.
   - Catches and logs any exceptions encountered during the splitting or saving process.

The function returns the training and testing datasets as `x_train`, `x_test`, `y_train`, and `y_test`, ensuring the data is ready for preprocessing and modeling.


In [3]:
def split_and_save_data(x, y, test_size=0.2, random_state=42):
     """Splits the data into training and test sets and saves them to CSV files."""
     try:
        logging.info("Splitting data into training and test sets...")
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size,
                                                            random_state=random_state,
                                                            stratify=y)

        logging.info("Saving training and testing datasets...")
        x_train.to_csv(os.path.join("/kaggle/working/x_train.csv"), index=False)
        x_test.to_csv(os.path.join("/kaggle/working/x_test.csv"), index=False)
        y_train.to_csv(os.path.join("/kaggle/working/y_train.csv"), index=False)
        y_test.to_csv(os.path.join("/kaggle/working/y_test.csv"), index=False)
     except Exception as e:
            logging.error(f"An error occurred: {e}")
            raise

     return x_train, x_test, y_train, y_test

## **Step 4: Main Execution Function**

The `main` function orchestrates the key steps of the data preparation process, ensuring a streamlined and organized workflow. The key steps include:

1. **Load and Prepare Data**:
   - Calls the `load_and_prepare_data` function to read, filter, and merge datasets, resulting in `x` (features) and `y` (target).

2. **Split and Save Data**:
   - Invokes the `split_and_save_data` function to divide the dataset into training and testing sets.
   - Saves the resulting subsets (`x_train`, `x_test`, `y_train`, `y_test`) to CSV files for later use.

3. **Logging and Error Handling**:
   - Logs progress to ensure visibility into the execution process.
   - Captures and logs any exceptions, providing detailed error messages for debugging.

4. **Execution Context**:
   - Ensures the function runs only when executed as a script (using the `if __name__ == "__main__":` block).

This function acts as the entry point for the data preparation process, ensuring the data is properly processed and saved for subsequent steps in the workflow.


In [4]:
def main():
    """Main execution function."""
    try:
        x, y = load_and_prepare_data()
        x_train, x_test, y_train, y_test = split_and_save_data(x, y)
    
        logging.info("Program execution completed successfully.")
    except Exception as e:
        logging.error(f"Program terminated with an exception: {e}")


if __name__ == "__main__":
    main()

An error occurred: name 'np' is not defined


## **Step 5: Importing Libraries and Configuring the Environment**

This step sets up the required libraries and configurations for the project:

1. **Importing Libraries**:
   - **Pandas**: For data manipulation and analysis.
   - **Scikit-learn**:
     - `RandomForestClassifier` for building the classification model.
     - Metrics such as `accuracy_score`, `classification_report`, `roc_auc_score`, and `f1_score` for evaluating model performance.
     - Preprocessing tools like `OneHotEncoder` and `StandardScaler` for feature transformation.
     - `Pipeline` and `ColumnTransformer` for building an end-to-end processing and training workflow.
     - `SimpleImputer` for handling missing data.
     - `train_test_split` for splitting the dataset into training and testing subsets.
   - **Joblib**: For saving and loading the trained model pipeline.
   - **Matplotlib and Seaborn**: For data visualization and graphical analysis.

2. **Configure Logging**:
   - Logging is configured to provide real-time feedback on the script's execution, including info-level messages for successful operations and error messages for debugging.

This foundational setup ensures that all tools and configurations are in place for the subsequent steps in the machine learning workflow.


In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, f1_score, confusion_matrix
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import joblib
import logging
import matplotlib.pyplot as plt
import seaborn as sns

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## **Step 6: Loading and Preprocessing Data**

This function performs the following tasks:

1. **Data Loading**:
   - Reads the training and testing datasets from pre-saved CSV files in the `/kaggle/working/` directory.
   - Outputs `x_train`, `x_test`, `y_train`, and `y_test`.

2. **Categorical Feature Encoding**:
   - Identifies categorical features in the dataset.
   - Converts these features into numerical format using `LabelEncoder`, ensuring compatibility with machine learning models.

3. **Numerical Feature Normalization**:
   - Scales numerical features using `StandardScaler` to standardize the range of numerical columns, improving model performance and convergence.

The returned processed datasets are ready for training and evaluation.


In [6]:
def load_and_preprocess_data():
    """Loads, preprocesses, and returns the training and testing datasets."""
    try:
        logging.info("Loading datasets...")
        x_train = pd.read_csv("/kaggle/working/x_train.csv")
        x_test = pd.read_csv("/kaggle/working/x_test.csv")
        y_train = pd.read_csv("/kaggle/working/y_train.csv").squeeze()
        y_test = pd.read_csv("/kaggle/working/y_test.csv").squeeze()

        logging.info("Data loaded successfully.")
        return x_train, x_test, y_train, y_test
    
    except Exception as e:
        logging.error(f"Error loading data: {e}")
        raise

## **Step 7: Creating the Preprocessing and Training Pipeline**

This function defines and returns a complete preprocessing and training pipeline. The pipeline is composed of the following steps:

1. **Identify Column Types**:
   - Separates columns into **numerical** and **categorical** categories.

2. **Define Preprocessing for Numerical Features**:
   - **Imputation**: Fills missing values with the mean of the column using `SimpleImputer`.
   - **Scaling**: Standardizes numerical features using `StandardScaler`.

3. **Define Preprocessing for Categorical Features**:
   - **Imputation**: Fills missing values with the most frequent category using `SimpleImputer`.
   - **One-Hot Encoding**: Encodes categorical features into binary (one-hot) vectors using `OneHotEncoder`.

4. **Combine Preprocessing Steps**:
   - Uses `ColumnTransformer` to apply the appropriate transformations to numerical and categorical columns.

5. **Build the Full Pipeline**:
   - Adds a `RandomForestClassifier` as the classifier to the pipeline with pre-defined hyperparameters (e.g., `n_estimators=50`, `max_depth=10`).
   - Ensures the entire workflow, from preprocessing to training, can be executed in a single pipeline.

The output pipeline can be directly used to preprocess data and train the classifier in one step.


In [7]:
def create_pipeline(x_train):
    """Creates a preprocessing pipeline and returns it."""
    try:
        logging.info("Creating preprocessing pipeline...")
        numeric_columns = x_train.select_dtypes(include=['float64', 'int64']).columns
        categorical_columns = x_train.select_dtypes(include=['object', 'bool']).columns

        # Define preprocessing steps
        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ])

        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ])

        # Column transformer
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_columns),
                ('cat', categorical_transformer, categorical_columns)
            ]
        )

        # Pipeline
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced', n_estimators=100, max_depth=20))
        ])

        return pipeline
        
    except Exception as e:
        logging.error(f"Error creating pipeline: {e}")
        raise

## **Visualizing the Effect of `max_depth` on Model Accuracy**

This function demonstrates the impact of varying the `max_depth` parameter on the accuracy of the RandomForestClassifier. By testing different tree depths, we can identify the optimal depth that balances both train and test accuracy.

#### Key Steps:
1. **Define Depth Range**:
   - The function evaluates tree depths `[5, 10, 15, 20]`.

2. **Pipeline Setup**:
   - A pipeline is created for each depth value using `create_pipeline()`.
   - The `max_depth` parameter of the classifier is adjusted dynamically.

3. **Train and Evaluate**:
   - The model is trained on the training set for each depth value.
   - Accuracy is computed for both the training and testing datasets.

4. **Visualization**:
   - Results are plotted with tree depth on the x-axis and accuracy on the y-axis.
   - Separate lines are plotted for **Train Accuracy** and **Test Accuracy**.

#### Output:
The chart provides insights into the trade-off between underfitting (low depth) and overfitting (high depth). The ideal `max_depth` corresponds to the point where the test accuracy plateaus or reaches its peak, without excessive overfitting.

Use this visualization to select the optimal `max_depth` value for tuning your RandomForest model.


In [8]:
def visualize_optimal_depth(x_train, y_train, x_test, y_test):
    """
    Visualizes the effect of max_depth on model accuracy to find the optimal depth.
    Args:
        x_train: Training features.
        y_train: Training labels.
        x_test: Testing features.
        y_test: Testing labels.
    """
    try:
        print("Visualizing the effect of max_depth on model accuracy...")

        depths = [3, 5, 10, 15, 20]  # Depth values to test
        train_scores = []
        test_scores = []

        for depth in depths:
            # Create a pipeline with the current depth
            pipeline = create_pipeline(x_train)
            pipeline.named_steps['classifier'].set_params(max_depth=depth)
            
            # Train the model
            pipeline.fit(x_train, y_train)
            
            # Evaluate on train and test sets
            train_scores.append(pipeline.score(x_train, y_train))
            test_scores.append(pipeline.score(x_test, y_test))

        # Plot the results
        plt.figure(figsize=(10, 6))
        plt.plot([d if d is not None else 25 for d in depths], train_scores, label='Train Accuracy', marker='o')
        plt.plot([d if d is not None else 25 for d in depths], test_scores, label='Test Accuracy', marker='o')
        plt.title("Effect of max_depth on Model Accuracy")
        plt.xlabel("Tree Depth")
        plt.ylabel("Accuracy")
        plt.legend()
        plt.grid()
        plt.tight_layout()
        plt.show()

    except Exception as e:
        logging.error(f"Error visualizing optimal depth: {e}")
        raise


## **Step 8: Training and Evaluating the Model**

This function trains the `RandomForestClassifier` using the provided pipeline and evaluates its performance on the test dataset. The key steps are:

1. **Training the Model**:
   - The pipeline, which includes both preprocessing and the classifier, is trained on the `x_train` and `y_train` datasets.

2. **Model Evaluation**:
   - The trained model makes predictions (`y_pred`) on the test dataset (`x_test`).
   - Calculates performance metrics:
     - **Accuracy**: The proportion of correct predictions over the total predictions.
     - **F1 Score**: The weighted average of precision and recall, accounting for imbalanced classes.

3. **Generate Classification Report**:
   - Displays precision, recall, F1-score, and support for each class, providing a detailed breakdown of the model's performance.

4. **Logging and Output**:
   - Logs the accuracy and F1 score for tracking.
   - Returns the trained pipeline for further use.

This function ensures the model's performance is quantified and provides insights into its effectiveness in predicting the target variable.


In [9]:
def train_and_evaluate_model(pipeline, x_train, x_test, y_train, y_test, importance_threshold=0.01):
    """
    Trains and evaluates a RandomForestClassifier model, dropping low-importance features.
    Args:
        pipeline: The preprocessing and training pipeline.
        x_train: Training features.
        x_test: Testing features.
        y_train: Training labels.
        y_test: Testing labels.
        importance_threshold: Minimum importance value for a feature to be retained.
    """
    try:
        logging.info("Training RandomForestClassifier model...")
        pipeline.fit(x_train, y_train)

        # Extract feature importances
        rf_model = pipeline.named_steps['classifier']
        feature_importances = rf_model.feature_importances_
        feature_names = (
            pipeline.named_steps['preprocessor'].transformers_[0][2].tolist() +
            pipeline.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out().tolist()
        )
        importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
        importance_df = importance_df.sort_values('importance', ascending=False)

        # Drop low-importance features
        low_importance_features = importance_df[importance_df['importance'] < importance_threshold]['feature'].tolist()
        logging.info(f"Dropping {len(low_importance_features)} low-importance features: {low_importance_features}")

        # Update datasets
        x_train = x_train.drop(columns=low_importance_features, errors='ignore')
        x_test = x_test.drop(columns=low_importance_features, errors='ignore')

        # Recreate and train pipeline
        pipeline = create_pipeline(x_train)
        pipeline.fit(x_train, y_train)

        logging.info("Evaluating model on the testing set...")
        y_pred = pipeline.predict(x_test)

        # Evaluate performance
        test_accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        logging.info(f"Testing Set Accuracy: {test_accuracy}")
        logging.info(f"F1 Score: {f1}")

        print("Testing Set Accuracy:", test_accuracy)
        print("F1 Score:", f1)
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred))

        return pipeline

    except Exception as e:
        logging.error(f"Error training or evaluating model: {e}")
        raise


## **Step 9: Visualizing Model Results and Performance**

This function generates detailed visualizations to evaluate the model's predictions, understand feature importance, and uncover key insights relevant to NFL data analysis. The key steps include:

### 1. **Confusion Matrix**:
   - A heatmap displays the confusion matrix, highlighting the comparison between **true labels** and **predicted labels**.
   - This helps identify which routes are commonly misclassified by the model.

### 2. **Classification Report Heatmap**:
   - A heatmap visualization of the classification report, including **precision**, **recall**, and **f1-score** for each route.
   - Offers a detailed breakdown of the model's performance on each class (route).

### 3. **Feature Importance**:
   - Extracts the `RandomForestClassifier` from the pipeline using `pipeline.named_steps['classifier']`.
   - Computes the importance of each feature in the model.
   - Combines feature names from the preprocessing pipeline (numerical and one-hot encoded categorical features) with their corresponding importance values.
   - Plots the top `n` features (default: 10) in descending order of importance to identify the most influential factors for predicting receiver routes.

### 4. **Distribution of Predicted Routes**:
   - Displays the frequency of predicted routes using a count plot.
   - Highlights the balance (or imbalance) in the predicted route classes, providing insights into model predictions.

### 5. **Log and Handle Errors**:
   - Logs the visualization process for better transparency and debugging.
   - Catches and logs any exceptions that may occur during the visualization process.

### **Insights**:
- These visualizations allow us to:
  - **Evaluate Model Performance**: Understand how well the model predicts each route and where it might struggle.
  - **Feature Insights**: Identify key features that drive the model's predictions, which can inform future feature engineering.
  - **Class Distribution**: Check for any biases in predictions across different routes.
  - **Guide Improvements**: Use the results to refine the model and address any weaknesses.

This step is essential for interpreting the model's results and ensuring its predictions align with the goals of the NFL Big Data Bowl project.


In [10]:
def visualize_model_results(x_train, y_test, y_pred, pipeline, top_n_features=10):
    """
    Generates visualizations of model performance, feature importance, and NFL-relevant data insights.
    Args:
        x_train: Training features used in the model.
        y_test: True labels for the test set.
        y_pred: Predicted labels from the model.
        pipeline: Trained machine learning pipeline.
        top_n_features: Number of top features to display in feature importance plots.
    """
    try:
        logging.info("Generating visualizations...")

        # Confusion Matrix
        plt.figure(figsize=(10, 7))
        cm = confusion_matrix(y_test, y_pred)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=y_test.unique(), yticklabels=y_test.unique())
        plt.title("Confusion Matrix")
        plt.xlabel("Predicted Labels")
        plt.ylabel("True Labels")
        plt.tight_layout()
        plt.show()

        # Classification Report Heatmap
        report = classification_report(y_test, y_pred, output_dict=True)
        report_df = pd.DataFrame(report).transpose()
        plt.figure(figsize=(10, 6))
        sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="YlGnBu", fmt=".2f")
        plt.title("Classification Report Heatmap")
        plt.tight_layout()
        plt.show()

        # Feature Importance
        rf_model = pipeline.named_steps['classifier']
        feature_importances = rf_model.feature_importances_
        feature_names = (
            pipeline.named_steps['preprocessor'].transformers_[0][2].tolist() +
            pipeline.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out().tolist()
        )
        importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
        importance_df = importance_df.sort_values('importance', ascending=False)

        plt.figure(figsize=(12, 8))
        sns.barplot(x="importance", y="feature", data=importance_df.head(top_n_features), palette="viridis")
        plt.title("Top Feature Importance")
        plt.xlabel("Feature Importance")
        plt.ylabel("Feature Names")
        plt.tight_layout()
        plt.show()

        # Distribution of Predicted Routes
        plt.figure(figsize=(10, 6))
        sns.countplot(x=y_pred, palette="muted")
        plt.title("Distribution of Predicted Routes")
        plt.xlabel("Predicted Route")
        plt.ylabel("Frequency")
        plt.tight_layout()
        plt.show()

    except Exception as e:
        logging.error(f"Error during visualization: {e}")
        raise


## **Step 10: Main Execution Function**

The `main` function orchestrates the complete workflow of the machine learning pipeline, ensuring each component integrates seamlessly. The key steps include:

### 1. **Load and Preprocess Data**:
   - Calls `load_and_preprocess_data` to load the training and testing datasets.
   - Handles preprocessing, including encoding categorical variables and scaling numerical features.

### 2. **Create the Pipeline**:
   - Utilizes `create_pipeline` to define the machine learning pipeline, combining feature preprocessing steps with a `RandomForestClassifier`.

### 3. **Train and Evaluate the Model**:
   - Invokes `train_and_evaluate_model` to train the pipeline on the training data and evaluate its performance on the test data.
   - Measures key performance metrics such as accuracy, precision, recall, and F1-score.

### 4. **Visualize Optimal Depth**:
   - Runs `visualize_optimal_depth` to analyze the impact of the `max_depth` parameter on model accuracy.
   - Identifies the depth value that balances model complexity and performance.

### 5. **Visualize Model Results**:
   - Executes `visualize_model_results` to generate critical visualizations, including:
     - **Confusion Matrix**: Highlights misclassifications.
     - **Classification Report Heatmap**: Summarizes precision, recall, and F1-scores.
     - **Feature Importance Plot**: Identifies the top features driving predictions.
     - **Predicted Route Distribution**: Shows the balance of route predictions.

### 6. **Error Handling and Logging**:
   - Logs the progress of each step and captures any exceptions encountered during execution, ensuring traceability and debugging support.

### **Insights**:
This function integrates all prior steps into a cohesive workflow. By training, evaluating, and visualizing the model's performance and key parameters, the `main` function ensures the project's goals are met with clarity and accuracy. This centralized structure provides an effective foundation for iterative improvements and further exploration.


In [11]:
def main():
    """Main execution function."""
    try:
        x_train, x_test, y_train, y_test = load_and_preprocess_data()
        pipeline = create_pipeline(x_train)

        # Train and evaluate model
        pipeline = train_and_evaluate_model(pipeline, x_train, x_test, y_train, y_test)

        # Visualize model results
        visualize_model_results(
        x_train=x_train,                 # Training features
        y_test=y_test,                   # True labels for the test set
        y_pred=pipeline.predict(x_test), # Predicted labels from the pipeline
        pipeline=pipeline,               # The trained pipeline object
        top_n_features=10                # Number of top features to display in the feature importance plot
)

        # Visualize optimal depth
        visualize_optimal_depth(x_train, x_test, y_train, y_test)

        logging.info("Program execution completed successfully.")

    except Exception as e:
        logging.error(f"Program terminated with an exception: {e}")

if __name__ == "__main__":
    main()

## **Updated Final Summary and Insights**

This project implemented a machine learning pipeline, utilizing a Random Forest Classifier, to predict NFL receiver routes using data from the NFL Big Data Bowl 2025. The integration of multiple datasets on players, games, and tracking data provided comprehensive insights into route prediction. Below are the key takeaways and updates based on the latest results.

---

### **Performance Metrics**
- **Testing Set Accuracy**: 57.54%
- **F1 Score**: 57.55%
- **Detailed Results**:
  - **Top-performing routes**:
    - **WHEEL**: Precision (92%), Recall (65%), F1 (76%) – highest-performing route despite limited data.
    - **FLAT**: Precision (80%), Recall (61%), F1 (69%) – solid overall performance.
  - **Moderate-performing routes**:
    - **CROSS**: Precision (55%), Recall (67%), F1 (60%).
    - **GO**: Balanced performance with F1 (60%).
  - **Low-performing routes**:
    - **ANGLE**: Precision (46%), Recall (89%), F1 (60%) – over-prediction remains a challenge.
    - **OUT**: Precision (65%), Recall (39%), F1 (48%) – under-prediction continues to limit accuracy.

---

### **Key Visualizations**
1. **Confusion Matrix**:
   - Revealed consistent challenges in distinguishing between similar routes, such as **OUT** vs. **CROSS**.
2. **Feature Importance**:
   - Key influential features:
     - **Distance from Line of Scrimmage (LOS)**
     - **Seconds Left in Game**
     - **Receiver Alignment**
   - Dropping low-importance features helped mitigate overfitting and marginally improved accuracy.
3. **Classification Report Heatmap**:
   - Highlighted varying levels of precision and recall across different routes.
4. **Predicted Route Distribution**:
   - Showed a reasonably balanced prediction distribution, though some routes remain underrepresented.

---

### **Key Challenges**
- **Class Imbalance**:
  - Underrepresented routes (e.g., **WHEEL**, **SCREEN**) remained difficult to predict with consistent accuracy.
- **Feature Overlap**:
  - Routes with similar feature profiles, such as **POST** and **GO**, continued to confuse the model.
- **Performance Plateau**:
  - Accuracy and F1 score improvements are incremental, suggesting diminishing returns with the current model setup.

---

### **Improvements and Next Steps**
1. **Enhanced Feature Engineering**:
   - Include advanced features such as player acceleration, proximity to defenders, and motion patterns.
2. **Alternative Models**:
   - Test gradient boosting methods (e.g., XGBoost, LightGBM) for improved decision boundaries.
3. **Data Augmentation**:
   - Generate synthetic samples for underrepresented classes like **WHEEL** and **SCREEN**.
4. **Domain Knowledge Integration**:
   - Leverage football-specific features such as route clusters, defensive coverage type, and quarterback tendencies.
5. **Hyperparameter Optimization**:
   - Conduct a more granular grid search to fine-tune key parameters like max depth and estimators.

---

### **Relevance to NFL Teams**
This model demonstrates practical utility for route prediction, offering insights into:
- **Game Strategy**: Identifying tendencies and counter-strategies for opponent receiver routes.
- **Player Performance Evaluation**: Assessing route execution and tendencies under different conditions.
- **In-Game Decision Support**: Enhancing real-time analytics to inform strategic adjustments during games.

---
