# Tennis Ball Tracking with Vision Transformer

In this notebook, we will outline the steps to fine-tune a Vision Transformer (ViT) model for the task of tracking a tennis ball in video frames. The goal is to predict the x, y coordinates of the tennis ball and the event type (flying, bouncing, being hit) in each frame.

The dataset we will use is provided by TrackNet and contains broadcast TV tennis match videos along with accompanying .csv files that annotate the x, y location of the tennis ball and the event type in each frame.

The steps we will follow are:

1. **Data Preparation:** Extract frames from the videos and save them as individual images. The labels for each image (the x, y coordinates of the tennis ball and the event type) will be extracted from the accompanying .csv file.

2. **Data Preprocessing:** Preprocess the images to be in the format expected by the Vision Transformer model. This typically involves resizing the images to the expected input size of the model (224x224 for the base Vision Transformer model), and normalizing the pixel values.

3. **Model Preparation:** Load the pre-trained Vision Transformer model, and modify its final layer to match the number of output classes for our tasks. For the x, y coordinate prediction task, we will add a fully connected layer with 2 output units (for the x and y coordinates). For the event type prediction task, we will add a fully connected layer with 3 output units (for the 3 event types), followed by a softmax activation function.

4. **Training Loop:** Define a training loop where we feed the preprocessed images to the model, compute the loss for both tasks (using a suitable loss function for each task), and update the model's weights based on the total loss. The total loss will be a weighted sum of the two individual losses, where the weights reflect the importance of each task.

5. **Evaluation:** After training the model for a certain number of epochs, we will evaluate its performance on a validation set. We will compute the loss and accuracy for each task, and adjust the model's hyperparameters or the training process as needed to improve its performance.

6. **Inference:** Once we are satisfied with the model's performance, we can use it to predict the x, y coordinates and event type of the tennis ball in new video frames.

Let's get started!

## Step 1: Data Preparation

In this step, we will extract the frames from the videos and save them as individual images. The labels for each image (the x, y coordinates of the tennis ball and the event type) will be extracted from the accompanying .csv file.

We will use the OpenCV library to read the video files and extract the frames. The pandas library will be used to read the .csv file and extract the labels.

### ImageNet normalization used
However, when using pre-trained models like the Vision Transformer google/vit-base-patch16-224, it's important to match the preprocessing steps that were applied to the data during the model's original training. In this case, the model was trained on the ImageNet dataset, which was normalized using the specific mean and standard deviation values for the RGB channels that I mentioned earlier.

So, while normalizing to the range {0,1} is not wrong per se, it might not yield the best results when using this specific pre-trained model. The model might perform better if the input images are normalized in the same way the training data was normalized.

Therefore, I would recommend adjusting your normalization step to match the ImageNet normalization:

In [5]:
# Import necessary libraries
import cv2
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm


In [1]:

# Define the size to resize the images to
image_size = (224, 224)

# Define the ImageNet mean and standard deviation
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])

# Define the path to the dataset directory
dataset_dir = Path('Dataset/Dataset')

def process_image(image_file):
    try:
        # Read the image
        image = cv2.imread(str(image_file))

        # Resize the image
        image = cv2.resize(image, image_size)

        # Convert the image to RGB and normalize the pixel values
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) / 255.0
        image = (image - mean) / std

        # Save the preprocessed image
        np.save(image_file.with_suffix('.npy'), image)
    except Exception as e:
        print(f'Error processing image {image_file}: {e}')

def process_labels(csv_file):
    try:
        labels_df = pd.read_csv(csv_file)

        # Adjust the x, y coordinates to match the new image size
        original_image_size = cv2.imread(str(next(csv_file.parent.glob('*.jpg')))).shape[:2][::-1]
        labels_df['x-coordinate'] *= image_size[0] / original_image_size[0]
        labels_df['y-coordinate'] *= image_size[1] / original_image_size[1]

        # Save the preprocessed labels for each image individually 
        for index, row in labels_df.iterrows():
            label_data = row[['visibility', 'x-coordinate', 'y-coordinate', 'status']].to_numpy()
            np.save(csv_file.parent / f'{row["file name"]}_labels.npy', label_data.astype(np.float32))

        # Save the updated dataframe
        labels_df.to_csv(csv_file.with_name(f'{csv_file.stem}_updated.csv'), index=False)
    except Exception as e:
        print(f'Error processing labels {csv_file}: {e}')


# Loop over the game directories
for game_dir in tqdm(list(dataset_dir.glob('game*')), desc='Processing games'):
    # Loop over the clip directories in each game directory
    for clip_dir in game_dir.glob('Clip*'):
        # Loop over the image files in each clip directory
        for image_file in clip_dir.glob('*.jpg'):
            process_image(image_file)

        # Read the .csv file in each clip directory
        csv_files = list(clip_dir.glob('Label.csv'))
        if csv_files:
            process_labels(csv_files[0])

# Now, we have the preprocessed images and labels.
# We can proceed to the next step.


Processing games:   0%|                                                                              | 0/10 [00:24<?, ?it/s]


KeyboardInterrupt: 

## Step 2: Create PyTorch Datasets and DataLoaders

In this step, we will create PyTorch `Dataset` objects for the training and validation sets. A `Dataset` is a PyTorch abstraction that allows us to encapsulate our data and provide a way to access it. We will also create `DataLoader` objects, which allow us to load data in batches during training, shuffle the data, and parallelize the data loading process.

In [1]:
# Import necessary libraries
from torch.utils.data import Dataset, DataLoader
import torch
import os
import glob
from torch import tensor

class TennisDataset(Dataset):
    def __init__(self, dataframe, root_dir, transform=None):
        self.dataframe = dataframe
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
    
        row = self.dataframe.iloc[idx]
    
        # Extract the base file name from the file name and the directories
        base_file_name = row["file name"].split('.jpg_labels')[0]
        clip_directory = row["clip directory"]
        game_directory = row["game directory"]
    
        # Construct the path to the image and labels
        image_path = self.root_dir / game_directory / clip_directory / f'{base_file_name}.npy'
        labels_path = self.root_dir / game_directory / clip_directory / f'{base_file_name}.jpg_labels.npy'
    
        # Rest of the code...
        image = np.load(image_path, allow_pickle=True)
        labels = np.load(labels_path, allow_pickle=True)

        # Convert the visibility and status to numeric values
        #visibility_mapping = {'not visible': 0, 'easily identifiable': 1, 'not easily identifiable': 2, 'occluded': 3}
        #trajectory_mapping = {'flying': 0, 'hit': 1, 'bounding': 2}
        #labels[0] = visibility_mapping[labels[0]]
        #labels[3] = trajectory_mapping[labels[3]]

        # Convert the labels to a PyTorch tensor
        labels = tensor(labels, dtype=torch.float32)
    
        # Print the shape of the image
        #print(f"Original shape: {image.shape}")
    
        # Transpose the image dimensions
        image = image.transpose((2, 0, 1))
    
        # Print the new shape of the image
        #print(f"Transposed shape: {image.shape}")

        # Convert the image and labels to PyTorch tensors
        image = tensor(image, dtype=torch.float32)

        
    
        if self.transform:
            image = self.transform(image)
    
        return image, labels




## Step 3: Split the Dataset into Training and Validation Sets

In this step, we will split the dataset into training and validation sets. The training set is used to train the model, while the validation set is used to evaluate the model's performance during training. This helps us to monitor the model for overfitting, which occurs when the model performs well on the training data but poorly on new, unseen data.

In [2]:
from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd

# Define the path to the dataset directory
dataset_dir = Path('Dataset/Dataset')

# Create a list to store the file names and their parent directories
data = []

# Loop over the game directories
for game_dir in dataset_dir.glob('game*'):
    # Loop over the clip directories in each game directory
    for clip_dir in game_dir.glob('Clip*'):
        # Loop over the image files in each clip directory
        for image_file in clip_dir.glob('*.npy'):
            # Append the base file name and its parent directories to the list
            data.append({'file name': os.path.splitext(image_file.name)[0], 'clip directory': clip_dir.name, 'game directory': game_dir.name})

# Create a DataFrame from the list
df = pd.DataFrame(data)

# Split the DataFrame into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2)

# Now you can use train_df and val_df to create your datasets
train_dataset = TennisDataset(train_df, root_dir=dataset_dir)
val_dataset = TennisDataset(val_df, root_dir=dataset_dir)

# And create your DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32)


### check one of the image .npy files

In [3]:
! pwd

/mnt/d/tennis/tracknet


In [6]:
test_path = "Dataset/Dataset/game1/Clip1/0000.npy"
data = np.load(test_path, allow_pickle=True)
print(data.shape, data.dtype)


(224, 224, 3) float64


### Step 3B, data augmentation (optional)
In this code, we define a transforms.Compose object that first converts the image to a PIL Image, then applies a random horizontal flip with a probability of 0.5, and finally converts the image back to a tensor. We then modify the TennisDataset class to accept an optional transform argument and apply this transform to the images in the __getitem__ method. If an image is flipped, we also flip the x-coordinate of the ball.

Please note that this is a simple example and might not work perfectly for your specific use case. For example, the RandomHorizontalFlip transform uses a fixed random state, so the same images will always be flipped. If you want truly random flipping, you might need to implement your own flipping transform. Also, this code assumes that the 'x' coordinate is the first element in the label array, and that it is a value between 0 and 1 representing the relative position of the ball in the frame. If your data is different, you would need to adjust the code accordingly.

In [7]:
'''
from torchvision import transforms

# Define the data augmentation
data_transforms = transforms.Compose([
    transforms.ToPILImage(),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor(),
])

class TennisDataset(Dataset):
    def __init__(self, df, frames_dir, transform=None):
        self.df = df
        self.frames_dir = frames_dir
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image = np.load(os.path.join(self.frames_dir, f'{row.clip_number}_{row.frame_number}.jpg.npy'))
        label = row[['x', 'y', 'event_type']].values

        if self.transform:
            image = self.transform(image)
            # If the image was flipped, flip the x-coordinate of the ball
            if self.transform.transforms[1].p == 1:
                label[0] = 1 - label[0]

        return torch.from_numpy(image), torch.from_numpy(label)

# Create the datasets with data augmentation
#train_dataset = TennisDataset(train_df, frames_dir, transform=data_transforms)
#al_dataset = TennisDataset(val_df, frames_dir)
'''

"\nfrom torchvision import transforms\n\n# Define the data augmentation\ndata_transforms = transforms.Compose([\n    transforms.ToPILImage(),\n    transforms.RandomHorizontalFlip(p=0.5),\n    transforms.ToTensor(),\n])\n\nclass TennisDataset(Dataset):\n    def __init__(self, df, frames_dir, transform=None):\n        self.df = df\n        self.frames_dir = frames_dir\n        self.transform = transform\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, idx):\n        row = self.df.iloc[idx]\n        image = np.load(os.path.join(self.frames_dir, f'{row.clip_number}_{row.frame_number}.jpg.npy'))\n        label = row[['x', 'y', 'event_type']].values\n\n        if self.transform:\n            image = self.transform(image)\n            # If the image was flipped, flip the x-coordinate of the ball\n            if self.transform.transforms[1].p == 1:\n                label[0] = 1 - label[0]\n\n        return torch.from_numpy(image), torch.from_numpy(label)\n\n#

## Step 4: Model Preparation

In this step, we will load the pre-trained Vision Transformer model, and modify its final layer to match the number of output classes for our tasks. For the x, y coordinate prediction task, we will add a fully connected layer with 2 output units (for the x and y coordinates). For the event type prediction task, we will add a fully connected layer with 3 output units (for the 3 event types), followed by a softmax activation function.

We will use the Hugging Face Transformers library to load the pre-trained Vision Transformer model.

In [8]:
# Import necessary libraries
from transformers import ViTModel, ViTConfig
import torch.nn as nn

# Load the pre-trained Vision Transformer model
config = ViTConfig.from_pretrained('google/vit-base-patch16-224')
model = ViTModel(config)

# Modify the final layer
model.classifier = nn.Sequential(
    nn.Linear(config.hidden_size, 2),  # For the x, y coordinate prediction task
    nn.Linear(config.hidden_size, 3),  # For the event type prediction task
    nn.Softmax(dim=1)
)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Now, we have the modified Vision Transformer model.
# We can proceed to the next step.


## Step 5: Define the Loss Function and Optimizer

In this step, we will define the loss function and the optimizer. The loss function measures how well the model's predictions match the actual values. The optimizer is used to update the model's parameters based on the gradients of the loss function with respect to the parameters.

In [9]:
# Import necessary libraries
import torch.optim as optim

# Define the loss function
criterion = nn.MSELoss()

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Now, we have the loss function and optimizer.
# We can proceed to the next step.


## Step 5: Training Loop

In this step, we will define a training loop where we feed the preprocessed images to the model, compute the loss for both tasks (using a suitable loss function for each task), and update the model's weights based on the total loss. The total loss will be a weighted sum of the two individual losses, where the weights reflect the importance of each task.

We will use the PyTorch library to define the training loop.

In [10]:
# Import necessary libraries
from tqdm import tqdm

# Define the number of epochs
num_epochs = 10

# Loop over the epochs
for epoch in tqdm(range(num_epochs), desc='Epochs'):
    # Train
    model.train()
    train_loss = 0.0
    for images, labels in tqdm(train_dataloader, desc='Training', leave=False):
        # Move the images and labels to the GPU if available
        images = images.to(device)
        labels = labels.to(device)

        # Process the labels
        #labels = process_labels(labels)
        if labels is None:
            print(f'error: labels is {labels}')

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(images)
        if outputs is None:
            print(f'error: output is {output} for {labels}')
        else:
            print(outputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * images.size(0)

    train_loss /= len(train_dataloader.dataset)

    # Validate
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for images, labels in tqdm(val_dataloader, desc='Validation', leave=False):
            # Move the images and labels to the GPU if available
            images = images.to(device)
            labels = labels.to(device)

            # Process the labels
            labels = process_labels(labels)

            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            val_loss += loss.item() * images.size(0)

    val_loss /= len(val_dataloader.dataset)

    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')


Epochs:   0%|                                                                                        | 0/10 [00:00<?, ?it/s]
Training:   0%|                                                                                     | 0/992 [00:00<?, ?it/s][A
Epochs:   0%|                                                                                        | 0/10 [00:10<?, ?it/s][A

BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 1.4075e+00,  1.0877e+00,  2.0836e+00,  ..., -4.9522e-01,
           1.0107e+00, -1.4469e-01],
         [ 1.7730e+00,  7.3344e-01,  1.6785e+00,  ...,  4.3754e-01,
           1.0780e+00, -1.3120e+00],
         [ 1.6538e+00,  1.8222e+00,  1.3716e+00,  ...,  5.6994e-01,
           6.7216e-01, -1.5812e+00],
         ...,
         [ 7.8348e-01,  2.3157e+00,  1.1174e+00,  ..., -6.1076e-01,
          -3.3558e-01, -1.1977e+00],
         [ 8.3033e-01,  2.3129e+00,  1.1678e+00,  ..., -6.9030e-01,
          -3.5379e-01, -1.1818e+00],
         [-2.2939e-01,  8.8154e-01,  1.3024e+00,  ...,  1.4277e-01,
           2.5424e-01, -8.7588e-01]],

        [[ 1.3515e+00,  9.6144e-01,  2.1472e+00,  ..., -6.8164e-01,
           8.3853e-01, -1.3803e-01],
         [ 1.7667e+00,  7.7777e-01,  1.7557e+00,  ...,  3.9531e-01,
           1.0675e+00, -1.3184e+00],
         [ 1.4740e+00,  2.3893e+00,  2.2491e+00,  ...,  4.2060e-01,
           1.0265e+00, -1.5493e




AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'size'

## Step 6: Evaluation

After training the model for a certain number of epochs, we will evaluate its performance on a validation set. We will compute the loss and accuracy for each task, and adjust the model's hyperparameters or the training process as needed to improve its performance.

We will use the PyTorch library to evaluate the model.

In [None]:
# Define the DataLoader for the validation data
val_loader = val_dataloader  # This is the DataLoader object containing your validation data

# Evaluation loop
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Do not compute gradients
    total_loss = 0
    for images, labels in val_loader:
        # Move the data to the GPU if available
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)

        # Compute the loss
        loss = criterion(outputs, labels)
        total_loss += loss.item()

    # Compute the average loss
    avg_loss = total_loss / len(val_loader)

    print(f'Validation Loss: {avg_loss}')

# Now, we have evaluated the model.
# We can proceed to the next step.


## Step 7: Inference

Once we are satisfied with the model's performance, we can use it to predict the x, y coordinates and event type of the tennis ball in new video frames.

We will use the PyTorch library to perform inference with the model.

In [None]:
# Define the DataLoader for the test data
test_data = ...  # This should be a PyTorch Dataset object containing your test data
test_loader = DataLoader(test_data, batch_size=32)

# Inference loop
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Do not compute gradients
    for images in test_loader:
        # Move the data to the GPU if available
        images = images.to(device)

        # Forward pass
        outputs = model(images)

        # Compute the predictions
        predictions = outputs.argmax(dim=1)

        # Here, you can do whatever you want with the predictions.
        # For example, you can visualize the predictions on the images.

# Now, we have performed inference with the model.
# This is the end of the process.
