# Predicting Finishing Times of Greyhounds in UK Races

The goal of this project is to predict the finishing time of dogs in UK greyhound races based on data from their previous races. The objective is to minimize the mean squared error on our prediction data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
import torch
from transformers import RobertaTokenizer, RobertaModel
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [None]:
!pip install transformers torch

In [None]:
!pip install tqdm

## Understanding the Data

In [None]:
df = pd.read_csv("df.csv")
unseendf = pd.read_csv("unseendf_example.csv")

In [None]:
df.head()

## Data Preprocessing

In [None]:
# Finding the number of null values in each column
num_null_count = df.isnull().sum()
print(num_null_count)

At this point, we can recognize that there are 158 rows without comments.

In [None]:
df['birthdate'] = pd.to_datetime(df['birthdate'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df['days_since_last_race'] = (df['date2'] - df['date1']).dt.days # we need to add the .dt.days because we need to convert the time-delta objects to integers
display(df['days_since_last_race'])

We have successfully added another column that measures the time since the last race. This is especiallly important because we want to determine if a dog has had ample rest before the following race.

In [None]:
df['speed'] = df['distance1'] / df['time1'] # Adding a speed column
df['distance_ratio'] = df['distance2'] / df['distance1']
df.head()

In [None]:
# Changing Categorical to Numerical Variables
# One-hot encoding for 'stadium'
df = pd.get_dummies(df, columns=['stadium'], drop_first=True)

# Check the result
df.head()

Next, we convert the trap values to numerical data, by utillizing the Orginal Encoder. This also preserver this order and will be used well in our Random Forest Model.

In [None]:
encoder = OrdinalEncoder()
df[['trap1', 'trap2']] = encoder.fit_transform(df[['trap1', 'trap2']])

Following that, we no longer need the columns for the birthdate, and the dates between the two races.

In [None]:
df = df.drop(columns=['birthdate', 'date1', 'date2'])

In [None]:
df['comment1'] = df['comment1'].fillna("No comment")

We need to use Sentiment Analysis to identify the emotions behind the text. This model uses a RoBERTa Model. Since we have many datapoints (~533,000), we take a random sample of 10% to use in our model, which is still a lot of data that we can use to generate our model.

In [None]:
df_sample = df.sample(frac=0.1, random_state=42)

We can import the transformers library into the notebook, which provides RoBERTa, a very strong natural language processing model for sentiment analysis.

In [None]:
# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

Now you pass the tokenized data through the RoBERTa model to extract embeddings. RoBERTa’s output consists of:

    last_hidden_state: Contains embeddings for all tokens in the sequence.
    pooler_output: A summary embedding for the [CLS] token.

### NLP on the Comments


In [None]:
def process_comments_in_batches(comments, batch_size=16, max_length=128):
    embeddings = []
    for i in range(0, len(comments), batch_size):
        batch = comments[i:i + batch_size]
        tokens = tokenizer(
            batch.tolist(),
            max_length=max_length,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )
        tokens = {key: value.to(device) for key, value in tokens.items()}  # Move tokens to our device
        with torch.no_grad():
            outputs = model(**tokens)
        batch_embeddings = outputs.last_hidden_state[:, 0, :]  # Extracting the CLS token, which summarizes the entire comment
        embeddings.append(batch_embeddings) # Adding that token to our embeddings

    return torch.vstack(embeddings).cpu().numpy()  # Stacking all embeddings into a single tensor, them moving it to the CPU and then a numpy array

# Using our Function
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Process the comments in batches and save embeddings
comment_embeddings = process_comments_in_batches(df_sample['comment1'], batch_size=16)


We need to create a DataFrame for the Embeddings. The comment_embeddings variable is a NumPy array containing the embeddings extracted from the comments. Each row represents a comment, and each column corresponds to a specific dimension of the embedding (768 dimensions of RoBERTa).


In [None]:
# Create a DataFrame for the embeddings
embedding_columns = [f'comment_embedding_{i}' for i in range(comment_embeddings.shape[1])]
embeddings_df = pd.DataFrame(comment_embeddings, columns=embedding_columns)

# Reset index of the DataFrame to align with embeddings
df_sample = df_sample.reset_index(drop=True)

# Concatenating our Embeddings to our Originak DataFrame
df_sample = pd.concat([df_sample, embeddings_df], axis=1)

In [None]:
# Define features and target
X = df_sample.drop(columns=['time2', 'comment1'])
y = df_sample['time2']  # Our Target variable

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f"X_train shape: {X_train.shape}")  # Rows and features
print(f"y_train shape: {y_train.shape}")  # Target size

In [None]:
# Apply PCA
pca = PCA(n_components=100, random_state=42)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train the Random Forest Regressor on PCA-reduced data
regression_model = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)
regression_model.fit(X_train_pca, y_train)

# Make predictions on the PCA-reduced test data
y_pred = regression_model.predict(X_test_pca)

# Compare Actual vs. Predicted
comparison_df = pd.DataFrame({
    'Actual': y_test.reset_index(drop=True),  # Reset index for alignment
    'Predicted': y_pred
})

print(comparison_df.head())


In [None]:
# Make predictions on the test set
y_pred = regression_model.predict(X_test_pca)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Optional: Calculate and print Root Mean Squared Error (RMSE)
rmse = mse ** 0.5
print(f"Root Mean Squared Error: {rmse}")

In [None]:
print(df['time2'].var())

Becuase the Varience is much greater than the mean squared error, we can conclude that our model is fairly accurate.

# Working with the Unseen Dataset for Races

In [None]:
# Load unseen data
unseendf = pd.read_csv("unseendf.csv")

# Preprocess unseen data
unseendf['date1'] = pd.to_datetime(unseendf['date1'])
unseendf['date2'] = pd.to_datetime(unseendf['date2'])
unseendf['days_since_last_race'] = (unseendf['date2'] - unseendf['date1']).dt.days
unseendf['speed'] = unseendf['distance1'] / unseendf['time1']
unseendf['distance_ratio'] = unseendf['distance2'] / unseendf['distance1']

# One-hot encode the 'stadium' column
unseendf = pd.get_dummies(unseendf, columns=['stadium'], drop_first=True)

# Ensure all expected stadium columns are present
expected_stadium_columns = [
    'stadium_Crayford', 'stadium_Doncaster', 'stadium_Harlow',
       'stadium_Henlow', 'stadium_Hove', 'stadium_Kinsley', 'stadium_Monmore',
       'stadium_Newcastle', 'stadium_Nottingham', 'stadium_Oxford',
       'stadium_Pelaw Grange', 'stadium_Perry Barr', 'stadium_Romford',
       'stadium_Sheffield', 'stadium_Suffolk Downs', 'stadium_Sunderland',
       'stadium_Swindon', 'stadium_Towcester', 'stadium_Yarmouth' # Include all stadiums seen during training
]

for col in expected_stadium_columns:
    if col not in unseendf.columns:
        unseendf[col] = 0

# Align the columns with training features
train_columns = [
    "days_since_last_race", "speed", "distance_ratio",
    'stadium_Crayford', 'stadium_Doncaster', 'stadium_Harlow',
       'stadium_Henlow', 'stadium_Hove', 'stadium_Kinsley', 'stadium_Monmore',
       'stadium_Newcastle', 'stadium_Nottingham', 'stadium_Oxford',
       'stadium_Pelaw Grange', 'stadium_Perry Barr', 'stadium_Romford',
       'stadium_Sheffield', 'stadium_Suffolk Downs', 'stadium_Sunderland',
       'stadium_Swindon', 'stadium_Towcester', 'stadium_Yarmouth',  # All one-hot encoded stadium columns
    "trap1", "trap2",
    *(f"comment_embedding_{i}" for i in range(768))  # Adjust for embedding size
]

# Reindex unseen dataset to match training feature set
unseendf = unseendf.reindex(columns=train_columns, fill_value=0)

# Apply PCA and predict
X_unseen_pca = pca.transform(unseendf)
unseendf['predtime'] = regression_model.predict(X_unseen_pca)

# Save predictions
unseendf.to_csv("mypred.csv", index=False)

