This section contains the final code used for testing on the test data. I have modularized the pipeline by defining functions for all the optimized steps that contributed to improved model performance. The selected input feature set and hyperparameters used here were based on the most effective combinations observed during experimentation. Although hyperparameter tuning was conducted, the improvement over the baseline model was marginal. Therefore, the focus remained on well-engineered, race-relative features and appropriate probability calibration rather than overfitting through extensive tuning.

## IMPORT LIBRARIES

In [65]:
# -----------------------------
# IMPORT ALL THE LIBRARIES
# -----------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from scipy.special import softmax
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb

from sklearn.metrics import log_loss, brier_score_loss
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import classification_report
from tabulate import tabulate

## FUNCTION TO EXECUTE MODELLING TASKS

In [66]:
# -----------------------------
# FUNCTION TO LOAD AND PREPROCESS DATA
# -----------------------------

def load_preprocess_data(mode,path):
  data = pd.read_csv(path)
  try:
    #Features for Input (features were initially created and tested to check if it adds information to the model, the updated features which helped model performing better were as follows: )

    # Relative speed
    data['SpeedRel1'] = data.groupby('Race_ID')['Speed_PreviousRun'].transform(lambda x: (x - x.mean()) / x.std())
    data['SpeedRel2'] = data.groupby('Race_ID')['Speed_2ndPreviousRun'].transform(lambda x: (x - x.mean()) / x.std())

    # Rank of MarketOdds within race (lower = more favored)
    data['OddsRank1'] = data.groupby('Race_ID')['MarketOdds_PreviousRun'].rank(method='min', ascending=True)
    data['OddsRank2'] = data.groupby('Race_ID')['MarketOdds_2ndPreviousRun'].rank(method='min', ascending=True)

    # Z-score of TrainerRating within each race
    data['TrainerRating_rel'] = data.groupby('Race_ID')['TrainerRating'].transform(lambda x: (x - x.mean()) / x.std())

    # Relative Damsire rating
    data['DamsireRel'] = data.groupby('Race_ID')['DamsireRating'].transform(lambda x: (x - x.mean()) / x.std())

    # If test mode: data will only have input features
    if(mode == 'train'):
      data['Win'] = (data['Position'] == 1).astype(int)

    # -------------------------------------------------------------------------------------------------------------------

    # Handling Missing Values:
    # We avoid blindly imputing global means or fixed values for missing entries, as we found most of the parameters (e.g., speed, sire rating, damsire rating) containing NAN values are horse-specific.
    # missing values are imputed using the median value per horse, which better preserves the individual performance trends and characteristics of each horse.
    # This needs to be updated for some parameters , as some parameters are not really horse specific

    cols_with_na = data.columns[data.isna().sum() > 0].tolist()

    for col in cols_with_na:
      data[col] = data.groupby('Horse')[col].transform(lambda x: x.fillna(x.median()))

    # Drop incomplete races (if any missing still persist)
    incomplete_races = data[data[cols_with_na].isna().any(axis=1)]['Race_ID'].unique()

    if len(incomplete_races)>0:
      data = data[~data['Race_ID'].isin(incomplete_races)].reset_index(drop=True)
    incomplete_races = data[data[cols_with_na].isna().any(axis=1)]['Race_ID'].unique()



  except Exception as e:
    print('OOPS! Wrong data')

  return data

In [67]:
# -----------------------------
# FUNCTION TO EXTRACT AND TRANSFORM FEATURES
# -----------------------------

def handle_features(data,mode='train'):

  # Selected Features based on testing during feature engineering
  INP = [
    # Historical form / fitness
    'SpeedRel1','SpeedRel2','OddsRank1','OddsRank2','daysSinceLastRun',
    # Ratings (aggregated skill indicators)
    'TrainerRating_rel','JockeyRating','SireRating','DamsireRel',
    # Demographics
    'Age',
    # Race configuration (distance = performance factor)
    'distanceYards','Going'
  ]

  # Make a dictionary for going category(to save the information and to add rank)
  going_rank = {
    'Firm': 1,
    'Good To Firm': 2,
    'Good': 3,
    'Good To Soft': 4,
    'Soft': 5,
    'Heavy': 6,
    'Standard': 7
  }

  X = data[INP]

  # Encode 'Going'
  X['Going'] = X['Going'].map(going_rank)

  # Scale numeric features
  scaler = StandardScaler()

  numeric_features = [col for col in INP if col != 'Going']
  X[numeric_features] = scaler.fit_transform(X[numeric_features])

  if mode == 'test':
    return X

  y = data['Win'].values

  return X,y


In [68]:
# -----------------------------
# TRAIN THE MODEL
# -----------------------------

def model_fit(X,y):
  final_model = LogisticRegression(
    C=100,
    penalty='l2',
    solver='saga',
    max_iter=1000,
    random_state=42,
    class_weight='balanced')
  final_model.fit(X,y)
  return final_model

In [69]:
# -----------------------------
# EVALUATION METRICS TABLE
# -----------------------------

def tabulate_result_metrics(val_data,y_val,y_pred):
  # Confusion matrix
  tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()

  sensitivity = tp / (tp + fn) if (tp + fn) else 0  # Recall
  specificity = tn / (tn + fp) if (tn + fp) else 0

  # Log Loss / Brier using softmax-normalized probs
  true_labels = val_data['Win'].values
  softmax_probs_pred = val_data['softmax_prob'].values

  logloss = log_loss(true_labels, softmax_probs_pred)
  brier = brier_score_loss(true_labels, softmax_probs_pred)
  bal_acc = balanced_accuracy_score(true_labels, y_pred)

  # Format as a pretty table
  results = [
    ['Balanced Accuracy', f'{bal_acc*100:.4f}%'],
    ['Sensitivity (Recall)', f'{sensitivity*100:.4f}%'],
    ['Specificity', f'{specificity*100:.4f}%'],
    ['Log Loss (Softmax)', f'{logloss:.4f}'],
    ['Brier Score (Softmax)', f'{brier:.4f}']
  ]

  return results

In [70]:
# -----------------------------
# SOFTMAX FUNCTION FOR NORMALIZED RACE PROBS
# -----------------------------

def softmax_probs(x):
    e_x = np.exp(x - np.max(x))  # numerical stability
    return e_x / e_x.sum()

# -----------------------------
# FUNCTION TO TEST MODEL
# -----------------------------

def test_model(model,test_data_path):
  # Prepare test data
  test_data = load_preprocess_data(mode = 'train',path = test_data_path)

  X_test,y_test = handle_features(test_data,mode='train')

  # Evaluate on test set
  test_probs = model.predict_proba(X_test)[:, 1]

  # Attach to validation set
  test_data['raw_prob'] = test_probs

  # Apply softmax per race
  test_data['softmax_prob'] = test_data.groupby('Race_ID')['raw_prob'].transform(lambda x: softmax_probs(x.values))

  # Predicted class from raw (not softmax)
  y_pred = (test_data['raw_prob'] >= 0.5).astype(int)

  # Print results
  results = tabulate_result_metrics(test_data,y_test,y_pred)
  print(tabulate(results, headers=['Metric', 'Value'], tablefmt='grid'))

  # Ensure column is named correctly
  test_data['Predicted_Probability'] = test_data['softmax_prob']

  # Select only required columns
  submission_df = test_data[['Race_ID', 'Position', 'Predicted_Probability']].copy()

  # Save to CSV
  submission_df.to_csv("test_predictions.csv", index=False)

  print("Saved to test_predictions.csv")

## MAIN EXECUTION

In [71]:
# -----------------------------
# MAIN EXECUTION
# -----------------------------

data = load_preprocess_data(mode='train',path='/content/trainData.csv')
X,y = handle_features(data,mode='train')
model = model_fit(X,y)
test_model(model,'/content/testData.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Going'] = X['Going'].map(going_rank)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[numeric_features] = scaler.fit_transform(X[numeric_features])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Going'] = X['Going'].map(going_rank)
A value is trying to be set on a copy of a slice from a DataF

+-----------------------+----------+
| Metric                | Value    |
| Balanced Accuracy     | 62.8455% |
+-----------------------+----------+
| Sensitivity (Recall)  | 66.5543% |
+-----------------------+----------+
| Specificity           | 59.1366% |
+-----------------------+----------+
| Log Loss (Softmax)    | 0.3310   |
+-----------------------+----------+
| Brier Score (Softmax) | 0.0943   |
+-----------------------+----------+
Saved to test_predictions.csv
