**CSCI4050U - Machine Learning - Final Project**

Muhammad Jibran Khan - 100877086

**Problem Formulation:**

We study whether latent ‚Äúfighter style‚Äù embeddings, learned from round-level performance statistics, improve the prediction of UFC fight outcomes and methods of victory.

Using fight_performance, round_stats, strike_breakdown, and strike_stats, we construct per-fight vectors describing a fighter‚Äôs striking and grappling behavior (e.g., knockdowns, submission attempts, control time, and detailed strike distributions). For each fighter, we order their fights by event.date (via fight and event) and feed the last N fights into a recurrent encoder to produce a time-dependent style embedding *ùëÜ_ùëñ(ùë°)* before each bout.

For each bout, we combine the two fighters‚Äô style embeddings with metadata from fighters and career_stats (height, reach, stance, career per-minute stats, etc.), and use a Siamese neural network to predict (1) the probability that fighter A wins and (2) a distribution over method of victory (KO/TKO, Submission, Decision), using labels from fight_performance.result and fight.method. All features are computed using only data from fights with event.date strictly earlier than the target bout, and train/validation/test sets are split by date to avoid information leakage.

To quantify the value of style embeddings, we compare a metadata-only baseline and an engineered-stat baseline against our embedding-enhanced model, evaluating accuracy, log loss, Brier score, ROC-AUC, and calibration. This allows us to determine whether style representations learned from per-fight statistics meaningfully improve UFC bout prediction.

Imports 

*Make sure to update requirements.txt*

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F
import torch.optim as optim
from collections import Counter

Load Datasets

In [2]:
DATA_DIR = Path("ufc_datasets/")
fighters          = pd.read_csv(DATA_DIR / "fighters.csv")
round_stats       = pd.read_csv(DATA_DIR / "round_stats.csv")
career_stats      = pd.read_csv(DATA_DIR / "career_stats.csv")
fight_performance = pd.read_csv(DATA_DIR / "fight_performance.csv")
fight             = pd.read_csv(DATA_DIR / "fight.csv")
event             = pd.read_csv(DATA_DIR / "event.csv")
strike_breakdown  = pd.read_csv(DATA_DIR / "strike_breakdown.csv")
strike_stats      = pd.read_csv(DATA_DIR / "strike_stats.csv")

Shape of Datasets

In [3]:
for name, df in [
    ("fighters", fighters),
    ("round_stats", round_stats),
    ("career_stats", career_stats),
    ("fight_performance", fight_performance),
    ("fight", fight),
    ("event", event),
    ("strike_breakdown", strike_breakdown),
    ("strike_stats", strike_stats),
]:
    print(f"{name:17s} shape={df.shape}")
    print("-" * 60)


fighters          shape=(4266, 15)
------------------------------------------------------------
round_stats       shape=(36582, 9)
------------------------------------------------------------
career_stats      shape=(4266, 10)
------------------------------------------------------------
fight_performance shape=(15378, 5)
------------------------------------------------------------
fight             shape=(7689, 9)
------------------------------------------------------------
event             shape=(697, 4)
------------------------------------------------------------
strike_breakdown  shape=(36582, 9)
------------------------------------------------------------
strike_stats      shape=(329238, 4)
------------------------------------------------------------


Columns

In [4]:
print("round_stats columns:\n", round_stats.columns.tolist(), "\n")
print("strike_breakdown columns:\n", strike_breakdown.columns.tolist(), "\n")
print("strike_stats columns:\n", strike_stats.columns.tolist(), "\n")
print("fight_performance columns:\n", fight_performance.columns.tolist(), "\n")
print("fight columns:\n", fight.columns.tolist(), "\n")


round_stats columns:
 ['round_id', 'round_number', 'knockdowns', 'submission_attempts', 'reversals', 'control_time', 'fightp_id', 'takedown_stats_id', 'strike_breakdown_id'] 

strike_breakdown columns:
 ['strike_breakdown_id', 'significant_strikes_id', 'total_strikes_id', 'head_strikes_id', 'body_strikes_id', 'leg_strikes_id', 'distance_strikes_id', 'clinch_strikes_id', 'ground_strikes_id'] 

strike_stats columns:
 ['strike_stats_id', 'landed', 'attempted', 'percentage'] 

fight_performance columns:
 ['fightp_id', 'fighter_name', 'fighters_id', 'fight_id', 'result'] 

fight columns:
 ['fight_id', 'event_name', 'date', 'title_bout', 'method', 'round_ended', 'time_ended', 'referee', 'event_id'] 



In [5]:
# We'll only use landed & attempted
strike_stats_small = strike_stats[["strike_stats_id", "landed", "attempted"]].copy()

sb = strike_breakdown.copy()

# Map categories to id columns
categories = {
    "sig_strike": "significant_strikes_id",
    "total_strike": "total_strikes_id",
    "head_strike": "head_strikes_id",
    "body_strike": "body_strikes_id",
    "leg_strike": "leg_strikes_id",
    "distance_strike": "distance_strikes_id",
    "clinch_strike": "clinch_strikes_id",
    "ground_strike": "ground_strikes_id",
}

for prefix, id_col in categories.items():
    sb = sb.merge(
        strike_stats_small.add_prefix(f"{prefix}_"),
        left_on=id_col,
        right_on=f"{prefix}_strike_stats_id",
        how="left"
    )

drop_cols = [f"{p}_strike_stats_id" for p in categories.keys()]
sb_wide = sb.drop(columns=drop_cols)

print("sb_wide shape:", sb_wide.shape)
print(sb_wide.head(2))


sb_wide shape: (36582, 25)
   strike_breakdown_id  significant_strikes_id  total_strikes_id  \
0                    1                       2                 3   
1                    2                      11                12   

   head_strikes_id  body_strikes_id  leg_strikes_id  distance_strikes_id  \
0                4                5               6                    7   
1               13               14              15                   16   

   clinch_strikes_id  ground_strikes_id  sig_strike_landed  ...  \
0                  8                  9                 13  ...   
1                 17                 18                 29  ...   

   body_strike_landed  body_strike_attempted  leg_strike_landed  \
0                   3                      5                  7   
1                   3                      5                  8   

   leg_strike_attempted  distance_strike_landed  distance_strike_attempted  \
0                     7                      13          

In [6]:
round_full = round_stats.merge(
    sb_wide,
    on="strike_breakdown_id",
    how="left"
)

td_stats = strike_stats_small.add_prefix("td_")

round_full = round_full.merge(
    td_stats,
    left_on="takedown_stats_id",
    right_on="td_strike_stats_id",
    how="left"
).drop(columns=["td_strike_stats_id"])

print("round_full shape:", round_full.shape)
print(round_full.head(2))


round_full shape: (36582, 35)
   round_id  round_number  knockdowns  submission_attempts  reversals  \
0         1             1           0                    0          0   
1         2             2           0                    0          0   

   control_time  fightp_id  takedown_stats_id  strike_breakdown_id  \
0             0          1                  1                    1   
1             0          1                 10                    2   

   significant_strikes_id  ...  leg_strike_landed  leg_strike_attempted  \
0                       2  ...                  7                     7   
1                      11  ...                  8                     8   

   distance_strike_landed  distance_strike_attempted  clinch_strike_landed  \
0                      13                         25                     0   
1                      29                         54                     0   

   clinch_strike_attempted  ground_strike_landed  ground_strike_attempted  \
0

Now we need to aggregate the data.

*Explain the early/late stuff, the entire fight with all 5 rounds in 1 vector etc...*

In [7]:
FIGHTP_ID_COL = "fightp_id"
ROUND_COL     = "round_number"

def build_round_agg(df, id_col, round_col, prefix):
    df = df.copy()

    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

    id_like_cols = {
        id_col,
        round_col,
        "round_id",
        "strike_stats_id",
        "strike_breakdown_id",
    }
    numeric_cols = [c for c in numeric_cols if c not in id_like_cols]

    # Total over all rounds
    total = (
        df.groupby(id_col)[numeric_cols]
          .sum()
          .add_prefix(f"{prefix}_total_")
    )

    # Early vs late
    df["phase"] = np.where(df[round_col] <= 2, "early", "late")
    phase = (
        df.groupby([id_col, "phase"])[numeric_cols]
          .sum()
          .unstack("phase", fill_value=0)
    )

    phase.columns = [
        f"{prefix}_{stat}_{phase_name}"
        for stat, phase_name in phase.columns.to_flat_index()
    ]

    agg = total.join(phase, how="left")
    return agg


In [8]:
round_stats_agg = build_round_agg(
    round_full,
    id_col=FIGHTP_ID_COL,
    round_col=ROUND_COL,
    prefix="rs"
)

print("round_stats_agg shape:", round_stats_agg.shape)
round_stats_agg.head()


round_stats_agg shape: (15378, 93)


Unnamed: 0_level_0,rs_total_knockdowns,rs_total_submission_attempts,rs_total_reversals,rs_total_control_time,rs_total_takedown_stats_id,rs_total_significant_strikes_id,rs_total_total_strikes_id,rs_total_head_strikes_id,rs_total_body_strikes_id,rs_total_leg_strikes_id,...,rs_clinch_strike_attempted_early,rs_clinch_strike_attempted_late,rs_ground_strike_landed_early,rs_ground_strike_landed_late,rs_ground_strike_attempted_early,rs_ground_strike_attempted_late,rs_td_landed_early,rs_td_landed_late,rs_td_attempted_early,rs_td_attempted_late
fightp_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,30,33,36,39,42,45,...,1,0,0,0,0,0,0,0,3,0
2,0,0,0,0,111,114,117,120,123,126,...,2,0,0,0,0,0,0,0,1,0
3,0,1,3,178,192,195,198,201,204,207,...,0,2,2,0,2,0,0,0,0,1
4,0,0,2,622,273,276,279,282,285,288,...,0,1,15,3,18,6,2,2,3,5
5,0,1,0,78,354,357,360,363,366,369,...,2,12,0,0,0,0,0,0,0,0


In [9]:
fightp_base = fight_performance.set_index(FIGHTP_ID_COL)

fightp_features = (
    fightp_base
        .join(round_stats_agg, how="left")
        .reset_index()
)

print("fightp_features shape:", fightp_features.shape)
fightp_features.head()


fightp_features shape: (15378, 98)


Unnamed: 0,fightp_id,fighter_name,fighters_id,fight_id,result,rs_total_knockdowns,rs_total_submission_attempts,rs_total_reversals,rs_total_control_time,rs_total_takedown_stats_id,...,rs_clinch_strike_attempted_early,rs_clinch_strike_attempted_late,rs_ground_strike_landed_early,rs_ground_strike_landed_late,rs_ground_strike_attempted_early,rs_ground_strike_attempted_late,rs_td_landed_early,rs_td_landed_late,rs_td_attempted_early,rs_td_attempted_late
0,1,Song Yadong,3631,1,w,0,0,0,0,30,...,1,0,0,0,0,0,0,0,3,0
1,2,Henry Cejudo,629,1,l,0,0,0,0,111,...,2,0,0,0,0,0,0,0,1,0
2,3,Anthony Hernandez,1586,2,w,0,1,3,178,192,...,0,2,2,0,2,0,0,0,0,1
3,4,Brendan Allen,84,2,l,0,0,2,622,273,...,0,1,15,3,18,6,2,2,3,5
4,5,Rob Font,1191,3,w,0,1,0,78,354,...,2,12,0,0,0,0,0,0,0,0


In [10]:
fight_meta = fight.copy()

def parse_time_to_seconds(s):
    if pd.isna(s):
        return np.nan
    try:
        m, sec = str(s).split(":")
        return int(m) * 60 + int(sec)
    except:
        return np.nan

fight_meta["fight_ended_round"] = fight_meta["round_ended"].fillna(0).astype(int)
fight_meta["fight_duration_sec"] = fight_meta["time_ended"].apply(parse_time_to_seconds)

def to_bool(x):
    if isinstance(x, str):
        return x.strip().lower() in ["true", "1", "t", "yes", "y"]
    return bool(x)

fight_meta["is_title_fight"] = fight_meta["title_bout"].apply(to_bool).astype(int)

fight_meta["ended_early"] = fight_meta["method"].apply(
    lambda m: 0 if isinstance(m, str) and "dec" in m.lower() else 1
).astype(int)

fight_meta_small = fight_meta[[
    "fight_id",
    "fight_ended_round",
    "fight_duration_sec",
    "ended_early",
    "is_title_fight",
    "method"
]]

print("fight_meta_small shape:", fight_meta_small.shape)
fight_meta_small.head()


fight_meta_small shape: (7689, 6)


Unnamed: 0,fight_id,fight_ended_round,fight_duration_sec,ended_early,is_title_fight,method
0,1,3,,0,0,Decision - Unanimous
1,2,3,,0,0,Decision - Unanimous
2,3,3,,0,0,Decision - Split
3,4,1,,1,0,KO/TKO
4,5,3,,0,0,Decision - Split


In [11]:
fightp_all = fightp_features.merge(
    fight_meta_small,
    on="fight_id",
    how="left"
)

print("fightp_all shape:", fightp_all.shape)
fightp_all.head()


fightp_all shape: (15378, 103)


Unnamed: 0,fightp_id,fighter_name,fighters_id,fight_id,result,rs_total_knockdowns,rs_total_submission_attempts,rs_total_reversals,rs_total_control_time,rs_total_takedown_stats_id,...,rs_ground_strike_attempted_late,rs_td_landed_early,rs_td_landed_late,rs_td_attempted_early,rs_td_attempted_late,fight_ended_round,fight_duration_sec,ended_early,is_title_fight,method
0,1,Song Yadong,3631,1,w,0,0,0,0,30,...,0,0,0,3,0,3,,0,0,Decision - Unanimous
1,2,Henry Cejudo,629,1,l,0,0,0,0,111,...,0,0,0,1,0,3,,0,0,Decision - Unanimous
2,3,Anthony Hernandez,1586,2,w,0,1,3,178,192,...,0,0,0,0,1,3,,0,0,Decision - Unanimous
3,4,Brendan Allen,84,2,l,0,0,2,622,273,...,6,2,2,3,5,3,,0,0,Decision - Unanimous
4,5,Rob Font,1191,3,w,0,1,0,78,354,...,0,0,0,0,0,3,,0,0,Decision - Split


In [12]:
df = fightp_all 

def add_accuracy_features(df):
    df = df.copy()
    cols = df.columns

    prefix = "rs_total_"
    landed_suffix = "_landed"
    attempted_suffix = "_attempted"

    for col in cols:
        if col.startswith(prefix) and col.endswith(landed_suffix):
            base = col[len(prefix):-len(landed_suffix)]  # e.g. "sig", "total", "head", "td", ...

            landed_col = col
            attempted_col = f"{prefix}{base}{attempted_suffix}"

            if attempted_col not in cols:
                continue  # skip if there's no matching attempted column

            acc_col = f"{base}_acc" 

            df[acc_col] = np.where(
                df[attempted_col] > 0,
                df[landed_col] / df[attempted_col],
                0.0
            )

    return df

fightp_all = add_accuracy_features(fightp_all)

print("fightp_all shape with acc features:", fightp_all.shape)

acc_cols = [c for c in fightp_all.columns if c.endswith("_acc")]
print("accuracy columns (first 15):")
for c in acc_cols[:15]:
    print("  ", c)

fightp_all.head()


fightp_all shape with acc features: (15378, 112)
accuracy columns (first 15):
   sig_strike_acc
   total_strike_acc
   head_strike_acc
   body_strike_acc
   leg_strike_acc
   distance_strike_acc
   clinch_strike_acc
   ground_strike_acc
   td_acc


Unnamed: 0,fightp_id,fighter_name,fighters_id,fight_id,result,rs_total_knockdowns,rs_total_submission_attempts,rs_total_reversals,rs_total_control_time,rs_total_takedown_stats_id,...,method,sig_strike_acc,total_strike_acc,head_strike_acc,body_strike_acc,leg_strike_acc,distance_strike_acc,clinch_strike_acc,ground_strike_acc,td_acc
0,1,Song Yadong,3631,1,w,0,0,0,0,30,...,Decision - Unanimous,0.503759,0.503759,0.4,0.666667,1.0,0.507576,0.0,0.0,0.0
1,2,Henry Cejudo,629,1,l,0,0,0,0,111,...,Decision - Unanimous,0.47191,0.47191,0.345865,0.666667,1.0,0.465909,1.0,0.0,0.0
2,3,Anthony Hernandez,1586,2,w,0,1,3,178,192,...,Decision - Unanimous,0.538462,0.746479,0.421053,1.0,0.75,0.454545,1.0,1.0,0.0
3,4,Brendan Allen,84,2,l,0,0,2,622,273,...,Decision - Unanimous,0.714286,0.791209,0.714286,0.0,0.0,0.705882,0.0,0.75,0.5
4,5,Rob Font,1191,3,w,0,1,0,78,354,...,Decision - Split,0.555556,0.591623,0.522293,0.928571,0.0,0.541401,0.714286,0.0,0.0


In [13]:
df = fightp_all.copy()

# convert result string to numeric
df["result_label"] = df["result"].apply(lambda x: 1 if str(x).lower() == "win" else 0)

y = df["result_label"].values


In [14]:
df = fightp_all.copy()

# 1) Build label y from "result" with a robust parser ----
def result_to_label(x):
    s = str(x).strip().lower()
    if "win" in s or s == "w":
        return 1
    if "loss" in s or s == "l":
        return 0
    # draw / NC / weird stuff -> 0
    return 0

df["result_label"] = df["result"].apply(result_to_label)

print("result value counts:")
print(df["result"].value_counts().head())
print("\nlabel distribution:")
print(df["result_label"].value_counts())

# 2) Drop non-feature columns 
drop_cols = [
    "fightp_id",
    "fighters_id",
    "fight_id",
    "fighter_name",
    "result",
    "method",
    "result_label",  # target, not feature
]

y_np = df["result_label"].values.astype(np.int64)

df_features = df.drop(columns=drop_cols)

# Keep only numeric cols and FILL NaNs with 0
X_df = (
    df_features
    .select_dtypes(include=["number"])
    .astype("float32")
    .fillna(0.0)
)

print("X_df shape:", X_df.shape)
print("y_np shape:", y_np.shape)
print("Any NaNs in X_df?", X_df.isna().any().any())
X_df.head()


result value counts:
result
w    7689
l    7689
Name: count, dtype: int64

label distribution:
result_label
1    7689
0    7689
Name: count, dtype: int64
X_df shape: (15378, 106)
y_np shape: (15378,)
Any NaNs in X_df? False


Unnamed: 0,rs_total_knockdowns,rs_total_submission_attempts,rs_total_reversals,rs_total_control_time,rs_total_takedown_stats_id,rs_total_significant_strikes_id,rs_total_total_strikes_id,rs_total_head_strikes_id,rs_total_body_strikes_id,rs_total_leg_strikes_id,...,is_title_fight,sig_strike_acc,total_strike_acc,head_strike_acc,body_strike_acc,leg_strike_acc,distance_strike_acc,clinch_strike_acc,ground_strike_acc,td_acc
0,0.0,0.0,0.0,0.0,30.0,33.0,36.0,39.0,42.0,45.0,...,0.0,0.503759,0.503759,0.4,0.666667,1.0,0.507576,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,111.0,114.0,117.0,120.0,123.0,126.0,...,0.0,0.47191,0.47191,0.345865,0.666667,1.0,0.465909,1.0,0.0,0.0
2,0.0,1.0,3.0,178.0,192.0,195.0,198.0,201.0,204.0,207.0,...,0.0,0.538462,0.746479,0.421053,1.0,0.75,0.454545,1.0,1.0,0.0
3,0.0,0.0,2.0,622.0,273.0,276.0,279.0,282.0,285.0,288.0,...,0.0,0.714286,0.791209,0.714286,0.0,0.0,0.705882,0.0,0.75,0.5
4,0.0,1.0,0.0,78.0,354.0,357.0,360.0,363.0,366.0,369.0,...,0.0,0.555556,0.591623,0.522293,0.928571,0.0,0.541401,0.714286,0.0,0.0


In [15]:
X_tensor = torch.tensor(X_df.values, dtype=torch.float32)
y_tensor = torch.tensor(y_np, dtype=torch.long)

print("X_tensor shape:", X_tensor.shape)
print("y_tensor shape:", y_tensor.shape)

# Compute mean/std per feature
mean = X_tensor.mean(dim=0, keepdim=True)
std  = X_tensor.std(dim=0, unbiased=False, keepdim=True)

# Guard against zero std
std[std == 0] = 1.0

X_scaled = (X_tensor - mean) / std

print("Any NaNs in X_scaled?", torch.isnan(X_scaled).any().item())


X_tensor shape: torch.Size([15378, 106])
y_tensor shape: torch.Size([15378])
Any NaNs in X_scaled? False


In [16]:
N = X_scaled.shape[0]
indices = torch.randperm(N)

train_ratio = 0.8
train_size = int(train_ratio * N)

train_idx = indices[:train_size]
val_idx   = indices[train_size:]

X_train = X_scaled[train_idx]
y_train = y_tensor[train_idx]

X_val = X_scaled[val_idx]
y_val = y_tensor[val_idx]

train_dataset = TensorDataset(X_train, y_train)
val_dataset   = TensorDataset(X_val, y_val)

batch_size = 128

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print("Train size:", len(train_dataset))
print("Val size:", len(val_dataset))


Train size: 12302
Val size: 3076


In [17]:

class FightEmbeddingNet(nn.Module):
    def __init__(self, input_dim, embedding_dim=32):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.embedding = nn.Linear(64, embedding_dim)  # fight embedding layer
        self.classifier = nn.Linear(embedding_dim, 2)  # 2 classes: loss, win

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        emb = self.embedding(x)        # shape: (batch, embedding_dim)
        emb = F.relu(emb)
        logits = self.classifier(emb)  # shape: (batch, 2)
        return logits, emb


In [18]:
input_dim = X_scaled.shape[1]
embedding_dim = 32 

model = FightEmbeddingNet(input_dim=input_dim, embedding_dim=embedding_dim)
print(model)


FightEmbeddingNet(
  (fc1): Linear(in_features=106, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (embedding): Linear(in_features=64, out_features=32, bias=True)
  (classifier): Linear(in_features=32, out_features=2, bias=True)
)


In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 25

for epoch in range(1, num_epochs + 1):
    # train
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0

    for xb, yb in train_loader:
        xb = xb.to(device)
        yb = yb.to(device)

        optimizer.zero_grad()
        logits, emb = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * xb.size(0)
        preds = logits.argmax(dim=1)
        correct += (preds == yb).sum().item()
        total += yb.size(0)

    train_loss = total_loss / total
    train_acc = correct / total

    # validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for xb, yb in val_loader:
            xb = xb.to(device)
            yb = yb.to(device)

            logits, emb = model(xb)
            loss = criterion(logits, yb)

            val_loss += loss.item() * xb.size(0)
            preds = logits.argmax(dim=1)
            val_correct += (preds == yb).sum().item()
            val_total += yb.size(0)

    val_loss /= val_total
    val_acc = val_correct / val_total

    print(f"Epoch {epoch:02d} | "
          f"train_loss={train_loss:.4f} acc={train_acc:.3f} | "
          f"val_loss={val_loss:.4f} acc={val_acc:.3f}")


Epoch 01 | train_loss=0.6742 acc=0.575 | val_loss=0.6673 acc=0.586
Epoch 02 | train_loss=0.6616 acc=0.593 | val_loss=0.6659 acc=0.581
Epoch 03 | train_loss=0.6584 acc=0.597 | val_loss=0.6606 acc=0.595
Epoch 04 | train_loss=0.6550 acc=0.602 | val_loss=0.6614 acc=0.595
Epoch 05 | train_loss=0.6524 acc=0.608 | val_loss=0.6600 acc=0.589
Epoch 06 | train_loss=0.6478 acc=0.609 | val_loss=0.6594 acc=0.593
Epoch 07 | train_loss=0.6443 acc=0.618 | val_loss=0.6605 acc=0.585
Epoch 08 | train_loss=0.6418 acc=0.619 | val_loss=0.6626 acc=0.585
Epoch 09 | train_loss=0.6382 acc=0.618 | val_loss=0.6613 acc=0.574
Epoch 10 | train_loss=0.6334 acc=0.626 | val_loss=0.6682 acc=0.585
Epoch 11 | train_loss=0.6290 acc=0.630 | val_loss=0.6648 acc=0.588
Epoch 12 | train_loss=0.6235 acc=0.637 | val_loss=0.6739 acc=0.583
Epoch 13 | train_loss=0.6181 acc=0.643 | val_loss=0.6703 acc=0.586
Epoch 14 | train_loss=0.6154 acc=0.644 | val_loss=0.6681 acc=0.587
Epoch 15 | train_loss=0.6054 acc=0.657 | val_loss=0.6810 acc=0

In [20]:
model.eval()
with torch.no_grad():
    X_all = X_scaled.to(device)
    logits_all, emb_all = model(X_all)   # emb_all shape: (15378, embedding_dim)

fight_embeddings = emb_all.cpu().numpy()
fight_embeddings.shape


(15378, 32)

In [21]:
# align indices with fightp_all (same order used for X_df / X_scaled)
fight_ids_np    = fightp_all["fightp_id"].values
fighter_ids_np  = fightp_all["fighters_id"].values
fighter_names_np = fightp_all["fighter_name"].values

print(fight_embeddings.shape, fight_ids_np.shape, fighter_ids_np.shape)


(15378, 32) (15378,) (15378,)


In [22]:
fight_emb_df = pd.DataFrame(
    fight_embeddings,
    columns=[f"emb_{i}" for i in range(fight_embeddings.shape[1])]
)
fight_emb_df["fightp_id"] = fight_ids_np
fight_emb_df["fighters_id"] = fighter_ids_np
fight_emb_df["fighter_name"] = fighter_names_np

fight_emb_df.head()


Unnamed: 0,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,emb_9,...,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31,fightp_id,fighters_id,fighter_name
0,0.0,0.0,0.0,0.0,0.0,0.0,1.310109,0.0,0.0,0.0,...,0.241875,0.084523,0.0,0.0,0.0,0.0,0.313347,1,3631,Song Yadong
1,0.0,0.0,0.140375,0.0,0.0,0.0,1.138295,0.0,0.0,0.0,...,0.145416,0.150053,0.0,0.0,0.0,0.246494,0.0,2,629,Henry Cejudo
2,0.0,0.0,0.0,0.0,0.0,0.504026,1.855722,0.0,0.0,0.0,...,1.000856,1.572763,1.689963,0.247396,0.0,0.0,1.089312,3,1586,Anthony Hernandez
3,0.0,0.0,0.0,0.187272,0.0,0.0,0.77466,0.0,0.0,0.0,...,0.124142,1.376907,1.387786,0.0,0.0,0.0,0.0,4,84,Brendan Allen
4,0.0,0.0,0.361046,0.069936,0.0,0.0,1.613611,0.0,0.0,0.0,...,0.0,0.240322,0.643345,0.0,0.0,0.0,0.0,5,1191,Rob Font


In [23]:
emb_cols = [c for c in fight_emb_df.columns if c.startswith("emb_")]

fighter_emb_df = (
    fight_emb_df
    .groupby("fighters_id")[emb_cols]
    .mean()
    .reset_index()
)

name_map = (
    fight_emb_df.groupby("fighters_id")["fighter_name"]
    .first()
    .to_dict()
)
fighter_emb_df["fighter_name"] = fighter_emb_df["fighters_id"].map(name_map)

fighter_emb_df.head()


Unnamed: 0,fighters_id,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_23,emb_24,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31,fighter_name
0,2,0.096713,0.113253,0.28891,0.0,0.215452,0.697377,0.027624,0.102373,0.758372,...,0.160234,0.693466,0.713407,0.143579,0.114896,0.224058,0.456728,0.099536,1.006372,Danny Abbadi
1,4,0.0,0.0,0.274548,0.0,0.0,0.828158,0.068962,0.191915,1.031428,...,0.022104,1.133325,0.999403,0.0,0.0,0.0,0.441926,0.0,1.619728,David Abbott
2,5,0.0,0.0,0.22964,0.0,0.0,0.0,1.260942,0.0,0.0,...,0.0,0.449514,0.531604,0.398297,0.262824,0.0,0.0,0.0,0.420968,Hamdy Abdelwahab
3,6,0.0,0.0,0.0,0.0,0.0,0.0,1.174773,0.0,0.0,...,0.0,0.05012,0.05765,0.029823,0.502179,0.0,0.0,0.003308,0.0,Mansur Abdul-Malik
4,7,0.033721,0.00507,0.080861,0.047395,0.030537,0.101368,0.542457,0.053025,0.175817,...,0.181172,0.108325,0.109949,0.056336,0.199439,0.014859,0.119792,0.247382,0.198635,Shamil Abdurakhimov


In [24]:
fp = fightp_all.copy()

def result_to_label(x):
    s = str(x).strip().lower()
    if "win" in s or s == "w":
        return 1
    if "loss" in s or s == "l":
        return 0
    # draw / NC / weird stuff -> 0
    return 0

fp["result_label"] = fp["result"].apply(result_to_label)

# Keep only the columns we need to form matchups
fp_small = fp[["fight_id", "fighters_id", "fighter_name", "result_label"]]

fight_counts = fp_small["fight_id"].value_counts()
two_fighter_fights = fight_counts[fight_counts == 2].index

fp_pairs = fp_small[fp_small["fight_id"].isin(two_fighter_fights)].copy()

print("Total fight performances:", len(fp_small))
print("Two-fighter fights:", len(two_fighter_fights))
fp_pairs.head()


Total fight performances: 15378
Two-fighter fights: 7689


Unnamed: 0,fight_id,fighters_id,fighter_name,result_label
0,1,3631,Song Yadong,1
1,1,629,Henry Cejudo,0
2,2,1586,Anthony Hernandez,1
3,2,84,Brendan Allen,0
4,3,1191,Rob Font,1


In [25]:
emb_cols = [c for c in fighter_emb_df.columns if c.startswith("emb_")]

# Index by fighters_id for quick lookup
emb_by_id = fighter_emb_df.set_index("fighters_id")[emb_cols]

print("Embedding dim:", len(emb_cols))
print("Number of fighters with embeddings:", emb_by_id.shape[0])


Embedding dim: 32
Number of fighters with embeddings: 2412


In [26]:
match_rows = []
XA_list = []
XB_list = []
y_match_list = []

for fight_id, g in fp_pairs.groupby("fight_id"):
    g = g.reset_index(drop=True)
    if len(g) != 2:
        continue

    f1 = g.iloc[0]
    f2 = g.iloc[1]

    fid1 = f1["fighters_id"]
    fid2 = f2["fighters_id"]

    # Skip if we don't have embeddings for both fighters
    if fid1 not in emb_by_id.index or fid2 not in emb_by_id.index:
        continue

    emb1 = emb_by_id.loc[fid1].values
    emb2 = emb_by_id.loc[fid2].values

    res1 = int(f1["result_label"])
    res2 = int(f2["result_label"])

    # Skip weird cases (draws, NC, etc.) where both results are same
    if res1 == res2:
        continue

    # Row 1: A = fighter 1, B = fighter 2
    XA_list.append(emb1)
    XB_list.append(emb2)
    y_match_list.append(1 if res1 == 1 else 0)

    match_rows.append({
        "fight_id": fight_id,
        "fighterA_id": fid1,
        "fighterB_id": fid2,
        "fighterA_name": f1["fighter_name"],
        "fighterB_name": f2["fighter_name"],
        "label_A_wins": 1 if res1 == 1 else 0,
    })

    # Row 2: A = fighter 2, B = fighter 1
    XA_list.append(emb2)
    XB_list.append(emb1)
    y_match_list.append(1 if res2 == 1 else 0)

    match_rows.append({
        "fight_id": fight_id,
        "fighterA_id": fid2,
        "fighterB_id": fid1,
        "fighterA_name": f2["fighter_name"],
        "fighterB_name": f1["fighter_name"],
        "label_A_wins": 1 if res2 == 1 else 0,
    })

# Convert to arrays / DataFrame
XA = np.vstack(XA_list)
XB = np.vstack(XB_list)
y_match = np.array(y_match_list, dtype=np.int64)

matchups_meta_df = pd.DataFrame(match_rows)

print("XA shape:", XA.shape)
print("XB shape:", XB.shape)
print("y_match shape:", y_match.shape)
print("\nLabel distribution (0=A loses, 1=A wins):")
print(pd.Series(y_match).value_counts())

matchups_meta_df.head()


XA shape: (15378, 32)
XB shape: (15378, 32)
y_match shape: (15378,)

Label distribution (0=A loses, 1=A wins):
1    7689
0    7689
Name: count, dtype: int64


Unnamed: 0,fight_id,fighterA_id,fighterB_id,fighterA_name,fighterB_name,label_A_wins
0,1,3631,629,Song Yadong,Henry Cejudo,1
1,1,629,3631,Henry Cejudo,Song Yadong,0
2,2,1586,84,Anthony Hernandez,Brendan Allen,1
3,2,84,1586,Brendan Allen,Anthony Hernandez,0
4,3,1191,2376,Rob Font,Jean Matsumoto,1


In [27]:
X_match = np.concatenate(
    [XA, XB, np.abs(XA - XB)],
    axis=1
)

print("X_match shape:", X_match.shape)  # (num_matchups, 32*3 = 96 if emb_dim=32)

X_match shape: (15378, 96)


In [28]:
# X_match: (num_matchups, 96) if emb_dim=32
# y_match: (num_matchups,)
print("X_match shape:", X_match.shape)
print("y_match shape:", y_match.shape)

X_match_tensor = torch.tensor(X_match, dtype=torch.float32)
y_match_tensor = torch.tensor(y_match, dtype=torch.long)

# Standardize features
mean_match = X_match_tensor.mean(dim=0, keepdim=True)
std_match  = X_match_tensor.std(dim=0, unbiased=False, keepdim=True)
std_match[std_match == 0] = 1.0

X_match_scaled = (X_match_tensor - mean_match) / std_match


X_match shape: (15378, 96)
y_match shape: (15378,)


In [29]:
N = X_match_scaled.shape[0]
indices = torch.randperm(N)

train_ratio = 0.8
train_size = int(train_ratio * N)

train_idx = indices[:train_size]
val_idx   = indices[train_size:]

X_train_match = X_match_scaled[train_idx]
y_train_match = y_match_tensor[train_idx]

X_val_match = X_match_scaled[val_idx]
y_val_match = y_match_tensor[val_idx]

train_dataset_match = TensorDataset(X_train_match, y_train_match)
val_dataset_match   = TensorDataset(X_val_match, y_val_match)

batch_size = 128
train_loader_match = DataLoader(train_dataset_match, batch_size=batch_size, shuffle=True)
val_loader_match   = DataLoader(val_dataset_match, batch_size=batch_size, shuffle=False)

print("Train matchups:", len(train_dataset_match))
print("Val matchups:", len(val_dataset_match))
print("Label distribution (0=A loses, 1=A wins):")
unique, counts = np.unique(y_match, return_counts=True)
print(dict(zip(unique, counts)))


Train matchups: 12302
Val matchups: 3076
Label distribution (0=A loses, 1=A wins):
{np.int64(0): np.int64(7689), np.int64(1): np.int64(7689)}


In [30]:
class OutcomeNet(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.out = nn.Linear(64, 2)  # 2 classes: A loses (0), A wins (1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        logits = self.out(x)
        return logits

input_dim_match = X_match_scaled.shape[1]
outcome_model = OutcomeNet(input_dim_match)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
outcome_model = outcome_model.to(device)
print(outcome_model)


OutcomeNet(
  (fc1): Linear(in_features=96, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (out): Linear(in_features=64, out_features=2, bias=True)
)


In [31]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(outcome_model.parameters(), lr=1e-3)

num_epochs = 15

for epoch in range(1, num_epochs + 1):
    # train
    outcome_model.train()
    total_loss = 0.0
    correct = 0
    total = 0

    for xb, yb in train_loader_match:
        xb = xb.to(device)
        yb = yb.to(device)

        optimizer.zero_grad()
        logits = outcome_model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * xb.size(0)
        preds = logits.argmax(dim=1)
        correct += (preds == yb).sum().item()
        total += yb.size(0)

    train_loss = total_loss / total
    train_acc = correct / total

    # validation
    outcome_model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for xb, yb in val_loader_match:
            xb = xb.to(device)
            yb = yb.to(device)

            logits = outcome_model(xb)
            loss = criterion(logits, yb)

            val_loss += loss.item() * xb.size(0)
            preds = logits.argmax(dim=1)
            val_correct += (preds == yb).sum().item()
            val_total += yb.size(0)

    val_loss /= val_total
    val_acc = val_correct / val_total

    print(f"Epoch {epoch:02d} | "
          f"train_loss={train_loss:.4f} acc={train_acc:.3f} | "
          f"val_loss={val_loss:.4f} acc={val_acc:.3f}")


Epoch 01 | train_loss=0.6196 acc=0.646 | val_loss=0.5880 acc=0.673
Epoch 02 | train_loss=0.5857 acc=0.683 | val_loss=0.5782 acc=0.689
Epoch 03 | train_loss=0.5715 acc=0.696 | val_loss=0.5806 acc=0.689
Epoch 04 | train_loss=0.5622 acc=0.701 | val_loss=0.5760 acc=0.690
Epoch 05 | train_loss=0.5519 acc=0.709 | val_loss=0.5764 acc=0.697
Epoch 06 | train_loss=0.5414 acc=0.717 | val_loss=0.5772 acc=0.691
Epoch 07 | train_loss=0.5331 acc=0.722 | val_loss=0.5774 acc=0.690
Epoch 08 | train_loss=0.5220 acc=0.734 | val_loss=0.5942 acc=0.686
Epoch 09 | train_loss=0.5166 acc=0.734 | val_loss=0.5844 acc=0.688
Epoch 10 | train_loss=0.5016 acc=0.747 | val_loss=0.5962 acc=0.679
Epoch 11 | train_loss=0.4912 acc=0.753 | val_loss=0.6070 acc=0.682
Epoch 12 | train_loss=0.4832 acc=0.761 | val_loss=0.6080 acc=0.683
Epoch 13 | train_loss=0.4692 acc=0.769 | val_loss=0.6183 acc=0.676
Epoch 14 | train_loss=0.4572 acc=0.775 | val_loss=0.6390 acc=0.667
Epoch 15 | train_loss=0.4462 acc=0.782 | val_loss=0.6259 acc=0

In [32]:
outcome_model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for xb, yb in val_loader_match:
        xb = xb.to(device)
        logits = outcome_model(xb)
        preds = logits.argmax(dim=1).cpu().numpy()
        all_preds.append(preds)
        all_labels.append(yb.numpy())

all_preds = np.concatenate(all_preds)
all_labels = np.concatenate(all_labels)

val_acc = (all_preds == all_labels).mean()
print("Final val accuracy:", val_acc)

# Simple confusion matrix
cm = Counter(zip(all_labels, all_preds))
print("Confusion matrix (true, pred) counts:", cm)


Final val accuracy: 0.6684005201560468
Confusion matrix (true, pred) counts: Counter({(np.int64(1), np.int64(1)): 1120, (np.int64(0), np.int64(0)): 936, (np.int64(0), np.int64(1)): 614, (np.int64(1), np.int64(0)): 406})
