# March Machine Learning Mania 2026

Forecast the 2026 NCAA Men's and Women's basketball tournaments: load data, build team strength, predict win probabilities for every matchup, and create the submission file.

## 1. Setup & Input Data

Kaggle provides the competition data in `/kaggle/input/`. We use the competition folder name to list files and set our data path.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np
import pandas as pd
import os

# Input data files are available in the read-only "../input/" directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Competition data path (update if your competition slug differs)
COMPETITION = 'march-machine-learning-mania-2026'
DATA_DIR = f'/kaggle/input/{COMPETITION}'

# Seasons to use for team strength (most recent available)
STAGE1_SEASONS = [2022, 2023, 2024, 2025]  # 519,144 rows required
RECENT_SEASONS = [2025, 2024, 2023]

## 2. Load Competition Data

We load compact results (regular season + NCAA tourney), the Stage 2 sample submission (all 2026 matchups to predict), and define helpers to parse IDs and detect men's vs women's teams.

In [None]:
def load_compact_results(gender):
    """Regular season + NCAA tourney results for M or W."""
    reg = pd.read_csv(f'{DATA_DIR}/{gender}RegularSeasonCompactResults.csv')
    tourney = pd.read_csv(f'{DATA_DIR}/{gender}NCAATourneyCompactResults.csv')
    return pd.concat([reg, tourney], ignore_index=True)

def parse_submission_id(id_str):
    """Parse 'SSSS_XXXX_YYYY' -> (season, team_low, team_high)."""
    s, low, high = id_str.split('_')
    return int(s), int(low), int(high)

def gender_from_team_id(team_id):
    """Men 1000-1999 -> 'M', Women 3000-3999 -> 'W'."""
    return 'M' if 1000 <= team_id < 2000 else 'W'

# Load Stage 2 sample (required format: ID, Pred for 2026 matchups)
sample = pd.read_csv(f'{DATA_DIR}/SampleSubmissionStage1.csv')
print('Sample submission shape:', sample.shape)
sample.head()

## 3. Feature Engineering â€” Team Strength

For each team and season we compute: **wins**, **losses**, **win%**, **points for/against per game**, and **point differential**. We stack the last few seasons to form a strength cache.

In [None]:
def team_season_stats(gender, season):
    """Per-team stats for one gender and season: wins, win%, pts for/against, point diff."""
    df = load_compact_results(gender)
    df = df[df['Season'] == season]
    if df.empty:
        return pd.DataFrame(columns=['TeamID', 'Season', 'Wins', 'Losses', 'Games',
                                       'WinPct', 'PtsForPerGame', 'PtsAgainstPerGame', 'PointDiffPerGame'])

    wins = df.groupby('WTeamID').agg(
        Wins=('WTeamID', 'count'),
        PtsFor=('WScore', 'sum'),
        PtsAgainst=('LScore', 'sum'),
    ).rename_axis('TeamID').reset_index()
    losses = df.groupby('LTeamID').agg(
        Losses=('LTeamID', 'count'),
        PtsFor=('LScore', 'sum'),
        PtsAgainst=('WScore', 'sum'),
    ).rename_axis('TeamID').reset_index()

    teams = wins[['TeamID']].merge(losses[['TeamID']], on='TeamID', how='outer').fillna(0)
    wins_agg = wins.set_index('TeamID')
    losses_agg = losses.set_index('TeamID')

    def wins_cnt(tid):
        return wins_agg.loc[tid, 'Wins'] if tid in wins_agg.index else 0
    def losses_cnt(tid):
        return losses_agg.loc[tid, 'Losses'] if tid in losses_agg.index else 0
    def pts_for(tid):
        w = wins_agg.loc[tid, 'PtsFor'] if tid in wins_agg.index else 0
        l = losses_agg.loc[tid, 'PtsFor'] if tid in losses_agg.index else 0
        return w + l
    def pts_against(tid):
        w = wins_agg.loc[tid, 'PtsAgainst'] if tid in wins_agg.index else 0
        l = losses_agg.loc[tid, 'PtsAgainst'] if tid in losses_agg.index else 0
        return w + l

    teams['Wins'] = teams['TeamID'].map(wins_cnt)
    teams['Losses'] = teams['TeamID'].map(losses_cnt)
    teams['Games'] = teams['Wins'] + teams['Losses']
    teams['PtsFor'] = teams['TeamID'].map(pts_for)
    teams['PtsAgainst'] = teams['TeamID'].map(pts_against)
    teams['Season'] = season
    g = teams['Games'].replace(0, 1)
    teams['WinPct'] = teams['Wins'] / g
    teams['PtsForPerGame'] = teams['PtsFor'] / g
    teams['PtsAgainstPerGame'] = teams['PtsAgainst'] / g
    teams['PointDiffPerGame'] = teams['PtsForPerGame'] - teams['PtsAgainstPerGame']
    return teams

def build_strength_cache(gender, seasons):
    """Stack team-season stats for multiple seasons."""
    return pd.concat([team_season_stats(gender, s) for s in seasons], ignore_index=True)

cache_m = build_strength_cache('M', STAGE1_SEASONS)
cache_w = build_strength_cache('W', STAGE1_SEASONS)
print('Men cache shape:', cache_m.shape)
print('Women cache shape:', cache_w.shape)
cache_m.head(10)

## 4. Predict Probabilities

For each matchup we use the **most recent season** where both teams have data. Strength = win% + small weight Ã— point diff; we predict **P(lower TeamId wins)** with a logistic on the strength difference. Missing teams get 0.5.

In [None]:
def strength_score(row):
    win_pct = row.get('WinPct', 0.5)
    pt_diff = row.get('PointDiffPerGame', 0.0)
    return win_pct + 0.002 * pt_diff

def predict_one(team_low, team_high, cache_m, cache_w, season=None, default_pred=0.5):
    gender = gender_from_team_id(team_low)
    cache = cache_m if gender == 'M' else cache_w
    seasons = [season] if season is not None else RECENT_SEASONS
    for s in seasons:
        sub = cache[cache['Season'] == s]
        low_row = sub[sub['TeamID'] == team_low]
        high_row = sub[sub['TeamID'] == team_high]
        if low_row.empty or high_row.empty:
            continue
        low_row, high_row = low_row.iloc[0], high_row.iloc[0]
        s_low = strength_score(low_row)
        s_high = strength_score(high_row)
        logit = s_low - s_high
        pred = 1.0 / (1.0 + np.exp(-np.clip(logit, -10, 10)))
        return float(np.clip(pred, 0.01, 0.99))
    return default_pred

preds = []
for id_str in sample['ID']:
    matchup_season, team_low, team_high = parse_submission_id(id_str)
    preds.append(predict_one(team_low, team_high, cache_m, cache_w, season=matchup_season))

submission = sample[['ID']].copy()
submission['Pred'] = preds
print('Predictions range:', submission['Pred'].min().round(4), '-', submission['Pred'].max().round(4))
submission.head(10)

## 5. Create Submission File

Save the submission to **`/kaggle/working/submission.csv`**. This file is preserved when you use **Save & Run All** and can be submitted to the competition.

In [None]:
out_path = '/kaggle/working/submission.csv'
submission.to_csv(out_path, index=False)
print(f'Saved {len(submission)} rows to {out_path}')
submission.head(20)