# FraudScoreCFB — College Football Fraud Detection

**Short project summary:**  
This notebook builds an interpretable “Fraud Score” that identifies college football teams that were **ranked highly in polls** (AP Top 12) despite **not meeting objective performance thresholds**. The pipeline is end-to-end: data import → cleaning → feature engineering → thresholding → labeling → classification → fraud score assignment.

**Key outputs:**
- `final_merged` / `merged_with_sos` — merged season-level dataset (Sports Reference + team-season aggregates).
- `final_with_thresholds` — dataset annotated with per-feature Elite/Borderline flags and counts.
- `labeled_df` — dataset with `FraudLabel` (0=contender, 1=fraud, 2=irrelevant) and final `FraudScore` (0–100).

**The final fraud score denotes a team with a higher score (i.e. 90-100) as having a higher likelihood of being a fraud, and vice versa for low scores; schools with a lower score have a higher likelihood of being a true contender.**

**How to run:**  
Run cells sequentially. Update `base_dir` at the top if your files are not in the same folder as this notebook. Expected files:
- `Main data source/cfbd_games_2014_2025.csv`
- `srcfb/ratings_YYYY.csv` (2014–2024)
- `srcfb/standings_YYYY.csv` (2014–2024)

**Notebook structure (high level):**
1. Setup & imports  
2. Load & clean data (games, ratings, standings)  
3. Build team-season aggregates and merge datasets  
4. Compute Strength of Schedule (SOS) and other derived features  
5. Define Elite/Borderline thresholds and apply to teams  
6. Create fraud labels and train a Random Forest to compute FraudScore  
7. Appendix: selected exploratory analyses (PCA + clustering)


In [1]:
# --- Setup & Imports ---
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GroupKFold
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score

# Base directory (where notebook + data folders live)
base_dir = "."

# File paths
games_file = f"{base_dir}/Main data source/cfbd_games_2014_2025.csv"
ratings_files = [f"{base_dir}/srcfb/ratings_{year}.csv" for year in range(2014, 2025)]
standings_files = [f"{base_dir}/srcfb/standings_{year}.csv" for year in range(2014, 2025)]


  from pandas.core import (


## 1 — Primary data: Game-level CFBD dataset

This section loads the CFBD game-level CSV. We derive team-season metrics (points for/against, wins, Elo summaries, Elo deltas, and average `excitementIndex`) from this file.

**Important columns used downstream:**
- `season`, `homeTeam`, `awayTeam`  
- `homePoints`, `awayPoints`  
- `homePregameElo`, `awayPregameElo`, `homePostgameElo`, `awayPostgameElo`  
- `excitementIndex`

We create `homeWon` / `awayWon` booleans here to support later aggregation.

In [2]:
# --- Load Games Data ---
df = pd.read_csv(games_file)

# Create win/loss columns
df["homeWon"] = (df["homePoints"] > df["awayPoints"]).astype(int)
df["awayWon"] = (df["awayPoints"] > df["homePoints"]).astype(int)

print("Games shape:", df.shape)
print(df[["season", "homeTeam", "awayTeam", "homePoints", "awayPoints"]].head())

Games shape: (27873, 35)
   season            homeTeam        awayTeam  homePoints  awayPoints
0    2014  Eastern Washington     Sam Houston        56.0        35.0
1    2014           Dartmouth         Harvard        12.0        23.0
2    2014            Michigan         Indiana        34.0        10.0
3    2014              Wagner    Sacred Heart         7.0        23.0
4    2014       James Madison  William & Mary        31.0        24.0


  df = pd.read_csv(games_file)


## 2 — Sports Reference: Ratings (season-level)

This section imports Sports Reference "ratings" CSVs (one per year) and standardizes the headers (some of the files have a two-row header). The `clean_ratings` helper:
- flattens the multi-row header,
- creates derived columns `WinPct` and `MOV` (margin of victory),
- sets a `Season` column.

**Key team-level fields used later**: `School` (team name), `Overall_W`, `Overall_L`, `Scoring_Off`, `Scoring_Def`, `AP_Rank`, `AP_Pre`, `AP_High`, `SRS_*` metrics.


In [3]:
# --- Clean Ratings Files ---
def clean_ratings(path, season):
    df = pd.read_csv(path, header=[0, 1])

    lvl0 = pd.Series([None if isinstance(a, str) and a.startswith("Unnamed") else a for a, _ in df.columns]).ffill()
    lvl1 = [None if (b is None or (isinstance(b, str) and b.startswith("Unnamed"))) else b for _, b in df.columns]

    df.columns = [
        (a if (b is None or b == "" or b == "None") else f"{a}_{b}" if not (a is None or a == "") else b).replace(" ", "_")
        for a, b in zip(lvl0, lvl1)
    ]

    df["WinPct"] = df["Overall_W"] / (df["Overall_W"] + df["Overall_L"])
    df["MOV"]    = df["Scoring_Off"] - df["Scoring_Def"]
    df["Season"] = int(season)

    return df

ratings_all = pd.concat([clean_ratings(f, int(f.split("_")[-1].split(".")[0])) for f in ratings_files], ignore_index=True)

print("Ratings shape:", ratings_all.shape)
ratings_all.head()

Ratings shape: (1432, 20)


Unnamed: 0,Rk,School,Conf,AP_Rank,Overall_W,Overall_L,SRS_OSRS,SRS_DSRS,SRS_SRS,Scoring_Off,Scoring_Def,Passing_Off,Passing_Def,Rushing_Off,Rushing_Def,Total_Off,Total_Def,WinPct,MOV,Season
0,1,Oregon,Pac-12 (North),2.0,13,2,14.36,7.86,22.22,61.16,5.13,11.41,5.1,6.57,3.39,8.58,4.13,0.866667,56.03,2014
1,2,Ohio State,Big Ten (East),1.0,14,1,13.44,7.0,20.44,62.75,5.78,10.71,4.44,6.98,2.61,8.38,3.52,0.933333,56.97,2014
2,3,Alabama,SEC (West),4.0,12,2,9.46,10.88,20.34,57.12,-0.11,10.71,4.5,6.35,1.96,8.23,3.4,0.857143,57.23,2014
3,4,Texas Christian,Big 12,3.0,12,1,11.2,7.76,18.96,61.91,4.27,9.52,5.31,6.22,1.96,7.88,3.51,0.923077,57.64,2014
4,5,Georgia,SEC (East),9.0,10,3,10.88,7.96,18.84,59.13,3.93,9.85,4.09,7.3,3.12,8.27,3.6,0.769231,55.2,2014


## 3 — Sports Reference: Standings (season-level)

This section imports the yearly standings CSVs and computes small derived metrics:
- `ConfStrengthDiff` = conference win pct − overall win pct  
- `PollRankGap` = preseason AP rank − final AP rank

These fields capture perception vs. performance and are used when deciding which teams might have benefitted from poll biases.


In [4]:
# --- Clean Standings Files ---
def clean_standings(path, season):
    df = pd.read_csv(path, header=[0, 1])

    lvl0 = pd.Series([None if isinstance(a, str) and a.startswith("Unnamed") else a for a, _ in df.columns]).ffill()
    lvl1 = [None if (b is None or (isinstance(b, str) and b.startswith("Unnamed"))) else b for _, b in df.columns]

    df.columns = [
        (a if (b is None or b == "" or b == "None") else f"{a}_{b}" if not (a is None or a == "") else b).replace(" ", "_")
        for a, b in zip(lvl0, lvl1)
    ]

    df["ConfStrengthDiff"] = df["Conference_Pct"] - df["Overall_Pct"]
    df["PollRankGap"] = df["Polls_AP_Pre"] - df["Polls_AP_Rank"]
    df["Season"] = int(season)

    return df

standings_all = pd.concat([clean_standings(f, int(f.split("_")[-1].split(".")[0])) for f in standings_files], ignore_index=True)

print("Standings shape:", standings_all.shape)
standings_all.head()

Standings shape: (1432, 20)


Unnamed: 0,Rk,School,Conf,Overall_W,Overall_L,Overall_Pct,Conference_W,Conference_L,Conference_Pct,Points_Per_Game_Off,Points_Per_Game_Def,SRS_SRS,SRS_SOS,Polls_AP_Pre,Polls_AP_High,Polls_AP_Rank,Polls_Notes,ConfStrengthDiff,PollRankGap,Season
0,1,Florida State,ACC (Atlantic),13,1,0.929,8.0,0.0,1.0,33.7,25.6,14.48,5.13,1.0,1.0,5.0,,0.071,-4.0,2014
1,2,Clemson,ACC (Atlantic),10,3,0.769,6.0,2.0,0.75,30.8,16.7,11.63,2.86,16.0,15.0,15.0,,-0.019,1.0,2014
2,3,Louisville,ACC (Atlantic),9,4,0.692,5.0,3.0,0.625,31.2,21.8,10.52,3.22,,20.0,24.0,,-0.067,,2014
3,4,Boston College,ACC (Atlantic),7,6,0.538,4.0,4.0,0.5,26.2,21.3,6.04,2.35,,,,,-0.038,,2014
4,5,North Carolina State,ACC (Atlantic),8,5,0.615,3.0,5.0,0.375,30.2,27.0,4.17,1.25,,,,,-0.24,,2014


## 4 — Build team-season aggregates from game-level data

We convert the game-level rows into **team-season** rows:
- split home/away into unified team rows,
- aggregate `Points_For`, `Points_Against`, `Total_Wins`,
- compute `PregameElo_Mean`, `PregameElo_Start`, `PostgameElo_End`,
- compute `Elo_Delta` = `PostgameElo_End` − `PregameElo_Start`.

Rationale: some performance metrics (points, Elo movement) require game-level aggregation and are not present in Sports Reference season-level tables.

In [5]:
# --- Build Team-Season Stats from Games ---
def build_team_season_from_games(games_df):
    home_df = games_df[[
        "season", "homeTeam", "homePoints", "awayPoints",
        "homePregameElo", "homePostgameElo", "homeWon", "excitementIndex"
    ]].copy()
    home_df.rename(columns={
        "homeTeam": "Team", "homePoints": "Points_For", "awayPoints": "Points_Against",
        "homePregameElo": "PregameElo", "homePostgameElo": "PostgameElo",
        "homeWon": "Win", "excitementIndex": "Excitement"
    }, inplace=True)

    away_df = games_df[[
        "season", "awayTeam", "awayPoints", "homePoints",
        "awayPregameElo", "awayPostgameElo", "awayWon", "excitementIndex"
    ]].copy()
    away_df.rename(columns={
        "awayTeam": "Team", "awayPoints": "Points_For", "homePoints": "Points_Against",
        "awayPregameElo": "PregameElo", "awayPostgameElo": "PostgameElo",
        "awayWon": "Win", "excitementIndex": "Excitement"
    }, inplace=True)

    team_games = pd.concat([home_df, away_df], ignore_index=True)

    team_season = team_games.groupby(["season", "Team"]).agg({
        "Points_For": "sum",
        "Points_Against": "sum",
        "Win": "sum",
        "PregameElo": ["mean", "first"],
        "PostgameElo": "last",
        "Excitement": "mean"
    }).reset_index()

    team_season.columns = [
        "Season", "Team", "Points_For", "Points_Against", "Total_Wins",
        "PregameElo_Mean", "PregameElo_Start", "PostgameElo_End", "Excitement"
    ]
    team_season["Elo_Delta"] = team_season["PostgameElo_End"] - team_season["PregameElo_Start"]
    return team_season

team_season_stats = build_team_season_from_games(df)
print(team_season_stats.head())

   Season               Team  Points_For  Points_Against  Total_Wins  \
0    2014  Abilene Christian       371.0           325.0           5   
1    2014          Air Force       409.0           314.0          10   
2    2014              Akron       271.0           277.0           5   
3    2014            Alabama       517.0           258.0          12   
4    2014        Alabama A&M       314.0           392.0           4   

   PregameElo_Mean  PregameElo_Start  PostgameElo_End  Excitement  Elo_Delta  
0              NaN               NaN              NaN    7.363467        NaN  
1      1323.666667            1320.0           1320.0    5.111368        0.0  
2      1243.636364            1269.0           1258.0    4.287397      -11.0  
3      2021.230769            1952.0           1952.0    4.157294        0.0  
4              NaN               NaN              NaN    0.976413        NaN  


## 5 — Merge datasets to create a single season-level table

Steps:
1. Merge Sports Reference `ratings_all` with the `team_season_stats` (on `School`/`Team`, `Season`).
2. Left-join selected `standings_all` columns (`Polls_AP_Pre`, `Polls_AP_High`, `SRS_SOS`) so we have perception metrics (AP polls) alongside objective performance metrics.

**Note on merges:** team name mismatches are the most common source of lost rows — if your merge returns unexpectedly few rows, inspect `School` vs `Team` naming and consider a name-mapping step.

In [6]:
# --- Merge Ratings + Team Stats + Standings ---
merged_df = pd.merge(ratings_all, team_season_stats, left_on=["School", "Season"], right_on=["Team", "Season"], how="inner")

standings_keep = standings_all[["School", "Season", "Polls_AP_Pre", "Polls_AP_High", "ConfStrengthDiff", "PollRankGap", "SRS_SOS"]]

final_merged = pd.merge(merged_df, standings_keep, on=["School", "Season"], how="left")
final_merged = final_merged.rename(columns={"Polls_AP_Pre": "AP_Pre", "Polls_AP_High": "AP_High", "SRS_SOS": "SOS"})


print("Final merged shape:", final_merged.shape)
print("ratings_all rows:", len(ratings_all))
print("team_season_stats rows:", len(team_season_stats))
print("after merge (ratings x team):", len(merged_df))
print("after final merge (with standings):", len(final_merged))
final_merged.head()

Final merged shape: (1299, 34)
ratings_all rows: 1432
team_season_stats rows: 5522
after merge (ratings x team): 1299
after final merge (with standings): 1299


Unnamed: 0,Rk,School,Conf,AP_Rank,Overall_W,Overall_L,SRS_OSRS,SRS_DSRS,SRS_SRS,Scoring_Off,...,PregameElo_Mean,PregameElo_Start,PostgameElo_End,Excitement,Elo_Delta,AP_Pre,AP_High,ConfStrengthDiff,PollRankGap,SOS
0,1,Oregon,Pac-12 (North),2.0,13,2,14.36,7.86,22.22,61.16,...,1957.857143,1910.0,1880.0,2.887519,-30.0,3.0,2.0,0.022,1.0,6.02
1,2,Ohio State,Big Ten (East),1.0,14,1,13.44,7.0,20.44,62.75,...,2005.866667,2021.0,1991.0,3.548213,-30.0,5.0,1.0,0.067,4.0,5.17
2,3,Alabama,SEC (West),4.0,12,2,9.46,10.88,20.34,57.12,...,2021.230769,1952.0,1952.0,4.157294,0.0,2.0,1.0,0.018,-2.0,7.27
3,5,Georgia,SEC (East),9.0,10,3,10.88,7.96,18.84,59.13,...,1926.916667,2015.0,2020.0,3.306849,5.0,12.0,6.0,-0.019,3.0,5.07
4,6,Ole Miss,SEC (West),17.0,9,4,0.87,15.93,16.8,47.62,...,1837.166667,1941.0,1898.0,3.656559,-43.0,18.0,3.0,-0.067,1.0,8.11


## 6 — Thresholds: Defining "Elite" vs "Borderline"

**Goal:** create objective rules for identifying truly elite performance for each season (so that we can detect teams that were highly ranked by polls but not truly elite by performance).  

**Process:**
- Convert unranked AP fields to a sentinel (−1) so missing rankings are explicit.
- Focus on **pure-play** metrics (e.g., SRS, MOV, WinPct, Points_For/Against, Elo) — avoid perception-based columns (AP poll) when defining performance thresholds.
- Compute season-level top-10 statistics for each metric to help establish sensible fixed thresholds and "borderline" ranges.
- Apply two categories:
  - **Elite** — meets a conservative fixed cutoff.
  - **Borderline** — within a reasonable window (e.g., within ~2σ of top-performing means).

These flags are later summed per team-season to get `EliteHits` and `BorderlineHits`, which feed fraud labeling.

In [7]:
# --- 1) AP cleaning: set unranked teams to -1 ---
for col in ["AP_Pre", "AP_Rank", "AP_High"]:
    if col in final_merged.columns:
        final_merged[col] = final_merged[col].fillna(-1)

print("AP cleaning done. Example AP values:")
print(final_merged[["Season","Team","AP_Pre","AP_Rank","AP_High"]].head(10))

# --- 2) Define pure-play performance features (drop perception + redundant ones) ---
pure_play_features = [
    "SRS_OSRS","SRS_DSRS","SRS_SRS",
    "Scoring_Off","Scoring_Def",
    "Passing_Off","Passing_Def",
    "Rushing_Off","Rushing_Def",
    "Total_Off","Total_Def",
    "WinPct","MOV","Points_For","Points_Against",
    "Total_Wins","PostgameElo_End","Elo_Delta","SOS"
]

print("\nPure-play features (cleaned):")
print(pure_play_features)

# --- 3) Direction rules (lower is better for defense/allowed) ---
lower_is_better = {
    "Scoring_Def", "Passing_Def","Rushing_Def","Total_Def","Points_Against"
}

def higher_is_better(col):
    return col not in lower_is_better

print("\nDirection check:")
for f in pure_play_features:
    print(f"{f:20s} -> {'higher' if higher_is_better(f) else 'lower'} is better")

# --- 4) Build top-10 summary per season & feature ---
import numpy as np
rows = []

for season, group in final_merged.groupby("Season"):
    for feat in pure_play_features:
        if feat not in group.columns:
            continue
        vals = group[feat].dropna()
        if vals.empty:
            continue

        if higher_is_better(feat):
            top = group.sort_values(feat, ascending=False).head(10)
        else:
            top = group.sort_values(feat, ascending=True).head(10)

        teams = top["Team"].astype(str).tolist()
        arr = top[feat].values

        rows.append({
            "Season": int(season),
            "Feature": feat,
            "Direction": "higher" if higher_is_better(feat) else "lower",
            "Top10_mean": float(np.mean(arr)),
            "Top10_std": float(np.std(arr, ddof=1)) if len(arr)>1 else 0.0,
            "Top10_min": float(np.min(arr)),
            "Top10_max": float(np.max(arr)),
            "Top10_count": len(arr),
            "Top10_teams": "; ".join(teams)
        })

top10_summary = pd.DataFrame(rows)

# --- 5) Aggregated summary across seasons ---
agg_summary = top10_summary.groupby("Feature").agg({
    "Top10_mean":["mean","std"],
    "Top10_min":"mean",
    "Top10_max":"mean"
})
agg_summary.columns = ["Top10_mean_mean","Top10_mean_std","Top10_min_mean","Top10_max_mean"]
agg_summary = agg_summary.reset_index().sort_values("Top10_mean_mean", ascending=False)

print("\nAggregated summary across seasons (threshold reference):")
display(agg_summary.round(3))

# --- 6) Helper function for inspection ---
def show_top10_for(season, feature, topn=10):
    df_s = final_merged[final_merged["Season"]==season].copy()
    if feature not in df_s.columns:
        print("Feature not found:", feature)
        return
    if higher_is_better(feature):
        top = df_s.sort_values(feature, ascending=False).head(topn)
    else:
        top = df_s.sort_values(feature, ascending=True).head(topn)
    display(top[["Team",feature,"AP_Pre","AP_Rank"]])

# Example usage:
print("\nExample: 2019 top 10 SOS")
show_top10_for(2019,"SOS")

AP cleaning done. Example AP values:
   Season               Team  AP_Pre  AP_Rank  AP_High
0    2014             Oregon     3.0      2.0      2.0
1    2014         Ohio State     5.0      1.0      1.0
2    2014            Alabama     2.0      4.0      1.0
3    2014            Georgia    12.0      9.0      6.0
4    2014           Ole Miss    18.0     17.0      3.0
5    2014     Michigan State     8.0      5.0      5.0
6    2014       Georgia Tech    -1.0      8.0      8.0
7    2014  Mississippi State    -1.0     11.0      1.0
8    2014             Auburn     6.0     22.0      2.0
9    2014             Baylor    10.0      7.0      4.0

Pure-play features (cleaned):
['SRS_OSRS', 'SRS_DSRS', 'SRS_SRS', 'Scoring_Off', 'Scoring_Def', 'Passing_Off', 'Passing_Def', 'Rushing_Off', 'Rushing_Def', 'Total_Off', 'Total_Def', 'WinPct', 'MOV', 'Points_For', 'Points_Against', 'Total_Wins', 'PostgameElo_End', 'Elo_Delta', 'SOS']

Direction check:
SRS_OSRS             -> higher is better
SRS_DSRS      

Unnamed: 0,Feature,Top10_mean_mean,Top10_mean_std,Top10_min_mean,Top10_max_mean
6,PostgameElo_End,1964.5,32.085,1858.0,2165.727
5,Points_For,563.227,34.453,509.182,640.636
4,Points_Against,201.073,36.995,161.273,230.818
0,Elo_Delta,141.636,23.35,96.818,236.091
14,Scoring_Off,57.977,1.726,54.517,63.857
1,MOV,54.803,4.538,49.472,63.301
12,SRS_SRS,18.566,0.997,15.055,25.326
11,SRS_OSRS,12.099,1.248,9.652,16.512
17,Total_Wins,12.055,0.662,10.909,14.182
10,SRS_DSRS,11.981,0.926,9.774,16.055



Example: 2019 top 10 SOS


Unnamed: 0,Team,SOS,AP_Pre,AP_Rank
595,Michigan,7.88,7.0,18.0
588,Wisconsin,7.73,19.0,11.0
593,Auburn,7.72,16.0,14.0
638,South Carolina,7.34,-1.0,-1.0
584,Ohio State,7.32,5.0,3.0
585,LSU,6.6,6.0,1.0
623,Mississippi State,5.95,-1.0,-1.0
602,Texas,5.8,10.0,25.0
658,Stanford,5.7,25.0,-1.0
611,USC,5.64,-1.0,-1.0


### 6.1 — Strength of Schedule Alternative — Opponent Average ELO

We compute an empirical SOS by averaging opponents' **pregame Elo** across all games for a team-season (`Opp_Avg_Elo`). This captures how strong the schedule was, independent of poll perception.

In [8]:
# --- Add Strength of Schedule (Opponent Average Elo) ---
def add_strength_of_schedule_to_merged(games_df, merged_df):
    home_opps = games_df[["season", "homeTeam", "awayPregameElo"]].rename(columns={"homeTeam": "Team", "awayPregameElo": "OpponentElo"})
    away_opps = games_df[["season", "awayTeam", "homePregameElo"]].rename(columns={"awayTeam": "Team", "homePregameElo": "OpponentElo"})
    opps = pd.concat([home_opps, away_opps], ignore_index=True)

    sos = opps.groupby(["season", "Team"])["OpponentElo"].mean().reset_index()
    sos.rename(columns={"OpponentElo": "Opp_Avg_Elo"}, inplace=True)

    merged_with_sos = pd.merge(merged_df, sos, left_on=["Season", "School"], right_on=["season", "Team"], how="left")
    merged_with_sos = merged_with_sos.drop(columns=[c for c in ["season", "Team"] if c in merged_with_sos.columns])
    return merged_with_sos

merged_with_sos = add_strength_of_schedule_to_merged(df, final_merged)
print(merged_with_sos[["School", "Season", "Total_Wins", "Opp_Avg_Elo"]].head())

       School  Season  Total_Wins  Opp_Avg_Elo
0      Oregon    2014          13  1670.000000
1  Ohio State    2014          14  1673.933333
2     Alabama    2014          12  1695.384615
3     Georgia    2014          10  1649.250000
4    Ole Miss    2014           9  1746.166667


### 6.2 — Fixed Elite Thresholds and borderline references

This cell defines:
- `elite_thresholds`: a dictionary of conservative numeric cutoffs per metric (e.g., `WinPct >= 0.85`).
- `borderline_ref`: historical top-team means and standard deviations used to define borderline windows.

`add_threshold_features()` creates binary `Elite_<feat>` and `Borderline_<feat>` columns and computes the counters `EliteHits`, `BorderlineHits`, and `TotalThresholdHits`.

In [9]:
# --- Define Thresholds & Helpers ---

# Elite thresholds
elite_thresholds = {
    "SRS_SRS": (18, "ge"),
    "MOV": (50, "ge"),
    "WinPct": (0.85, "ge"),
    "Total_Wins": (11, "ge"),
    "PostgameElo_End": (1950, "ge"),
    "Elo_Delta": (120, "ge"),
    "SOS": (7.0, "ge"),

    # Offense
    "SRS_OSRS": (12, "ge"),
    "Scoring_Off": (55, "ge"),
    "Points_For": (520, "ge"),
    "Passing_Off": (11.0, "ge"),
    "Rushing_Off": (6.9, "ge"),
    "Total_Off": (8.3, "ge"),

    # Defense
    "SRS_DSRS": (12, "ge"),
    "Scoring_Def": (1.0, "le"),
    "Points_Against": (230, "le"),
    "Passing_Def": (4.2, "le"),
    "Rushing_Def": (2.1, "le"),
    "Total_Def": (3.3, "le"),
}

# Borderline reference (means + stds from top teams across seasons)
borderline_ref = {
    "SRS_SRS": (18.6, 1.0),
    "MOV": (54.8, 4.5),
    "WinPct": (0.879, 0.017),
    "Total_Wins": (12.1, 0.66),
    "PostgameElo_End": (1964.5, 32.1),
    "Elo_Delta": (141.6, 23.4),
    "SOS": (7.07, 1.09),
    "SRS_OSRS": (12.1, 1.25),
    "Scoring_Off": (58.0, 1.7),
    "Points_For": (563.2, 34.5),
    "Passing_Off": (11.35, 0.55),
    "Rushing_Off": (7.06, 0.27),
    "Total_Off": (8.51, 0.21),
    "SRS_DSRS": (12.0, 0.93),
    "Scoring_Def": (-0.75, 2.88),
    "Points_Against": (201.1, 37.0),
    "Passing_Def": (4.07, 0.46),
    "Rushing_Def": (1.80, 0.25),
    "Total_Def": (3.11, 0.30),
}

# --- Apply thresholds ---
def add_threshold_features(df, elite_thresholds, borderline_ref=None):
    df_out = df.copy()
    elite_cols, borderline_cols = [], []

    for feat, (cutoff, direction) in elite_thresholds.items():
        if feat not in df_out.columns:
            df_out[f"Elite_{feat}"] = 0
            df_out[f"Borderline_{feat}"] = 0
            continue

        col = df_out[feat]
        direction = direction.lower()

        if direction == "ge":
            elite_mask = col >= cutoff
            if feat in borderline_ref:
                mu, sigma = borderline_ref[feat]
                borderline_mask = (col >= mu - 2*sigma) & (col < cutoff)
            else:
                borderline_mask = pd.Series(False, index=df_out.index)

        elif direction == "le":
            elite_mask = col <= cutoff
            if feat in borderline_ref:
                mu, sigma = borderline_ref[feat]
                borderline_mask = (col > cutoff) & (col <= mu + 2*sigma)
            else:
                borderline_mask = pd.Series(False, index=df_out.index)
        else:
            continue

        df_out[f"Elite_{feat}"] = elite_mask.fillna(False).astype(int)
        df_out[f"Borderline_{feat}"] = borderline_mask.fillna(False).astype(int)

    df_out["EliteHits"] = df_out.filter(like="Elite_").sum(axis=1)
    df_out["BorderlineHits"] = df_out.filter(like="Borderline_").sum(axis=1)
    df_out["TotalThresholdHits"] = df_out["EliteHits"] + df_out["BorderlineHits"]

    return df_out

final_with_thresholds = add_threshold_features(merged_with_sos, elite_thresholds, borderline_ref)
print(final_with_thresholds[["School","Season","EliteHits","BorderlineHits"]].head())


       School  Season  EliteHits  BorderlineHits
0      Oregon    2014          9               2
1  Ohio State    2014         10               4
2     Alabama    2014          9               7
3     Georgia    2014          7               6
4    Ole Miss    2014          7               2


### 6.3 — Inspect threshold results and identify candidates

This quick check prints:
- Top teams by `EliteHits` (who meet many elite cutoffs), and
- Candidate “frauds” (teams with low `EliteHits` but that appeared in AP Top 12 at some point).

Use these outputs to sanity-check the thresholds and tweak the `elite_thresholds` or `borderline_ref` values if needed.

In [10]:
# Check seasons 2014, 2019, 2022
for yr in [2014, 2019, 2022, 2024]:
    s = final_with_thresholds[final_with_thresholds["Season"] == yr].copy()
    print(f"\nTop Schools by EliteHits in {yr}:")
    display(
        s.sort_values("EliteHits", ascending=False)[
            ["School", "EliteHits", "BorderlineHits", "TotalThresholdHits",
             "AP_Rank","AP_Pre", "AP_High", "SRS_SRS", "WinPct", "SOS"]
        ].head(15)
    )

    print(f"\nFraud Candidates in {yr} (<6 EliteHits & Top 12 AP at some point):")
    frauds = s[(s["EliteHits"] < 6) &
               (((s["AP_Pre"] > 0) & (s["AP_Pre"] <= 12)) |
                ((s["AP_High"] > 0) & (s["AP_High"] <= 12)))]
    display(
        frauds.sort_values("AP_Pre")[
            ["School", "AP_Rank", "AP_Pre", "AP_High", "EliteHits", "BorderlineHits",
             "TotalThresholdHits", "SRS_SRS", "WinPct", "SOS"]
        ]
    )


Top Schools by EliteHits in 2014:


Unnamed: 0,School,EliteHits,BorderlineHits,TotalThresholdHits,AP_Rank,AP_Pre,AP_High,SRS_SRS,WinPct,SOS
1,Ohio State,10,4,14,1.0,5.0,1.0,20.44,0.933333,5.17
2,Alabama,9,7,16,4.0,2.0,1.0,20.34,0.857143,7.27
0,Oregon,9,2,11,2.0,3.0,2.0,22.22,0.866667,6.02
3,Georgia,7,6,13,9.0,12.0,6.0,18.84,0.769231,5.07
4,Ole Miss,7,2,9,17.0,18.0,3.0,16.8,0.692308,8.11
9,Baylor,6,2,8,7.0,10.0,4.0,14.97,0.846154,0.59
21,Clemson,6,0,6,15.0,16.0,15.0,11.63,0.769231,2.86
18,Stanford,5,1,6,-1.0,11.0,11.0,12.42,0.615385,4.65
5,Michigan State,5,2,7,5.0,8.0,5.0,16.07,0.846154,1.68
6,Georgia Tech,5,3,8,8.0,-1.0,8.0,15.92,0.785714,4.78



Fraud Candidates in 2014 (<6 EliteHits & Top 12 AP at some point):


Unnamed: 0,School,AP_Rank,AP_Pre,AP_High,EliteHits,BorderlineHits,TotalThresholdHits,SRS_SRS,WinPct,SOS
6,Georgia Tech,8.0,-1.0,8.0,5,3,8,15.92,0.785714,4.78
7,Mississippi State,11.0,-1.0,1.0,2,2,4,15.04,0.769231,4.81
20,Arizona,19.0,-1.0,8.0,0,1,1,11.9,0.714286,6.12
11,Florida State,5.0,1.0,1.0,3,2,5,14.48,0.928571,5.13
26,Oklahoma,-1.0,4.0,4.0,1,1,2,10.21,0.615385,3.9
8,Auburn,22.0,6.0,2.0,4,5,9,14.99,0.615385,9.06
12,UCLA,10.0,7.0,7.0,1,2,3,14.4,0.769231,9.1
5,Michigan State,5.0,8.0,5.0,5,2,7,16.07,0.846154,1.68
36,South Carolina,-1.0,9.0,9.0,0,1,1,6.7,0.538462,4.93
18,Stanford,-1.0,11.0,11.0,5,1,6,12.42,0.615385,4.65



Top Schools by EliteHits in 2019:


Unnamed: 0,School,EliteHits,BorderlineHits,TotalThresholdHits,AP_Rank,AP_Pre,AP_High,SRS_SRS,WinPct,SOS
584,Ohio State,18,0,18,3.0,5.0,2.0,27.39,0.928571,7.32
587,Clemson,15,1,16,2.0,1.0,1.0,21.04,0.933333,2.7
585,LSU,11,4,15,1.0,6.0,1.0,25.8,1.0,6.6
586,Alabama,10,6,16,8.0,2.0,1.0,21.11,0.846154,2.81
588,Wisconsin,10,3,13,11.0,19.0,6.0,17.73,0.714286,7.73
594,Oklahoma,10,0,10,7.0,4.0,4.0,15.79,0.857143,4.36
589,Georgia,10,3,13,4.0,3.0,3.0,17.64,0.857143,5.21
597,Utah,8,1,9,16.0,14.0,5.0,14.86,0.785714,2.29
599,Memphis,7,1,8,17.0,-1.0,15.0,13.66,0.857143,2.09
590,Penn State,6,6,12,9.0,15.0,5.0,17.34,0.846154,4.88



Fraud Candidates in 2019 (<6 EliteHits & Top 12 AP at some point):


Unnamed: 0,School,AP_Rank,AP_Pre,AP_High,EliteHits,BorderlineHits,TotalThresholdHits,SRS_SRS,WinPct,SOS
603,Baylor,13.0,-1.0,8.0,3,3,6,12.12,0.785714,2.62
604,Minnesota,10.0,-1.0,7.0,2,2,4,11.73,0.846154,2.11
595,Michigan,18.0,7.0,7.0,5,3,8,15.5,0.692308,7.88
591,Notre Dame,12.0,9.0,7.0,5,4,9,17.22,0.846154,3.3
602,Texas,25.0,10.0,9.0,0,5,5,12.18,0.615385,5.8
608,Texas A&M,-1.0,12.0,12.0,0,6,6,10.02,0.615385,5.09



Top Schools by EliteHits in 2022:


Unnamed: 0,School,EliteHits,BorderlineHits,TotalThresholdHits,AP_Rank,AP_Pre,AP_High,SRS_SRS,WinPct,SOS
937,Georgia,15,3,18,1.0,3.0,1.0,25.48,1.0,6.28
940,Alabama,11,5,16,5.0,1.0,1.0,19.66,0.846154,4.58
939,Ohio State,10,4,14,4.0,2.0,2.0,19.82,0.846154,5.2
938,Michigan,9,7,16,3.0,8.0,2.0,20.59,0.928571,2.73
941,Tennessee,8,3,11,6.0,-1.0,2.0,17.9,0.846154,4.98
942,Penn State,6,3,9,7.0,-1.0,7.0,17.56,0.846154,4.02
944,Texas,6,1,7,25.0,-1.0,18.0,15.92,0.615385,8.16
951,USC,6,1,7,12.0,14.0,4.0,12.13,0.785714,3.49
964,Iowa,6,0,6,-1.0,-1.0,-1.0,7.91,0.615385,2.29
957,Illinois,4,1,5,-1.0,-1.0,14.0,10.01,0.615385,1.17



Fraud Candidates in 2022 (<6 EliteHits & Top 12 AP at some point):


Unnamed: 0,School,AP_Rank,AP_Pre,AP_High,EliteHits,BorderlineHits,TotalThresholdHits,SRS_SRS,WinPct,SOS
943,Kansas State,14.0,-1.0,11.0,3,2,5,16.16,0.714286,7.95
967,Florida,-1.0,-1.0,12.0,1,2,3,6.92,0.461538,7.0
954,Tulane,9.0,-1.0,9.0,2,5,7,11.53,0.857143,0.39
953,Washington,8.0,-1.0,8.0,1,4,5,11.56,0.846154,0.64
949,LSU,16.0,-1.0,6.0,1,4,5,13.29,0.714286,6.14
960,UCLA,21.0,-1.0,9.0,0,3,3,8.88,0.692308,0.73
946,Florida State,11.0,-1.0,11.0,1,7,8,14.77,0.769231,3.38
947,Clemson,13.0,4.0,4.0,1,5,6,13.59,0.785714,3.3
955,Notre Dame,18.0,5.0,5.0,0,2,2,10.8,0.692308,3.8
994,Texas A&M,-1.0,6.0,6.0,1,3,4,3.21,0.416667,3.13



Top Schools by EliteHits in 2024:


Unnamed: 0,School,EliteHits,BorderlineHits,TotalThresholdHits,AP_Rank,AP_Pre,AP_High,SRS_SRS,WinPct,SOS
1177,Ohio State,16,2,18,1.0,2.0,1.0,25.21,0.875,8.96
1178,Notre Dame,13,2,15,2.0,7.0,2.0,21.69,0.875,7.57
1181,Texas,12,3,15,4.0,4.0,1.0,19.3,0.8125,7.37
1184,Indiana,11,1,12,10.0,-1.0,5.0,16.33,0.846154,1.95
1179,Oregon,10,4,14,3.0,3.0,1.0,20.04,0.928571,5.83
1180,Penn State,9,5,14,5.0,8.0,3.0,19.6,0.8125,6.91
1182,Ole Miss,8,4,12,11.0,6.0,5.0,16.57,0.769231,2.03
1183,Georgia,8,1,9,6.0,1.0,1.0,16.37,0.785714,7.37
1195,Tennessee,7,0,7,9.0,15.0,4.0,10.89,0.769231,1.89
1185,Alabama,7,3,10,17.0,5.0,1.0,15.33,0.692308,5.33



Fraud Candidates in 2024 (<6 EliteHits & Top 12 AP at some point):


Unnamed: 0,School,AP_Rank,AP_Pre,AP_High,EliteHits,BorderlineHits,TotalThresholdHits,SRS_SRS,WinPct,SOS
1187,SMU,12.0,-1.0,8.0,3,6,9,14.19,0.785714,2.76
1189,BYU,13.0,-1.0,7.0,3,3,6,13.09,0.846154,1.47
1192,Arizona State,7.0,-1.0,7.0,2,5,7,11.85,0.785714,3.28
1196,Iowa State,15.0,-1.0,9.0,1,2,3,10.8,0.785714,3.01
1201,Boise State,8.0,-1.0,8.0,4,0,4,10.5,0.857143,-1.85
1197,Michigan,-1.0,9.0,9.0,5,2,7,10.76,0.615385,9.38
1268,Florida State,-1.0,10.0,10.0,0,1,1,-5.14,0.166667,6.2
1205,Missouri,22.0,11.0,6.0,0,3,3,9.18,0.769231,3.18
1235,Utah,-1.0,12.0,10.0,0,4,4,1.63,0.416667,1.63
1188,LSU,-1.0,13.0,8.0,0,4,4,13.36,0.692308,6.82


## 7 — Fraud labeling: formalize target classes

**Label definitions:**
- **Fraud (1)**: `AP_High <= 12` (was top-12 at some point) **AND** `EliteHits <= 6`  → top poll presence but low objective elite support.
- **Contender (0)**: `AP_High <= 12` **AND** `EliteHits > 6`  → poll rank supported by many elite metrics.
- **Irrelevant (2)**: never ranked in AP Top 25 (`AP_High <= 0`).

These labels are intentionally conservative so the classifier learns to separate perception (polls) from objective performance.

In [11]:
# --- Fraud Label Creation ---

# Copy dataset
labeled_df = final_with_thresholds.copy()

# Initialize FraudLabel as NaN for clarity
labeled_df["FraudLabel"] = np.nan

# Fraud: Top 12 AP at some point AND <= 6 elite hits
fraud_mask = (labeled_df["AP_High"] <= 12) & (labeled_df["EliteHits"] <= 6) & (labeled_df["AP_High"] > 0)
labeled_df.loc[fraud_mask, "FraudLabel"] = 1

# Contender: Top 12 AP at some point AND > 6 elite hits
contender_mask = (labeled_df["AP_High"] <= 12) & (labeled_df["EliteHits"] > 6) & (labeled_df["AP_High"] > 0)
labeled_df.loc[contender_mask, "FraudLabel"] = 0

# Irrelevant: Never ranked in AP Top 25 (AP_High <= 0 or missing)
irrelevant_mask = (labeled_df["AP_High"] <= 0) | (labeled_df["AP_High"].isna())
labeled_df.loc[irrelevant_mask, "FraudLabel"] = 2

# --- Quick sanity checks ---
for yr in [2014, 2019, 2022]:
    subset = labeled_df[labeled_df["Season"] == yr]
    print(f"\nFraud Labels in {yr}:")
    print(
        subset[["School", "AP_Rank", "AP_High", "EliteHits", "FraudLabel"]]
        .sort_values("FraudLabel", ascending=False)
        .head(15)
    )


Fraud Labels in 2014:
              School  AP_Rank  AP_High  EliteHits  FraudLabel
66            Nevada     -1.0     -1.0          0         2.0
91        Vanderbilt     -1.0     -1.0          0         2.0
89            Kansas     -1.0     -1.0          0         2.0
88           Wyoming     -1.0     -1.0          0         2.0
87     Bowling Green     -1.0     -1.0          0         2.0
86       Texas State     -1.0     -1.0          0         2.0
85            Purdue     -1.0     -1.0          0         2.0
84      Old Dominion     -1.0     -1.0          0         2.0
83  Central Michigan     -1.0     -1.0          0         2.0
82      Fresno State     -1.0     -1.0          0         2.0
81           Indiana     -1.0     -1.0          1         2.0
80          Syracuse     -1.0     -1.0          0         2.0
79    Arkansas State     -1.0     -1.0          0         2.0
78          Colorado     -1.0     -1.0          0         2.0
77        Texas Tech     -1.0     -1.0         

## 8 — Prepare training data (features and labels)

**Design choices:**
- We only train on `Fraud` (1) vs `Contender` (0) — exclude `Irrelevant` (2).
- We avoid using threshold-derived features (e.g., Elite_* flags) as predictors to prevent label leakage. Instead we use season-level numeric performance features:
  - `SRS_*`, `Scoring_*`, `Passing_*`, `Rushing_*`, `WinPct`, `MOV`, `Points_For`, `Points_Against`, `Total_Wins`, `PostgameElo_End`, `Elo_Delta`, `SOS`.
- Check class balance (`Fraud` vs `Contender`) and use class-weighting in the model if necessary.

In [12]:
# --- Prepare training dataset ---

# Keep only teams labeled as fraud (1) or contender (0)
train_df = labeled_df[labeled_df["FraudLabel"].isin([0, 1])].copy()

# Drop irrelevant teams (FraudLabel == 2)
print("Dropped irrelevant schools:", labeled_df[labeled_df["FraudLabel"] == 2].shape[0])

# Select features for training (exclude threshold-based ones to avoid leakage)
feature_cols = [
    "SRS_OSRS", "SRS_DSRS", "SRS_SRS",
    "Scoring_Off", "Scoring_Def", "Passing_Off", "Passing_Def",
    "Rushing_Off", "Rushing_Def", "Total_Off", "Total_Def",
    "WinPct", "MOV", "Points_For", "Points_Against",
    "Total_Wins", "PostgameElo_End", "Elo_Delta", "SOS"
]

X = train_df[feature_cols]
y = train_df["FraudLabel"]

print("Training set shape:", X.shape)
print("Fraud/Contender distribution:\n", y.value_counts())
print(X.columns)

Dropped irrelevant schools: 814
Training set shape: (260, 19)
Fraud/Contender distribution:
 FraudLabel
1.0    192
0.0     68
Name: count, dtype: int64
Index(['SRS_OSRS', 'SRS_DSRS', 'SRS_SRS', 'Scoring_Off', 'Scoring_Def',
       'Passing_Off', 'Passing_Def', 'Rushing_Off', 'Rushing_Def', 'Total_Off',
       'Total_Def', 'WinPct', 'MOV', 'Points_For', 'Points_Against',
       'Total_Wins', 'PostgameElo_End', 'Elo_Delta', 'SOS'],
      dtype='object')


## 9 — Model training & evaluation

We train a **Random Forest** classifier with:
- `class_weight="balanced"` to handle class imbalance,
- `n_estimators=500` for stability.

We evaluate with:
- Classification report (precision/recall/F1),
- Confusion matrix,
- ROC-AUC score,
- 5-fold cross-validation (AUC) for robustness.

After validation we will retrain on the full labeled dataset and generate final FraudScore probabilities.

In [13]:
# --- Train/Test Split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify keeps fraud/contender ratio balanced
)

# --- Model ---
rf_clf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    random_state=42,
    class_weight="balanced"  # handles imbalance (192 frauds vs 68 contenders)
)
rf_clf.fit(X_train, y_train)

# --- Predictions ---
y_pred = rf_clf.predict(X_test)
y_proba = rf_clf.predict_proba(X_test)[:, 1]  # probability of being a fraud

# --- Evaluation ---
print("Classification Report:\n", classification_report(y_test, y_pred, digits=3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))

# --- Cross-Validation (5-fold) ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(rf_clf, X, y, cv=kf, scoring="roc_auc")
cv_acc = cross_val_score(rf_clf, X, y, cv=kf, scoring="accuracy")

print("\nCross-validated AUC scores:", cv_auc)
print("Mean AUC:", cv_auc.mean())
print("Cross-validated Accuracy scores:", cv_acc)
print("Mean Accuracy:", cv_acc.mean())

# --- Feature Importance ---
importances = pd.Series(rf_clf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("\nTop Feature Importances:")
print(importances.head(15))

Classification Report:
               precision    recall  f1-score   support

         0.0      0.923     0.857     0.889        14
         1.0      0.949     0.974     0.961        38

    accuracy                          0.942        52
   macro avg      0.936     0.915     0.925        52
weighted avg      0.942     0.942     0.942        52

Confusion Matrix:
 [[12  2]
 [ 1 37]]
ROC-AUC: 0.9868421052631579

Cross-validated AUC scores: [0.99761905 0.96666667 0.97368421 0.98165869 0.99459459]
Mean AUC: 0.9828446423183266
Cross-validated Accuracy scores: [0.98076923 0.92307692 0.90384615 0.90384615 0.98076923]
Mean Accuracy: 0.9384615384615383

Top Feature Importances:
MOV                0.245352
SRS_SRS            0.149832
PostgameElo_End    0.090839
Points_For         0.073414
Total_Off          0.071443
Scoring_Off        0.069919
Scoring_Def        0.052664
Total_Def          0.042717
Passing_Off        0.038934
WinPct             0.027942
Total_Wins         0.021605
Rushing_De

### 9.1 — Season-aware cross validation (GroupKFold)

To avoid leakage across seasons (teams from the same season appearing in both train and test splits), we optionally use `GroupKFold` with `groups = train_df["Season"]`. This provides a sense of how the model generalizes across seasons.

In [14]:
# Define groups (seasons)
groups = train_df["Season"]

# Model
rf_clf = RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    class_weight="balanced"
)

# GroupKFold: 5 splits, season-aware
gkf = GroupKFold(n_splits=5)

auc_scores = []
acc_scores = []

for train_idx, test_idx in gkf.split(X, y, groups=groups):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    y_proba = rf_clf.predict_proba(X_test)[:, 1]

    auc_scores.append(roc_auc_score(y_test, y_proba))
    acc_scores.append(accuracy_score(y_test, y_pred))

print("GroupKFold Season-Aware AUC Scores:", auc_scores)
print("Mean AUC:", np.mean(auc_scores))
print("GroupKFold Season-Aware Accuracy:", acc_scores)
print("Mean Accuracy:", np.mean(acc_scores))

GroupKFold Season-Aware AUC Scores: [0.95703125, 0.9940476190476191, 1.0, 0.9651567944250872, 0.9927536231884058]
Mean AUC: 0.9817978573322226
GroupKFold Season-Aware Accuracy: [0.875, 0.96, 0.9795918367346939, 0.8958333333333334, 0.9230769230769231]
Mean Accuracy: 0.9267004186289901


## 10 — Final model: retrain & assign FraudScore

After validation, we retrain the Random Forest on the complete fraud vs contender dataset, compute fraud probabilities, and scale them to a `FraudScore` (0–100). Irrelevant teams (never ranked) are assigned `FraudScore = 100` by convention here (they are not evaluated by the classifier).

In [15]:
# --- Retrain on all fraud vs contender teams ---
rf_clf.fit(X, y)

# --- Predict probabilities for fraud (class 1) ---
fraud_proba = rf_clf.predict_proba(X)[:, 1]  # probability of fraud
labeled_df.loc[labeled_df["FraudLabel"].isin([0, 1]), "FraudScore"] = fraud_proba * 100

# Assign irrelevant teams (FraudLabel == 2) a FraudScore of 100 automatically
labeled_df.loc[labeled_df["FraudLabel"] == 2, "FraudScore"] = 100.0

## 11 — Results: Top predicted frauds by season

This prints the top predicted frauds (by `FraudScore`) per season. Use these tables to:
- Inspect surprising predictions,
- Verify whether high-scoring frauds match intuition (check their AP rankings and `EliteHits`),
- Export/visualize top cases for reports.

In [16]:
# --- Check top predicted frauds per season (excluding irrelevant teams) ---
for season in [2014,2015,2016,2017,2018,2019,2020,2021, 2022, 2023, 2024]:
    print(f"\nTop Predicted Frauds in {season}:")
    subset = labeled_df[(labeled_df["Season"] == season) & (labeled_df["FraudLabel"].isin([0, 1]))].copy()
    top_frauds = subset.sort_values("FraudScore", ascending=False).head(25)
    print(top_frauds[["School", "FraudScore", "FraudLabel", "AP_Rank", "AP_High", "EliteHits", "WinPct", "SRS_SRS"]])


Top Predicted Frauds in 2014:
               School  FraudScore  FraudLabel  AP_Rank  AP_High  EliteHits  \
13                USC       100.0         1.0     20.0      9.0          0   
12               UCLA       100.0         1.0     10.0      7.0          1   
29         Notre Dame       100.0         1.0     -1.0      5.0          0   
27          Texas A&M       100.0         1.0     -1.0      6.0          1   
25           Nebraska       100.0         1.0     -1.0     11.0          0   
20            Arizona       100.0         1.0     19.0      8.0          0   
15       Kansas State       100.0         1.0     18.0      9.0          0   
14      Arizona State       100.0         1.0     12.0      7.0          0   
36     South Carolina       100.0         1.0     -1.0      9.0          0   
7   Mississippi State       100.0         1.0     11.0      1.0          2   
26           Oklahoma        99.8         1.0     -1.0      4.0          1   
22                LSU        99.4

              School  FraudScore  FraudLabel  AP_Rank  AP_High  EliteHits  \
1093  North Carolina       100.0         1.0     -1.0     10.0          0   
1087            Utah       100.0         1.0     -1.0     10.0          2   
1078    Oregon State       100.0         1.0     -1.0     10.0          0   
1070         Arizona        99.8         1.0     11.0     11.0          1   
1075         Clemson        99.8         1.0     20.0      9.0          2   
1073      Louisville        99.8         1.0     19.0      9.0          1   
1072       Tennessee        99.8         1.0     17.0      9.0          1   
1065   Florida State        98.8         1.0      6.0      3.0          3   
1071        Ole Miss        97.8         1.0      9.0      9.0          2   
1066        Missouri        97.8         1.0      8.0      8.0          3   
1077             USC        92.8         1.0     -1.0      5.0          5   
1067        Oklahoma        80.6         1.0     15.0      5.0          3   

# Executive Summary

This project develops a methodology for identifying “fraudulent” teams in college football—teams whose performance metrics do not align with their national rankings. Using play-level game data, historical team ratings, and standings from 2014–2024, we constructed a merged dataset capturing both performance-based features (e.g., scoring efficiency, defensive strength, Elo rating changes, strength of schedule) and perception-based indicators (e.g., AP rankings). From this foundation, we established statistical thresholds to classify elite and borderline-elite performance levels across multiple metrics.

We then applied these thresholds to flag teams as contenders, frauds, or irrelevant, and trained a Random Forest classifier to assign a continuous “Fraud Score” reflecting the likelihood of a team being overrated. The model was validated across seasons and demonstrated strong predictive ability, highlighting mismatches between rankings and true statistical performance. This framework provides a transparent, data-driven approach to evaluate ranking integrity and team quality in college football.