## üìê Statistical Tests

### Parametric Test
- **t-test**: Do home teams score significantly more goals than away teams?
  - *Hypothesis*: Home advantage leads to higher goal counts

### Non-parametric Test
- **Mann-Whitney U test**: xG comparison between top 5 and bottom 5 teams
  - *Hypothesis*: Top teams generate significantly higher xG

---

In [1]:
import pandas as pd

df = pd.read_parquet('data/serie_a_matches_processed.parquet')

df.tail()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,pkatt,season,team,opp captain,opp sh,opp sot,opp dist,opp fk,opp pk,opp pkatt
1946,2020-09-20,18:00,Serie A,1,Sun,Home,D,1,1,Cagliari,...,0,2020,Sassuolo,Jo√£o Pedro,8,4,15.9,1,0,0
1947,2020-09-20,15:00,Serie A,1,Sun,Home,W,4,1,Crotone,...,0,2020,Genoa,Alex Cordaz,13,3,18.8,0,0,0
1948,2020-09-20,12:30,Serie A,1,Sun,Home,L,0,2,Napoli,...,0,2020,Parma,Lorenzo Insigne,17,6,19.4,0,0,0
1949,2020-09-19,20:45,Serie A,1,Sat,Home,W,3,0,Roma,...,0,2020,Hellas Verona,Lorenzo Pellegrini,21,4,,0,0,0
1950,2020-09-19,18:00,Serie A,1,Sat,Home,W,1,0,Torino,...,0,2020,Fiorentina,Andrea Belotti,6,3,18.7,1,0,0


## ‚úÖ Implementation (season-aware statistical tests)
Below we implement both tests using the `df` dataset.
We compute results **overall** and **per season** to keep the analysis season-based and interpretable.

### üîç Data Diagnostic Check
First, let's verify the data structure to ensure we have both home and away matches.

In [6]:
# Diagnostic: Check venue column values and distribution
print("=== Data Diagnostics ===")
print(f"Total rows in df: {len(df)}")
print(f"\nUnique values in 'venue' column:")
print(df['venue'].value_counts())
print(f"\nNull values in key columns:")
print(df[['venue', 'gf', 'xg', 'season', 'team']].isnull().sum())
print(f"\nData types:")
print(df[['venue', 'gf', 'xg', 'season', 'team']].dtypes)

=== Data Diagnostics ===
Total rows in df: 1951

Unique values in 'venue' column:
venue
Home    1951
Name: count, dtype: int64

Null values in key columns:
venue     0
gf        0
xg        1
season    0
team      0
dtype: int64

Data types:
venue         str
gf          int64
xg        float64
season      int64
team          str
dtype: object


### ‚ö†Ô∏è Issue Identified
The preprocessed dataset only contains **Home** venue entries. Each row represents a match from the home team's perspective with opponent stats included.

To properly test home vs away performance, we need to compare:
- **Home performance**: `gf` (goals for - home team)
- **Away performance**: `ga` (goals against - which are the away team's goals)

Let's redesign the tests accordingly.

In [7]:
import numpy as np
from scipy import stats

# ------------------------------------------------------------
# Data Preparation (CORRECTED)
# ------------------------------------------------------------
# The dataset only contains 'Home' venue entries where each row represents
# a match from the home team's perspective.
# - gf = goals scored by HOME team
# - ga = goals scored by AWAY team (opponent)
# We'll compare these directly.

# Convert numeric columns
numeric_cols = ['gf', 'ga', 'xg', 'xga', 'season']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Create clean dataset
clean_df = df.dropna(subset=['gf', 'ga', 'xg', 'xga', 'season', 'team']).copy()
clean_df['season'] = pd.to_numeric(clean_df['season'], errors='coerce')
clean_df = clean_df.dropna(subset=['season'])

print(f"Clean dataset: {len(clean_df)} matches")

# ------------------------------------------------------------
# 1) Parametric test: t-test (Home goals vs Away goals)
# ------------------------------------------------------------
# Hypothesis: Home teams score more goals than away teams
# Since each row is a match, we compare gf (home) vs ga (away) for all matches

home_goals = clean_df['gf']  # Goals scored by home team
away_goals = clean_df['ga']  # Goals scored by away team

# Paired t-test (same matches, paired observations)
t_stat_paired, p_val_paired = stats.ttest_rel(home_goals, away_goals, nan_policy='omit')

print("=== Paired t-test (Home vs Away goals) - Overall ===")
print(f"Home goals mean: {home_goals.mean():.3f}")
print(f"Away goals mean: {away_goals.mean():.3f}")
print(f"Difference mean: {(home_goals - away_goals).mean():.3f}")
print(f"t-statistic: {t_stat_paired:.3f}")
print(f"p-value: {p_val_paired:.5f}")

# Season-based results
season_results_ttest = []
for season, grp in clean_df.groupby('season'):
    h = grp['gf']
    a = grp['ga']
    if len(h) > 5 and len(a) > 5:
        t_stat, p_val = stats.ttest_rel(h, a, nan_policy='omit')
        season_results_ttest.append({
            'season': int(season),
            'home_mean': h.mean(),
            'away_mean': a.mean(),
            'diff_mean': (h - a).mean(),
            't_stat': t_stat,
            'p_value': p_val
        })

season_results_ttest_df = pd.DataFrame(season_results_ttest)
if not season_results_ttest_df.empty:
    season_results_ttest_df = season_results_ttest_df.sort_values('season')

print("\n=== Paired t-test by season ===")
if not season_results_ttest_df.empty:
    print(season_results_ttest_df)
else:
    print("No season-level results (insufficient data).")

# Interpretation
alpha = 0.05
print("\n=== Significance (overall) ===")
if not np.isnan(p_val_paired):
    is_sig = p_val_paired < alpha
    print(f"Result: {'‚úÖ SIGNIFICANT' if is_sig else '‚ùå NOT SIGNIFICANT'} at Œ±={alpha}")
    if is_sig:
        print(f"Conclusion: Home teams score significantly {'MORE' if home_goals.mean() > away_goals.mean() else 'LESS'} goals than away teams.")
else:
    print("‚ùå Test failed (insufficient data)")

if not season_results_ttest_df.empty:
    season_results_ttest_df['significant'] = season_results_ttest_df['p_value'] < alpha
    print("\n=== Significance by season (t-test) ===")
    print(season_results_ttest_df[['season', 'home_mean', 'away_mean', 'p_value', 'significant']])

Clean dataset: 1950 matches
=== Paired t-test (Home vs Away goals) - Overall ===
Home goals mean: 1.458
Away goals mean: 1.264
Difference mean: 0.194
t-statistic: 4.929
p-value: 0.00000

=== Paired t-test by season ===
   season  home_mean  away_mean  diff_mean    t_stat   p_value
0    2020   1.630607   1.430079   0.200528  2.130837  0.033747
1    2021   1.502632   1.363158   0.139474  1.521883  0.128872
2    2022   1.412073   1.154856   0.257218  2.963359  0.003235
3    2023   1.434211   1.176316   0.257895  2.990988  0.002962
4    2024   1.339474   1.221053   0.118421  1.355217  0.176155
5    2025   1.240000   1.080000   0.160000  0.687440  0.495047

=== Significance (overall) ===
Result: ‚úÖ SIGNIFICANT at Œ±=0.05
Conclusion: Home teams score significantly MORE goals than away teams.

=== Significance by season (t-test) ===
   season  home_mean  away_mean   p_value  significant
0    2020   1.630607   1.430079  0.033747         True
1    2021   1.502632   1.363158  0.128872        Fa

In [8]:
# ------------------------------------------------------------
# 2) Non-parametric test: Mann-Whitney U (xG top 5 vs bottom 5)
# ------------------------------------------------------------
# Hypothesis: Top teams generate higher xG than bottom teams.
# Approach:
# - Build season table by summing points from results.
# - Select top 5 and bottom 5 teams per season.
# - Compare xG distributions for those groups (all matches for those teams).

def build_points_table(season_df: pd.DataFrame) -> pd.DataFrame:
    """Build league table based on match results."""
    teams = season_df['team'].unique()
    points_dict = {team: 0 for team in teams}
    
    # Count points for home team based on result
    for _, row in season_df.iterrows():
        home = row['team']
        away = row['opponent']
        
        if row['gf'] > row['ga']:  # Home win
            points_dict[home] = points_dict.get(home, 0) + 3
        elif row['gf'] < row['ga']:  # Away win
            points_dict[away] = points_dict.get(away, 0) + 3
        else:  # Draw
            points_dict[home] = points_dict.get(home, 0) + 1
            points_dict[away] = points_dict.get(away, 0) + 1
    
    points_df = pd.DataFrame(list(points_dict.items()), columns=['team', 'points'])
    return points_df.sort_values('points', ascending=False)

# Mann-Whitney U per season
mw_results = []

for season, season_df in clean_df.groupby('season'):
    # Skip if not enough teams
    if season_df['team'].nunique() < 10:
        continue

    table = build_points_table(season_df)
    top_5 = table.head(5)['team'].tolist()
    bottom_5 = table.tail(5)['team'].tolist()

    # Collect xG values for those teams (only when they play at home)
    xg_top = season_df[season_df['team'].isin(top_5)]['xg']
    xg_bottom = season_df[season_df['team'].isin(bottom_5)]['xg']

    # Mann-Whitney U test (two-sided)
    if len(xg_top) > 5 and len(xg_bottom) > 5:
        u_stat, p_val = stats.mannwhitneyu(xg_top, xg_bottom, alternative='two-sided')
        mw_results.append({
            'season': int(season),
            'top5_xg_mean': xg_top.mean(),
            'top5_n': len(xg_top),
            'bottom5_xg_mean': xg_bottom.mean(),
            'bottom5_n': len(xg_bottom),
            'u_stat': u_stat,
            'p_value': p_val
        })

mw_results_df = pd.DataFrame(mw_results)
if not mw_results_df.empty:
    mw_results_df = mw_results_df.sort_values('season')

print("=== Mann-Whitney U test by season (Top 5 vs Bottom 5 xG) ===")
if not mw_results_df.empty:
    print(mw_results_df)
else:
    print("No season-level results (insufficient data).")

# Overall Mann-Whitney U across all seasons combined
overall_table = build_points_table(clean_df)
top_5_all = overall_table.head(5)['team'].tolist()
bottom_5_all = overall_table.tail(5)['team'].tolist()

print(f"\nTop 5 teams overall: {top_5_all}")
print(f"Bottom 5 teams overall: {bottom_5_all}")

xg_top_all = clean_df[clean_df['team'].isin(top_5_all)]['xg']
xg_bottom_all = clean_df[clean_df['team'].isin(bottom_5_all)]['xg']

u_stat_all, p_val_all = stats.mannwhitneyu(xg_top_all, xg_bottom_all, alternative='two-sided')

print("\n=== Mann-Whitney U (Overall) ===")
print(f"Top 5 teams - Mean xG: {xg_top_all.mean():.3f} (n={len(xg_top_all)} matches)")
print(f"Bottom 5 teams - Mean xG: {xg_bottom_all.mean():.3f} (n={len(xg_bottom_all)} matches)")
print(f"U-statistic: {u_stat_all:.3f}")
print(f"p-value: {p_val_all:.5f}")

# Significance interpretation
alpha = 0.05
print("\n=== Significance (overall) ===")
if not np.isnan(p_val_all):
    is_sig = p_val_all < alpha
    print(f"Result: {'‚úÖ SIGNIFICANT' if is_sig else '‚ùå NOT SIGNIFICANT'} at Œ±={alpha}")
    if is_sig:
        print(f"Conclusion: Top 5 teams generate significantly {'HIGHER' if xg_top_all.mean() > xg_bottom_all.mean() else 'LOWER'} xG than bottom 5 teams.")
else:
    print("‚ùå Test failed (insufficient data)")

if not mw_results_df.empty:
    mw_results_df['significant'] = mw_results_df['p_value'] < alpha
    print("\n=== Significance by season (Mann-Whitney U) ===")
    print(mw_results_df[['season', 'top5_xg_mean', 'bottom5_xg_mean', 'p_value', 'significant']])

=== Mann-Whitney U test by season (Top 5 vs Bottom 5 xG) ===
   season  top5_xg_mean  top5_n  bottom5_xg_mean  bottom5_n  u_stat  \
0    2020      1.912632      95         1.243158         95  6658.0   
1    2021      1.720000      95         1.096842         95  6472.5   
2    2022      1.633684      95         1.065625         96  6659.0   
3    2023      1.758947      95         1.128421         95  6588.0   
4    2024      1.591579      95         0.983158         95  6705.0   
5    2025      1.400000      13         0.953846         13   126.5   

        p_value  
0  1.468838e-08  
1  2.268895e-07  
2  3.796238e-08  
3  4.241179e-08  
4  7.029652e-09  
5  3.247901e-02  

Top 5 teams overall: ['Internazionale', 'Napoli', 'Milan', 'Juventus', 'Atalanta']
Bottom 5 teams overall: ['Cremonese', 'Frosinone', 'Benevento', 'Crotone', 'Pisa']

=== Mann-Whitney U (Overall) ===
Top 5 teams - Mean xG: 1.763 (n=488 matches)
Bottom 5 teams - Mean xG: 1.111 (n=81 matches)
U-statistic: 29498.500