<a href="https://colab.research.google.com/github/hakanegee/goztepeanalysis/blob/main/sub3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Submission 3 – Machine Learning Analysis

This notebook applies supervised machine learning methods to investigate
whether player performance metrics can distinguish between the pre– and
post–Sport Republic periods at Göztepe.

The goal is not prediction accuracy alone, but to identify which performance
features best characterize the post-acquisition era.

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

# FBref DATA (already cleaned Excel files) ===

std_2122 = pd.read_excel("Goztepe2122StandardStats.xlsx")
std_2425 = pd.read_excel("Goztepe2425StandardStats.xlsx")
std_2526 = pd.read_excel("Goztepe_2025_2026 Standard Stats Clean.xlsx")

misc_2122 = pd.read_excel("Goztepe2122MiscStatsClean.xlsx")
misc_2425 = pd.read_excel("Goztepe2425MiscStatsClean.xlsx")
misc_2526 = pd.read_excel("Goztepe_2025_2026_Misc_Stats_Clean.xlsx")

In [4]:
# Label seasons
for df, season in zip(
    [std_2122, std_2425, std_2526],
    ["2021-2022", "2024-2025", "2025-2026"]
):
    df["season"] = season

for df, season in zip(
    [misc_2122, misc_2425, misc_2526],
    ["2021-2022", "2024-2025", "2025-2026"]
):
    df["season"] = season

standard_df = pd.concat([std_2122, std_2425, std_2526], ignore_index=True)
misc_df = pd.concat([misc_2122, misc_2425, misc_2526], ignore_index=True)

def period_label(season):
    return "pre" if season == "2021-2022" else "post"

standard_df["period"] = standard_df["season"].apply(period_label)
misc_df["period"] = misc_df["season"].apply(period_label)

In [12]:
# Merge performance metrics
ml_df = standard_df.merge(
    misc_df,
    on=["Player", "season", "period"],
    suffixes=("_std", "_misc")
)

y = ml_df["period"]

X = ml_df.select_dtypes(include=[np.number]).copy()

X = X.replace([np.inf, -np.inf], np.nan).fillna(0)
y = y.loc[X.index]

X.shape, y.value_counts()

((66, 12),
 period
 post    43
 pre     23
 Name: count, dtype: int64)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

logreg = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])

logreg.fit(X_train, y_train)
pred_lr = logreg.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred_lr))
print("F1:", f1_score(y_test, pred_lr, pos_label="post"))
print(classification_report(y_test, pred_lr))

Accuracy: 0.8235294117647058
F1: 0.8571428571428571
              precision    recall  f1-score   support

        post       0.90      0.82      0.86        11
         pre       0.71      0.83      0.77         6

    accuracy                           0.82        17
   macro avg       0.81      0.83      0.81        17
weighted avg       0.83      0.82      0.83        17



In [14]:
rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    class_weight="balanced"
)

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print("RF Accuracy:", accuracy_score(y_test, pred_rf))
print("RF F1:", f1_score(y_test, pred_rf, pos_label="post"))

importances = pd.DataFrame({
    "feature": X.columns,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=False)

importances.head(15)

RF Accuracy: 0.8823529411764706
RF F1: 0.9166666666666666


Unnamed: 0,feature,importance
11,Tackles Won,0.394102
8,Fouls Drawn,0.085262
5,Age_misc,0.078735
0,Age_std,0.074885
10,Interceptions,0.073499
6,90s,0.057603
1,Minutes,0.054608
4,G+A per 90,0.048879
7,Fouls,0.047984
9,Crosses,0.042379


## Interpretation

The Random Forest model indicates that defensive-related metrics such as
interceptions, tackles won, and fouls drawn are among the most important
features distinguishing the post–Sport Republic period.

This aligns with earlier EDA findings and suggests that the sporting recovery
observed after 2022 is associated more strongly with defensive stability than
with offensive output.

Such defensive improvements are consistent with improved league standings,
as reduced goals conceded typically translate into more points over a season.