# Football Player Analysis - Machine Learning

This notebook loads the dataset, processes features, and trains two models:
1. **Linear Regression**: To predict the current `overall_rating`.
2. **Logistic Regression**: To classify the player's future development (`future_class`).

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, classification_report

# Set plot style for better aesthetics
sns.set(style="whitegrid")

## 1. Data Loading

We define the path to the dataset relative to this notebook.

In [None]:
# Determine the project root directory (assuming notebook is in src/)
PROJECT_ROOT = os.path.abspath("..")
DATA_PATH = os.path.join(PROJECT_ROOT, "data", "fifa_players.csv")

print(f"Loading data from: {DATA_PATH}")

try:
    df = pd.read_csv(DATA_PATH)
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"Error: File not found at {DATA_PATH}")

### Data Preview

In [None]:
print("=== Data Preview ===")
display(df.head())

print("\n=== Column Info ===")
df.info()

## 2. Feature Selection and Cleaning

We select specific physical and technical attributes to use as features for our models.

In [None]:
# Define targets and features
target_overall = "overall_rating"
target_potential = "potential"
col_age = "age"

feature_cols = [
    "age",
    "height_cm",
    "weight_kgs",
    "finishing",
    "dribbling",
    "short_passing",
    "acceleration",
    "sprint_speed",
    "stamina",
    "strength",
]

# Check if all columns exist
missing_cols = [c for c in feature_cols + [target_overall, target_potential, col_age] if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing columns in CSV: {missing_cols}")

# Drop rows with missing values in selected columns
df_clean = df.dropna(subset=feature_cols + [target_overall, target_potential, col_age]).copy()

print(f"Number of players after cleaning: {len(df_clean)}")

## 3. Regression Model: Predicting `overall_rating`

We use Linear Regression to estimate a player's current overall rating based on their attributes.

In [None]:
X = df_clean[feature_cols]
y = df_clean[target_overall]

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Predict
y_pred = reg_model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== Regression Model Results (overall_rating) ===")
print(f"MSE : {mse:.2f}")
print(f"RÂ²  : {r2:.3f}")

print("\nCoefficients (Impact of each feature):")
print(f"Intercept : {reg_model.intercept_:.3f}")
for feat, coef in zip(feature_cols, reg_model.coef_):
    print(f"{feat:15s} -> {coef:.3f}")

### Visualization: True vs Predicted Ratings

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.3, color='blue')
plt.xlabel("True Ratings (overall_rating)")
plt.ylabel("Predicted Ratings")
plt.title("Regression: True vs Predicted (overall_rating)")

# Plot perfect prediction line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], "r--", lw=2)

plt.show()

## 4. Classification Model: Predicting Future Development

We create a new target variable `future_class` based on the gap between `potential` and `overall_rating`, and the player's `age`.

**Rules:**
- `high_growth`: gap >= 10 and age <= 23
- `likely_improve`: gap >= 4
- `stable`: -2 <= gap < 4
- `decline`: otherwise

In [None]:
def build_future_label(row):
    """
    Creates a 'future_class' label based on potential, overall_rating, and age.
    """
    overall = row["overall_rating"]
    potential = row["potential"]
    age = row["age"]

    gap = potential - overall

    if gap >= 10 and age <= 23:
        return "high_growth"
    elif gap >= 4:
        return "likely_improve"
    elif gap >= -2:
        return "stable"
    else:
        return "decline"

# Apply the function
df_clean["future_class"] = df_clean.apply(build_future_label, axis=1)

print("Future Class Distribution:")
print(df_clean["future_class"].value_counts())

In [None]:
X_cls = df_clean[feature_cols]
y_cls = df_clean["future_class"]

# Stratified split to maintain class balance
Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)

# Train Logistic Regression
clf = LogisticRegression(
    max_iter=1000,
    multi_class="multinomial"
)
clf.fit(Xc_train, yc_train)

# Predict
yc_pred = clf.predict(Xc_test)

print("=== Classification Model Results (future_class) ===")
print(classification_report(yc_test, yc_pred))

## 5. Example Prediction

We create a synthetic player to test our models.

In [None]:
example_player = {
    "age": 20,
    "height_cm": 175.0,
    "weight_kgs": 70.0,
    "finishing": 78,
    "dribbling": 85,
    "short_passing": 82,
    "acceleration": 88,
    "sprint_speed": 90,
    "stamina": 80,
    "strength": 65,
}

example_df = pd.DataFrame([example_player])

# Predict using both models
overall_pred_example = reg_model.predict(example_df[feature_cols])[0]
future_class_example = clf.predict(example_df[feature_cols])[0]
future_proba_example = clf.predict_proba(example_df[feature_cols])[0]
classes = clf.classes_

print("=== Example Player Analysis ===")
print("Player Stats:")
for k, v in example_player.items():
    print(f"  {k:15s} = {v}")

print(f"\nPredicted Overall Rating: {overall_pred_example:.1f}")
print(f"Predicted Future Class:   {future_class_example}")
print("Class Probabilities:")
for cls, proba in zip(classes, future_proba_example):
    print(f"  {cls:15s} -> {proba:.3f}")