---
# _Logistic Regression to Predict PL Home Team Outcome_
---

# Data Preparation
## Data Loading and Exploration
---
- First, we load the dataset and convert the 'Date' column to datetime format. This ensures we can easily filter matches by date for chronological splitting. We use pandas' to_datetime for the conversion.
---

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('PL_integrated_dataset_10years.csv')

# Preview relevant columns for first five matches
print(df[['HomeTeam', 'AwayTeam', 'HomePoints', 'Home_xG', 'Away_xG', 'ELO_Difference', 'xG_Difference']].head())


datetime64[ns]


--- 
- The "HomePoints" column indicate whether the "HomeTeam" received a positive match outcome or not (e.g.: HomePoints = 3 -> The HomeTeam got a win on that match and vice versa)
- A positive ELO_Difference means the home team is stronger per Elo ratings
- A positive xG_Difference means the home team created more scoring chances in the match.
---

## Cleaning and Feature Engineering

Next, we prepare the data for modeling. We'll check for missing values in the key features and create the binary target. In this dataset, the Elo and xG features are already computed, so minimal cleaning is needed. We confirm there are no missing values in Home_xG, Away_xG, Home_ELO, or Away_ELO. Then we define our target/response variable HomeWin and select the features of interest:

---
- ELO_Difference (home team Elo – away team Elo)
- Home_xG (home expected goals)
- Away_xG (away expected goals)
---

In [12]:
# Create binary target: 1 if home team wins (HomePoints=3), else 0
df['HomeWin'] = (df['HomePoints'] == 3).astype(int)
df['HomenotWin'] = (df['HomePoints'] == 0).astype(int)
df['HomeDraw'] = (df['HomePoints'] == 1).astype(int)

# Check class distribution for HomeWin
# Create a HomeWin class
print(df['HomeWin'].value_counts())
print(df['HomenotWin'].value_counts())
print(df['HomeDraw'].value_counts())

print("Proportion of HomeWin=1 (home wins):", df['HomeWin'].mean())
print("Proportion of HomeWin=0 (home not wins):", df['HomenotWin'].mean())
print("Proportion of HomeWin=2 (home draw):", df['HomeDraw'].mean())


HomeWin
0    2098
1    1676
Name: count, dtype: int64
HomenotWin
0    2557
1    1217
Name: count, dtype: int64
HomeDraw
0    2893
1     881
Name: count, dtype: int64
Proportion of HomeWin=1 (home wins): 0.4440911499735029
Proportion of HomeWin=0 (home not wins): 0.3224695283518813
Proportion of HomeWin=2 (home draw): 0.23343932167461579


---
- There are 3,774 matches in total, with about 44.4% home wins (1,676 matches) and 55.6% non-wins (draws or away wins). The classes are somewhat imbalanced but not severely skewed. We will keep this in mind when evaluating performance (hence using F1-score in addition to accuracy).
---

---
Now we select our features and split the data. We use a
- 75% training
- 15% validation
- 10% test split

The validation set will be used to tune the hyperparameter (regularization strength) of the logistic regression, and the test set will evaluate final performance. We stratify the split by HomeWin to maintain the same class ratio in each subset

---

In [21]:
from sklearn.model_selection import train_test_split

# Define feature set X and target y
features = ['ELO_Difference', 'Home_xG', 'Away_xG']
X = df[features]
y = df['HomeWin']

# Split off 15% of remaining for validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1667, stratify=y_temp, random_state=42)

# Split off the final 10% for testing
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.10, stratify=y, random_state=42)

print("Total samples:", len(X))
print("Train size:", len(X_train))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))
print("Train HomeWin fraction:", round(y_train.mean(), 3))
print("Validation HomeWin fraction:", round(y_val.mean(), 3))
print("Test HomeWin fraction:", round(y_test.mean(), 3))

Total samples: 3774
Train size: 2829
Validation size: 567
Test size: 378
Train HomeWin fraction: 0.444
Validation HomeWin fraction: 0.444
Test HomeWin fraction: 0.444


# Model Development

## Logistic Regression Model Training


--- 
- We will use logistic regression (a linear classification model) to predict the probability of a home win. Logistic regression produces a weighted combination of features passed through a logistic function to output a probability between 0 and 1. It’s suitable for binary classification and provides interpretable coefficients for each feature.
- Before training, we apply feature scaling. Elo differences range in the hundreds and xG values around 0–5, so scaling ensures the features are on a comparable scale for regularization to be effective. We'll use standardization (subtract mean, divide by standard deviation) based on the training set.
- We then train logistic regression models on the training set for various regularization strengths C (inverse of regularization strength λ). A smaller C means heavier regularization (simpler model), and a larger C means less regularization (allowing more complex fit). We'll try a range of C values and evaluate on the validation set to choose the best. We use L2 regularization by default in scikit-learn’s LogisticRegression.
---

In [23]:
# Let's perform scaling and hyperparameter tuning
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Scale features based on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled  = scaler.transform(X_val)

# Try multiple values for regularization strength C
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
    model = LogisticRegression(C=C, max_iter=500, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_val_pred = model.predict(X_val_scaled)
    print(f"C={C:<6} | Validation Accuracy = {accuracy_score(y_val, y_val_pred):.3f} | Validation F1 = {f1_score(y_val, y_val_pred):.3f}")

C=0.001  | Validation Accuracy = 0.744 | Validation F1 = 0.665
C=0.01   | Validation Accuracy = 0.757 | Validation F1 = 0.713
C=0.1    | Validation Accuracy = 0.753 | Validation F1 = 0.717
C=1      | Validation Accuracy = 0.753 | Validation F1 = 0.717
C=10     | Validation Accuracy = 0.753 | Validation F1 = 0.717
C=100    | Validation Accuracy = 0.753 | Validation F1 = 0.717


## Analysis

---
- Analysis: The model with very strong regularization (C=0.001) underfits, giving lower F1 (0.665). Performance improves as C increases. Around C = 0.1, the F1-score peaks at ~0.717. Beyond that, the accuracy and F1 on validation plateau (in fact, models for C = 0.1 up through 100 produce the same validation metrics to three decimals). This suggests that beyond a certain point, reducing regularization doesn’t yield further improvement — the model might be as complex as needed to fit the data. To avoid overfitting with unnecessarily large C, we select C = 0.1 as the best hyperparameter (it achieves the highest F1 with the simplest model). We will use C=0.1 for our final model.

Note: We focused on F1-score to select C because our classes are imbalanced (we want a model that performs well on predicting wins). The chosen model at C=0.1 also has high accuracy, indicating a good overall fit.

---

# Final Model Training

- Now we train the final logistic regression using the chosen regularization strength (C=0.1) on the full training set. We will then evaluate it on the held-out test set to assess how well it generalizes to unseen matches. We use the StandardScaler fitted earlier on the training data to transform the test features as well.


In [27]:
# Train final Logistic Regression with best C on the training data
best_C = 0.1
final_model = LogisticRegression(C=best_C, max_iter=500, random_state=42)
final_model.fit(X_train_scaled, y_train)

# Evaluate on the test set
X_test_scaled = scaler.transform(X_test)
y_pred_test = final_model.predict(X_test_scaled)

# Calculate performance metrics on test set
test_accuracy = accuracy_score(y_test, y_pred_test)
test_f1 = f1_score(y_test, y_pred_test)
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test F1-Score: {test_f1:.4f}")

# Display confusion matrix for detailed error analysis
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_test)
print("Confusion Matrix:\n", cm)

Test Accuracy: 0.7751
Test F1-Score: 0.7401
Confusion Matrix:
 [[172  38]
 [ 47 121]]


---
- Test Performance: The model achieves about 77.5% accuracy on the test set, meaning it correctly predicts the match outcome (home win vs not win) in 3 out of 4 cases. The F1-score is 0.740, which is high and in line with validation, indicating good balance between precision and recall for the positive class. For comparison, predicting the majority class (no home win) every time would only give 55.6% accuracy and a much lower F1 (roughly 57% for wins), so our model provides a significant improvement.
---

| Actual \ Predicted | Not Win (0) | Win (1) |
| ------------------ | ----------: | ------: |
| **Not Win (0)**    |         172 |      38 |
| **Win (1)**        |          47 |     121 |


1) True Negatives (TN) = 172: The home team did not win and the model correctly predicted 0 (no win).

2) False Positives (FP) = 38: The home team did not win, but the model incorrectly predicted a win.

3) False Negatives (FN) = 47: The home team actually won, but the model missed it (predicted no win).

4) True Positives (TP) = 121: The home team won and the model correctly predicted a win.

---
From this, we can derive the precision and recall for "home win":

Precision = TP / (TP+FP) = 121 / (121+38) ≈ 0.761 – when the model predicts a home win, it’s correct about 76% of the time.

Recall = TP / (TP+FN) = 121 / (121+47) ≈ 0.720 – the model catches about 72% of all actual home wins.

The F1-score of 0.740 is the harmonic mean of these (indicating a good balance). The model makes some mistakes, typically predicting wins that don't happen (38 cases) or missing wins (47 cases), but overall performance is strong given the limited feature set.

---

# Feature Influence and Coefficient Interpretation
One advantage of logistic regression is interpretability. We can examine the learned coefficients to understand the influence of each feature on the probability of a home win. Let's retrieve the model coefficients for our three features:

In [30]:
feature_names = ['ELO_Difference', 'Home_xG', 'Away_xG']
coefficients = final_model.coef_[0]  # coefficients for the features
intercept = final_model.intercept_[0]
for name, coef in zip(feature_names, coefficients):
    print(f"{name}: {coef:.3f}")
print(f"Intercept: {intercept:.3f}")


ELO_Difference: 0.280
Home_xG: 0.855
Away_xG: -0.855
Intercept: -0.326


These coefficients are based on standardized features (mean 0, std 1). A positive coefficient means an increase in that feature raises the log-odds of a home win, while a negative coefficient decreases it. Key observations:

- Home_xG (0.855): This is the largest positive coefficient. It implies that higher expected goals for the home team strongly increase the chance of a win. In practical terms, if the home team generates more scoring opportunities (higher xG), they are far more likely to win the match. Among our features, home xG has the strongest influence.

- Away_xG (-0.855): This is almost exactly the negative of the home xG coefficient. A higher away team xG significantly decreases the home team’s win probability. This makes intuitive sense: if the away side creates many good chances, the home side is less likely to win. The equal magnitude to home_xG suggests the model effectively uses the difference in xG (home_xG minus away_xG) as a crucial predictor. In fact, the model learned to weight home and away xG equally (one positively, one negatively), meaning a 1-unit swing in xG difference (say, home xG minus away xG increases by 1) has a substantial impact on the odds of winning.

- ELO_Difference (0.280): This coefficient is positive, indicating that if the home team’s Elo rating is higher than the away team’s by one standard deviation (about 164 Elo points in our data), the log-odds of a home win increase by ~0.28. In simpler terms, stronger home teams (higher Elo) are more likely to win, as expected. However, the effect size is smaller than that of a comparable change in xG. For example, a one standard deviation increase in xG difference (≈1.4 xG) has about three times the impact on the log-odds (0.855) than a one std dev Elo advantage (0.280). This suggests that the immediate match performance (xG) is a more powerful predictor of the result than the pre-match team strength rating, which aligns with intuition — what happens on the pitch (chances created) largely decides the outcome, though team quality provides a helpful baseline.

- Intercept (-0.326): This corresponds to the baseline log-odds of a home win when all features are at their zero mean. A negative intercept indicates that in a scenario where the home team and away team are evenly matched (Elo difference 0) and both have average expected goals (home_xG = away_xG at average levels), the model predicts the home team to have less than 50% chance of winning. In fact, the intercept of -0.326 implies about a 42% predicted win probability in such an even scenario. This is reasonable because overall home win rate is ~44%, and if teams are equal in strength and chances, a win is not the majority outcome. The intercept also effectively captures the home-field advantage baseline. If there were no difference in Elo or xG, the slight tilt close to 50% (but not over) suggests that being the home team alone doesn't guarantee a win but gives some advantage that's reflected when combined with other factors (if home xG equals away xG, the home might still draw or not always win).

# Model Evaluation and Conclusion

__Performance__: The logistic regression model performed well in predicting home wins using just Elo ratings and expected goals features. With ~77.5% accuracy and an F1-score of 0.74 on the test set, it substantially outperforms a naive baseline. This indicates that the features chosen carry significant predictive signal: Elo difference captures team strength disparities, and xG captures the match dynamics. Especially, the xG features (which come from the match events) were very strong predictors — which is expected since they essentially quantify how many good chances each team had to score in that match.

__Insights__: By examining the coefficients, we gained insight that in-match performance (xG) was more influential than pre-match ratings (Elo) for determining the outcome. The model essentially emphasizes that if a home team creates considerably more chances than their opponent, they are very likely to win, regardless of the Elo ratings. However, having a higher Elo (a stronger team) does tilt the odds in your favor somewhat, all else being equal.

__Use of Metrics__: We used accuracy and F1-score to evaluate this binary classifier. Accuracy was appropriate to gauge overall correctness, while F1-score gave a better sense of how well the model predicts the positive class (home wins) considering precision and recall. This was important given the slight class imbalance (44% wins vs 56% non-wins). The chosen model balanced these well (precision ~76%, recall ~72%), showing it does not overly favor one class at the expense of the other. If the class imbalance were more severe, we might have placed even more emphasis on F1 or even looked at precision/recall separately or adjusted the decision threshold to suit specific objectives (e.g., if predicting wins had a different cost than missing wins).

__Limitations__: It’s worth noting that the expected goals (xG) features we used are post-match statistics (they come from analyzing the chances in the match). This means our model, as constructed, is not truly predictive before a match is played – rather, it’s descriptive of how match performance leads to wins. In a real pre-match prediction scenario, we wouldn't have xG available. We included xG here to follow the assignment focus, but in practice, xG is more useful for live win probability models or post-game analysis. For a genuine pre-game prediction model, we would rely on pre-match features (like team Elo ratings, form, player stats, etc.) and possibly historical average xG values. The Elo rating difference is one such pre-game feature and does provide some predictive power, but by itself it would yield a less accurate model (we can infer that using Elo alone would not reach 77% accuracy).

__Model Choice__: Logistic regression, being a linear model, handled this task well. The decision boundary is linear in terms of these features (effectively a plane in the 3D feature space of Elo diff, home xG, away xG). This was sufficient given the strong linear separability induced by xG differences. If we had more complex nonlinear interactions or wanted to include many other features, we could consider more complex models. However, the interpretability of logistic regression is a big plus – we can clearly see how each factor contributes to winning probability.

__Potential Improvements__: To further improve or generalize the model, we could:

Include more features (e.g., home advantage indicator explicitly, team form, head-to-head history, etc.) for a pre-match model.

Address any remaining class imbalance by adjusting the decision threshold or using class weights if we were more concerned about a particular error type (for example, if predicting wins correctly is more important than false alarms, we might prioritize recall).

Evaluate the model with cross-validation on training data to ensure stability of results, given we had a relatively small dataset (3774 matches is moderate, but more data or seasons could help).

In conclusion, our logistic regression model provides a clear and reasonably accurate prediction of home wins in Premier League matches using Elo rating differences and expected goals for both teams. It highlights that while team quality matters, the actual performance on match day (as captured by chances created) is the dominant factor in determining match outcomes. The model met the expectations of the assignment, demonstrating the full workflow: from data preparation and feature engineering to model training, hyperparameter tuning, and evaluation with appropriate metrics. We also gained interpretable insights from the model coefficients, reinforcing our understanding of what drives a soccer team’s chances of victory at home.