À partir de ce code, nous souhaitons trouver un modèle performant permettant de prédire le nombre de buts qu'un joueur peut marquer durant un match.

## Importation des données

In [120]:
# Importation des librairies
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [121]:
# Chargement du dataset
df = pd.read_csv(r"C:\Users\HP\Downloads\DATASET_Combiner.csv")

Maintenant, nous voulons conserver uniquement les colonnes numériques (int, float) afin de pouvoir calculer la corrélation entre ces colonnes par la suite.

In [122]:
df_numeric = df.select_dtypes(include=['number'])
df_numeric

Unnamed: 0,Age,Matches_Played,Matches_Started,Minutes_Played,Full_Match_Equivalents,Goals,Assists,Goals_Assists,Non_Penalty_Goals,Penalty_Goals,...,Goals/90,Assists/90,Goals_Assists/90,Non_Penalty_Goals/90,Goals_Assists_NoPK/90,xG/90,xAG/90,xG_plus_xAG/90,Non_Penalty_xG/90,npxG_plus_xAG/90
0,27.0,1,1,59,0.7,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,20.0,1,0,1,0.0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,35.0,17,10,823,9.1,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.021871,0.142163,0.164034,0.021871,0.164034
3,25.0,4,3,279,3.1,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,35.0,6,0,109,1.2,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.165138,0.082569,0.247706,0.165138,0.247706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4849,34.0,34,34,2994,33.3,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4850,24.0,7,5,444,4.9,1,0,1,1,0,...,0.202703,0.000000,0.202703,0.202703,0.202703,0.060811,0.040541,0.101351,0.060811,0.101351
4851,21.0,3,0,24,0.3,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.375000,0.000000,0.375000,0.375000,0.375000
4852,30.0,10,10,911,10.1,0,1,1,0,0,...,0.000000,0.098793,0.098793,0.000000,0.098793,0.019759,0.088913,0.108672,0.019759,0.108672


Nous allons également encoder la colonne position pour pouvoir inclure sa corrélation dans l’analyse. 

In [123]:
from sklearn.preprocessing import LabelEncoder

# Créer un encodeur
le = LabelEncoder()

# Encoder la colonne Position (du DataFrame original)
position_encoded = le.fit_transform(df['Position'])
df_numeric['Position_Code'] = position_encoded

In [124]:
df_numeric.corr()

Unnamed: 0,Age,Matches_Played,Matches_Started,Minutes_Played,Full_Match_Equivalents,Goals,Assists,Goals_Assists,Non_Penalty_Goals,Penalty_Goals,...,Assists/90,Goals_Assists/90,Non_Penalty_Goals/90,Goals_Assists_NoPK/90,xG/90,xAG/90,xG_plus_xAG/90,Non_Penalty_xG/90,npxG_plus_xAG/90,Position_Code
Age,1.0,0.159544,0.182933,0.183834,0.183848,0.046534,0.06502,0.059678,0.03289,0.084155,...,-0.00892,-0.014131,-0.016616,-0.018404,-0.035893,-0.027017,-0.042851,-0.044621,-0.050547,-0.03232
Matches_Played,0.159544,1.0,0.91568,0.933952,0.933965,0.500877,0.576127,0.589776,0.513445,0.223511,...,0.077371,0.041872,0.01392,0.037506,0.018171,0.054473,0.039364,0.00948,0.032178,0.000587
Matches_Started,0.182933,0.91568,1.0,0.994893,0.994891,0.468991,0.552281,0.557683,0.476346,0.228171,...,0.04706,0.013457,-0.005663,0.009559,-0.03057,0.006941,-0.023522,-0.038815,-0.030577,-0.067159
Minutes_Played,0.183834,0.933952,0.994893,1.0,0.999996,0.465446,0.548528,0.553646,0.472997,0.225364,...,0.047728,0.014059,-0.00525,0.010158,-0.028479,0.008218,-0.021154,-0.036693,-0.028177,-0.067562
Full_Match_Equivalents,0.183848,0.933965,0.994891,0.999996,1.0,0.465449,0.548544,0.553655,0.472997,0.225383,...,0.047769,0.014054,-0.005271,0.010151,-0.028483,0.008273,-0.021134,-0.036697,-0.028157,-0.067521
Goals,0.046534,0.500877,0.468991,0.465446,0.465449,1.0,0.589601,0.939259,0.983237,0.625368,...,0.126236,0.197988,0.149646,0.180258,0.235007,0.13601,0.262854,0.200876,0.233882,0.066217
Assists,0.06502,0.576127,0.552281,0.548528,0.548544,0.589601,1.0,0.830997,0.589438,0.327115,...,0.335758,0.158353,0.047239,0.15041,0.093388,0.245434,0.187356,0.077072,0.174752,0.086174
Goals_Assists,0.059678,0.589776,0.557683,0.553646,0.553655,0.939259,0.830997,1.0,0.927645,0.569704,...,0.229613,0.203647,0.123138,0.188061,0.201538,0.197964,0.260645,0.171098,0.235336,0.082223
Non_Penalty_Goals,0.03289,0.513445,0.476346,0.472997,0.472997,0.983237,0.589438,0.927645,1.0,0.472606,...,0.125972,0.199546,0.158189,0.188187,0.229857,0.137649,0.259094,0.206968,0.239882,0.067269
Penalty_Goals,0.084155,0.223511,0.228171,0.225364,0.225383,0.625368,0.327115,0.569704,0.472606,1.0,...,0.071018,0.102943,0.046288,0.06586,0.152146,0.068285,0.161611,0.085138,0.103808,0.032158


Alors voici les noms des colonnes ayant une forte corrélation (> 0.5) avec la colonne Goals :

Matches_Played (0.500877)

Assists (0.589601)

Goals_Assists (0.939259)

Non_Penalty_Goals (0.983237)

Penalty_Goals (0.625368)

Penalties_Attempted (0.654164)

Expected_Goals (xG) (0.942491)

Non_Penalty_xG (0.922844)

Expected_Assisted_Goals (xAG) (0.638174)

Non_Penalty_xG_plus_xAG (npxG_plus_xAG) (0.882262)

Progressive_Receptions (0.692918)

# Division des données 

Après avoir choisi les colonnes qui ont une forte corrélation avec la colonne cible (target), nous allons les utiliser comme features. Ensuite, nous diviserons notre dataset en ensembles d'entraînement et de test.

In [125]:
# Sélection des features et de la cible
features = [
    'Matches_Played',
    'Assists',
    'Goals_Assists',
    'Non_Penalty_Goals',
    'Penalty_Goals',
    'Penalties_Attempted',
    'Non_Penalty_xG',
    'Non_Penalty_xG_plus_xAG',
    'Progressive_Receptions'
]
target = "Goals"

X = df[features]
y = df[target]

# Pipeline de prétraitement
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), features)
])

# Split des données
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Le choix du modèle de prédiction

## 1.1. Random Forest Regressor

Le premier modèle que nous allons évaluer est la régression Random Forest.

In [126]:
# Pipeline complet avec Random Forest
model1 = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(n_estimators=100, random_state=42))
])

In [127]:
# Entraînement du modèle
model1.fit(X_train, y_train)

# Prédictions
y_pred = model1.predict(X_test)

# Évaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Random Forest Regressor")
print(f"---------------------")
print(f"RMSE: {rmse:.3f}")
print(f"R2 Score: {r2:.3f}")

Random Forest Regressor
---------------------
RMSE: 0.406
R2 Score: 0.985


## 1.2. XGBRegressor

Le deuxième modèle à évaluer est XGBoost.

In [128]:
# Pipeline avec préprocessing + XGBoost
model2 = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(objective='reg:squarederror', random_state=42))
])

In [129]:
# Entraînement du modèle
model2.fit(X_train, y_train)

# Prédictions
y_pred = model2.predict(X_test)

# Évaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"XGBRegressor")
print(f"---------------------")
print(f"RMSE: {rmse:.3f}")
print(f"R2 Score: {r2:.3f}")

XGBRegressor
---------------------
RMSE: 0.357
R2 Score: 0.988


## 1.3. Gradient Boosting Regressor

Le troisième modèle à évaluer est le Gradient Boosting Regressor. 

In [130]:
from sklearn.ensemble import GradientBoostingRegressor
model_gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)

In [131]:
# Entraînement du modèle
model_gb.fit(X_train, y_train)

# Prédictions
y_pred = model_gb.predict(X_test)

# Évaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Gradient Boosting Regressor")
print(f"---------------------")
print(f"RMSE: {rmse:.3f}")
print(f"R2 Score: {r2:.3f}")

Gradient Boosting Regressor
---------------------
RMSE: 0.310
R2 Score: 0.991


## 1.4. Linear Regression

Le quatrième modèle à évaluer est la régression linéaire.

In [138]:
# Uses all features.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Scale the data.
scaler = StandardScaler()


X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create model and fit it to the training data.
model= LinearRegression()
model.fit(X_train_scaled, y_train)


# Make predictions.
y_pred = model.predict(X_test_scaled)

print(f"Linear Regression")
print(f"---------------------")
# Calculate and print errors.
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.4f}")


mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mse:.4f}")


rmse = mse ** 0.5
print(f"Root mean squared error: {rmse:.4f}")

Linear Regression
---------------------
R-squared: 1.0000
Mean squared error: 0.0000
Root mean squared error: 0.0000


# Évaluation des modèles avec un dataset externe

Le but de cette étape est d’évaluer les résultats obtenus lors des phases d'entraînement et de test, et de vérifier si nos modèles sont capables de prédire les valeurs des buts (Goals) de manière efficace et performante. 

In [133]:
df_evaluation= pd.read_csv(r"C:\Users\HP\Downloads\Projet Analyse du web\Resultat_final\final_dataset.csv")

In [134]:
X_eval = df_evaluation[features]
y_eval= df_evaluation[target]

In [135]:

# Liste des modèles avec leurs noms
models = {
    "Random Forest": model1,
    "XGBoost": model2,
    "Gradient Boosting": model_gb,
    "Linear Regression": model
}

# Dictionnaire pour stocker les résultats
results = {}

# Évaluation pour chaque modèle
for name, mdl in models.items():
    y_pred = mdl.predict(X_eval)  # ou X_new si tu testes sur une autre dataset
    rmse = mean_squared_error(y_eval, y_pred, squared=False)
    r2 = r2_score(y_eval, y_pred)
    results[name] = {"RMSE": rmse, "R2 Score": r2}

# Affichage des résultats
print(f"{'Modèle':<20} {'RMSE':<10} {'R2 Score':<10}")
print("-" * 40)
for name, metrics in results.items():
    print(f"{name:<20} {metrics['RMSE']:<10.3f} {metrics['R2 Score']:<10.3f}")


Modèle               RMSE       R2 Score  
----------------------------------------
Random Forest        0.149      0.996     
XGBoost              0.113      0.998     
Gradient Boosting    0.113      0.998     
Linear Regression    5.971      -4.926    




Analyse comparative des performances des modèles sur df_evaluation :

Après l’évaluation sur les données externes, nous avons constaté que les modèles Random Forest, XGBoost et Gradient Boosting Regressor ont réussi à maintenir des performances similaires à celles obtenues sur les données de test. En effet, leur erreur quadratique moyenne (RMSE) est restée faible, et leur coefficient de détermination (R²) est resté élevé, ce qui témoigne d'une bonne capacité de généralisation.

En revanche, le modèle de régression linéaire, qui semblait initialement fournir des résultats cohérents pendant la phase d'entraînement, a échoué à bien prédire les nouvelles valeurs. Ce comportement indique très probablement un surapprentissage (overfitting) ou une sous-modélisation (incapacité à capturer la complexité des données).

Parmi les trois modèles performants, Gradient Boosting Regressor a obtenu les meilleurs résultats avec un RMSE de 0.113 et un R² de 0.998. Nous l'avons donc retenu comme modèle final pour la prédiction des valeurs de buts (Goals).

In [136]:
import joblib
joblib.dump(model_gb, 'gradient_boosting_model.pkl')
print("Modèles enregistrés avec succès au format .pkl")

Modèles enregistrés avec succès au format .pkl
