### Title: Predicting Goal Scoring Totals per Season for Soccer Players

**Objective:**
This project aims to develop a predictive model that forecasts the number of goals a player is likely to score over a season, leveraging historical performance data, team dynamics, and contextual match factors. Accurately predicting a player's goal total can offer insights into player valuation, transfer potential, and game strategy. The project will focus on data collection, feature engineering, model training, and visualization to highlight key predictors and demonstrate model performance.

**Description :**
This script aims to predict the number of goals scored by a player named Kany over multiple seasons using linear regression. It employs a custom gradient descent implementation for training, alongside the LinearRegression model from scikit-learn. The code involves loading historical data, preparing features, and making predictions based on various input parameters such as the season, minutes played, and matches played. The model is trained on historical data, and a specific season is tested to evaluate the predictive capability of the model.

In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pandas as pd

In [2]:
pip install html5lib

Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
Installing collected packages: html5lib
Successfully installed html5lib-1.1
Note: you may need to restart the kernel to use updated packages.


We start by creating the function MSEStep, which performs one iteration of the Mean Squared Error (MSE) optimization. This function calculates the predicted values using the current weights and bias, computes the error, and updates the weights and bias based on this error. A learning rate is applied to control the magnitude of the updates, enabling the model to learn from the data progressively. The updated weights and bias are then returned for use in subsequent iterations of the optimization process.

In [3]:
# Set a seed for reproducibility
np.random.seed(42)

# Function to perform one step of Mean Squared Error (MSE) optimization
def MSEStep(X, y, W, b, learn_rate=0.001):
    # Predict values using the current weights and bias
    y_pred = np.matmul(X, W) + b
    # Calculate the error between actual and predicted values
    error = y - y_pred
    # Update weights based on the error and learning rate
    W_new = W + learn_rate * np.matmul(error, X)
    # Update bias
    b_new = b + learn_rate * error.sum()
    return W_new, b_new

We define the gradient_descent function to optimize the weights and bias of our linear regression model. It initializes the weights as a zero vector and sets the bias to zero. The function then performs multiple iterations of the gradient descent process, calling the MSEStep function to update the weights and bias based on the training data. Finally, the optimized weights and bias are returned, ready for making predictions.

In [163]:
# Gradient descent function to optimize weights and bias
def gradient_descent(X, y, learn_rate=0.005, num_iter=25):
    # Initialize weights and bias
    W = np.zeros(X.shape[1])
    b = 0 
    # Perform gradient descent for a specified number of iterations
    for _ in range(num_iter):
        W, b = MSEStep(X, y, W, b, learn_rate)
    return W, b

This section begins by loading data from a CSV file into a NumPy array, utilizing the np.loadtxt function to read the contents while specifying the delimiter as a comma. The data is then split into features (X) and the target variable (y), where all columns except the last are considered features and the last column represents the target. Additionally, we define the predict function, which calculates predicted values based on the learned weights and bias by applying the linear regression formula. This function will be used to generate predictions once the model has been trained.

In [None]:
# Load the data from a CSV file
data = np.loadtxt('/Users/badrelabbady/Desktop/ML/data.csv', delimiter=',')
# Split the data into features (X) and target variable (y)
X = data[:, :-1]
y = data[:, -1]

# Function to make predictions using the learned weights and bias
def predict(X, w, b):
    return X * w + b

In this part, we make a manual prediction for a specific input value (e.g., an input feature of 3) by using the previously defined predict function with the learned weights and bias. To validate our results, we also employ scikit-learn's LinearRegression model to fit the same dataset, creating a standard for comparison. After fitting the model, we predict the output for the same input feature using the scikit-learn model. This allows us to compare the manually calculated prediction with that produced by a widely-used machine learning library.

In [None]:
# Make a prediction for a specific input (e.g., input feature of 3)
prediction_manual = predict(3, w, b)  

# Using scikit-learn's LinearRegression model for comparison
linear_regression = LinearRegression()
model = linear_regression.fit(X, y)
# Predicting for the same input using scikit-learn model
sklearn_prediction = model.predict([[3]])  

This section initializes the dataset containing Kane's seasonal performance, including the number of goals scored, minutes played, and matches played across various seasons. A Pandas DataFrame is created from this data, and the 'saison' column is transformed into ordinal encoding to facilitate numerical processing. The features (X) and target variable (y) are separated, with a specific test case prepared for the 2012-2013 season while excluding it from the training data. After training a Linear Regression model with the remaining seasons, the code predicts the number of goals for Kane in the 2012-2013 season and prints the result.

In [46]:
# Data for Kane's seasons and goals
data = {
    'saison': ['2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015', 
               '2015-2016', '2016-2017', '2017-2018', '2018-2019', '2019-2020',
               '2020-2021', '2021-2022', '2022-2023', '2023-2024', '2024-2025'],
    'buts': [5, 7, 2, 3, 29, 28, 35, 41, 24, 24, 28, 25, 32, 44, 10],  # Goals scored by Kany
    'minutes_jouees': [1200, 1300, 1000, 1100, 1500, 1600, 1800, 1900, 1700, 1600, 1700, 1600, 1800, 2000, 1200],
    'matchs_joues': [20, 22, 18, 19, 25, 26, 28, 30, 27, 26, 27, 26, 28, 30, 22]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Convert the 'saison' column to ordinal encoding
df['saison'] = pd.Categorical(df['saison']).codes

# Separate features (X) and target variable (y)
X = df[['saison', 'minutes_jouees', 'matchs_joues']]  # Using new features
y = df['buts']      # Target: number of goals

# Prepare a test case for the specific season (2012-2013)
X_test = X[4:5]  # Input features for the season 2012-2013 (index 4)
y_test = y[4:5]  # Expected output (known value)

# Use all other seasons for training (excluding the season being tested)
X_train = pd.concat([X[:4], X[5:]])  # Training data excluding the 2012-2013 season
y_train = pd.concat([y[:4], y[5:]])  # Corresponding target values

# Create and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make a prediction for the test case (season 2012-2013)
predictions = model.predict(X_test)

# Print the predicted number of goals for Kany in the season 2012-2013
print(f"Predicted number of goals for Kane in 2012-2013: {predictions[0]:.2f}")


Nombre de buts prédit pour Kany en 2017-2018 : 20.80


### Conclusion
This project effectively demonstrated the application of a linear regression model to predict the number of goals scored by Kane during the 2012-2013 season, utilizing historical performance data from previous seasons. By systematically organizing data related to the season, minutes played, and matches played, we were able to establish a foundational model that captures the relationship between these variables and goal-scoring output. The model was trained using a robust dataset, allowing us to make informed predictions and providing valuable insights into Kane's scoring patterns.

The results of our predictions indicate that the model can provide a reasonable estimate of a player's performance based on available data. However, to further enhance the model's predictive power and accuracy, I plan to incorporate additional variables that could significantly impact goal-scoring potential. Future enhancements will include metrics such as shots on target, assists, player age, and perhaps even contextual factors like the strength of the opposing team or match conditions.

These additional variables can provide a richer dataset that better represents a player's performance and contributions on the field. By expanding the scope of our analysis, we aim to develop a more sophisticated model that can not only predict goals more accurately but also offer insights into the underlying factors that contribute to a player's success. Ultimately, this project serves as a stepping stone towards building a comprehensive analytical tool for evaluating player performance in football, with potential applications in scouting, coaching strategies, and fan engagement.