#**Premier League Predictor Model**

###**Import libraries** <br/>

This section imports necessary libraries and modules to run the model:

- `from sklearn.linear_model import LinearRegression`: This is the algorithm used for creating the predictive model. <br/>

- `from sklearn.model_selection import train_test_split`: This function splits the dataset into training and testing subsets. <br/>

- `from sklearn.preprocessing import StandardScaler`: This is used to scale the features so that they all have the same range and unit variance, ensuring the model's accuracy. <br/>

- `from sklearn.metrics import mean_absolute_error`: This is used to calculate the model's error by comparing predicted values to the true values. <br/>

- `import pandas as pd `: This is the library used for data manipulation, particularly for working with datasets in tabular form. <br/>

- `from google.colab import files`: This is used to upload the dataset files to Google Colab.

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
import pandas as pd
from google.colab import files

###**Upload .csv Files** <br/>

This code allows you to upload your .csv files containing Premier League data for each season. By running this code in Google Colab, you'll be prompted to upload the required files. Once uploaded, they can be used to load into pandas DataFrames for further analysis.

In [3]:
uploaded = files.upload()

Saving Premier_League_2016_2017.csv to Premier_League_2016_2017.csv
Saving Premier_League_2017_2018.csv to Premier_League_2017_2018.csv
Saving Premier_League_2018_2019.csv to Premier_League_2018_2019.csv
Saving Premier_League_2019_2020.csv to Premier_League_2019_2020.csv
Saving Premier_League_2020_2021.csv to Premier_League_2020_2021.csv
Saving Premier_League_2021_2022.csv to Premier_League_2021_2022.csv
Saving Premier_League_2022_2023.csv to Premier_League_2022_2023.csv
Saving Premier_League_2023_2024.csv to Premier_League_2023_2024.csv


###**Loading Datasets** <br/>

This section loads individual season datasets into pandas DataFrames. Each DataFrame represents the Premier League standings for a given season. After loading all datasets, they are combined into one larger DataFrame (df) using pd.concat(), which concatenates them into a single dataset. The fillna(0, inplace=True) ensures that any missing values in the dataset are filled with zero, which helps avoid issues during model training.

In [4]:
# Load datasets
df_16_17 = pd.read_csv('Premier_League_2016_2017.csv')
df_17_18 = pd.read_csv('Premier_League_2017_2018.csv')
df_18_19 = pd.read_csv('Premier_League_2018_2019.csv')
df_19_20 = pd.read_csv('Premier_League_2019_2020.csv')
df_20_21 = pd.read_csv('Premier_League_2020_2021.csv')
df_21_22 = pd.read_csv('Premier_League_2021_2022.csv')
df_22_23 = pd.read_csv('Premier_League_2022_2023.csv')
df_23_24 = pd.read_csv('Premier_League_2023_2024.csv')

# Combine all datasets into one DataFrame
df = pd.concat([df_16_17, df_17_18, df_18_19, df_19_20, df_20_21, df_21_22, df_22_23, df_23_24], ignore_index=True)
df.fillna(0, inplace=True)

###**Feature defining**

 Here, the features (predictor variables) and target (the variable to predict) for the model are defined:

- `features:`  A list of statistics from the dataset that will be used to predict the number of points a team will earn. <br/>
- `Recent_Form: ` A new feature representing the rolling average of wins in the past 5 games. This captures a team's recent performance.
- `Last_Year_Points: `This feature captures the points a team earned in the previous season, which may be a useful predictor of the team's performance in the current season.
- The target variable `target` is defined as `'Points'`, which is the number of points each team earned in the league.
- The list `teams_to_predict` contains the teams for which the model will predict the points for the upcoming season.

In [5]:
# Define features and target for the model
features = ['Win', 'Draw', 'Loss', 'Goals Scored', 'Goals Conceded',
            'Goal Difference', 'Expected Goals Scored', 'Expected Goals Conceded',
            'Expected Goal Difference']
df['Recent_Form'] = df['Win'].rolling(window=5, min_periods=1).mean()
features.append('Recent_Form')

# Add 'Last_Year_Points' as a feature
df['Last_Year_Points'] = df.groupby('Team')['Points'].shift(1)
df.fillna({'Last_Year_Points': 0}, inplace=True)

target = 'Points'

# Teams to predict for the upcoming season
teams_to_predict = ['Arsenal', 'Aston Villa', 'Bournemouth', 'Brentford', 'Brighton & Hove Albion', 'Chelsea',
                    'Crystal Palace', 'Everton', 'Fulham', 'Ipswich', 'Leicester City', 'Liverpool', 'Manchester City',
                    'Manchester United', 'Newcastle United', 'Nottingham Forest', 'Southampton', 'Tottenham',
                    'West Ham', 'Wolves']

### **Model Training** <br/>
This section splits the data into training and testing sets:

- `df_train`: All rows of the dataset that do not correspond to the teams we want to predict.
- `train_test_split()`: Splits the data into training and testing sets (70% for training, 30% for testing) using the features and target variables.

Then, the features are scaled using `StandardScaler()`, which standardizes the data by removing the mean and scaling to unit variance. This is important for linear regression as it ensures that all features have the same scale.

The `LinearRegression` model is initialized, and the model is trained on the scaled training data (`X_train` and `y_train`). After training, predictions are made on the test data (`X_test`), and the model’s performance is evaluated using `mean_absolute_error`, which calculates the average absolute difference between predicted and actual points.

In [6]:
# Split into training and test data
df_train = df[~df['Team'].isin(teams_to_predict)]
X_train, X_test, y_train, y_test = train_test_split(df_train[features], df_train[target], test_size=0.3, random_state=42)

# Scale features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the Linear Regression model
model = LinearRegression()

# Fit the model (ensure X_train and X_test are numpy arrays)
# X_train and X_test should already be numpy arrays after scaling
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')

Mean Absolute Error: 0.07


###**Predictions** <br/>
This section prepares the data to make predictions for the specified teams (`teams_to_predict`):

- The data for the teams to predict is extracted from the dataset, and the features are averaged for each team using `groupby()` and `mean()`.
- The features are then scaled using the same `StandardScaler` that was fitted on the training data, ensuring the prediction data is processed the same way.
- The model predicts the points for each team, and the predictions are stored in the `Predicted_Points` column.
- The teams are ranked by their predicted points in descending order, with the highest predicted points ranked first. This ranking is stored in the `Rank` column.
- Finally, the results are printed, showing each team's rank, name, and predicted points for the upcoming season.

In [7]:
# Prepare prediction data for teams to predict
df_predict = df[df['Team'].isin(teams_to_predict)].groupby('Team', as_index=False)[features].mean()

# Scale prediction features
df_predict[features] = scaler.transform(df_predict[features])

# Convert to numpy arrays to avoid feature name warnings
X_predict = df_predict[features].to_numpy()

# Predict the points for the teams
df_predict['Predicted_Points'] = model.predict(X_predict)

# Sort by predicted points and rank the teams
df_predict['Rank'] = df_predict['Predicted_Points'].rank(ascending=False, method='min')

# Display the predictions sorted by predicted points
results = df_predict[['Rank', 'Team', 'Predicted_Points']].sort_values(by='Rank')
print("Predicted Points for the New Season:")
print(results)

Predicted Points for the New Season:
    Rank               Team  Predicted_Points
10   1.0    Manchester City         89.500000
9    2.0          Liverpool         82.125000
0    3.0            Arsenal         70.875000
14   4.0          Tottenham         69.000000
11   5.0  Manchester United         68.625000
4    5.0            Chelsea         68.625000
1    7.0        Aston Villa         53.017501
8    8.0     Leicester City         51.000000
16   9.0             Wolves         49.833333
6   10.0            Everton         49.375000
3   11.0          Brentford         48.000000
15  12.0           West Ham         47.857143
5   13.0     Crystal Palace         45.375000
2   14.0        Bournemouth         42.666667
13  15.0        Southampton         40.142857
7   16.0             Fulham         38.250000
12  17.0  Nottingham Forest         37.000000
