**Premier League Predictor Model**

**Import .csv files**

The standings for the past 8 Premier League seasons will be loaded into Google Colab in the form of a .csv file.

In [1]:
from google.colab import files
uploaded = files.upload()

Saving Premier_League_2016_2017.csv to Premier_League_2016_2017.csv
Saving Premier_League_2017_2018.csv to Premier_League_2017_2018.csv
Saving Premier_League_2018_2019.csv to Premier_League_2018_2019.csv
Saving Premier_League_2019_2020.csv to Premier_League_2019_2020.csv
Saving Premier_League_2020_2021.csv to Premier_League_2020_2021.csv
Saving Premier_League_2021_2022.csv to Premier_League_2021_2022.csv
Saving Premier_League_2022_2023.csv to Premier_League_2022_2023.csv
Saving Premier_League_2023_2024.csv to Premier_League_2023_2024.csv


**Loading Datasets**

The individual .csv files will be named with respect to the season it belongs to. All the .csv files will be combined into one file at the end.

In [2]:
import pandas as pd

# Load each season's dataset
df_16_17 = pd.read_csv('Premier_League_2016_2017.csv')
df_17_18 = pd.read_csv('Premier_League_2017_2018.csv')
df_18_19 = pd.read_csv('Premier_League_2018_2019.csv')
df_19_20 = pd.read_csv('Premier_League_2019_2020.csv')
df_20_21 = pd.read_csv('Premier_League_2020_2021.csv')
df_21_22 = pd.read_csv('Premier_League_2021_2022.csv')
df_22_23 = pd.read_csv('Premier_League_2022_2023.csv')
df_23_24 = pd.read_csv('Premier_League_2023_2024.csv')

# Combine all datasets
df = pd.concat([
    df_16_17, df_17_18, df_18_19, df_19_20,
    df_20_21, df_21_22, df_22_23, df_23_24
], ignore_index=True)
df.to_csv('Combined_Premier_League_Standings.csv', index=False)


**Feature defining**

The different statistics will be defined as features that will be used as data for training. The prediction feature will be the points. The teams concerned will be stored in a list.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Define features and target for the model
features = [
    'Win', 'Draw', 'Loss', 'Goals Scored', 'Goals Conceded',
    'Goal Difference', 'Expected Goals Scored', 'Expected Goals Conceded',
    'Expected Goal Difference'
]
target = 'Points'

# Teams to predict
teams_to_predict = ['Arsenal', 'Aston Villa', 'Bournemouth', 'Brentford', 'Brighton & Hove Albion', 'Chelsea', 'Crystal Palace',
                    'Everton', 'Fulham', 'Ipswich', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United',
                    'Newcastle United', 'Nottingham Forest', 'Southampton', 'Tottenham', 'West Ham', 'Wolves']


**Model Training**

The X variable will be the list of statistics stored in the variable features. The y variable is the number of points that a team is predicted to obtain.
The model is trained based on the above factors.

In [None]:
# Preparing training data
df_train = df[~df['Team'].isin(teams_to_predict)]
X_train = df_train[features]
y_train = df_train[target]

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

**Predictions**

The points of the teams present in the list obtain their points predictions for the upcoming season. The results are printed with the team expected to obtain the highest number of points at the top of the list.

In [None]:
# Aggregate past data to generate input for new predictions
df_predict = (
    df[df['Team'].isin(teams_to_predict)]
    .groupby('Team', as_index=False)[features]
    .mean()
)

# Predict points for the new season
df_predict['Predicted_Points'] = model.predict(df_predict[features])

# Display results sorted by predicted points in descending order
results = df_predict[['Team', 'Predicted_Points']].sort_values(by='Predicted_Points', ascending=False)

print("Predicted Points for the New Season:")
print(results)