## Data Preprocessing

In this phase, we will implement the same preprocessing methods utilized in previous iterations to ensure that our dataset achieves a consistent format and is ready for modeling. This process involves several key steps to guarantee that the data is clean, well-organized, and optimized for analysis.

### Steps for Data Preprocessing

1. **Combining League Data**:  
   The first step is to merge all league data into a single comprehensive dataframe. This involves concatenating the individual league datasets, ensuring that all relevant matches are included. Once combined, we will sort the dataset chronologically by match date, which will facilitate further analysis and modeling.

2. **Appending H2H and Elo Columns**:  
   After consolidating the league data, we will append the historical head-to-head (H2H) statistics and updated Elo ratings to the main dataframe. These additional columns provide valuable insights into team performance and rivalry dynamics, which are crucial for predicting match outcomes.

3. **Removing Unnecessary Columns**:  
   With the dataset now enriched with relevant features, we will identify and remove any columns that do not contribute meaningfully to our analysis or modeling efforts. This helps to streamline the dataset, reducing dimensionality and improving computational efficiency.

4. **Handling Missing Values**:  
   Any missing values in the dataset will be addressed using appropriate imputation techniques. This step is vital to ensure that our model has complete data to work with, as missing values can significantly impact the accuracy of predictions. We may choose to fill missing values with statistical measures (e.g., mean, median) or by employing more sophisticated imputation methods depending on the nature of the data.

5. **Standardizing the Data**:  
   To prepare the dataset for modeling, we will standardize the feature values using techniques such as min-max scaling or z-score normalization. This process ensures that all features are on a comparable scale, which is particularly important for algorithms that rely on distance measurements or gradient descent.

6. **Saving the Processed Dataframe**:  
   Finally, we will save the resulting standardized dataframe into a file, typically in CSV format. This file will serve as the input for our modeling efforts, ensuring that all preprocessing steps have been completed and that the data isy into the modeling phase.


In [1]:
import pandas as pd
import numpy as np
import warnings
from Dataset_functions import *
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import matplotlib.pyplot as plt
import seaborn as sns
# Suppress all warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

In [26]:
premier_league_data = pd.read_csv('premier_league_data_v1.csv')
championship_data = pd.read_csv('championship_data.csv')
spanish1_data = pd.read_csv('spanish1_data_v1.csv')
spanish2_data = pd.read_csv('spanish2_data.csv')
italian1_data = pd.read_csv('italian1_data_v1.csv')
italian2_data = pd.read_csv('italian2_data.csv')
german1_data = pd.read_csv('german1_data_v1.csv')
german2_data = pd.read_csv('german2_data.csv')

In [27]:
england_data = pd.concat([championship_data,spanish1_data,spanish2_data,italian1_data,italian2_data,german1_data,german2_data, premier_league_data])
england_data.fillna(0, inplace = True)
england_data.sort_values(['Date'], ascending = True, inplace = True)

home_team_columns = [col for col in england_data.columns if col.startswith('HomeTeam_')]
england_data['HomeTeam'] = england_data[home_team_columns].idxmax(axis=1).str.replace('HomeTeam_', '')

# Step 2: Dedummify AwayTeam columns
away_team_columns = [col for col in england_data.columns if col.startswith('AwayTeam_')]
england_data['AwayTeam'] = england_data[away_team_columns].idxmax(axis=1).str.replace('AwayTeam_', '')
england_data.reset_index(inplace = True, drop = True)
england_data[['HomeTeam','AwayTeam']].to_csv('home_away_v1.csv', index = False)

In [29]:
h2h = pd.read_csv('h2h_data_v1.csv')
england_data.drop_duplicates(['Date',	'HomeTeam',	'AwayTeam'], inplace = True)
h2h.drop_duplicates(['Date',	'HomeTeam',	'AwayTeam'], inplace = True)
england_data.reset_index(inplace = True, drop = True)
h2h.reset_index(inplace = True, drop = True)
england_data = pd.merge(england_data, h2h[['Date',	'HomeTeam',	'AwayTeam','Home_h2h_Goals',	'Home_h2h_Points',	'Away_h2h_Goals',	'Away_h2h_Points', 'home_elo', 'away_elo', 'month', 'year','Avg>2.5',	'Avg<2.5']], on =['Date',	'HomeTeam',	'AwayTeam'], how = 'left')

england_data[['HomeTeam',	'AwayTeam', 'Date']].to_csv('next_fixture_teams.csv', index = False)
england_data.drop(['HomeTeam', 'AwayTeam'], axis = 1, inplace = True)
def assign_time_bucket(time_str):
    hour = int(time_str.split(':')[0])
    
    if hour < 15:
        return 'morning_afternoon'
    elif 15 <= hour < 19:
        return 'afternoon_evening'
    else:
        return 'evening_night'
# Apply the bucketing function to the 'Time' column
england_data['TimeBucket'] = england_data['Time'].apply(assign_time_bucket)
england_data.drop(['Date', 'Home xG', 'Away xG', 'Time', 'ovr25_per_game'], axis = 1, inplace= True,errors = 'ignore')
england_data = pd.get_dummies(england_data, columns=['Div', 'year', 'month','TimeBucket'])

In [30]:
targets = england_data[['FTHG', 'FTAG']]
england_data.drop(['FTHG', 'FTAG'], inplace= True, axis = 1)

In [36]:
england_data['Away_h2h_Points'].fillna(0, inplace = True)
england_data['Away_h2h_Goals'].fillna(0, inplace = True)
england_data['Home_h2h_Points'].fillna(0, inplace = True)
england_data['Home_h2h_Goals'].fillna(0, inplace = True)
england_data['home_elo'].fillna(1400, inplace = True)
england_data['away_elo'].fillna(1400, inplace = True)

In [38]:
h2h[['HomeTeam',	'AwayTeam','Avg>2.5',	'Avg<2.5']].to_csv('2_5_goals.csv', index = False)

In [40]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler_minmax = MinMaxScaler()
numeric_columns = england_data.select_dtypes(include=['float64', 'int64']).columns
england_data_n = pd.DataFrame(scaler_minmax.fit_transform(england_data[numeric_columns]), columns=numeric_columns)
england_data[numeric_columns] = england_data_n

In [42]:
england_data.to_csv('default_data_all_variables_v1.csv',index = False)

In [43]:
targets.to_csv('targets_v1.csv',index = False)

### Summary

By following these preprocessing steps, we will ensure that our dataset is in optimal condition for modeling. The combination of league data, H2H statistics, and Elo ratings, along with the removal of unnecessary columns and handling of missing values, will enhance the quality of our predictions. Standardizing the data further prepares it for effective analysis, allowing us to proceed confidently into the modeling phase.