# Data Preprocessing for Premier League 2022/23 Dataset

This notebook documents the steps used to clean and preprocess the Premier League dataset for the 2022/23 season.  
The dataset (`Premier_League.csv`) contains match statistics including attendance, goals scored, possession percentages, and shot counts.

**This notebook will cover:**
1. Loading the dataset
2. Cleaning missing data (especially in the attendance column)
3. Creating a target variable for match outcomes
4. Feature engineering (calculating differences in possession and shots)
5. Selecting final features and saving the cleaned dataset

Let's get started!

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv('Premier_League.csv')

# Display the first few rows to understand the data structure
df.head()

## 1. Clean Missing Data

In this step, we address missing or improperly formatted data:

- **Attendance Column:**  
  The attendance column might contain string representations of missing values ("Nan", "NaN", "nan"). We replace these with actual `np.nan`.

- **Drop Missing Rows:**  
  We drop rows where critical columns (i.e., 'stadium' and 'attendance') have missing values.

- **Clean Attendance Values:**  
  Remove commas from the attendance strings and convert them to integers.

In [None]:
# Replace string representations of missing values in the attendance column with np.nan
df['attendance'] = df['attendance'].replace(['Nan', 'NaN', 'nan'], np.nan)

# Drop rows with missing values in critical columns: 'stadium' and 'attendance'
df = df.dropna(subset=['stadium', 'attendance'])

# Remove commas and convert the attendance column to integers
df['attendance'] = df['attendance'].astype(str).str.replace(',', '').astype(int)

In [None]:
# 2. Create target variable (match outcome)

# Define conditions for match outcomes (Home Win, Away Win, Draw)
conditions = [
    df['Goals Home'] > df['Away Goals'],
    df['Goals Home'] < df['Away Goals'],
    df['Goals Home'] == df['Away Goals']
]
choices = ['Home Win', 'Away Win', 'Draw']

# Create a new column 'outcome' based on these conditions
df['outcome'] = np.select(conditions, choices)

In [None]:
# 3. Feature engineering

# Calculate the difference in possession percentages between the home and away teams
df['possession_difference'] = df['home_possessions'] - df['away_possessions']

# Calculate the difference in the number of shots taken by the home and away teams
df['shot_difference'] = df['home_shots'] - df['away_shots']

In [None]:
# 4. Select final features

# Define the columns we want to keep for further analysis
final_columns = [
    'possession_difference',
    'shot_difference',
    'attendance',
    'outcome'
]

# Subset the DataFrame to keep only these final columns
df = df[final_columns]

In [None]:
# 5. Save the cleaned dataset
df.to_csv('premier_league_cleaned.csv', index=False)

# Print confirmation and display some summary information
print("Dataset cleaned successfully!")
print(f"Final dataset shape: {df.shape}")
print("Outcome distribution:")
print(df['outcome'].value_counts())

## Conclusion

In this notebook, we have:
- Loaded the `Premier_League.csv` dataset.
- Performed data cleaning, including handling missing values and fixing the attendance column.
- Created a target variable for match outcomes.
- Engineered new features (possession and shot differences).
- Selected final features and saved the cleaned data.

**Next Steps:**  
You can use this cleaned dataset to perform further exploratory data analysis, build predictive models, or develop visualizations that provide insights into team performances over the season.