# üõ†Ô∏è Data Preprocessing

This notebook is dedicated to **preprocessing the soccer player dataset** for our project **xGenius ‚Äì Machine Learning for Soccer Player Performance Prediction**.  

The preprocessing steps are performed using the **`DataPreprocessor` class** from the `src` folder, which handles missing values, encodes categorical features, scales numeric features, and optionally drops the target column. All preprocessing steps are logged to ensure **traceability and reproducibility**.

---

## ‚ö° What‚Äôs included in this notebook

1. **Loading the dataset**  
   Using the `DataLoader` class, we combine datasets from multiple leagues and seasons.

2. **Handling missing values**  
   - Numeric columns ‚Üí filled with **mean**  
   - Categorical columns ‚Üí filled with **mode**

3. **Encoding categorical columns**  
   - Using **Ordinal Encoding**  
   - Unknown values handled safely

4. **Scaling numeric columns**  
   - Applied **MinMax scaling** to normalize numeric features

5. **Dropping the target column**  
   - Optional, depending on whether you want features only for model training  

---

### Why We Drop the Target Column Before Training/Preprocessing

In machine learning, the target column (the variable we want to predict) must be separated from the input features for a few important reasons:

**1. Avoid Data Leakage**  
If the target column remains in the feature set, the model could ‚Äúsee‚Äù the answer during training. This can cause the model to perform unrealistically well on training data but fail on new data.

**2. Feature-Target Separation**  
Input features (X) and target (y) must be separate so that training algorithms know what to predict. Features are the independent variables, and the target is the dependent variable.

**3. Preprocessing Safety**  
Many preprocessing steps (like normalization, encoding, scaling) should only be applied to features. Keeping the target in the feature set can distort these transformations.

---

6. **Logging**  
   - All steps are recorded in a log file:  
   [02_data_preprocessing.log](../logs/02_data_preprocessing.log)

     

---

In [1]:
import sys, os
import pandas as pd
sys.path.append(r"C:\Users\davro\OneDrive\Desktop\xGenius-Machine-Learning-for-Soccer-Player-Performance-Prediction\src")
from dp_02_data_preprocessing import DataPreprocessor
from logger_setup import Logger




In [2]:
log_file = r"C:\Users\davro\OneDrive\Desktop\xGenius-Machine-Learning-for-Soccer-Player-Performance-Prediction\logs\02_data_preprocessing.log"
logger = Logger(log_file)
df=pd.read_csv(r"C:\Users\davro\OneDrive\Desktop\xGenius-Machine-Learning-for-Soccer-Player-Performance-Prediction\Data\merged\df_merged.csv")



In [3]:
# Initialize preprocessor with logger
pre = DataPreprocessor(df, target_column="xG", logger=logger)

# Run preprocessing
clean_df = pre.process()

# Optionally drop target
clean_df = pre.drop_target()

In [4]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20926 entries, 0 to 20925
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            20926 non-null  float64
 1   player_name   20926 non-null  float64
 2   games         20926 non-null  float64
 3   time          20926 non-null  float64
 4   goals         20926 non-null  float64
 5   assists       20926 non-null  float64
 6   xA            20926 non-null  float64
 7   shots         20926 non-null  float64
 8   key_passes    20926 non-null  float64
 9   yellow_cards  20926 non-null  float64
 10  red_cards     20926 non-null  float64
 11  position      20926 non-null  float64
 12  team_title    20926 non-null  float64
 13  npg           20926 non-null  float64
 14  npxG          20926 non-null  float64
 15  xGChain       20926 non-null  float64
 16  xGBuildup     20926 non-null  float64
 17  league        20926 non-null  float64
 18  season        20926 non-nu

In [5]:
#Save preprocessed dataset for later use
clean_df.to_csv(r"C:\Users\davro\OneDrive\Desktop\xGenius-Machine-Learning-for-Soccer-Player-Performance-Prediction\Data\preprocessed\df_preprocessed.csv",index=False)