# Data Preprocessing and Feature Engineering

Data preprocessing and feature engineering are critical steps in the machine learning pipeline. They involve preparing raw data for analysis and transforming it into a format that can be effectively used by machine learning algorithms. How is this diiferent from EDA? Well, EDA (Exploratory Data Analysis) focuses on understanding the data, identifying patterns, and uncovering insights through visualization and statistical analysis. In contrast, data preprocessing and feature engineering are more about cleaning the data, standardizing or normalizing it and creating new features to improve model performance.

EDA -> Understanding the data, identifying patterns, uncovering insights.

Data Preprocessing and Feature Engineering -> Cleaning the data, standardizing/normalizing, creating new features.

## Data Preprocessing Steps

##### Step 1: Handling Missing Values

- Identify missing values in the dataset.
- Decide on a strategy to handle them (e.g., removal, imputation).
- If imputing, choose an appropriate method (mean, median, mode, or predictive modeling).[Use median for numerical data to avoid outlier impact, mean for normally distributed data, and mode for categorical data.]

```python
import pandas as pd 

df = pd.read_csv('data.csv')
df['column_having_missing_values'].fillna(df['column_having_missing_values'].median())

# or for categorical data
df['categorical_column'].fillna(df['categorical_column'].mode()[0])

# Check for missing values
print(df.isnull().sum())
```

##### Step 2: Encoding Categorical Variables

- Convert categorical variables into numerical format.
- Use techniques like one-hot encoding(for non-hierarchical categories), label encoding(for yes/no categories), or ordinal encoding(for hierarchical categories) based on the nature of the categorical data.

```python
from sklearn.preprocessing import LabelEncoder

# Label Encoding
label_encoder = LabelEncoder()
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

# One-Hot Encoding
pd.get_dummies(df, columns=categorical_columns, dtype=int)      #It generates true/false columns for each category, so specify dtype=int to convert them to 0/1
```

##### Step 3: Scaling and Normalization

- Check if your dataset have features with different range of values or not. Then take decision on whether scaling is required or not
- Apply scaling techniques like Min-Max Scaling or Standardization (Z-score normalization) to ensure all features contribute equally to the model.  
- If required apply one of the scaling methods.

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
target = 'target_column'
X = df.drop(columns=[target])           # Select all columns except target column
y = df[target]                  # Select only target column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)           # Split the data into training and testing sets. How the whole process is working here? We are splitting the data into training and testing sets before scaling to avoid data leakage. Scaling should be done after the split to ensure that the model is trained on data that is representative of real-world scenarios. Here test_size=0.25 means 25% data will be used for testing and 75% data will be used for training, random_state=42 is used to ensure reproducibility of the results. Reproducibility means that every time the code runs with the same random_state, it will produce the same split of data. Why 42? Because it's a commonly used arbitrary number in programming and data science.

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit to training data and transform it
X_test_scaled = scaler.transform(X_test)        # Transform test data

# fit_transform() is used on training data to compute the mean and standard deviation, and then transform the data. transform() is used on test data to apply the same transformation using the parameters learned from the training data.

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)  # Fit to training data and transform it
X_test_minmax = min_max_scaler.transform(X_test)        # Transform test data

# Convert scaled arrays back to DataFrames for easier handling
# Standardization 
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

# Min-Max Scaling
X_train_minmax = pd.DataFrame(X_train_minmax, columns=X_train.columns, index=X_train.index)
X_test_minmax = pd.DataFrame(X_test_minmax, columns=X_test.columns, index=X_test.index)

# Display the first few rows of the scaled data
display(X_train_scaled.head())
display(X_test_scaled.head())
display(X_train_minmax.head())
display(X_test_minmax.head())
```