# 🧹 Preprocessing Techniques in Machine Learning
Data preprocessing is a crucial step to prepare raw data for modeling. It improves data quality and model performance.

## 📌 Common Preprocessing Techniques
1. Handling Missing Values
2. Encoding Categorical Variables
3. Feature Scaling (Normalization & Standardization)
4. Feature Engineering
5. Outlier Handling
6. Train/Test Split

## 1. Handling Missing Values

In [None]:
import pandas as pd
import numpy as np

# Sample data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, np.nan, 30, 28],
        'Salary': [50000, 60000, np.nan, 58000]}
df = pd.DataFrame(data)

# Fill missing with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing with median
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

print(df)

## 2. Encoding Categorical Variables

In [None]:
df = pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male'],
                   'City': ['New York', 'Paris', 'Paris', 'London']})

# Label Encoding
df['Gender_encoded'] = df['Gender'].map({'Male': 0, 'Female': 1})

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['City'])

print(df_encoded)

## 3. Feature Scaling (Normalization & Standardization)

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.DataFrame({'Score': [10, 20, 30, 40, 50]})

# Normalization (Min-Max Scaling)
minmax = MinMaxScaler()
df['Score_Normalized'] = minmax.fit_transform(df[['Score']])

# Standardization (Z-score Scaling)
scaler = StandardScaler()
df['Score_Standardized'] = scaler.fit_transform(df[['Score']])

print(df)

## 4. Feature Engineering Example

In [None]:
df = pd.DataFrame({'FirstName': ['Alice', 'Bob'], 'LastName': ['Smith', 'Jones']})

# Combine columns
df['FullName'] = df['FirstName'] + ' ' + df['LastName']

print(df)

## 5. Outlier Detection & Removal

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot for outlier detection
data = pd.DataFrame({'Income': [30, 32, 34, 33, 31, 150]})
sns.boxplot(x=data['Income'])
plt.title('Boxplot of Income')
plt.show()

# Remove outliers using IQR
Q1 = data['Income'].quantile(0.25)
Q3 = data['Income'].quantile(0.75)
IQR = Q3 - Q1
filtered = data[(data['Income'] >= Q1 - 1.5 * IQR) & (data['Income'] <= Q3 + 1.5 * IQR)]

print(filtered)

## 6. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = pd.DataFrame({'feature': [1, 2, 3, 4, 5]})
y = pd.Series([0, 1, 0, 1, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train:\n", X_train)
print("X_test:\n", X_test)