# Data Preprocessing Demo
This notebook shows a small end-to-end demo of preprocessing: importing *dirty* data, cleaning & imputing, scaling numeric features, encoding categorical features, and visualizing _before vs after_.
Run cells sequentially.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from IPython.display import display
np.random.seed(42)

## 1) Create a small 'dirty' dataset
We'll intentionally make mistakes: missing values, bad numeric outliers, negative salary, and a weird category.

In [None]:
data = {
    'Age': [25, np.nan, 47, 51, 23, 999, 35, np.nan, 28],
    'Salary': [5000, 6000, np.nan, 8000, 12000, 30000, -100, 7000, 6500],
    'Gender': ['M', 'F', 'F', np.nan, 'M', 'Other', 'F', 'F', 'M'],
    'City': ['SP', 'RJ', 'SP', 'MG', np.nan, 'RJ', 'RS', 'SP', 'RJ']
}

df = pd.DataFrame(data)
print('üîπ Original dirty data (first rows):')
display(df)

### Visual: Missing values heatmap (before)
White = missing, dark = present. We'll use a matplotlib imshow for a clean look.

In [None]:
plt.figure(figsize=(6,2.5))
plt.imshow(df.isnull().T, aspect='auto', interpolation='nearest')
plt.yticks(range(df.shape[1]), df.columns)
plt.xticks(range(df.shape[0]), range(df.shape[0]))
plt.title('Missing values (before)')
plt.xlabel('Row index')
plt.colorbar(label='missing (True=1, False=0)')
plt.show()

## 2) Cleaning & Imputation
Steps:
- Fix obvious outliers (Age > 100 ‚Üí NaN)
- Fix Salary negative or extremely large values ‚Üí NaN
- Impute numeric features (median/mean)
- Impute categorical features (mode / placeholder)

In [None]:
df_clean = df.copy()
df_clean['Age'] = df_clean['Age'].apply(lambda x: np.nan if x>100 else x)
df_clean['Salary'] = df_clean['Salary'].apply(lambda x: np.nan if (x is not None and (x<0 or x>20000)) else x)

df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
df_clean['Salary'].fillna(df_clean['Salary'].mean(), inplace=True)
df_clean['Gender'].fillna(df_clean['Gender'].mode()[0], inplace=True)
df_clean['City'].fillna('Unknown', inplace=True)

print('After cleaning & imputation:')
display(df_clean)

### Visual: Missing values heatmap (after)
All NaNs should be handled ‚Äî watch the heatmap glow up.

In [None]:
plt.figure(figsize=(6,2.5))
plt.imshow(df_clean.isnull().T, aspect='auto', interpolation='nearest')
plt.yticks(range(df_clean.shape[1]), df_clean.columns)
plt.xticks(range(df_clean.shape[0]), range(df_clean.shape[0]))
plt.title('Missing values (after)')
plt.xlabel('Row index')
plt.colorbar(label='missing (True=1, False=0)')
plt.show()

## 3) Scaling numeric features
We'll standardize Age and Salary (mean=0, std=1) using `StandardScaler`. This is necessary for many ML models.

In [None]:
scaler = StandardScaler()
df_scaled = df_clean.copy()
df_scaled[['Age','Salary']] = scaler.fit_transform(df_scaled[['Age','Salary']])

print('üìè After scaling (first rows):')
display(df_scaled)

## 4) Encoding categorical features
We'll convert `Gender` and `City` to numeric using one-hot encoding (drop_first=True to avoid redundancy).

In [None]:
df_final = pd.get_dummies(df_scaled, columns=['Gender','City'], drop_first=True)
print('Final dataset ready for modeling (first rows):')
display(df_final)

## 5) Before vs After distributions
We'll show histograms of numeric columns before cleaning and after scaling to highlight the change.

In [None]:
plt.figure(figsize=(12,4))

plt.subplot(1,2,1)
plt.hist(df['Age'].dropna(), bins=6)
plt.title('Age (original)')

plt.subplot(1,2,2)
plt.hist(df['Salary'].loc[df['Salary'].notnull()], bins=6)
plt.title('Salary (original)')
plt.show()

plt.figure(figsize=(12,4))

plt.subplot(1,2,1)
plt.hist(df_scaled['Age'], bins=6)
plt.title('Age (scaled)')

plt.subplot(1,2,2)
plt.hist(df_scaled['Salary'], bins=6)
plt.title('Salary (scaled)')
plt.show()