# Class 2: Data Preprocessing Techniques

**Week 4: Intermediate Python and Data Preprocessing**

## Objectives
- Encode categorical variables (e.g., one-hot encoding, label encoding).
- Scale numerical features using normalization and standardization.
- Detect and handle outliers to clean datasets.
- Understand why preprocessing is critical for AI models.

## Dataset
We'll continue using the Titanic dataset (`titanic.csv`) with columns like `PassengerId`, `Pclass`, `Name`, `Sex`, `Age`, `Fare`, `Embarked`, and `Survived`. This dataset includes categorical and numerical features, ideal for practicing preprocessing.

## Instructions
- Run the setup cell to load libraries and the dataset.
- Complete the exercises by filling in the code cells.
- Use the hints if you're stuck.
- Save your notebook and submit it if required.

## Setup
Run the cell below to import libraries and load the Titanic dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load the Titanic dataset
titanic = pd.read_csv('data/titanic.csv')

# Display the first few rows
print('Titanic dataset:')
print(titanic.head())

# Check data types and missing values
print('\nData Info:')
print(titanic.info())

## Exercise 1: Feature Encoding

**Goal**: Convert categorical variables into numerical formats for machine learning.

**Task**: Encode the `Sex` and `Embarked` columns:
- Use **one-hot encoding** for `Sex` (creates binary columns like `Sex_male`, `Sex_female`).
- Use **label encoding** for `Embarked` (maps categories like 'C', 'Q', 'S' to numbers).

**Steps**:
1. Use `pd.get_dummies()` for one-hot encoding `Sex`.
2. Create a dictionary to map `Embarked` values (e.g., {'C': 0, 'Q': 1, 'S': 2}) and apply it using `map()`.
3. Display the first 5 rows of the modified dataset.

**Hint**: Handle missing values in `Embarked` by filling with the most common value (mode) before encoding.

In [None]:
# Your code here

# Handle missing values in Embarked
titanic['Embarked'] = titanic['Embarked'].fillna(titanic['Embarked'].mode()[0])

# One-hot encode Sex
titanic = # YOUR CODE

# Label encode Embarked
embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}
titanic['Embarked'] = # YOUR CODE

# Display the result
print('Dataset after encoding:')
print(titanic[['Sex_male', 'Sex_female', 'Embarked']].head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work. Try to complete the exercise yourself first!

```python
# titanic['Embarked'] = titanic['Embarked'].fillna(titanic['Embarked'].mode()[0])
# titanic = pd.get_dummies(titanic, columns=['Sex'], drop_first=False)
# embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}
# titanic['Embarked'] = titanic['Embarked'].map(embarked_mapping)
# print('Dataset after encoding:')
# print(titanic[['Sex_male', 'Sex_female', 'Embarked']].head())
```

## Exercise 2: Normalization

**Goal**: Scale numerical features to a [0,1] range using normalization.

**Task**: Normalize the `Fare` column using scikit-learn's `MinMaxScaler`.
- Fit the scaler to `Fare`.
- Transform `Fare` and add it as a new column `Fare_normalized`.
- Verify the min and max values of `Fare_normalized`.

**Steps**:
1. Create a `MinMaxScaler` object.
2. Reshape `Fare` to a 2D array (required by scikit-learn) using `.values.reshape(-1, 1)`.
3. Fit and transform `Fare`, then add the result to the DataFrame.
4. Check min and max with `.min()` and `.max()`.

**Hint**: Use `scaler.fit_transform()` to do both steps at once.

In [None]:
# Your code here

# Initialize MinMaxScaler
scaler = # YOUR CODE

# Normalize Fare
titanic['Fare_normalized'] = # YOUR CODE

# Verify min and max
print('Fare_normalized min:', titanic['Fare_normalized'].min())
print('Fare_normalized max:', titanic['Fare_normalized'].max())
print(titanic[['Fare', 'Fare_normalized']].head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# scaler = MinMaxScaler()
# titanic['Fare_normalized'] = scaler.fit_transform(titanic[['Fare']].values.reshape(-1, 1))
# print('Fare_normalized min:', titanic['Fare_normalized'].min())
# print('Fare_normalized max:', titanic['Fare_normalized'].max())
# print(titanic[['Fare', 'Fare_normalized']].head())
```

## Exercise 3: Standardization

**Goal**: Scale numerical features to have mean=0 and standard deviation=1 using standardization.

**Task**: Standardize the `Age` column using `StandardScaler`.
- Handle missing `Age` values by filling with the median.
- Fit the scaler to `Age`.
- Transform `Age` and add it as `Age_standardized`.
- Verify the mean and std (should be ~0 and ~1).

**Steps**:
1. Fill missing `Age` values with `titanic['Age'].median()`.
2. Use `StandardScaler` to standardize `Age`.
3. Check mean and std with `.mean()` and `.std()`.

**Hint**: Use `scaler.fit_transform()` and reshape `Age` like in Exercise 2.

In [None]:
# Your code here

# Handle missing Age values
titanic['Age'] = # YOUR CODE

# Initialize StandardScaler
scaler = # YOUR CODE

# Standardize Age
titanic['Age_standardized'] = # YOUR CODE

# Verify mean and std
print('Age_standardized mean:', titanic['Age_standardized'].mean())
print('Age_standardized std:', titanic['Age_standardized'].std())
print(titanic[['Age', 'Age_standardized']].head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
# scaler = StandardScaler()
# titanic['Age_standardized'] = scaler.fit_transform(titanic[['Age']].values.reshape(-1, 1))
# print('Age_standardized mean:', titanic['Age_standardized'].mean())
# print('Age_standardized std:', titanic['Age_standardized'].std())
# print(titanic[['Age', 'Age_standardized']].head())
```

## Exercise 4: Outlier Detection

**Goal**: Identify and handle outliers to improve data quality.

**Task**: Detect outliers in `Fare` using the Interquartile Range (IQR) method.
- Calculate Q1, Q3, and IQR for `Fare`.
- Identify outliers (values < Q1 - 1.5*IQR or > Q3 + 1.5*IQR).
- Print the number of outliers and display a few example rows.

**Steps**:
1. Compute Q1 and Q3 using `quantile(0.25)` and `quantile(0.75)`.
2. Calculate IQR = Q3 - Q1.
3. Define outlier bounds and filter the DataFrame.
4. Count and display outliers.

**Hint**: Use boolean indexing to select outliers.

In [None]:
# Your code here

# Calculate Q1, Q3, and IQR for Fare
Q1 = # YOUR CODE
Q3 = # YOUR CODE
IQR = # YOUR CODE

# Define outlier bounds
lower_bound = # YOUR CODE
upper_bound = # YOUR CODE

# Identify outliers
outliers = # YOUR CODE

# Print results
print('Number of outliers in Fare:', len(outliers))
print('Example outliers:')
print(outliers[['Fare']].head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# Q1 = titanic['Fare'].quantile(0.25)
# Q3 = titanic['Fare'].quantile(0.75)
# IQR = Q3 - Q1
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR
# outliers = titanic[(titanic['Fare'] < lower_bound) | (titanic['Fare'] > upper_bound)]
# print('Number of outliers in Fare:', len(outliers))
# print('Example outliers:')
# print(outliers[['Fare']].head())
```

## Bonus Challenge

**Task**: Apply one-hot encoding to `Pclass` and normalize both `Age` and `Fare` in a single `MinMaxScaler`.
- Create a new DataFrame with encoded `Pclass`, normalized `Age`, and normalized `Fare`.
- Display the first 5 rows.

**Hint**: Use `scaler.fit_transform()` on a DataFrame with multiple columns.

In [None]:
# Your code here

# One-hot encode Pclass and normalize Age and Fare
# YOUR CODE

print('Bonus Result:')
# print YOUR RESULT

## Discussion Questions
1. Why is one-hot encoding preferred over label encoding for non-ordinal categories?
2. When should you use normalization vs. standardization?
3. How do outliers affect machine learning models, and what are other ways to handle them?

Feel free to jot down your thoughts in a new markdown cell below!

## Your Notes

(Add your thoughts here)