# Class 4: Introduction to OOP and Mini-Project Completion

**Week 4: Intermediate Python and Data Preprocessing**

## Objectives
- Understand basic object-oriented programming (OOP) concepts: classes, objects, methods, and attributes.
- Build a simple class to organize preprocessing tasks.
- Complete the Week 4 mini-project by preprocessing the Titanic dataset end-to-end.
- Produce a clean, AI-ready dataset with proper documentation.

## Dataset
- **Titanic dataset** (`titanic.csv`): Contains columns like `PassengerId`, `Pclass`, `Name`, `Sex`, `Age`, `Fare`, `Embarked`, `Survived`.
- Optional: `titanic_preprocessed.csv` (from Class 3) if you want to start from your previous work.

## Instructions
- Run the setup cell to load libraries and the dataset.
- Complete the exercises by filling in the code cells.
- Use the hints if you're stuck.
- Finalize the mini-project in Exercise 3 and save your results.
- Save your notebook and submit it as part of the mini-project.

## Setup
Run the cell below to import libraries and load the Titanic dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Load the Titanic dataset
try:
    titanic = pd.read_csv('data/titanic.csv')
    print('Titanic dataset loaded successfully.')
    print(titanic.head())
except FileNotFoundError:
    print('Error: titanic.csv not found in data/ folder.')

# Optional: Load preprocessed data from Class 3
# titanic = pd.read_csv('data/titanic_preprocessed.csv')

## Exercise 1: Creating a Simple Class

**Goal**: Understand OOP by building a basic class for data preprocessing.

**Task**:
- Define a `DataPreprocessor` class with:
  - An attribute to store a DataFrame (`self.data`).
  - A method `fill_missing` to fill missing values (e.g., median for numerical columns).
  - A method `get_summary` to return basic statistics (e.g., mean, min, max).
- Create an object and test the methods on a subset of the Titanic dataset.

**Steps**:
1. Define the class with `__init__` to store the DataFrame.
2. Implement `fill_missing` to handle missing numerical values.
3. Implement `get_summary` using `describe()`.
4. Test with `titanic[['Age', 'Fare']]`, filling missing `Age` and printing stats.

**Hint**: Use `self.data.fillna()` for missing values and `self.data.describe()` for stats.

In [None]:
# Your code here

# Define the DataPreprocessor class
class DataPreprocessor:
    def __init__(self, data):
        # YOUR CODE (store data)
        pass
    
    def fill_missing(self, column):
        # YOUR CODE (fill missing values in column with median)
        pass
    
    def get_summary(self):
        # YOUR CODE (return describe())
        pass

# Test the class
subset = titanic[['Age', 'Fare']].copy()
preprocessor = # YOUR CODE (create object)
preprocessor.fill_missing('Age')  # Fill missing Age
print('Summary after filling missing values:')
print(preprocessor.get_summary())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work. Try to complete the exercise yourself first!

```python
# class DataPreprocessor:
#     def __init__(self, data):
#         self.data = data.copy()
#     
#     def fill_missing(self, column):
#         self.data[column] = self.data[column].fillna(self.data[column].median())
#     
#     def get_summary(self):
#         return self.data.describe()
#
# subset = titanic[['Age', 'Fare']].copy()
# preprocessor = DataPreprocessor(subset)
# preprocessor.fill_missing('Age')
# print('Summary after filling missing values:')
# print(preprocessor.get_summary())
```

## Exercise 2: Enhancing the Class

**Goal**: Add preprocessing methods to the `DataPreprocessor` class.

**Task**:
- Extend `DataPreprocessor` with two new methods:
  - `encode_categorical`: One-hot encode a specified column.
  - `normalize`: Normalize a numerical column using `MinMaxScaler`.
- Test the methods on `titanic` by encoding `Sex` and normalizing `Fare`.

**Steps**:
1. Add `encode_categorical` using `pd.get_dummies()`.
2. Add `normalize` using `MinMaxScaler`.
3. Create a new object with the full Titanic dataset.
4. Apply both methods and display the updated DataFrame.

**Hint**: Update `self.data` in each method to store changes.

In [None]:
# Your code here

# Enhanced DataPreprocessor class
class DataPreprocessor:
    def __init__(self, data):
        self.data = data.copy()
    
    def fill_missing(self, column):
        self.data[column] = self.data[column].fillna(self.data[column].median())
    
    def get_summary(self):
        return self.data.describe()
    
    def encode_categorical(self, column):
        # YOUR CODE (one-hot encode column)
        pass
    
    def normalize(self, column):
        # YOUR CODE (normalize column)
        pass

# Test the enhanced class
preprocessor = # YOUR CODE (create object with titanic)
preprocessor.encode_categorical('Sex')
preprocessor.normalize('Fare')
print('DataFrame after encoding and normalization:')
print(preprocessor.data.head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# class DataPreprocessor:
#     def __init__(self, data):
#         self.data = data.copy()
#     
#     def fill_missing(self, column):
#         self.data[column] = self.data[column].fillna(self.data[column].median())
#     
#     def get_summary(self):
#         return self.data.describe()
#     
#     def encode_categorical(self, column):
#         self.data = pd.get_dummies(self.data, columns=[column], drop_first=False)
#     
#     def normalize(self, column):
#         scaler = MinMaxScaler()
#         self.data[f'{column}_normalized'] = scaler.fit_transform(self.data[[column]].values.reshape(-1, 1))
#
# preprocessor = DataPreprocessor(titanic)
# preprocessor.encode_categorical('Sex')
# preprocessor.normalize('Fare')
# print('DataFrame after encoding and normalization:')
# print(preprocessor.data.head())
```

## Exercise 3: Mini-Project Completion

**Goal**: Finalize the Week 4 mini-project by preprocessing the Titanic dataset end-to-end.

**Task**:
- Preprocess the Titanic dataset (either with `DataPreprocessor` or standard pandas/sklearn):
  - Handle missing values: Fill `Age` with median, `Embarked` with mode.
  - Encode categorical variables: One-hot encode `Sex`, `Embarked`, and `Pclass`.
  - Normalize `Age` and `Fare`.
  - Detect and report outliers in `Fare` (e.g., using IQR, but don’t remove them unless justified).
- Save the final dataset as `titanic_final.csv`.
- Document your steps with comments or a markdown cell.

**Steps**:
1. Start with `titanic.csv` (or `titanic_preprocessed.csv` from Class 3).
2. Apply all preprocessing steps (reuse Class 1–3 techniques or `DataPreprocessor`).
3. Check the final dataset for missing values and correct formats.
4. Save with `to_csv()`.

**Hint**: If using `DataPreprocessor`, extend it with methods as needed (e.g., for `Pclass` encoding).

In [None]:
# Your code here

# Option 1: Use DataPreprocessor
class DataPreprocessor:
    def __init__(self, data):
        self.data = data.copy()
    
    def fill_missing(self, column, method='median'):
        if method == 'median':
            self.data[column] = self.data[column].fillna(self.data[column].median())
        elif method == 'mode':
            self.data[column] = self.data[column].fillna(self.data[column].mode()[0])
    
    def encode_categorical(self, column):
        self.data = pd.get_dummies(self.data, columns=[column], drop_first=False)
    
    def normalize(self, column):
        scaler = MinMaxScaler()
        self.data[f'{column}_normalized'] = scaler.fit_transform(self.data[[column]].values.reshape(-1, 1))
    
    def detect_outliers(self, column):
        Q1 = self.data[column].quantile(0.25)
        Q3 = self.data[column].quantile(0.75)
        IQR = Q3 - Q1
        outliers = self.data[(self.data[column] < Q1 - 1.5 * IQR) | (self.data[column] > Q3 + 1.5 * IQR)]
        return len(outliers), outliers

# Initialize preprocessor
preprocessor = # YOUR CODE

# Preprocess the dataset
# YOUR CODE (fill missing, encode, normalize, detect outliers)

# Check the final dataset
print('Final dataset info:')
print(preprocessor.data.info())
print('\nMissing values:')
print(preprocessor.data.isna().sum())

# Report outliers
outlier_count, outliers = # YOUR CODE
print(f'\nNumber of outliers in Fare: {outlier_count}')
print('Example outliers:')
print(outliers[['Fare']].head())

# Save the final dataset
# YOUR CODE

# Display the first few rows
print('\nFinal DataFrame:')
print(preprocessor.data.head())

## Mini-Project Documentation

**Steps Performed**:
- (List your steps here, e.g., filled missing Age with median, encoded Sex, etc.)
- 

**Challenges Faced**:
- (Note any issues, e.g., handling missing values, choosing encoding methods)
- 

**Why This Preprocessing**:
- (Explain why you chose these steps, e.g., normalization for model compatibility)
- 

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# class DataPreprocessor:
#     def __init__(self, data):
#         self.data = data.copy()
#     
#     def fill_missing(self, column, method='median'):
#         if method == 'median':
#             self.data[column] = self.data[column].fillna(self.data[column].median())
#         elif method == 'mode':
#             self.data[column] = self.data[column].fillna(self.data[column].mode()[0])
#     
#     def encode_categorical(self, column):
#         self.data = pd.get_dummies(self.data, columns=[column], drop_first=False)
#     
#     def normalize(self, column):
#         scaler = MinMaxScaler()
#         self.data[f'{column}_normalized'] = scaler.fit_transform(self.data[[column]].values.reshape(-1, 1))
#     
#     def detect_outliers(self, column):
#         Q1 = self.data[column].quantile(0.25)
#         Q3 = self.data[column].quantile(0.75)
#         IQR = Q3 - Q1
#         outliers = self.data[(self.data[column] < Q1 - 1.5 * IQR) | (self.data[column] > Q3 + 1.5 * IQR)]
#         return len(outliers), outliers
#
# preprocessor = DataPreprocessor(titanic)
# preprocessor.fill_missing('Age', method='median')
# preprocessor.fill_missing('Embarked', method='mode')
# preprocessor.encode_categorical('Sex')
# preprocessor.encode_categorical('Embarked')
# preprocessor.encode_categorical('Pclass')
# preprocessor.normalize('Age')
# preprocessor.normalize('Fare')
# print('Final dataset info:')
# print(preprocessor.data.info())
# print('\nMissing values:')
# print(preprocessor.data.isna().sum())
# outlier_count, outliers = preprocessor.detect_outliers('Fare')
# print(f'\nNumber of outliers in Fare: {outlier_count}')
# print('Example outliers:')
# print(outliers[['Fare']].head())
# preprocessor.data.to_csv('data/titanic_final.csv', index=False)
# print('\nFinal DataFrame:')
# print(preprocessor.data.head())
```

## Bonus Challenge

**Task**: Add a `standardize` method to `DataPreprocessor` to standardize a column (mean=0, std=1).
- Standardize `Age` instead of normalizing it in the mini-project.
- Verify the mean and std of the standardized column.

**Hint**: Use `StandardScaler` and follow the `normalize` method’s structure.

In [None]:
# Your code here

# Extend DataPreprocessor with standardize method
class DataPreprocessor:
    # (Reuse previous definition)
    def standardize(self, column):
        # YOUR CODE
        pass

# Test standardization
# YOUR CODE

print('Standardized Age mean:', # YOUR CODE)
print('Standardized Age std:', # YOUR CODE)

## Discussion Questions
1. How does OOP help organize preprocessing tasks compared to standalone functions?
2. Why is documentation important in a data preprocessing pipeline?
3. What other preprocessing steps might be useful for the Titanic dataset (e.g., feature engineering)?

Feel free to jot down your thoughts in a new markdown cell below!

## Your Notes

(Add your thoughts here)