# Tutorial 2: Feature Engineering and ML Preparation

Welcome back! In the previous tutorial, we learned how to load and clean data. Now we'll take the next step: **transforming our clean data into features that machine learning models can learn from effectively**.

## What is Feature Engineering?

Feature engineering is the art and science of creating new features from existing data to improve model performance. It's often said that:

> *"Better features beat better algorithms."*

A simple model with great features often outperforms a complex model with poor features!

## Why Does This Matter for ML?

- **Better Predictions**: Well-engineered features help models find patterns more easily
- **Reduced Complexity**: Good features can make simple models work as well as complex ones
- **Domain Knowledge**: Feature engineering lets you incorporate what you know about the problem
- **Model Performance**: Can dramatically improve accuracy, precision, and other metrics

## What We'll Learn

1. **Feature Scaling & Normalization** - Making features comparable
2. **Encoding Techniques** - Advanced categorical encoding
3. **Feature Creation** - Building new features from existing ones
4. **Feature Selection** - Choosing the most important features
5. **Train-Test Splitting** - Properly dividing data for ML
6. **Handling Imbalanced Data** - Dealing with unequal class distributions

Let's get started!

## Import Libraries and Load Data

We'll start by importing our essential libraries and loading the Titanic dataset.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# ML preprocessing libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

### Load and Prepare the Titanic Dataset

Let's load the data and perform the basic cleaning steps we learned in the previous tutorial.

In [None]:
# Load the dataset
df = sns.load_dataset('titanic')

# Fill missing values (using recommended pandas approach)
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Drop columns with too many missing values
df = df.drop(columns=['deck', 'embark_town'])

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print("\nFirst few rows:")
df.head()

## Feature Scaling and Normalization

### Why Scale Features?

Different features often have different scales:
- **Age**: 0-80 years
- **Fare**: 0-500+ dollars
- **Pclass**: 1, 2, or 3

Many ML algorithms (like SVM, KNN, Neural Networks) are sensitive to feature scales. Features with larger ranges can dominate the learning process!

### Two Main Approaches:

1. **Standardization (Z-score normalization)**: Centers data around mean=0, std=1
   - Formula: `(x - mean) / std`
   - Use when: Data follows normal distribution, or algorithm assumes this (SVM, Logistic Regression)

2. **Min-Max Scaling**: Scales data to a fixed range [0, 1]
   - Formula: `(x - min) / (max - min)`
   - Use when: You need bounded values, or data doesn't follow normal distribution

Let's see both in action!

In [None]:
# Select numerical features to scale
numerical_features = ['age', 'fare', 'sibsp', 'parch']

# Before scaling - look at the distributions
print("BEFORE SCALING:")
print(df[numerical_features].describe())
print("\nNotice the different scales: age (0-80), fare (0-512), etc.")

### StandardScaler (Standardization)

In [None]:
# Create a copy for standardization
df_standardized = df.copy()

# Initialize the scaler
scaler_standard = StandardScaler()

# Fit and transform the data
df_standardized[numerical_features] = scaler_standard.fit_transform(df[numerical_features])

print("AFTER STANDARDIZATION (StandardScaler):")
print(df_standardized[numerical_features].describe())
print("\nNotice: mean ≈ 0, std ≈ 1 for all features!")

### MinMaxScaler (Normalization)

In [None]:
# Create a copy for min-max scaling
df_normalized = df.copy()

# Initialize the scaler
scaler_minmax = MinMaxScaler()

# Fit and transform the data
df_normalized[numerical_features] = scaler_minmax.fit_transform(df[numerical_features])

print("AFTER MIN-MAX SCALING:")
print(df_normalized[numerical_features].describe())
print("\nNotice: all values are between 0 and 1!")

### Visualizing the Difference

In [None]:
# Compare age distribution across different scaling methods
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original
df['age'].hist(bins=30, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Original Age Distribution')
axes[0].set_xlabel('Age (years)')

# Standardized
df_standardized['age'].hist(bins=30, ax=axes[1], color='lightgreen', edgecolor='black')
axes[1].set_title('Standardized Age (StandardScaler)')
axes[1].set_xlabel('Age (z-score)')

# Normalized
df_normalized['age'].hist(bins=30, ax=axes[2], color='coral', edgecolor='black')
axes[2].set_title('Normalized Age (MinMaxScaler)')
axes[2].set_xlabel('Age (0-1 scale)')

plt.tight_layout()
plt.show()

print("Key Insight: Shape stays the same, but the scale changes!")

## Categorical Encoding

In the previous tutorial, we used simple encoding. Now let's explore more sophisticated techniques!

### Encoding Methods:
1. **Label Encoding**: Convert to integers (0, 1, 2, ...)
2. **One-Hot Encoding**: Create binary columns
3. **Frequency Encoding**: Replace with category frequency


### Label Encoding (for binary or ordinal features)

In [None]:
# Use our standardized dataframe
df_encoded = df_standardized.copy()

# Label encode 'sex' (binary: male/female)
label_encoder = LabelEncoder()
df_encoded['sex_encoded'] = label_encoder.fit_transform(df_encoded['sex'])

print("Sex encoding:")
print(df_encoded[['sex', 'sex_encoded']].drop_duplicates().sort_values('sex_encoded'))
print(f"\nMapping: female={label_encoder.transform(['female'])[0]}, male={label_encoder.transform(['male'])[0]}")

### One-Hot Encoding (for nominal features)

In [None]:
# One-hot encode 'embarked' (nominal: no natural order)
embarked_dummies = pd.get_dummies(df_encoded['embarked'], prefix='embarked', drop_first=True)

# Add to dataframe
df_encoded = pd.concat([df_encoded, embarked_dummies], axis=1)

print("One-hot encoding for 'embarked':")
print(df_encoded[['embarked', 'embarked_Q', 'embarked_S']].head(10))
print("\nNote: We dropped first category 'embarked_C' to avoid multicollinearity which is when two or more predictor variables are highly correlated.")
print("If embarked_Q=0 and embarked_S=0, then embarked='C'")

### Frequency Encoding (useful for high-cardinality features)

In [None]:
# Frequency encode 'class' - how often does each class appear?
class_freq = df_encoded['class'].value_counts(normalize=True)
df_encoded['class_frequency'] = df_encoded['class'].map(class_freq)

print("Frequency encoding for 'class':")
print(df_encoded[['class', 'class_frequency']].drop_duplicates().sort_values('class'))
print("\nThis tells us: Third class is most common (~55%), First is least (~24%)")

## Feature Engineering Revisited

Now let's create some new features from existing ones! This is where domain knowledge really shines.

### Domain Knowledge Applied:
- **Families** might have different survival rates
- **Age groups** (children, adults, elderly) had different priorities
- **Wealth indicators** like fare per person can be more informative than total fare

In [None]:
# 1. Family Size Features
# Combine siblings/spouses and parents/children, then add 1 for the person themselves
df_encoded['family_size'] = df_encoded['sibsp'] + df_encoded['parch'] + 1

# Is traveling alone?
df_encoded['is_alone'] = (df_encoded['family_size'] == 1).astype(int)

print("Family-based features created:")
print(df_encoded[['sibsp', 'parch', 'family_size', 'is_alone']].head(10))
print(f"\nSurvival rate - Alone: {df_encoded[df_encoded['is_alone']==1]['survived'].mean():.2%}")
print(f"Survival rate - With family: {df_encoded[df_encoded['is_alone']==0]['survived'].mean():.2%}")

In [None]:
# 2. Age-based Features
# Is child? (under 18 - "children first" policy)
df_encoded['is_child'] = (df_encoded['age'] < 18).astype(int)

# Is elderly? (over 60)
df_encoded['is_elderly'] = (df_encoded['age'] > 60).astype(int)

print("Age-based features created:")
print(df_encoded[['age', 'is_child', 'is_elderly']].head(15))
print(f"\nSurvival rate - Children: {df_encoded[df_encoded['is_child']==1]['survived'].mean():.2%}")
print(f"Survival rate - Adults: {df_encoded[(df_encoded['is_child']==0) & (df_encoded['is_elderly']==0)]['survived'].mean():.2%}")
print(f"Survival rate - Elderly: {df_encoded[df_encoded['is_elderly']==1]['survived'].mean():.2%}")

In [None]:
# 3. Fare per Person (Wealth Indicator)
# Total fare might be for a whole family, so divide by family size
df_encoded['fare_per_person'] = df_encoded['fare'] / df_encoded['family_size']

print("Fare-based features created:")
print(df_encoded[['fare', 'family_size', 'fare_per_person', 'pclass']].head(10))

# Compare fare per person by class
print("\nAverage fare per person by class:")
print(df_encoded.groupby('pclass')['fare_per_person'].mean())

print("\n  All new features created successfully!")

## Feature Selection

We've created many features! But more features ≠ better model. We need to select the most important ones.

### Why Feature Selection?
- **Reduces overfitting**: Fewer features = simpler, more generalizable model
- **Faster training**: Less data to process
- **Better interpretability**: Easier to understand what drives predictions
- **Removes noise**: Irrelevant features can hurt performance

### Methods:
1. **Statistical Tests**: F-score, chi-squared
2. **Correlation Analysis**: Which features correlate with target?
3. **Feature Importance**: From tree-based models (bonus thing for you to explore)

### Correlation Analysis

In [None]:
# Select only numerical features for correlation
numerical_cols = df_encoded.select_dtypes(include=[np.number]).columns.tolist()

# Calculate correlation with target (survived)
correlations = df_encoded[numerical_cols].corr()['survived'].sort_values(ascending=False)

print("Top features correlated with survival:")
print(correlations.head(15))

### Visualize Feature Correlations

In [None]:
# Top correlated features
top_features = correlations.head(10).index.tolist()
top_features.remove('survived')  # Remove target itself

# Create correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df_encoded[top_features + ['survived']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=1, yticklabels=True)
plt.title('Correlation Heatmap: Top Features vs Survival')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### SelectKBest (Statistical Feature Selection)

In [None]:
# Prepare features and target
X = df_encoded[numerical_cols].drop('survived', axis=1)
y = df_encoded['survived']

# Remove any remaining NaN values
X = X.fillna(0)

# Select top 10 features using F-score
selector = SelectKBest(f_classif, k=6)
selector.fit(X, y)

# Get selected features
selected_features = X.columns[selector.get_support()].tolist()

print("Top features selected by F-score:")
for i, feature in enumerate(selected_features, 1):
    score = selector.scores_[X.columns.get_loc(feature)]
    print(f"{i}. {feature}: {score:.2f}")

# Visualize the scores
plt.figure(figsize=(10, 6))
feature_scores = [selector.scores_[X.columns.get_loc(feature)] for feature in selected_features]
plt.barh(selected_features, feature_scores, color='lightblue', edgecolor='navy')
plt.xlabel('F-score')
plt.title('Top 6 Features Selected by F-score')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Train-Test Split

### Critical Concept: Never Test on Training Data!

If we train and test on the same data, our model will appear perfect but fail on new data. This is called **overfitting**.

### The Solution: Train-Test Split
- **Training Set** (typically 70-80%): Used to train the model
- **Test Set** (typically 20-30%): Used to evaluate the model

The model NEVER sees the test set during training!

### Important: When to Split?
-  Split AFTER cleaning
-  Split BEFORE feature scaling (to avoid data leakage)
-  Use stratification for imbalanced datasets

In [None]:
# Select our final features (using top correlated features)
final_features = ['sex_encoded', 'pclass', 'fare', 'age', 'family_size', 
                  'fare_per_person', 'is_child', 'is_alone', 'embarked_Q', 'embarked_S']

X_final = df_encoded[final_features].fillna(0)
y_final = df_encoded['survived']

# Perform the split
# stratify=y ensures same proportion of survived/died in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducibility
    stratify=y_final    # Maintain class distribution
)

print("Dataset split completed!")
print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nFeatures: {X_train.shape[1]}")

print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))

print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))
print("\n Distributions are similar - stratification worked!")

### Scaling AFTER Split (Critical!)

In [None]:
# Initialize scaler
scaler = StandardScaler()

# Fit ONLY on training data
scaler.fit(X_train)

# Transform both sets using the same scaler
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for easier viewing
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("Scaling completed!")
print("\nTraining set statistics (after scaling):")
print(X_train_scaled.describe())

print("\"Why fit only on training data?")
print("To prevent DATA LEAKAGE: Test data information shouldn't influence training!")

## Step 6: Handling Imbalanced Data

Let's check if our classes are balanced:

In [None]:
# Check class balance
class_distribution = y_train.value_counts()
class_percentages = y_train.value_counts(normalize=True) * 100

print("Class Distribution:")
print(f"Died (0): {class_distribution[0]} ({class_percentages[0]:.1f}%)")
print(f"Survived (1): {class_distribution[1]} ({class_percentages[1]:.1f}%)")

imbalance_ratio = class_distribution[0] / class_distribution[1]
print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 1.5:
    print("\nClasses are somewhat imbalanced!")
    print("This is moderate imbalance. Consider:")
    print("- Using class_weight='balanced' in models")
    print("- Focusing on precision/recall instead of just accuracy")
    print("- Using stratified sampling (already done!)")
else:
    print("\nClasses are reasonably balanced!")

# Visualize
plt.figure(figsize=(8, 5))
plt.bar(['Died', 'Survived'], class_distribution.values, color=['coral', 'skyblue'])
plt.title('Class Distribution in Training Set')
plt.ylabel('Count')
plt.grid(axis='y', alpha=0.3)
for i, v in enumerate(class_distribution.values):
    plt.text(i, v + 10, str(v), ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

## Step 7: Final Data Summary

Let's review what we've prepared for machine learning!

In [None]:
print("="*60)
print("FINAL ML-READY DATASET SUMMARY")
print("="*60)

print("\nDataset Shape:")
print(f"   Training: {X_train_scaled.shape}")
print(f"   Testing:  {X_test_scaled.shape}")

print("\nTarget Variable:")
print(f"   Name: 'survived' (binary: 0=died, 1=survived)")
print(f"   Training samples: {len(y_train)}")
print(f"   Test samples: {len(y_test)}")

print("\nFeatures Used:")
for i, feature in enumerate(final_features, 1):
    print(f"   {i}. {feature}")

print("\nPreprocessing Applied:")
print("   - Missing values handled (median/mode imputation)")
print("   - Features scaled (StandardScaler)")
print("   - Categorical variables encoded (Label + One-Hot)")
print("   - New features engineered (family_size, fare_per_person, etc.)")
print("   - Train-test split performed (80-20)")
print("   - Stratified sampling to maintain class balance")

print("\nData is ready for machine learning models!")
print("="*60)

### Additional Techniques to Explore:

- **PCA** (Principal Component Analysis) - Dimensionality reduction
- **Polynomial Features** - Create interaction terms automatically
- **Target Encoding** - Use target statistics for encoding
- **SMOTE** - Synthetic data generation for imbalanced classes
- **Feature Importance** - From tree-based models

##### Never stop learning! If you stop learning, you'll get left behind in this industry.

## Practice Exercise: Apply to Iris Dataset

Now it's your turn! Apply these techniques to the Iris dataset from the previous tutorial.

**Tasks:**
1. Load the Iris dataset from `data/Iris.csv`
2. Encode the Species column
3. Split into train-test sets (80-20)
4. Scale the features using StandardScaler
5. Check if the classes are balanced

In [None]:
# Load the Iris dataset

# Separate features and target
X = ...
y = ...

# Encode species using LabelEncoder
le = LabelEncoder()
y = ...

# Perform train-test split
X_train, X_test, y_train, y_test = ...

# Print the 4 dataset shapes

# Scale features using StandardScaler (fit_transform for training vs transform for test)
scaler = StandardScaler()
X_train = ...
X_test = ...

# Show the mean and std of scaled features
print("\nScaled feature statistics (training set):")
print(f"Mean (should be ~0): {np.mean(X_train, axis=0)}")
print(f"Std (should be ~1): {np.std(X_train, axis=0)}\n")

# Visualize class distribution
unique, counts = np.unique(y_train, return_counts=True)
class_distribution = dict(zip(le.inverse_transform(unique), counts))
classes = class_distribution.keys()
counts = class_distribution.values()

plt.figure(figsize=(8, 5))
# Matplotlib bar chart goes here - make each bar a different color :)
plt.show()
