# Pandas for Machine Learning

**Learning Objectives:**
- Master DataFrame operations essential for ML data preprocessing
- Learn feature engineering techniques using Pandas
- Understand data cleaning and preparation workflows
- Practice converting Pandas data to formats suitable for PyTorch and TensorFlow

**Prerequisites:** NumPy essentials, basic Python

**Estimated Time:** 45 minutes

---

Pandas is the go-to library for data manipulation and analysis in Python. In ML workflows, Pandas is typically used for:
- Loading and exploring datasets
- Data cleaning and preprocessing
- Feature engineering and selection
- Preparing data for ML frameworks

While PyTorch and TensorFlow work with tensors, most real-world data starts in Pandas DataFrames.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
from datetime import datetime, timedelta

# Add src to path for our utilities
sys.path.append(os.path.join('..', '..', 'src'))

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 1. DataFrame Creation and Basic Operations

Understanding how to create and manipulate DataFrames is fundamental to ML data preprocessing.

In [None]:
# Create sample ML dataset
np.random.seed(42)
n_samples = 1000

# Generate synthetic customer data for ML
data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.normal(35, 12, n_samples).astype(int),
    'income': np.random.lognormal(10, 0.5, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples, p=[0.3, 0.4, 0.2, 0.1]),
    'experience_years': np.random.exponential(5, n_samples),
    'num_purchases': np.random.poisson(3, n_samples),
    'satisfaction_score': np.random.uniform(1, 5, n_samples),
    'is_premium': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'signup_date': pd.date_range('2020-01-01', periods=n_samples, freq='D')[:n_samples]
}

df = pd.DataFrame(data)

# Introduce some missing values (realistic scenario)
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices[:20], 'income'] = np.nan
df.loc[missing_indices[20:40], 'satisfaction_score'] = np.nan

print("Sample ML Dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

In [None]:
# Basic DataFrame information (essential for ML)
print("Dataset Information:")
print(df.info())

print("\nData Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

print("\nBasic Statistics:")
print(df.describe())

## 2. Data Exploration and Analysis

Understanding your data is crucial before building ML models.

In [None]:
# Categorical data analysis
print("Categorical Data Analysis:")

# Value counts for categorical features
print("Education distribution:")
print(df['education'].value_counts())
print(f"\nEducation percentages:")
print(df['education'].value_counts(normalize=True) * 100)

print("\nRegion distribution:")
print(df['region'].value_counts())

print("\nPremium customers:")
print(df['is_premium'].value_counts())
print(f"Premium rate: {df['is_premium'].mean():.2%}")

In [None]:
# Numerical data analysis
print("Numerical Data Analysis:")

# Select numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(f"Numerical columns: {list(numerical_cols)}")

# Correlation analysis (important for feature selection)
correlation_matrix = df[numerical_cols].corr()
print("\nCorrelation with target (is_premium):")
target_corr = correlation_matrix['is_premium'].sort_values(ascending=False)
print(target_corr)

# Identify highly correlated features (multicollinearity)
print("\nHighly correlated feature pairs (|correlation| > 0.5):")
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.5:
            high_corr_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], corr_val))

for col1, col2, corr in high_corr_pairs:
    print(f"{col1} - {col2}: {corr:.3f}")

In [None]:
# Groupby analysis (understanding patterns)
print("Group Analysis:")

# Analyze by education level
education_analysis = df.groupby('education').agg({
    'age': ['mean', 'std'],
    'income': ['mean', 'median'],
    'satisfaction_score': 'mean',
    'is_premium': 'mean',
    'customer_id': 'count'
}).round(2)

print("Analysis by Education Level:")
print(education_analysis)

# Analyze by region
print("\nAnalysis by Region:")
region_analysis = df.groupby('region').agg({
    'income': 'mean',
    'is_premium': 'mean',
    'satisfaction_score': 'mean'
}).round(2)
print(region_analysis)

## 3. Data Cleaning and Preprocessing

Essential steps before feeding data to ML models.

In [None]:
# Handle missing values
print("Handling Missing Values:")
print(f"Missing values before cleaning:")
print(df.isnull().sum())

# Create a copy for cleaning
df_clean = df.copy()

# Strategy 1: Fill numerical missing values with median
df_clean['income'].fillna(df_clean['income'].median(), inplace=True)
df_clean['satisfaction_score'].fillna(df_clean['satisfaction_score'].mean(), inplace=True)

print(f"\nMissing values after cleaning:")
print(df_clean.isnull().sum())

# Alternative strategies
print("\nAlternative missing value strategies:")
print("1. Forward fill: df.fillna(method='ffill')")
print("2. Backward fill: df.fillna(method='bfill')")
print("3. Interpolation: df.interpolate()")
print("4. Drop rows: df.dropna()")
print("5. Drop columns: df.dropna(axis=1)")

In [None]:
# Handle outliers
print("Outlier Detection and Handling:")

# Identify outliers using IQR method
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (series < lower_bound) | (series > upper_bound)

# Check for outliers in income
income_outliers = detect_outliers_iqr(df_clean['income'])
print(f"Income outliers: {income_outliers.sum()} ({income_outliers.mean():.1%})")

# Visualize outliers
print(f"Income statistics:")
print(f"Mean: ${df_clean['income'].mean():.0f}")
print(f"Median: ${df_clean['income'].median():.0f}")
print(f"95th percentile: ${df_clean['income'].quantile(0.95):.0f}")
print(f"99th percentile: ${df_clean['income'].quantile(0.99):.0f}")
print(f"Max: ${df_clean['income'].max():.0f}")

# Handle outliers (cap at 95th percentile)
income_cap = df_clean['income'].quantile(0.95)
df_clean['income_capped'] = df_clean['income'].clip(upper=income_cap)

print(f"\nAfter capping at 95th percentile:")
print(f"Max income: ${df_clean['income_capped'].max():.0f}")

In [None]:
# Data type optimization (important for large datasets)
print("Data Type Optimization:")
print(f"Memory usage before optimization: {df_clean.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Optimize integer columns
int_cols = df_clean.select_dtypes(include=['int64']).columns
for col in int_cols:
    if col != 'customer_id':  # Keep ID as int64
        df_clean[col] = pd.to_numeric(df_clean[col], downcast='integer')

# Optimize float columns
float_cols = df_clean.select_dtypes(include=['float64']).columns
for col in float_cols:
    df_clean[col] = pd.to_numeric(df_clean[col], downcast='float')

# Convert categorical columns to category dtype
categorical_cols = ['education', 'region']
for col in categorical_cols:
    df_clean[col] = df_clean[col].astype('category')

print(f"Memory usage after optimization: {df_clean.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Memory reduction: {(1 - df_clean.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum()) * 100:.1f}%")

print("\nOptimized data types:")
print(df_clean.dtypes)

## 4. Feature Engineering

Creating new features that can improve ML model performance.

In [None]:
# Feature engineering examples
print("Feature Engineering:")

# 1. Binning continuous variables
df_clean['age_group'] = pd.cut(df_clean['age'], 
                              bins=[0, 25, 35, 50, 100], 
                              labels=['Young', 'Adult', 'Middle-aged', 'Senior'])

df_clean['income_tier'] = pd.qcut(df_clean['income_capped'], 
                                 q=4, 
                                 labels=['Low', 'Medium', 'High', 'Very High'])

print("Age group distribution:")
print(df_clean['age_group'].value_counts())

print("\nIncome tier distribution:")
print(df_clean['income_tier'].value_counts())

# 2. Mathematical transformations
df_clean['log_income'] = np.log1p(df_clean['income_capped'])  # log(1+x) to handle zeros
df_clean['income_per_purchase'] = df_clean['income_capped'] / (df_clean['num_purchases'] + 1)
df_clean['satisfaction_squared'] = df_clean['satisfaction_score'] ** 2

# 3. Date-based features
df_clean['signup_year'] = df_clean['signup_date'].dt.year
df_clean['signup_month'] = df_clean['signup_date'].dt.month
df_clean['signup_dayofweek'] = df_clean['signup_date'].dt.dayofweek
df_clean['days_since_signup'] = (datetime.now() - df_clean['signup_date']).dt.days

print("\nNew features created:")
new_features = ['age_group', 'income_tier', 'log_income', 'income_per_purchase', 
                'satisfaction_squared', 'signup_year', 'signup_month', 'days_since_signup']
print(df_clean[new_features].head())

In [None]:
# Interaction features
print("Interaction Features:")

# Create interaction between important features
df_clean['age_income_interaction'] = df_clean['age'] * df_clean['log_income']
df_clean['experience_satisfaction'] = df_clean['experience_years'] * df_clean['satisfaction_score']

# Boolean combinations
df_clean['high_income_high_satisfaction'] = (
    (df_clean['income_tier'] == 'Very High') & 
    (df_clean['satisfaction_score'] > 4)
).astype(int)

df_clean['experienced_premium'] = (
    (df_clean['experience_years'] > 5) & 
    (df_clean['is_premium'] == 1)
).astype(int)

print("Interaction features:")
interaction_features = ['age_income_interaction', 'experience_satisfaction', 
                       'high_income_high_satisfaction', 'experienced_premium']
print(df_clean[interaction_features].describe())

In [None]:
# Aggregation features (useful for time series or grouped data)
print("Aggregation Features:")

# Features based on region
region_stats = df_clean.groupby('region').agg({
    'income_capped': ['mean', 'std'],
    'satisfaction_score': 'mean',
    'is_premium': 'mean'
}).round(3)

# Flatten column names
region_stats.columns = ['_'.join(col).strip() for col in region_stats.columns]
region_stats = region_stats.add_prefix('region_')

# Merge back to main dataframe
df_clean = df_clean.merge(region_stats, left_on='region', right_index=True, how='left')

print("Region-based features:")
region_features = [col for col in df_clean.columns if col.startswith('region_')]
print(df_clean[['region'] + region_features].head())

# Relative features (compare individual to group)
df_clean['income_vs_region_mean'] = df_clean['income_capped'] / df_clean['region_income_capped_mean']
df_clean['satisfaction_vs_region_mean'] = df_clean['satisfaction_score'] / df_clean['region_satisfaction_score_mean']

print("\nRelative features:")
print(df_clean[['income_vs_region_mean', 'satisfaction_vs_region_mean']].describe())

## 5. Categorical Encoding

Converting categorical variables to numerical format for ML models.

In [None]:
# One-hot encoding
print("One-Hot Encoding:")

# Select categorical columns for encoding
categorical_cols = ['education', 'region', 'age_group', 'income_tier']

# One-hot encode
df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, prefix=categorical_cols, drop_first=True)

print(f"Shape before encoding: {df_clean.shape}")
print(f"Shape after encoding: {df_encoded.shape}")

# Show new columns
new_cols = [col for col in df_encoded.columns if any(cat in col for cat in categorical_cols)]
print(f"\nNew encoded columns ({len(new_cols)}):")
for col in new_cols[:10]:  # Show first 10
    print(f"  {col}")
if len(new_cols) > 10:
    print(f"  ... and {len(new_cols) - 10} more")

In [None]:
# Label encoding (for ordinal variables)
print("Label Encoding:")

from sklearn.preprocessing import LabelEncoder

# Create a copy for label encoding
df_label_encoded = df_clean.copy()

# Education has natural ordering
education_order = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_label_encoded['education_encoded'] = df_label_encoded['education'].map(education_order)

# For non-ordinal categories, use LabelEncoder
le_region = LabelEncoder()
df_label_encoded['region_encoded'] = le_region.fit_transform(df_label_encoded['region'])

print("Education encoding:")
print(df_label_encoded[['education', 'education_encoded']].drop_duplicates().sort_values('education_encoded'))

print("\nRegion encoding:")
print(df_label_encoded[['region', 'region_encoded']].drop_duplicates().sort_values('region_encoded'))

# Show encoding mapping
print(f"\nRegion encoding mapping:")
for i, region in enumerate(le_region.classes_):
    print(f"  {region}: {i}")

In [None]:
# Target encoding (advanced technique)
print("Target Encoding:")

# Calculate mean target value for each category
def target_encode(df, categorical_col, target_col, smoothing=1):
    """
    Target encoding with smoothing to prevent overfitting
    """
    # Calculate global mean
    global_mean = df[target_col].mean()
    
    # Calculate category means and counts
    category_stats = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])
    
    # Apply smoothing
    smoothed_means = (
        (category_stats['mean'] * category_stats['count'] + global_mean * smoothing) /
        (category_stats['count'] + smoothing)
    )
    
    return smoothed_means

# Target encode education based on premium rate
education_target_encoding = target_encode(df_clean, 'education', 'is_premium')
df_clean['education_target_encoded'] = df_clean['education'].map(education_target_encoding)

print("Education target encoding (premium rate):")
print(education_target_encoding.sort_values(ascending=False))

# Target encode region
region_target_encoding = target_encode(df_clean, 'region', 'is_premium')
df_clean['region_target_encoded'] = df_clean['region'].map(region_target_encoding)

print("\nRegion target encoding (premium rate):")
print(region_target_encoding.sort_values(ascending=False))

## 6. Feature Scaling and Normalization

Preparing numerical features for ML algorithms that are sensitive to scale.

In [None]:
# Feature scaling
print("Feature Scaling:")

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Select numerical features for scaling
numerical_features = ['age', 'income_capped', 'experience_years', 'satisfaction_score', 
                     'log_income', 'days_since_signup']

print("Original feature statistics:")
print(df_clean[numerical_features].describe())

# Standard scaling (z-score normalization)
scaler_standard = StandardScaler()
df_standard_scaled = df_clean.copy()
df_standard_scaled[numerical_features] = scaler_standard.fit_transform(df_clean[numerical_features])

print("\nAfter Standard Scaling (mean=0, std=1):")
print(df_standard_scaled[numerical_features].describe())

# Min-Max scaling (0-1 range)
scaler_minmax = MinMaxScaler()
df_minmax_scaled = df_clean.copy()
df_minmax_scaled[numerical_features] = scaler_minmax.fit_transform(df_clean[numerical_features])

print("\nAfter Min-Max Scaling (range 0-1):")
print(df_minmax_scaled[numerical_features].describe())

In [None]:
# Robust scaling (less sensitive to outliers)
scaler_robust = RobustScaler()
df_robust_scaled = df_clean.copy()
df_robust_scaled[numerical_features] = scaler_robust.fit_transform(df_clean[numerical_features])

print("After Robust Scaling (median=0, IQR=1):")
print(df_robust_scaled[numerical_features].describe())

# Compare scaling methods visually
print("\nScaling Comparison for 'income_capped':")
comparison_df = pd.DataFrame({
    'Original': df_clean['income_capped'],
    'Standard': df_standard_scaled['income_capped'],
    'MinMax': df_minmax_scaled['income_capped'],
    'Robust': df_robust_scaled['income_capped']
})

print(comparison_df.describe())

## 7. Data Splitting and Sampling

Preparing data for training, validation, and testing.

In [None]:
# Train-validation-test split
print("Data Splitting:")

from sklearn.model_selection import train_test_split

# Prepare features and target
feature_columns = [col for col in df_encoded.columns 
                  if col not in ['customer_id', 'is_premium', 'signup_date', 'education', 'region']]

X = df_encoded[feature_columns]
y = df_encoded['is_premium']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: separate train and validation (80% of remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp  # 0.25 * 0.8 = 0.2 of total
)

print(f"\nSplit sizes:")
print(f"Train: {X_train.shape[0]} ({X_train.shape[0]/len(X):.1%})")
print(f"Validation: {X_val.shape[0]} ({X_val.shape[0]/len(X):.1%})")
print(f"Test: {X_test.shape[0]} ({X_test.shape[0]/len(X):.1%})")

# Check target distribution in each split
print(f"\nTarget distribution:")
print(f"Train: {y_train.mean():.3f}")
print(f"Validation: {y_val.mean():.3f}")
print(f"Test: {y_test.mean():.3f}")

In [None]:
# Handling imbalanced data
print("Handling Imbalanced Data:")

# Check class imbalance
class_counts = y_train.value_counts()
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"Class distribution: {class_counts.to_dict()}")
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 2:  # If significantly imbalanced
    print("\nDataset is imbalanced. Strategies to consider:")
    
    # 1. Undersampling majority class
    majority_class = y_train.value_counts().index[0]
    minority_class = y_train.value_counts().index[1]
    
    majority_indices = y_train[y_train == majority_class].index
    minority_indices = y_train[y_train == minority_class].index
    
    # Random undersample majority class
    undersampled_majority = np.random.choice(majority_indices, size=len(minority_indices), replace=False)
    balanced_indices = np.concatenate([undersampled_majority, minority_indices])
    
    X_train_balanced = X_train.loc[balanced_indices]
    y_train_balanced = y_train.loc[balanced_indices]
    
    print(f"1. Undersampling - New size: {len(X_train_balanced)}")
    print(f"   New distribution: {y_train_balanced.value_counts().to_dict()}")
    
    # 2. Class weights (for algorithms that support it)
    from sklearn.utils.class_weight import compute_class_weight
    
    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weight_dict = dict(zip(np.unique(y_train), class_weights))
    
    print(f"2. Class weights: {class_weight_dict}")
    
else:
    print("Dataset is reasonably balanced.")

## 8. Converting to ML Framework Formats

Preparing Pandas data for PyTorch and TensorFlow.

In [None]:
# Convert to NumPy (intermediate step)
print("Converting to NumPy:")

# First, identify numeric columns only (exclude categorical strings)
numeric_columns = X_train.select_dtypes(include=[np.number]).columns
categorical_columns = X_train.select_dtypes(include=['object']).columns

print(f"Numeric columns ({len(numeric_columns)}): {list(numeric_columns)}")
print(f"Categorical columns ({len(categorical_columns)}): {list(categorical_columns)}")

# Convert only numeric columns to NumPy arrays
X_train_np = X_train[numeric_columns].values.astype(np.float32)
y_train_np = y_train.values.astype(np.int64)

X_val_np = X_val[numeric_columns].values.astype(np.float32)
y_val_np = y_val.values.astype(np.int64)

X_test_np = X_test[numeric_columns].values.astype(np.float32)
y_test_np = y_test.values.astype(np.int64)

print(f"\nTraining data shape: {X_train_np.shape}")
print(f"Training data type: {X_train_np.dtype}")
print(f"Training labels shape: {y_train_np.shape}")
print(f"Training labels type: {y_train_np.dtype}")

print(f"\n⚠️  Note: Only numeric columns converted to NumPy.")
print(f"   Categorical columns need encoding before ML model training.")

In [None]:
# PyTorch format (conceptual - would need PyTorch installed)
print("PyTorch Format (Conceptual):")
print("""
# Convert to PyTorch tensors
import torch
from torch.utils.data import TensorDataset, DataLoader

# Create tensors
X_train_torch = torch.from_numpy(X_train_np)
y_train_torch = torch.from_numpy(y_train_np)

# Create dataset
train_dataset = TensorDataset(X_train_torch, y_train_torch)

# Create data loader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Usage in training loop
for batch_X, batch_y in train_loader:
    # batch_X shape: (batch_size, num_features)
    # batch_y shape: (batch_size,)
    pass
""")

print(f"Expected tensor shapes:")
print(f"  X_train_torch: {X_train_np.shape}")
print(f"  y_train_torch: {y_train_np.shape}")
print(f"  Batch X: (32, {X_train_np.shape[1]})")
print(f"  Batch y: (32,)")

In [None]:
# TensorFlow format (conceptual - would need TensorFlow installed)
print("TensorFlow Format (Conceptual):")
print("""
# Convert to TensorFlow tensors
import tensorflow as tf

# Create tensors
X_train_tf = tf.constant(X_train_np)
y_train_tf = tf.constant(y_train_np)

# Create dataset
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tf, y_train_tf))

# Batch and shuffle
train_dataset = train_dataset.batch(32).shuffle(1000)

# Usage in training
for batch_X, batch_y in train_dataset:
    # batch_X shape: (batch_size, num_features)
    # batch_y shape: (batch_size,)
    pass
""")

print(f"Expected tensor shapes:")
print(f"  X_train_tf: {X_train_np.shape}")
print(f"  y_train_tf: {y_train_np.shape}")
print(f"  Batch X: (32, {X_train_np.shape[1]})")
print(f"  Batch y: (32,)")

In [None]:
# Save processed data for later use
print("Saving Processed Data:")

# Create a summary of the preprocessing pipeline
preprocessing_summary = {
    'original_shape': df.shape,
    'final_shape': (X_train_np.shape[0] + X_val_np.shape[0] + X_test_np.shape[0], X_train_np.shape[1]),
    'num_features': len(feature_columns),
    'feature_names': feature_columns,
    'target_distribution': y_train.value_counts().to_dict(),
    'split_sizes': {
        'train': X_train_np.shape[0],
        'val': X_val_np.shape[0],
        'test': X_test_np.shape[0]
    },
    'preprocessing_steps': [
        'Missing value imputation',
        'Outlier capping',
        'Feature engineering',
        'Categorical encoding',
        'Feature scaling (StandardScaler)',
        'Train/val/test split'
    ]
}

print("Preprocessing Summary:")
for key, value in preprocessing_summary.items():
    if key != 'feature_names':  # Too long to print
        print(f"  {key}: {value}")

# In a real project, you would save the data:
print("\nTo save the processed data:")
print("""
# Save NumPy arrays
np.save('data/processed/X_train.npy', X_train_np)
np.save('data/processed/y_train.npy', y_train_np)
np.save('data/processed/X_val.npy', X_val_np)
np.save('data/processed/y_val.npy', y_val_np)
np.save('data/processed/X_test.npy', X_test_np)
np.save('data/processed/y_test.npy', y_test_np)

# Save feature names and preprocessing info
import pickle
with open('data/processed/preprocessing_info.pkl', 'wb') as f:
    pickle.dump(preprocessing_summary, f)

# Save scalers for future use
with open('data/processed/scaler.pkl', 'wb') as f:
    pickle.dump(scaler_standard, f)
""")

## Summary and Key Takeaways

**What we've learned:**

1. **Data Exploration**: Understanding your data through statistics, distributions, and correlations
2. **Data Cleaning**: Handling missing values, outliers, and data type optimization
3. **Feature Engineering**: Creating new features through binning, transformations, and interactions
4. **Categorical Encoding**: One-hot, label, and target encoding techniques
5. **Feature Scaling**: StandardScaler, MinMaxScaler, and RobustScaler
6. **Data Splitting**: Train/validation/test splits and handling imbalanced data
7. **Framework Conversion**: Preparing data for PyTorch and TensorFlow

**Key Pandas Patterns for ML:**
- Use `.info()`, `.describe()`, and `.value_counts()` for data exploration
- Handle missing values before feature engineering
- Create interaction features for better model performance
- Always split data before scaling to prevent data leakage
- Convert to appropriate data types (float32, int64) for ML frameworks
- Save preprocessing steps and scalers for production use

**Best Practices:**
- Explore data thoroughly before preprocessing
- Document all preprocessing steps
- Use stratified splits for classification problems
- Consider class imbalance in your target variable
- Validate that preprocessing doesn't introduce data leakage
- Keep feature names and preprocessing metadata

**Next Steps:**
- Learn how to integrate this preprocessing with PyTorch DataLoaders
- Understand TensorFlow's tf.data API for efficient data pipelines
- Explore advanced feature engineering techniques
- Learn about feature selection methods

Pandas is the bridge between raw data and ML-ready datasets. Master these preprocessing techniques to build robust ML pipelines!