# Women Risk Predictor - Feature Engineering

This notebook covers the feature engineering pipeline for the women harassment risk prediction project.

## Overview
This notebook includes:
1. **Load Cleaned Data** - Import the cleaned dataset
2. **Correlation Analysis** - Analyze feature correlations with target variable
3. **Create New Features** - Engineer new features from existing ones
4. **Scale Numeric Features** - Standardize numerical features
5. **Split Features and Target** - Separate features from target variable
6. **Save Processed Data** - Export the engineered dataset for model training

---

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print("‚úÖ All libraries imported successfully!")

## 2. Load Cleaned Data

Load the cleaned dataset from the previous data preparation step.

In [None]:
# Load cleaned data
data_path = "../data/women_risk_cleaned.csv"

print("=" * 60)
print("LOADING CLEANED DATASET")
print("=" * 60)

data = pd.read_csv(data_path)

print(f"\n‚úÖ Dataset loaded successfully!")
print(f"Shape: {data.shape}")
print(f"Number of Rows: {data.shape[0]}")
print(f"Number of Columns: {data.shape[1]}")

# Display first few rows
print("\nFirst 5 rows:")
data.head()

## 3. Correlation Analysis

Analyze the correlation between features and the target variable.

In [None]:
# Correlation analysis
print("=" * 60)
print("CORRELATION ANALYSIS")
print("=" * 60)

# Calculate correlation matrix
corr_matrix = data.corr()

# Visualize correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n‚úÖ Correlation heatmap displayed")

In [None]:
# Correlation with target variable
target_col = 'risk'

if target_col in corr_matrix.columns:
    print(f"\nCorrelation with target variable ('{target_col}'):")
    target_corr = corr_matrix[target_col].sort_values(ascending=False)
    print(target_corr)
    
    # Bar plot of correlations with target
    plt.figure(figsize=(10, 6))
    target_corr_filtered = target_corr[target_corr.index != target_col]
    target_corr_filtered.plot(kind='barh', color='steelblue')
    plt.title(f'Feature Correlation with {target_col}', fontsize=14, fontweight='bold')
    plt.xlabel('Correlation Coefficient', fontsize=12)
    plt.ylabel('Features', fontsize=12)
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print(f"\n‚ö†Ô∏è Warning: Target column '{target_col}' not found!")
    print(f"Available columns: {data.columns.tolist()}")

## 4. Create New Features

Engineer new features from existing ones to improve model performance.

In [None]:
# Create new features
print("=" * 60)
print("CREATING NEW FEATURES")
print("=" * 60)

initial_features = data.shape[1]

# Example: Create interaction features if columns exist
# Customize based on your actual dataset columns

if 'age' in data.columns and 'past_incidents' in data.columns:
    data['risk_score'] = data['age'] * data['past_incidents']
    print("\n‚úÖ Created 'risk_score' = age * past_incidents")

if 'public_transport_usage' in data.columns and 'time_of_day' in data.columns:
    data['transport_time_interaction'] = data['public_transport_usage'] * data['time_of_day']
    print("‚úÖ Created 'transport_time_interaction' = public_transport_usage * time_of_day")

# Add more custom feature engineering based on domain knowledge

final_features = data.shape[1]
new_features = final_features - initial_features

print(f"\nüìä New features created: {new_features}")
print(f"üìä Total features now: {final_features}")

if new_features > 0:
    print("\nUpdated dataset shape:", data.shape)
else:
    print("\n‚ö†Ô∏è No new features were created (columns may not exist)")

## 5. Scale Numeric Features

Standardize numerical features using StandardScaler to ensure all features are on the same scale.

In [None]:
# Scale numeric features
print("=" * 60)
print("SCALING NUMERIC FEATURES")
print("=" * 60)

target_col = 'risk'

# Identify numeric columns (excluding target)
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()

if target_col in numeric_cols:
    numeric_cols.remove(target_col)

if numeric_cols:
    print(f"\nüìã Numeric columns to scale ({len(numeric_cols)}): {numeric_cols}")
    
    scaler = StandardScaler()
    data_scaled = data.copy()
    data_scaled[numeric_cols] = scaler.fit_transform(data[numeric_cols])
    
    # Save the scaler
    os.makedirs('../models', exist_ok=True)
    joblib.dump(scaler, '../models/scaler.pkl')
    print("\n‚úÖ Scaler saved to '../models/scaler.pkl'")
    
    # Show statistics before and after scaling
    print("\n--- Statistics Before Scaling ---")
    print(data[numeric_cols].describe())
    
    print("\n--- Statistics After Scaling ---")
    print(data_scaled[numeric_cols].describe())
    
    # Update data with scaled values
    data = data_scaled
else:
    print("\n‚ö†Ô∏è No numeric columns found to scale!")

## 6. Split Features and Target

Separate features (X) from the target variable (y) for verification.

In [None]:
# Split features and target
print("=" * 60)
print("SPLITTING FEATURES AND TARGET")
print("=" * 60)

target_col = 'risk'

if target_col in data.columns:
    X = data.drop(target_col, axis=1)
    y = data[target_col]
    
    print(f"\n‚úÖ Features shape: {X.shape}")
    print(f"‚úÖ Target shape: {y.shape}")
    
    print(f"\nüìä Target distribution:")
    print(y.value_counts())
    
    print(f"\nüìä Target distribution (%):")
    print(y.value_counts(normalize=True) * 100)
else:
    print(f"\n‚ùå Error: Target column '{target_col}' not found!")
    print(f"Available columns: {data.columns.tolist()}")

## 7. Save Processed Data

Save the feature-engineered dataset for model training.

In [None]:
# Save processed data
output_path = "../data/women_risk_processed.csv"

print("=" * 60)
print("SAVING PROCESSED DATA")
print("=" * 60)

data.to_csv(output_path, index=False)

print(f"\n‚úÖ Processed data saved to: {output_path}")
print(f"‚úÖ Shape: {data.shape}")
print(f"‚úÖ Rows: {data.shape[0]}")
print(f"‚úÖ Columns: {data.shape[1]}")

print("\n" + "=" * 60)
print("FEATURE ENGINEERING COMPLETED SUCCESSFULLY!")
print("=" * 60)

print("\nüìå Next Steps:")
print("   1. Review the processed data")
print("   2. Proceed to model training")
print("   3. Use the saved scaler and encoders for predictions")