# Women Risk Predictor - Data Preparation

This notebook covers the complete data preparation pipeline for the women harassment risk prediction project.

## Overview
This notebook includes:
1. **Load Dataset** - Import raw data from CSV file
2. **Data Exploration** - Understand the structure and content of the dataset
3. **Handle Missing Values** - Identify and handle missing data
4. **Remove Duplicates** - Clean duplicate entries
5. **Encode Categorical Variables** - Convert categorical features to numerical format
6. **Save Cleaned Data** - Export the cleaned dataset for next steps

---

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

## 2. Load Dataset

Load the raw dataset from the CSV file.

In [None]:
# Load the dataset
data_path = "../data/women_risk.csv"

print("=" * 60)
print("LOADING DATASET")
print("=" * 60)

data = pd.read_csv(data_path)

print(f"\nDataset loaded successfully!")
print(f"Shape: {data.shape}")
print(f"Number of Rows: {data.shape[0]}")
print(f"Number of Columns: {data.shape[1]}")

## 3. Data Exploration

Explore the dataset to understand its structure, content, and data types.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
data.head()

In [None]:
# Dataset information
print("Dataset Info:")
print(data.info())
print("\n" + "=" * 60)

# Statistical summary
print("\nStatistical Summary:")
data.describe()

In [None]:
# Display column names
print("Column Names:")
for i, col in enumerate(data.columns, 1):
    print(f"{i}. {col}")

## 4. Check for Missing Values

Identify and handle any missing values in the dataset.

In [None]:
# Check for missing values
print("=" * 60)
print("CHECKING MISSING DATA")
print("=" * 60)

missing = data.isnull().sum()
print("\nMissing values per column:")
print(missing)

if missing.sum() > 0:
    print(f"\n‚ö†Ô∏è Total missing values: {missing.sum()}")
    print("\nDropping rows with missing values...")
    data = data.dropna()
    print(f"‚úÖ New shape after dropping missing values: {data.shape}")
else:
    print("\n‚úÖ No missing values found!")

## 5. Remove Duplicates

Remove any duplicate rows from the dataset.

In [None]:
# Remove duplicates
print("=" * 60)
print("REMOVING DUPLICATES")
print("=" * 60)

initial_rows = len(data)
data = data.drop_duplicates()
final_rows = len(data)

duplicates_removed = initial_rows - final_rows

print(f"\nüìä Initial rows: {initial_rows}")
print(f"üìä Duplicates removed: {duplicates_removed}")
print(f"‚úÖ Final shape: {data.shape}")

## 6. Encode Categorical Variables

Convert categorical variables to numerical format using Label Encoding.

In [None]:
# Encode categorical variables
print("=" * 60)
print("ENCODING CATEGORICAL VARIABLES")
print("=" * 60)

# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

if categorical_cols:
    print(f"\nüìã Categorical columns found: {categorical_cols}")
    
    label_encoders = {}
    
    for col in categorical_cols:
        print(f"\nüîÑ Encoding '{col}'...")
        print(f"   Unique values before encoding: {data[col].nunique()}")
        
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col])
        label_encoders[col] = le
        
        print(f"   ‚úÖ Encoding completed for '{col}'")
    
    # Save label encoders for later use
    os.makedirs('../models', exist_ok=True)
    joblib.dump(label_encoders, '../models/label_encoders.pkl')
    print("\n‚úÖ Label encoders saved to '../models/label_encoders.pkl'")
else:
    print("\n‚úÖ No categorical columns found!")

## 7. Save Cleaned Data

Save the cleaned and prepared dataset for the next stage of the pipeline.

In [None]:
# Save cleaned data
output_path = "../data/women_risk_cleaned.csv"

print("=" * 60)
print("SAVING CLEANED DATA")
print("=" * 60)

data.to_csv(output_path, index=False)

print(f"\n‚úÖ Cleaned data saved to: {output_path}")
print(f"‚úÖ Final shape: {data.shape}")
print(f"‚úÖ Rows: {data.shape[0]}")
print(f"‚úÖ Columns: {data.shape[1]}")

print("\n" + "=" * 60)
print("DATA PREPARATION COMPLETED SUCCESSFULLY!")
print("=" * 60)