# Women Risk Predictor - Data Preparation

This notebook covers the complete data preparation pipeline for the women harassment risk prediction project.

## Overview
This notebook includes:
1. **Load Dataset** - Import raw data from CSV file
2. **Data Exploration** - Understand the structure and content of the dataset
3. **Handle Missing Values** - Identify and handle missing data
4. **Remove Duplicates** - Clean duplicate entries
5. **Encode Categorical Variables** - Convert categorical features to numerical format
6. **Save Cleaned Data** - Export the cleaned dataset for next steps

---

## 1. Import Required Libraries

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Load Dataset

Load the raw dataset from the CSV file.

In [2]:
# Load the dataset
data_path = "../data/women_risk.csv"

print("=" * 60)
print("LOADING DATASET")
print("=" * 60)

data = pd.read_csv(data_path)

print(f"\nDataset loaded successfully!")
print(f"Shape: {data.shape}")
print(f"Number of Rows: {data.shape[0]}")
print(f"Number of Columns: {data.shape[1]}")

LOADING DATASET

Dataset loaded successfully!
Shape: (115, 13)
Number of Rows: 115
Number of Columns: 13


## 3. Data Exploration

Explore the dataset to understand its structure, content, and data types.

In [3]:
# Display first few rows
print("First 5 rows of the dataset:")
data.head()

First 5 rows of the dataset:


Unnamed: 0,Timestamp,1. What is your age group?,2. What is your occupation?,3. At what time of day did the incident occur?,4. Where did the incident occur?,5. How crowded was the location at the time of the incident?,6. What was the lighting condition in the area?,7. Was any form of security present at the location?,8. Were you familiar with the area where the incident occurred?,9. What type of harassment did you experience?,10. How often have you experienced harassment in similar situations?,11. How safe did you feel during the incident?,"12. Overall, how would you rate the risk level of harassment in that situation?"
0,2026-01-29 22:19:51,18-25,Student,Evening,Street/Public place,The location was slightly crowded,The lighting was moderate,There was no security at all,I mostly knew the area,"Physical harassment (touching, grabbing, assault)",I have sometimes experienced harassment in sim...,I felt somewhat safe,There was no risk of harassment at all
1,2026-01-29 22:20:52,18-25,Student,Evening,Public transport,The location was extremely crowded,The lighting was poor,There was no security at all,I somewhat knew the area,"Physical harassment (touching, grabbing, assault)",I have never experienced harassment in similar...,Option I felt very unsafe1,There was a very high risk of harassment
2,2026-01-29 23:27:42,18-25,Student,Night,Online platform,The location was not crowded at all,The lighting was very poor,There was no security at all,I did not know the area at all,"Online harassment (messages, social media, calls)",I have sometimes experienced harassment in sim...,I felt very unsafe,There was a very high risk of harassment
3,2026-01-30 20:28:06,18-25,Student,Night,Public transport,The location was extremely crowded,The lighting was poor,There was no security at all,I mostly knew the area,"Physical harassment (touching, grabbing, assault)",I have rarely experienced harassment in simila...,I felt very unsafe,There was a high risk of harassment
4,2026-01-30 22:06:27,26-35,Self-employed,Early Morning,Street/Public place,The location was mostly crowded,The lighting was moderate,There was some security,I knew the area completely,"Verbal harassment (unwanted comments, remarks)",I have sometimes experienced harassment in sim...,I felt very unsafe,There was a high risk of harassment


In [4]:
# Dataset information
print("Dataset Info:")
print(data.info())
print("\n" + "=" * 60)

# Statistical summary
print("\nStatistical Summary:")
data.describe()

Dataset Info:
<class 'pandas.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 13 columns):
 #   Column                                                                           Non-Null Count  Dtype
---  ------                                                                           --------------  -----
 0   Timestamp                                                                        115 non-null    str  
 1   1. What is your age group?                                                       115 non-null    str  
 2   2. What is your occupation?                                                      115 non-null    str  
 3   3. At what time of day did the incident occur?                                   115 non-null    str  
 4   4. Where did the incident occur?                                                 115 non-null    str  
 5   5. How crowded was the location at the time of the incident?                     115 non-null    str  
 6   6. What was the lightin

Unnamed: 0,Timestamp,1. What is your age group?,2. What is your occupation?,3. At what time of day did the incident occur?,4. Where did the incident occur?,5. How crowded was the location at the time of the incident?,6. What was the lighting condition in the area?,7. Was any form of security present at the location?,8. Were you familiar with the area where the incident occurred?,9. What type of harassment did you experience?,10. How often have you experienced harassment in similar situations?,11. How safe did you feel during the incident?,"12. Overall, how would you rate the risk level of harassment in that situation?"
count,115,115,115,115,115,115,115,115,115,115,115,115,115
unique,115,5,5,5,5,5,5,4,5,5,5,6,5
top,2026-01-29 22:19:51,18-25,Student,Evening,Street/Public place,The location was moderately crowded,The lighting was moderate,Security was fully present,I somewhat knew the area,"Physical harassment (touching, grabbing, assault)",I have rarely experienced harassment in simila...,I felt somewhat safe,There was a moderate risk of harassment
freq,1,37,33,31,34,46,33,35,34,34,45,39,42


In [5]:
# Display column names
print("Column Names:")
for i, col in enumerate(data.columns, 1):
    print(f"{i}. {col}")

Column Names:
1. Timestamp
2. 1. What is your age group?
3. 2. What is your occupation?
4. 3. At what time of day did the incident occur?
5. 4. Where did the incident occur?
6. 5. How crowded was the location at the time of the incident?
7. 6. What was the lighting condition in the area?
8. 7. Was any form of security present at the location?
9. 8. Were you familiar with the area where the incident occurred?
10. 9. What type of harassment did you experience?
11. 10. How often have you experienced harassment in similar situations?
12. 11. How safe did you feel during the incident?
13. 12. Overall, how would you rate the risk level of harassment in that situation?


## 4. Check for Missing Values

Identify and handle any missing values in the dataset.

In [7]:
# Check for missing values
print("=" * 60)
print("CHECKING MISSING DATA")
print("=" * 60)

missing = data.isnull().sum()
print("\nMissing values per column:")
print(missing)

if missing.sum() > 0:
    print(f"\nTotal missing values: {missing.sum()}")
    print("\nDropping rows with missing values...")
    data = data.dropna()
    print(f"New shape after dropping missing values: {data.shape}")
else:
    print("\nNo missing values found!")

CHECKING MISSING DATA

Missing values per column:
Timestamp                                                                          0
1. What is your age group?                                                         0
2. What is your occupation?                                                        0
3. At what time of day did the incident occur?                                     0
4. Where did the incident occur?                                                   0
5. How crowded was the location at the time of the incident?                       0
6. What was the lighting condition in the area?                                    0
7. Was any form of security present at the location?                               0
8. Were you familiar with the area where the incident occurred?                    0
9. What type of harassment did you experience?                                     0
10. How often have you experienced harassment in similar situations?               0
11. How safe di

## 5. Remove Duplicates

Remove any duplicate rows from the dataset.

In [8]:
# Remove duplicates
print("=" * 60)
print("REMOVING DUPLICATES")
print("=" * 60)

initial_rows = len(data)
data = data.drop_duplicates()
final_rows = len(data)

duplicates_removed = initial_rows - final_rows

print(f"\nInitial rows: {initial_rows}")
print(f"Duplicates removed: {duplicates_removed}")
print(f"Final shape: {data.shape}")

REMOVING DUPLICATES

Initial rows: 115
Duplicates removed: 0
Final shape: (115, 13)


## 6. Encode Categorical Variables

Convert categorical variables to numerical format using Label Encoding.

In [9]:
# Encode categorical variables
print("=" * 60)
print("ENCODING CATEGORICAL VARIABLES")
print("=" * 60)

# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

if categorical_cols:
    print(f"\nCategorical columns found: {categorical_cols}")
    
    label_encoders = {}
    
    for col in categorical_cols:
        print(f"\nEncoding '{col}'...")
        print(f"   Unique values before encoding: {data[col].nunique()}")
        
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col])
        label_encoders[col] = le
        
        print("   Encoding completed for '{col}'")
    
    # Save label encoders for later use
    os.makedirs('../models', exist_ok=True)
    joblib.dump(label_encoders, '../models/label_encoders.pkl')
    print("\nLabel encoders saved to '../models/label_encoders.pkl'")
else:
    print("\nNo categorical columns found!")

ENCODING CATEGORICAL VARIABLES

Categorical columns found: ['Timestamp', '1. What is your age group?', '2. What is your occupation?', '3. At what time of day did the incident occur?', '4. Where did the incident occur?', '5. How crowded was the location at the time of the incident?', '6. What was the lighting condition in the area?', '7. Was any form of security present at the location?', '8. Were you familiar with the area where the incident occurred?', '9. What type of harassment did you experience?', '10. How often have you experienced harassment in similar situations?', '11. How safe did you feel during the incident?', '12. Overall, how would you rate the risk level of harassment in that situation?']

Encoding 'Timestamp'...
   Unique values before encoding: 115
   Encoding completed for '{col}'

Encoding '1. What is your age group?'...
   Unique values before encoding: 5
   Encoding completed for '{col}'

Encoding '2. What is your occupation?'...
   Unique values before encoding: 5

## 7. Save Cleaned Data

Save the cleaned and prepared dataset for the next stage of the pipeline.

In [10]:
# Save cleaned data
output_path = "../data/women_risk_cleaned.csv"

print("=" * 60)
print("SAVING CLEANED DATA")
print("=" * 60)

data.to_csv(output_path, index=False)

print(f"\nCleaned data saved to: {output_path}")
print(f"Final shape: {data.shape}")
print(f"Rows: {data.shape[0]}")
print(f"Columns: {data.shape[1]}")

print("\n" + "=" * 60)
print("DATA PREPARATION COMPLETED SUCCESSFULLY!")
print("=" * 60)

SAVING CLEANED DATA

Cleaned data saved to: ../data/women_risk_cleaned.csv
Final shape: (115, 13)
Rows: 115
Columns: 13

DATA PREPARATION COMPLETED SUCCESSFULLY!
