# Feature Engineering for Predictive Safety Risk Classifier

This notebook performs feature engineering on the transformed Chicago Crime Dataset to prepare it for modeling. We start with a train-test-validation split to avoid data leakage, then engineer features to enhance our safety risk predictions.

## Dependencies
- `pandas` and `numpy`: For data manipulation.
- `scikit-learn`: For splitting the dataset.

In [10]:
#Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Step 1: Load the transformed object
# - File: chicago_crimes_latest_transformed.csv from ETL step
df = pd.read_csv("chicago_crimes_latest_transformed.csv") 
print(f"Initial shape: {df.shape}")
print(df.head())

Initial shape: (34432, 5)
    latitude  longitude  CrimeCount  ViolentCount  Risk
0  41.644604 -87.610728           1             0     0
1  41.644608 -87.598848           1             0     0
2  41.645378 -87.540022           1             0     0
3  41.646123 -87.542896           1             0     0
4  41.647038 -87.616003           1             1     0


## Step 1: Train-Test-Validation Split
Before feature engineering, we split the data into train (70%), validation (15%), and test (15%) sets to prevent leakage of test information into the training process.

In [11]:
# Train-Test-Validation Split
# Step 1: Define features (X) and target (y)
X = df.drop(columns=["Risk"]) #Features: latitude, longitude, CrimeCount, ViolentCount
y = df["Risk"]

# Step 2: First split: 70% train, 30% temp (val + test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Step 3: Second split: 15% validation, 15% test from temp
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Step 4: Verify split sizes
print(f"Train shape: {X_train.shape}, Validation shape: {X_val.shape}, Test shape: {X_test.shape}")
print(f"Train risk distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Validation risk distribution:\n{y_val.value_counts(normalize=True)}")
print(f"Test risk distribution:\n{y_test.value_counts(normalize=True)}")

Train shape: (24102, 4), Validation shape: (5165, 4), Test shape: (5165, 4)
Train risk distribution:
Risk
0    0.803336
1    0.196664
Name: proportion, dtype: float64
Validation risk distribution:
Risk
0    0.803291
1    0.196709
Name: proportion, dtype: float64
Test risk distribution:
Risk
0    0.803485
1    0.196515
Name: proportion, dtype: float64


## Step 2: Feature Engineering
We create new features to improve model performance, applying transformations only to the training set initially, then replicating them on validation and test sets to maintain consistency.

In [12]:
# Feature Engineering
# Step 1: Define a function to engineer features
def engineer_features(df):
    # Feature 1: Crime Density (CrimeCount normalized by unique locations)
    df["CrimeDensity"] = df["CrimeCount"] / (df["latitude"] * df["longitude"]).abs()

    # Feature 2: Violent Crime Ratio
    df["ViolentRatio"] = df["ViolentCount"] / df["CrimeCount"].replace(0, 1) # Avoid division by zero

    # Feature 3: Distance from City Center (Chicago: 41.8781 N, -87.6298 W)
    city_center_lat, city_center_lon = 41.8781, -87.6298
    df["DistanceFromCenter"] = np.sqrt(
        (df["latitude"] - city_center_lat)**2 + (df["longitude"] - city_center_lon)**2
    )

    return df

# Step 2: Apply to train, valudation, and test sets
X_train_fe = engineer_features(X_train.copy())
X_val_fe = engineer_features(X_val.copy())
X_test_fe = engineer_features(X_test.copy())

# Step 3: Verify new features
print("Training set with new features:")
print(X_train_fe.head())
print(X_train_fe.describe())

Training set with new features:
        latitude  longitude  CrimeCount  ViolentCount  CrimeDensity  \
8215   41.766610 -87.571418           1             0      0.000273   
2682   41.719901 -87.563490           2             2      0.000547   
20512  41.880992 -87.704487           2             0      0.000544   
4882   41.745807 -87.604266           2             1      0.000547   
25661  41.909922 -87.723248           1             0      0.000272   

       ViolentRatio  DistanceFromCenter  
8215            0.0            0.125851  
2682            1.0            0.171534  
20512           0.0            0.074743  
4882            0.5            0.134734  
25661           0.0            0.098718  
           latitude     longitude    CrimeCount  ViolentCount  CrimeDensity  \
count  24102.000000  24102.000000  24102.000000  24102.000000  24102.000000   
mean      41.844080    -87.671596      1.447930      0.451332      0.000395   
std        0.087877      0.058380      1.822483     

## Step 3: Save the Engineered Datasets
We recombine features with targets and save the split datasets for modeling.

In [14]:
# Save the Engineered Datasets
# Step 1: Recombine features and targets
train_df = pd.concat([X_train_fe, y_train.reset_index(drop=True)], axis=1)
val_df = pd.concat([X_val_fe, y_val.reset_index(drop=True)], axis=1)
test_df = pd.concat([X_test_fe, y_test.reset_index(drop=True)], axis=1)

# Step 2: Export to CSV
train_df.to_csv("train_egineered.csv", index=False)
val_df.to_csv("val_engineered.csv", index=False)
test_df.to_csv("test_engineered.csv", index=False)

# Step 3: Confirm completion
print("Engineered datasets saved: train_engineered.csv, val_engineered.csv, test_engineered.csv")

Engineered datasets saved: train_engineered.csv, val_engineered.csv, test_engineered.csv
