# Feature Engineering for Predictive Safety Risk Classifier

This notebook performs feature engineering on the transformed Chicago Crime Dataset to prepare it for modeling. We start with a train–test–validation split to avoid data leakage, then create new features that improve our safety risk predictions.

**Overview:**
- **Dependencies:** pandas, numpy, and scikit-learn.
- **Steps:**
  1. Load the transformed dataset.
  2. Split the data into train (70%), validation (15%), and test (15%) sets.
  3. Engineer new features on the training set and replicate these transformations on the validation and test sets.
  4. Save the engineered datasets for modeling.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Step 1: Load the transformed dataset
# File generated in the ETL step
df = pd.read_csv("chicago_crimes_latest_transformed.csv")
print(f"Initial dataset shape: {df.shape}")
print(df.head())

Initial dataset shape: (34432, 5)
    latitude  longitude  CrimeCount  ViolentCount  Risk
0  41.644604 -87.610728           1             0     0
1  41.644608 -87.598848           1             0     0
2  41.645378 -87.540022           1             0     0
3  41.646123 -87.542896           1             0     0
4  41.647038 -87.616003           1             1     0


## Step 1: Train–Test–Validation Split

Before applying feature engineering, we split the data into training, validation, and test sets to prevent any leakage of test information into our training process.

In [2]:
# Define features (X) and target (y)
X = df.drop(columns=["Risk"])  # Features: latitude, longitude, CrimeCount, ViolentCount
y = df["Risk"]

# Ensure that only rows with non-null target values are retained
print("NaNs in target before split:", y.isnull().sum())
X = X[y.notnull()]
y = y[y.notnull()]

# Step 2: Split data into train (70%) and temporary set (30%: for validation and test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 3: Split temporary set into validation (15%) and test (15%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Verify split sizes and target distribution
print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}\n")
print("Training risk distribution:\n", y_train.value_counts(normalize=True))
print("Validation risk distribution:\n", y_val.value_counts(normalize=True))
print("Test risk distribution:\n", y_test.value_counts(normalize=True))

NaNs in target before split: 0
Training set shape: (24102, 4)
Validation set shape: (5165, 4)
Test set shape: (5165, 4)

Training risk distribution:
 Risk
0    0.803336
1    0.196664
Name: proportion, dtype: float64
Validation risk distribution:
 Risk
0    0.803291
1    0.196709
Name: proportion, dtype: float64
Test risk distribution:
 Risk
0    0.803485
1    0.196515
Name: proportion, dtype: float64


## Step 2: Feature Engineering

We now create new features that can help improve model performance. All transformations are first developed on the training set and then applied to the validation and test sets for consistency.

In [3]:
def engineer_features(df):
    """
    Engineer new features for the safety risk classifier.

    Features added:
    - CrimeDensity: CrimeCount normalized by a combination of latitude and longitude (avoiding division by zero).
    - ViolentRatio: Ratio of violent crimes to total crimes (with zero division handling).
    - DistanceFromCenter: Euclidean distance from Chicago city center (41.8781, -87.6298).

    Parameters:
        df (pd.DataFrame): DataFrame containing at least the columns 'latitude', 'longitude', 
                           'CrimeCount', and 'ViolentCount'.

    Returns:
        pd.DataFrame: DataFrame with new engineered features.
    """
    # Feature 1: Crime Density
    # Avoid division by zero by replacing any zero denominators with 1.
    denominator = (df["latitude"] * df["longitude"]).abs()
    df["CrimeDensity"] = df["CrimeCount"] / denominator.replace(0, 1)
    df["CrimeDensity"] = df["CrimeDensity"].fillna(0)

    # Feature 2: Violent Crime Ratio
    df["ViolentRatio"] = df["ViolentCount"] / df["CrimeCount"].replace(0, 1)
    df["ViolentRatio"] = df["ViolentRatio"].fillna(0)

    # Feature 3: Distance from City Center (Chicago: 41.8781 N, -87.6298 W)
    city_center_lat, city_center_lon = 41.8781, -87.6298
    df["DistanceFromCenter"] = np.sqrt(
        (df["latitude"].fillna(city_center_lat) - city_center_lat) ** 2 +
        (df["longitude"].fillna(city_center_lon) - city_center_lon) ** 2
    )
    return df

# Apply feature engineering to train, validation, and test sets
X_train_fe = engineer_features(X_train.copy())
X_val_fe = engineer_features(X_val.copy())
X_test_fe = engineer_features(X_test.copy())

# Verify that new features have been added
print("Training set with new features:")
print(X_train_fe.head())
print("\nStatistical summary of engineered features:")
print(X_train_fe.describe())

Training set with new features:
        latitude  longitude  CrimeCount  ViolentCount  CrimeDensity  \
8215   41.766610 -87.571418           1             0      0.000273   
2682   41.719901 -87.563490           2             2      0.000547   
20512  41.880992 -87.704487           2             0      0.000544   
4882   41.745807 -87.604266           2             1      0.000547   
25661  41.909922 -87.723248           1             0      0.000272   

       ViolentRatio  DistanceFromCenter  
8215            0.0            0.125851  
2682            1.0            0.171534  
20512           0.0            0.074743  
4882            0.5            0.134734  
25661           0.0            0.098718  

Statistical summary of engineered features:
           latitude     longitude    CrimeCount  ViolentCount  CrimeDensity  \
count  24102.000000  24102.000000  24102.000000  24102.000000  24102.000000   
mean      41.844080    -87.671596      1.447930      0.451332      0.000395   
std    

## Step 3: Save the Engineered Datasets

After engineering the features, we recombine the features with their respective targets and save the datasets. These files will be used in the modeling step.

In [4]:
# Recombine features and targets for each split
train_df = pd.concat([X_train_fe.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)
val_df = pd.concat([X_val_fe.reset_index(drop=True), y_val.reset_index(drop=True)], axis=1)
test_df = pd.concat([X_test_fe.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)

# Check for any remaining NaNs in the validation set (as an example)
print("NaN counts in validation set before saving:")
print(val_df.isnull().sum())

# Export the engineered datasets to CSV
train_df.to_csv("../train_engineered.csv", index=False)
val_df.to_csv("../val_engineered.csv", index=False)
test_df.to_csv("../test_engineered.csv", index=False)

print("Engineered datasets saved successfully as:")
print("- train_engineered.csv")
print("- val_engineered.csv")
print("- test_engineered.csv")

NaN counts in validation set before saving:
latitude              0
longitude             0
CrimeCount            0
ViolentCount          0
CrimeDensity          0
ViolentRatio          0
DistanceFromCenter    0
Risk                  0
dtype: int64
Engineered datasets saved successfully as:
- train_engineered.csv
- val_engineered.csv
- test_engineered.csv
