# ETL: Extract, Transform, and Load for Predictive Safety Risk Classifier

This notebook fetches the latest Chicago Crime Dataset via the Socrata API, transforms it using pandas, and saves the result for further analysis.  
**Overview:**  
- **Extract:** Retrieve up to 50,000 records from the Chicago Crime Dataset.  
- **Transform:** Clean data, extract time features, and aggregate by location to generate our target variable (Risk).  
- **Load:** Save the transformed dataset as a CSV for downstream processing.

## Step 0: Install Dependencies

Ensure the required library `sodapy` is installed.  
*Tip: In production, consider using a requirements.txt or environment management tool (e.g., conda) instead of inline pip commands.*

In [1]:
# Install sodapy (only needed once)
!pip install sodapy

print("sodapy installed successfully. Please restart the kernel before proceeding.")

sodapy installed successfully. Please restart the kernel before proceeding.


## Step 1: Restart the Kernel

After installing sodapy, please restart the kernel (Kernel > Restart) and then re-run the notebook cells.

In [2]:
# Step 1: Placeholder for kernel restart confirmation
# - This cell does nothing but confirm you've restarted the kernel when you run it
print("Kernel restarted successfully. Now we can proceed with imports.")

Kernel restarted successfully. Now we can proceed with imports.


## Step 2: Import Libraries and Initialize Socrata Client

We import the necessary libraries and set up the Socrata API client.  
*Note: For public data, an API token is not required.*

In [1]:
import pandas as pd
import numpy as np
from sodapy import Socrata
from datetime import datetime

# Initialize Socrata client for the Chicago data portal
client = Socrata("data.cityofchicago.org", None)

# Define the data range
start_date = "2024-09-01"  # 6 months ago from March 2025
end_date = "2025-03-01"    # 7 days before today (per data policy)
print(f"Data will be fetched from {start_date} to {end_date}.")



Data will be fetched from 2024-09-01 to 2025-03-01.


## Step 3: Extract Data

We fetch the latest crime data from the Chicago Crime Dataset (ID: ijzp-q8t2), limiting the query to 50,000 rows for manageability.

In [2]:
# Define and execute the query to extract data
query = f"date between '{start_date}' and '{end_date}'"
try:
    results = client.get("ijzp-q8t2", where=query, limit=50000)
    print("Data extraction successful.")
except Exception as e:
    print("Error during data extraction:", e)
    results = []

# Convert results to DataFrame
df = pd.DataFrame.from_records(results)
print(f"Initial dataset shape: {df.shape}")
print(df.head())

Data extraction successful.
Initial dataset shape: (50000, 22)
         id case_number                     date                  block  iucr  \
0  13701180    JH550927  2024-09-01T00:00:00.000       0000X E 117TH PL  1310   
1  13700621    JH553061  2024-09-01T00:00:00.000  057XX N MAPLEWOOD AVE  2820   
2  13703231    JH556313  2024-09-01T00:00:00.000        033XX W 84TH ST  2825   
3  13704073    JH557391  2024-09-01T00:00:00.000  054XX N EAST RIVER RD  1320   
4  13707285    JH561154  2024-09-01T00:00:00.000      081XX S DAMEN AVE  0810   

      primary_type              description location_description  arrest  \
0  CRIMINAL DAMAGE              TO PROPERTY            APARTMENT   False   
1    OTHER OFFENSE         TELEPHONE THREAT            RESIDENCE   False   
2    OTHER OFFENSE  HARASSMENT BY TELEPHONE            RESIDENCE   False   
3  CRIMINAL DAMAGE               TO VEHICLE            APARTMENT   False   
4            THEFT                OVER $500            RESIDENCE   Fal

## Step 4: Explore the Raw Data

Before transforming, we inspect the data structure and check for missing values.

In [3]:
# Display available columns and missing values
print("Columns in raw data:", df.columns.tolist())
print("Missing values in raw data:\n", df.isnull().sum())

Columns in raw data: ['id', 'case_number', 'date', 'block', 'iucr', 'primary_type', 'description', 'location_description', 'arrest', 'domestic', 'beat', 'district', 'ward', 'community_area', 'fbi_code', 'x_coordinate', 'y_coordinate', 'year', 'updated_on', 'latitude', 'longitude', 'location']
Missing values in raw data:
 id                        0
case_number               0
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description    153
arrest                    0
domestic                  0
beat                      0
district                  0
ward                      0
community_area            1
fbi_code                  0
x_coordinate             10
y_coordinate             10
year                      0
updated_on                0
latitude                 10
longitude                10
location                 10
dtype: int64


## Step 5: Clean and Feature Extraction

We clean the dataset by:  
- Selecting relevant columns  
- Converting latitude/longitude to numeric values  
- Dropping rows with missing coordinates  
- Converting date strings to datetime and extracting the hour and day of the week  
- Creating an 'IsViolent' indicator for specific crime types

In [4]:
# Select only the necessary columns
columns = ["date", "latitude", "longitude", "primary_type"]
df = df[columns].copy()

# Convert coordinates to numeric and drop rows with missing values
df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")
df = df.dropna(subset=["latitude", "longitude"])
print(f"Shape after dropping missing coordinates: {df.shape}")

# Convert 'date' to datetime and extract additional features
df["date"] = pd.to_datetime(df["date"])
df["Hour"] = df["date"].dt.hour
df["DayOfWeek"] = df["date"].dt.dayofweek  # Monday=0, Sunday=6

# Create an indicator for violent crimes
violent_crimes = ["HOMICIDE", "ASSAULT", "BATTERY", "ROBBERY", "CRIM SEXUAL ASSAULT"]
df["IsViolent"] = df["primary_type"].isin(violent_crimes).astype(int)

# Confirm transformations
print(df.head())
print(df.info())

Shape after dropping missing coordinates: (49990, 4)
        date   latitude  longitude     primary_type  Hour  DayOfWeek  \
0 2024-09-01  41.680758 -87.621727  CRIMINAL DAMAGE     0          6   
1 2024-09-01  41.985759 -87.693169    OTHER OFFENSE     0          6   
2 2024-09-01  41.740621 -87.705239    OTHER OFFENSE     0          6   
3 2024-09-01  41.977609 -87.846467  CRIMINAL DAMAGE     0          6   
4 2024-09-01  41.745877 -87.673029            THEFT     0          6   

   IsViolent  
0          0  
1          0  
2          0  
3          0  
4          0  
<class 'pandas.core.frame.DataFrame'>
Index: 49990 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          49990 non-null  datetime64[ns]
 1   latitude      49990 non-null  float64       
 2   longitude     49990 non-null  float64       
 3   primary_type  49990 non-null  object        
 4   Hour          49

## Step 6: Aggregate Data and Define Risk

We aggregate the crime records by location (latitude and longitude) and calculate:  
- **CrimeCount:** Total crimes at that location  
- **ViolentCount:** Total violent crimes at that location  

The target variable **Risk** is defined as:  
- **High-risk (1):** Locations with a CrimeCount above the median  
- **Low-risk (0):** Locations with a CrimeCount at or below the median

In [5]:
# Aggregate data by location
location_counts = df.groupby(["latitude", "longitude"]).agg(
    CrimeCount=("primary_type", "count"),
    ViolentCount=("IsViolent", "sum")
).reset_index()

# Define the target variable 'Risk'
median_count = location_counts["CrimeCount"].median()
location_counts["Risk"] = (location_counts["CrimeCount"] > median_count).astype(int)

# Display aggregated results and target distribution
print(location_counts.head())
print("Risk distribution:\n", location_counts["Risk"].value_counts())

    latitude  longitude  CrimeCount  ViolentCount  Risk
0  41.644604 -87.610728           1             0     0
1  41.644608 -87.598848           1             0     0
2  41.645378 -87.540022           1             0     0
3  41.646123 -87.542896           1             0     0
4  41.647038 -87.616003           1             1     0
Risk distribution:
 Risk
0    27669
1     6772
Name: count, dtype: int64


## Step 7: Save the Transformed Data

The final aggregated dataset is saved as a CSV file for use in subsequent feature engineering and modeling steps.

In [6]:
# Save the transformed dataset to CSV
output_file = "../chicago_crimes_latest_transformed.csv"
location_counts.to_csv(output_file, index=False)
print(f"Transformed dataset saved as '{output_file}'.")

Transformed dataset saved as '../chicago_crimes_latest_transformed.csv'.
