# Discovering Spatio-Temporal Crime Patterns Using Unsupervised Learning


*by Justin Kim and Srishti Rajpurohit*

**Dataset:** [Seattle Police Department Crime Incident Data (2008–Present)](https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5/about_data)

This notebook applies unsupervised learning methods, specifically clustering, to analyze spatial and temporal patterns in Seattle crime incident data. The analysis includes preprocessing, exploratory data analysis, and feature engineering to prepare the dataset for pattern discovery.


## 0. Setup

In [15]:
# Import statements
import pandas as pd
import numpy as np

## 1. Data Overview

### 1.1 Load Data

In [16]:
# Import statements
import pandas as pd
import numpy as np

# Load the dataset into a pandas DataFrame
df = pd.read_csv('data/SPD_Crime_Data__2008-Present_20260210.csv')

### 1.2 Basic Inspection

In [17]:
# Display the first few rows of the DataFrame to verify it loaded correctly
df.head()

Unnamed: 0,Report Number,Report DateTime,Offense ID,Offense Date,NIBRS Group AB,NIBRS Crime Against Category,Offense Sub Category,Shooting Type Group,Block Address,Latitude,Longitude,Beat,Precinct,Sector,Neighborhood,Reporting Area,Offense Category,NIBRS Offense Code Description,NIBRS_offense_code
0,2015-190826,2015 Jun 08 09:39:00 AM,7655587915,2015 May 24 02:00:00 PM,A,PROPERTY,LARCENY-THEFT,-,26XX BLOCK OF W MARINA PL,47.63103937,-122.391970808268,Q1,West,Q,-,7089,PROPERTY CRIME,All Other Larceny,23H
1,2008-479747,2008 Dec 28 10:14:00 PM,7639775836,2008 Dec 28 10:14:00 PM,A,PROPERTY,LARCENY-THEFT,-,-,-1.0,-1.0,D2,West,D,-,3700,PROPERTY CRIME,Theft From Motor Vehicle,23F
2,2014-158003,2014 May 20 05:42:00 PM,7668842409,2014 May 20 05:42:00 PM,A,PROPERTY,LARCENY-THEFT,-,-,-1.0,-1.0,F3,Southwest,F,-,4197,PROPERTY CRIME,All Other Larceny,23H
3,2012-380870,2012 Nov 06 09:40:00 AM,7649760707,2012 Nov 04 08:00:00 PM,A,PROPERTY,BURGLARY,-,30XX BLOCK OF 29TH AVE W,47.64773751,-122.394242455682,Q1,West,Q,-,7024,PROPERTY CRIME,Burglary/Breaking & Entering,220
4,2014-041879,2014 Feb 07 10:47:00 PM,7628705100,2014 Feb 07 09:00:00 PM,A,PROPERTY,LARCENY-THEFT,-,7XX BLOCK OF S DEARBORN ST,47.59583224,-122.323111156883,K3,West,K,-,1502,PROPERTY CRIME,Theft From Motor Vehicle,23F


In [18]:
# Display the column names to understand the structure of the dataset
df.columns

Index(['Report Number', 'Report DateTime', 'Offense ID', 'Offense Date',
       'NIBRS Group AB', 'NIBRS Crime Against Category',
       'Offense Sub Category', 'Shooting Type Group', 'Block Address',
       'Latitude', 'Longitude', 'Beat', 'Precinct', 'Sector', 'Neighborhood',
       'Reporting Area', 'Offense Category', 'NIBRS Offense Code Description',
       'NIBRS_offense_code'],
      dtype='object')

In [19]:
# Display shape of the DataFrame to understand how many rows and columns it contains
df.shape

(1514383, 19)

In [20]:
# Display summary statistics of the DataFrame to understand the distribution of numerical columns
df.describe()

Unnamed: 0,Offense ID
count,1514383.0
mean,19034750000.0
std,18999560000.0
min,7624429000.0
25%,7659382000.0
50%,7689669000.0
75%,26401750000.0
max,68492600000.0


In [21]:
# Display information about the DataFrame to understand data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1514383 entries, 0 to 1514382
Data columns (total 19 columns):
 #   Column                          Non-Null Count    Dtype 
---  ------                          --------------    ----- 
 0   Report Number                   1514383 non-null  object
 1   Report DateTime                 1514383 non-null  object
 2   Offense ID                      1514383 non-null  int64 
 3   Offense Date                    1514383 non-null  object
 4   NIBRS Group AB                  1514383 non-null  object
 5   NIBRS Crime Against Category    1514383 non-null  object
 6   Offense Sub Category            1514383 non-null  object
 7   Shooting Type Group             1514383 non-null  object
 8   Block Address                   1514383 non-null  object
 9   Latitude                        1514383 non-null  object
 10  Longitude                       1514383 non-null  object
 11  Beat                            1514383 non-null  object
 12  Precinct      

**Notable observations post-inspection**
- The dataset contains 1,514,383 records and 19 variables, representing reported crime incidents in Seattle over multiple years. With a memory footprint of approximately 219 MB, the dataset is sufficiently large to support meaningful exploratory and clustering analysis.

- Although no explicit null values are present, placeholder values such as “-1.0” in Latitude and Longitude and “-” in categorical fields indicate implicit missing data. These values must be addressed during preprocessing.

- Both temporal fields (Report DateTime and Offense Date) are stored as object types and require conversion to datetime format for temporal feature extraction. Additionally, the geographic coordinate variables are stored as object types rather than numeric, indicating the presence of non-numeric values that must be cleaned and converted.

- The dataset is predominantly categorical, with only one numeric identifier column (Offense ID), which serves as a unique identifier and does not provide analytical value. Therefore, significant feature engineering and encoding will be required prior to clustering.

- The presence of hierarchical crime classification variables enables analysis at multiple levels of crime categorization.