# Predictive Crime Analytics: From Clusters to District-Level Action Plans

# Purpose
This notebook aims to analyze crime data to identify patterns and trends, cluster similar crime incidents, and develop actionable insights for district-level crime prevention strategies using recommender systems.

# Background
Crime analytics involves the use of data analysis techniques to understand crime patterns and trends. By clustering similar crime incidents, we can identify hotspots and common factors contributing to criminal activities. This information can be used to develop targeted action plans for different districts, enhancing public safety and resource allocation. This project uses real crime data reported in Raleigh, NC, to demonstrate the application of clustering algorithms and recommender systems in crime analytics.

# Methodology
1. Data Collection: Gather crime data from reliable sources, including incident reports, locations, times, and types of crimes.
2. Data Preprocessing: Clean and preprocess the data to handle missing values, outliers, and inconsistencies.
3. Exploratory Data Analysis (EDA): Analyze the data to identify trends, patterns, and correlations.
4. Clustering: Apply clustering algorithms (e.g., K-Means, DBSCAN) to group similar crime incidents based on features such as location, time, and type of crime.
5. Non-negative Matrix Factorization (NMF): Use NMF to reduce dimensionality and extract latent features from the clustered data.
6. Recommender System: Develop a recommender system to suggest district-level action plans based on the identified clusters.
7. Prediction and Evaluation: Test the recommender system's effectiveness in predicting crime trends and evaluate its performance using appropriate metrics.

# Ethical Considerations

## Bias and Fairness
This dataset contains information related to crime incidents in the Raleigh, NC area. While there is no identifying information about individuals in the dataset, there do exist incident report numbers and location data that could potentially be used to identify individuals involved in the incidents, or may contain embedded biases based on demographic information. It is crucial to ensure that the analysis does not perpetuate existing biases or lead to unfair treatment of certain communities.

## Privacy
When working with crime data, it is essential to respect the privacy of individuals involved in the incidents. Analysis will be conducted in a manner that protects personal information and adheres to data protection regulations

## Responsible Use
The insights derived from this analysis should be used responsibly to enhance public safety without infringing on individual rights. Recommendations should focus on community engagement and preventive measures rather than punitive actions.

# Dataset Description
The dataset used in this analysis is the "Raleigh Police Incidents (NIBRS)" dataset, which contains detailed records of crime incidents reported to the Raleigh Police Department. The dataset includes various attributes such as incident type, location, date, time, and other relevant details. The data is sourced from Open Data Raleigh and is publicly available for analysis (City of Raleigh, 2014–present). The data dictionary for this dataset is as follows:

| Column Name            |                                               Description                                               | Data Type  | Feature Type       |
|:-----------------------|:-------------------------------------------------------------------------------------------------------:|:-----------|:-------------------|
| Case Number            |  A unique identifier assigned to each police case. Used for tracking and referencing specific reports.  | String     | Unique Identifier  |
| Crime_Category         |                                         Type of crime committed                                         | String     | Categorical        |
| Crime Code             |          A standardized code representing the type of crime based on internal classification.           | String     | Categorical        |
| Crime Description      |                               A brief description of the crime incident.                                | String     | Categorical        |
| Crime Type             |                      A broader classification of the crime (e.g., Theft, Assault).                      | String     | Categorical        |
| Reported Block Address |                       The anonymized block address where the crime was reported.                        | String     | Categorical        |
| City of Incident       |                               The city where the crime incident occurred.                               | String     | Categorical        |
| District               |                            The police district where the crime was reported.                            | String     | Categorical        |
| Reported Date          |                                  The date when the crime was reported.                                  | Date       | Temporal           |
| Reported Year          |                                  The year when the crime was reported.                                  | Integer    | Categorical        |
| Reported Month         |                                 The month when the crime was reported.                                  | Integer    | Categorical        |
| Reported Day           |                            The day of the month when the crime was reported.                            | Integer    | Categorical        |
| Reported Hour          |                            The hour of the day when the crime was reported.                             | Integer    | Categorical        |
| Reported Day of Week   |                            The day of the week when the crime was reported.                             | String     | Categorical        |
| Latitude               |                         The latitude coordinate of the crime incident location.                         | Float      | Numerical          |
| Longitude              |                        The longitude coordinate of the crime incident location.                         | Float      | Numerical          |
| Agency                 |                      The law enforcement agency that reported the crime incident.                       | String     | Categorical        |
| Updated_Date           |                        The date when the crime incident record was last updated.                        | Date       | Temporal           |

It should be noted that this dataset is updated daily from the hosting website. The data used in this analysis was downloaded on November 14, 2025. It is too large to be included directly in this repository but can be accessed via the provided download link.

# Global Actions

## Importing Libraries

In [34]:

import warnings

import gdown
import pandas as pd
import seaborn as sns

sns.set_theme(style='ticks', palette='pastel')
warnings.filterwarnings('ignore')

## Notebook-Level Constants

In [35]:
random_seed = 42
dataset_url = 'https://drive.google.com/uc?id=1xNzeXy8IXL21qgmVym5OV14PZ3kmNhox&export=download'
file_id = '1xNzeXy8IXL21qgmVym5OV14PZ3kmNhox'
output_file = 'crime_data.csv'

## Download the Dataset

In [36]:
# Download the file using gdown (handles large files with virus scan warnings)
gdown.download(f'https://drive.google.com/uc?id={file_id}', output_file, quiet=True)

# Define data types for columns
dtype_spec = {
    'Case Number': 'string',
    'Crime_Category': 'string',
    'Crime Code': 'string',
    'Crime Description': 'string',
    'Crime Type': 'string',
    'Reported Block Address': 'string',
    'City of Incident': 'string',
    'District': 'string',
    'Reported Year': 'Int64',
    'Reported Month': 'Int64',
    'Reported Day': 'Int64',
    'Reported Hour': 'Int64',
    'Reported Day of Week': 'string',
    'Latitude': 'float64',
    'Longitude': 'float64',
    'Agency': 'string',
    'x': 'float64',
    'y': 'float64'
}

# Read the downloaded CSV file with specified dtypes
# Note: Date columns will be parsed separately
df = pd.read_csv(
    output_file,
    dtype=dtype_spec,
    parse_dates=['Reported Date', 'Updated_Date']
)

# Exploratory Data Analysis (EDA)

## Basic Information

View the basic information about the dataset. This is a large dataset with 595,963 observations and 23 features. There are several features with missing values, which will need to be addressed during data cleaning. Since we want to create a rich dataset for clustering and recommendation, we will attempt to impute missing values where possible. Dropping observations with missing values may lead to loss of valuable information and will be considered as a last resort.

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595963 entries, 0 to 595962
Data columns (total 23 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   OBJECTID                595963 non-null  int64         
 1   GlobalID                595963 non-null  object        
 2   Case Number             445478 non-null  string        
 3   Crime_Category          595963 non-null  string        
 4   Crime Code              595963 non-null  string        
 5   Crime Description       595963 non-null  string        
 6   Crime Type              386964 non-null  string        
 7   Reported Block Address  445357 non-null  string        
 8   City of Incident        445352 non-null  string        
 9   City                    595957 non-null  object        
 10  District                595963 non-null  string        
 11  Reported Date           595963 non-null  datetime64[ns]
 12  Reported Year           595963

View the first few rows of the dataset

In [42]:
df.head()

Unnamed: 0,OBJECTID,GlobalID,Case Number,Crime_Category,Crime Code,Crime Description,Crime Type,Reported Block Address,City of Incident,City,...,Reported Month,Reported Day,Reported Hour,Reported Day of Week,Latitude,Longitude,Agency,Updated_Date,x,y
0,12001,9cdee08d-11c8-4789-864b-6965a1b2e620,,MISCELLANEOUS,81H,Miscellaneous/Missing Person (18 & over),,,,RALEIGH,...,1,14,22,Saturday,0.0,0.0,RPD,2017-01-19 20:11:28,,
1,12002,6f6731f4-dd64-44c7-895c-555de2703c8a,,MISCELLANEOUS,81A,Miscellaneous/All Other Non-Offenses,,,,RALEIGH,...,7,29,8,Saturday,0.0,0.0,RPD,2017-08-01 14:06:50,,
2,12003,f0fd0e92-448e-4ca8-86c9-e6594564318b,,MISCELLANEOUS,81F,Miscellaneous/Mental Commitment,,,,RALEIGH,...,3,6,22,Sunday,0.0,0.0,RPD,2016-04-14 14:43:38,,
3,12004,8a212e84-7b53-478a-b225-c212aa25d7fd,,MISCELLANEOUS,81A,Miscellaneous/All Other Non-Offenses,,,,RALEIGH,...,3,24,0,Tuesday,0.0,0.0,RPD,2015-03-25 19:24:28,,
4,12005,01614b98-48f5-4374-a561-17c4b29d8857,,MISCELLANEOUS,81A,Miscellaneous/All Other Non-Offenses,,,,RALEIGH,...,12,22,19,Tuesday,0.0,0.0,RPD,2016-01-13 19:29:51,,


## Data Cleaning
There are some obvious features which can be dropped for the purpose of clustering and recommendation. The following features will be dropped:
- `OBJECTID`
- `GlobalID`
- `Case Number`
- `Updated_Date`
- `x`
- `y`

These columns are either unique identifiers, useless information or redundant for the analysis.

In [58]:
df = df.drop(columns=[
    'OBJECTID', 'GlobalID', 'Case Number', 'Updated_Date', 'x', 'y'
]) if all(col in df.columns for col in [
    'OBJECTID', 'GlobalID', 'Case Number', 'Updated_Date', 'x', 'y'
]) else df

We can also remove the `Agency` column since all the records are from the same agency, Raleigh Police Department.

In [56]:
df['Agency'].value_counts()

Agency
RPD    595963
Name: count, dtype: Int64

In [57]:
df = df.drop(columns=['Agency']) if 'Agency' in df.columns else df

The `Reported Date` column is redundant since we have the year, month, day, hour, and day of the week in separate columns. We can drop this column as well.

In [62]:
df = df.drop(columns=['Reported Date']) if 'Reported Date' in df.columns else df

# Works Cited
- City of Raleigh. (2014–present). Raleigh Police Incidents (NIBRS) [Data set]. Open Data Raleigh. https://data-ral.opendata.arcgis.com/datasets/ral::raleigh-police-incidents-nibrs/about
