## Predicting Crime Types in Los Angeles (2020–Present)

**Team Members:**  
- Alex Sautereau  
- Bradley Russell 
****



### Source and Format
We are using the **Crime Data from 2020 to Present** 

- **Source:** https://catalog.data.gov/dataset/crime-data-from-2020-to-present 
- **Format:** CSV  
- **File Used:** `Crime_Data_from_2020_to_Present.csv`  
- **Description:** Each row represents one reported crime incident, including date, location, victim demographics, offense classification codes, and investigation status.

### Dataset Attributes
The dataset contains the following columns:

- **DR_NO** – Division of Records number (unique incident identifier)  
- **Date Rptd** – Date the crime was reported  
- **DATE OCC** – Date and time the crime occurred  
- **TIME OCC** – Time of occurrence (in HHMM format)  
- **AREA** – LAPD basic command area code  
- **AREA NAME** – Name of the LAPD area  
- **Rpt Dist No** – Reporting district number  
- **Part 1-2** – Whether the crime is Part 1 or Part 2 (FBI classification)  
- **Crm Cd** – Numeric crime code  
- **Crm Cd Desc** – Description of the crime (textual category)  
- **Mocodes** – Modus operandi codes  
- **Vict Age** – Age of the victim  
- **Vict Sex** – Victim sex  
- **Vict Descent** – Victim race/ethnicity  
- **Premis Cd** – Coded type of property/premises  
- **Premis Desc** – Description of type of property/premises  
- **Weapon Used Cd** – Numeric weapon code  
- **Weapon Desc** – Description of weapon used  
- **Status** – Case status code  
- **Status Desc** – Case status description  
- **Crm Cd 1–4** – Additional crime codes if multiple offenses occurred  
- **LOCATION** – Street-level address information  
- **Cross Street** – Cross-street detail (may be missing)  
- **LAT** – Latitude coordinate  
- **LON** – Longitude coordinate  

### Class Attribute (Prediction Target)
The target variable for our classification task will be:

**`Crm Cd Desc` — the textual description of the crime type.**

This is a **multi-class classification** problem with dozens of possible labels (*THEFT OF IDENTITY*, *BATTERY – SIMPLE ASSAULT*, *BURGLARY*, etc.).


****
## Implementation / Technical Merit

Our project will be implemented in Python using the tools and techniques covered in CPSC 322. The goal is to build a complete end-to-end classification pipeline operating on a large, real-world dataset. The implementation will include the following components:

### **1. Data Cleaning and Preprocessing**
Because crime data contains inconsistencies and missing values, we expect to perform several preprocessing steps:

- Handling missing values in:  
  - `Vict Sex`, `Vict Descent`, `Vict Age`  
  - `Weapon Desc`, `Weapon Used Cd`  
  - `Cross Street` and some location fields  
- Converting `DATE OCC` and `TIME OCC` into usable features  
- Normalizing or scaling numerical attributes (`TIME OCC`, `Vict Age`)  
- Removing or encoding high-cardinality text fields such as `Mocodes`  
- Dropping purely administrative fields such as `DR_NO` if they do not aid classification  

### **2. Feature Engineering**
To create meaningful features for classification, we will engineer new attributes such as:

- **Temporal features**  
  - Day of week  
  - Month  
  - Hour of day (derived from `TIME OCC`)  
  - Weekend vs. weekday  

- **Location-based features**  
  - Using `AREA` or `AREA NAME` as categorical features  
  - Including latitude/longitude (possibly binned into zones)

- **Victim-related features**  
  - Age groups (e.g., 0–17, 18–30, 31–45, 46–65, 65+)  
  - Encoding sex and descent categories  

### **3. Classifiers to be Implemented**
We will evaluate **three different classifiers**, consistent with course requirements:

1. **Decision Tree Classifier**  
2. **k-Nearest Neighbors (k-NN)**  
3. **Naive Bayes**  

Additional classifiers (like Random Forest) may be explored if time allows, but our minimum deliverable is three.

### **4. Feature Selection**
Because several attributes—such as `Premis Desc`, `Crm Cd Desc` (labels), and `Weapon Desc`—are high-cardinality categorical features, one-hot encoding may increase dimensionality.

To manage this, we may explore:

- Mutual Information scoring  
- Decision-tree feature importance  
- Filtering out low-frequency categories  
- Dimensionality reduction (PCA) if numeric feature space grows too large  

### **5. Evaluation**
We will evaluate classifier performance using:

- Accuracy  
- Confusion matrices  
- Precision and recall (macro-averaged due to class imbalance)  
- Stratified train/test/validation splits  

These assessments will help determine which model performs best on multi-class crime prediction using real-world, imbalanced data.
