# Machine Learning and Predictive Analytics: Solving a Real World Problem with Machine Learning

## Datasets: Los Angeles Crime Data 2010-19 and 2020-25
2010-19 dataset: https://catalog.data.gov/dataset/crime-data-from-2010-to-2019      
2020-25 dataset (accessed up to 25/05/2025): https://catalog.data.gov/dataset/crime-data-from-2020-to-present

## Research Question: How can we protect children from being victims of crime in Los Angeles?

The model will predict the risk level of a child becoming a victim of crime, based on demographic factors (such as age, sex, and descent) in combination with spatial and temporal variables (such as location, time of day, and day of the week).

Real-world interventions can be based on the predictions of the model. For example, if the model predicts that there is a high risk level for Black children being victimised in Central LA during weekday evenings, a local youth centre could implement targeted outreach programmes during those hours — offering safe spaces, support services, or structured activities.

## Methodology Plan

- Combine and clean two datasets
    - ~~Rename columns for clarity~~
    - Rename/group low frequency values in `vict_sex` and `vict_descent`
    - Convert messy dates and times
    - Categorise children as those under 18 and over 0 (vict_age contains many 0s, possibly as crimes without known/human victims e.g. vandalism)
- Decide on which features to include
- Where are children most likely to be victims of crime?
    - Split `location` into longitude and latitude
- When are children most likely to be victims of crime? - time of year, day of week, hour of day
    - Issue with logging of dates - crimes disproportionately logged on 1st Jan or first of month - can use day of week as a proxy
    - Look at metadata to understand times - is there enough accuracy to use this variable?
- How are demographics associated with crime and is this a useful factor to include?
- Make heatmap 
- One-hot encode categorical variables
- Split data into training and test sets
- Group crime types to reduce dimensionality

**Problems with temporal variables:**
- Dates are likely to contain inaccuracies - crimes being logged on 1st Jan or first of month
- Times are stored as text but military times can start with 0s so there are errors
    - Cleaning is possible to a certain extent but would require assumptions (e.g. is '4' supposed to be 4am, 4pm, or a typo?)
- Times may be inaccurate if officers wait until their shift ends to log crimes (may explain peaks around midday, 6pm etc)
- Temporal data is not stored as datetime


## Setup and Pre-processing

In [8]:
#import libraries
import pandas as pd
import numpy as np
import janitor 

In [9]:
#get 2010-19 data from csv
df1 = pd.read_csv("la_crimes_2010-19.csv")

In [10]:
#get 2020-25 data from csv
df2 = pd.read_csv("la_crimes_2020-25.csv")

In [None]:
#clean variable names
df1 = (
    df1.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

df2 = (
    df2.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time", "area":"area_", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

In [None]:
#join dataframes
df = pd.concat([df1, df2], ignore_index=True)