# Machine Learning and Predictive Analytics: Solving a Real World Problem with Machine Learning

## Datasets: Los Angeles Crime Data 2010-19 and 2020-25
2010-19 dataset: https://catalog.data.gov/dataset/crime-data-from-2010-to-2019      
2020-25 dataset (accessed up to 25/05/2025): https://catalog.data.gov/dataset/crime-data-from-2020-to-present

## Research Question: How can we protect children from being victims of crime in Los Angeles?

The model will predict the risk level of a child becoming a victim of crime, based on demographic factors (such as age, sex, and descent) in combination with spatial and temporal variables (such as location, time of day, and day of the week).

Real-world interventions can be based on the predictions of the model. For example, if the model predicts that there is a high risk level for Black children being victimised in Central LA during weekday evenings, a local youth centre could implement targeted outreach programmes during those hours — offering safe spaces, support services, or structured activities.


## Setup and Pre-processing

In [None]:
#import libraries
import pandas as pd
import numpy as np
import janitor 
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import pprint

In [None]:
#get 2010-19 data from csv
df1 = pd.read_csv("la_crimes_2010-19.csv")

#get 2020-25 data from csv
df2 = pd.read_csv("la_crimes_2020-25.csv")

#clean variable names
df1 = (
    df1.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

df2 = (
    df2.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time", "area":"area_", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

#join dataframes and view all columns
df = pd.concat([df1, df2], ignore_index=True)
pd.set_option('display.max_columns', None)
df.head(40)

### Data Cleaning

First I will remove true duplicates, as each instance should have a unique identifier (according to the metadata, this is `dr_no` - Division of Records number). Any row that is a complete duplicate is therefore likely to be attributable to a data entry error. 

Next, I will view the unique values that appear in the demographic variables to check whether they need cleaning, using the metadata to support my understanding and decisions.

In [None]:
#remove duplicate rows
df = df.drop_duplicates()
df.shape

In [None]:
#get demographic values and counts
vict_sex_values = df['vict_sex'].value_counts(dropna=False).to_dict()
vict_descent_values = df['vict_descent'].value_counts(dropna=False).to_dict()

print(f"Victim Sex Values:\n{vict_sex_values}\n\nVictim Descent Values:\n{vict_descent_values}")

In [None]:
#tidy victim sex variable
vict_sex_map = {
    "M": "Male",
    "F": "Female",
    "X": "Other/Unknown",
    "H": "Other/Unknown",
    "N": "Other/Unknown",
    "-": "Other/Unknown"
}
df["vict_sex"] = df["vict_sex"].map(vict_sex_map).fillna("Other/Unknown")

#tidy victim descent variable
vict_descent_map = {
    "A": "Other Asian",
    "B": "Black",
    "C": "Chinese",
    "D": "Cambodian",
    "F": "Filipino",
    "G": "Guamanian",
    "H": "Hispanic/Latin/Mexican",
    "I": "American Indian/Alaskan Native",
    "J": "Japanese",
    "K": "Korean",
    "L": "Laotian",
    "O": "Other",
    "P": "Pacific Islander",
    "S": "Samoan",
    "U": "Hawaiian",
    "V": "Vietnamese",
    "W": "White",
    "X": "Unknown",
    "Z": "Asian Indian",
    "-": "Unknown"
}
df["vict_descent"] = df["vict_descent"].map(vict_descent_map).fillna("Unknown")

In [None]:
#convert dates
df["date_rptd"] = pd.to_datetime(df["date_rptd"], format="%m/%d/%Y %I:%M:%S %p").dt.normalize()
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y %I:%M:%S %p").dt.normalize()

#convert times
df = df[df["time"] > 99]
df["time"] = df["time"].astype(str).str.zfill(4)

#get datetime column
df["datetime_str"] = df["date"].dt.strftime("%Y-%m-%d") + " " + df["time"].str[:2] + ":" + df["time"].str[2:]
df["datetime"] = pd.to_datetime(df["datetime_str"], format="%Y-%m-%d %H:%M")
df.drop(columns="datetime_str", inplace=True)

I will drop rows with missing victim age, as this variable is essential for building my model. I will also drop rows where the age is zero or less (vict_age contains many 0s and negative numbers, possibly as crimes without known/human victims e.g. vandalism) 

In [None]:
#drop columns that won't be used for the model
df = df.drop(columns=["dr_no", "date_rptd", "area_", "rpt_dist_no", "part_1_2", "crime_code", "mocodes", "premises_code", "weapon_code", "status", "status_desc", "crm_cd_1", "crm_cd_2", "crm_cd_3", "crm_cd_4", "location", "cross_street"])

#drop rows with missing victim age
df = df.dropna(subset=["vict_age"])

#drop rows where age is zero or less
df = df[df["vict_age"] > 0]

In [None]:
df.head(4)

### Grouping Categorical Data

#### Demographics

The `vict_sex` requires no grouping as there are three categories, none of which are too low frequency.

However, there are several low-frequency values in `vict_descent`. Knowing that this will be a feature in my model, and to avoid anonymisation issues, I will combine some of the lower-frequency ethnicities


Groups that are under 23k (1% of the dataset)



 values with fewer than 30 observations into "Other". This approach will help protect the validity of future tests and models; multiple low-frequency groups in this context would represent too small a sample size for each value to draw meaningful conclusions (Krishnan, 2011).

 I grouped different Asian ethnicities to reduce data sparsity and avoid skewing the model. While I recognise the distinct experiences of subgroups, aggregating was necessary to ensure statistical reliability when predicting child victimisation, as supported in UK government guidance when justified.

 In this project, I have chosen to group individuals of various Asian descents into a single broader category. This decision was made with care and awareness of the sensitivities involved, as different Asian subgroups—such as Indian, Chinese, Pakistani, Bangladeshi, and others—have distinct cultural, socioeconomic, and historical backgrounds. However, the decision was ultimately based on methodological needs rather than sociopolitical assumptions. Aggregating categories was necessary to ensure sufficient sample size for model training and to avoid introducing noise or skew into the predictive model due to sparsity. The primary objective of the analysis is to explore child victimisation patterns using machine learning, and for this purpose, overly granular ethnic categories could lead to unreliable or misleading outputs. While disaggregation may be more appropriate in some social or policy contexts, grouping was deemed the best option here to maintain model robustness and analytical clarity, in line with statistical practices outlined by the UK government when appropriate justifications exist for doing so [source: UK Government Ethnicity Data Guidance, 2020].

In [None]:
vict_descent_map_2 = {
    "Other Asian": "Asian",
    "Chinese": "Asian",
    "Cambodian": "Asian",
    "Filipino": "Asian",
    "Guamanian": "Other/Unknown",
    "American Indian/Alaskan Native": "Other/Unknown",
    "Japanese": "Asian",
    "Korean": "Asian",
    "Laotian": "Asian",
    "Pacific Islander": "Other/Unknown",
    "Samoan": "Other/Unknown",
    "Hawaiian": "Other/Unknown",
    "Vietnamese": "Asian",
    "Asian Indian": "Asian",
    "Other": "Other/Unknown",
    "Unknown": "Other/Unknown"
}
df["vict_descent"] = df["vict_descent"].map(vict_descent_map_2).combine_first(df["vict_descent"])

#### Weapon Type

There are 80 unique weapon types in the dataset. I felt that this was a small enough number to handle mostly manually, so I defined categories that I felt to be logical. For efficienct, I asked ChatGPT to categorise each of the weapon types into my pre-defined categories (OpenAI, 2025). I edited the dictionary slightly to tweak the decisions made by ChatGPT, to ensure that the categorisation was logical, e.g. changing "SYRINGE" from "Burning/Toxic Substance" to "Knife/Blade/Sharp Object".

In [None]:
#define weapon categories
weapon_map = {
    "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)": "Bodily Force",
    "UNKNOWN WEAPON/OTHER WEAPON": "Other/Unknown/No Weapon Used",
    "VERBAL THREAT": "Verbal Threat",
    "HAND GUN": "Gun/Firearm",
    "KNIFE WITH BLADE 6INCHES OR LESS": "Knife/Blade/Sharp Object",
    "SEMI-AUTOMATIC PISTOL": "Gun/Firearm",
    "OTHER KNIFE": "Knife/Blade/Sharp Object",
    "UNKNOWN FIREARM": "Gun/Firearm",
    "VEHICLE": "Vehicle",
    "MACE/PEPPER SPRAY": "Burning/Toxic Substance",
    "BOTTLE": "Blunt/Hitting Object",
    "STICK": "Blunt/Hitting Object",
    "ROCK/THROWN OBJECT": "Blunt/Hitting Object",
    "CLUB/BAT": "Blunt/Hitting Object",
    "FOLDING KNIFE": "Knife/Blade/Sharp Object",
    "REVOLVER": "Gun/Firearm",
    "KITCHEN KNIFE": "Knife/Blade/Sharp Object",
    "BLUNT INSTRUMENT": "Blunt/Hitting Object",
    "KNIFE WITH BLADE OVER 6 INCHES IN LENGTH": "Knife/Blade/Sharp Object",
    "PIPE/METAL PIPE": "Blunt/Hitting Object",
    "AIR PISTOL/REVOLVER/RIFLE/BB GUN": "Gun/Firearm",
    "SIMULATED GUN": "Gun/Firearm",
    "BELT FLAILING INSTRUMENT/CHAIN": "Blunt/Hitting Object",
    "OTHER CUTTING INSTRUMENT": "Knife/Blade/Sharp Object",
    "HAMMER": "Blunt/Hitting Object",
    "PHYSICAL PRESENCE": "Bodily Force",
    "SCREWDRIVER": "Knife/Blade/Sharp Object",
    "MACHETE": "Knife/Blade/Sharp Object",
    "UNKNOWN TYPE CUTTING INSTRUMENT": "Knife/Blade/Sharp Object",
    "SCISSORS": "Knife/Blade/Sharp Object",
    "OTHER FIREARM": "Gun/Firearm",
    "CONCRETE BLOCK/BRICK": "Blunt/Hitting Object",
    "SHOTGUN": "Gun/Firearm",
    "RIFLE": "Gun/Firearm",
    "FIXED OBJECT": "Blunt/Hitting Object",
    "STUN GUN": "Gun/Firearm",
    "BOARD": "Blunt/Hitting Object",
    "FIRE": "Burning/Toxic Substance",
    "GLASS": "Blunt/Hitting Object",
    "SWITCH BLADE": "Knife/Blade/Sharp Object",
    "CAUSTIC CHEMICAL/POISON": "Burning/Toxic Substance",
    "BRASS KNUCKLES": "Blunt/Hitting Object",
    "AXE": "Knife/Blade/Sharp Object",
    "TIRE IRON": "Blunt/Hitting Object",
    "SCALDING LIQUID": "Burning/Toxic Substance",
    "TOY GUN": "Gun/Firearm",
    "RAZOR BLADE": "Knife/Blade/Sharp Object",
    "SWORD": "Knife/Blade/Sharp Object",
    "BOMB THREAT": "Verbal Threat",
    "RAZOR": "Knife/Blade/Sharp Object",
    "ICE PICK": "Knife/Blade/Sharp Object",
    "HECKLER & KOCH 93 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "ASSAULT WEAPON/UZI/AK47/ETC": "Gun/Firearm",
    "DIRK/DAGGER": "Knife/Blade/Sharp Object",
    "LIQUOR/DRUGS": "Other/Unknown/No Weapon Used",
    "EXPLOXIVE DEVICE": "Burning/Toxic Substance",
    "AUTOMATIC WEAPON/SUB-MACHINE GUN": "Gun/Firearm",
    "SAWED OFF RIFLE/SHOTGUN": "Gun/Firearm",
    "STARTER PISTOL/REVOLVER": "Gun/Firearm",
    "ROPE/LIGATURE": "Other/Unknown/No Weapon Used",
    "SEMI-AUTOMATIC RIFLE": "Gun/Firearm",
    "CLEAVER": "Knife/Blade/Sharp Object",
    "BOWIE KNIFE": "Knife/Blade/Sharp Object",
    "DOG/ANIMAL (SIC ANIMAL ON)": "Other/Unknown/No Weapon Used",
    "DEMAND NOTE": "Verbal Threat",
    "STRAIGHT RAZOR": "Knife/Blade/Sharp Object",
    "BLACKJACK": "Blunt/Hitting Object",
    "SYRINGE": "Knife/Blade/Sharp Object",
    "BOW AND ARROW": "Other/Unknown/No Weapon Used",
    "MARTIAL ARTS WEAPONS": "Blunt/Hitting Object",
    "UNK TYPE SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "UZI SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "RELIC FIREARM": "Gun/Firearm",
    "HECKLER & KOCH 91 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "ANTIQUE FIREARM": "Gun/Firearm",
    "MAC-10 SEMIAUTOMATIC ASSAULT WEAPON": "Gun/Firearm",
    "MAC-11 SEMIAUTOMATIC ASSAULT WEAPON": "Gun/Firearm",
    "M1-1 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "M-14 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm"
}

#map to dataframe, drop original column
df["weapon_group"] = df["weapon_type"].map(weapon_map).fillna("Other/Unknown/No Weapon Used")
df.drop("weapon_type", axis=1, inplace=True)

#### Crime Type

There are 142 crime types in the dataset, which I felt was too many to deal with completely manually, but I noticed that many of them had repeating words (e.g. "THEFT") so I wrote a function to group them by keyword. 

In [None]:
#define function to group crimes
def crime_grouping(crime):
    if pd.isna(crime):
        return "Other"
    crime = crime.upper()
    if any(word in crime for word in ["CHILD"]):
        return "Offense Against a Child"
    elif any(word in crime for word in ["HOMICIDE", "MANSLAUGHTER", "LYNCHING"]):
        return "Murder/Manslaughter"
    elif any(word in crime for word in ["ASSAULT", "BRANDISH", "SHOTS", "BATTERY", "BOMB"]):
        return "Assault/Violence"
    elif any(word in crime for word in ["THEFT", "BURGLARY", "ROBBERY", "STOLEN", "EXTORTION", "PICKPOCKET", "SNATCHING", "BUNCO", "FRAUD", "COUNTERFEIT"]):
        return "Theft-Related"
    elif any(word in crime for word in ["VANDALISM", "ARSON"]):
        return "Property Damage"
    elif any(word in crime for word in ["VIOLATION", "TRESPASSING", "DISTURBING", "CONTEMPT", "THROWING", "RESISTING", "STALKING", "PROWLER", "THREAT"]):
        return "Public Order/Threatening Behaviour"
    elif any(word in crime for word in ["LEWD", "SEX", "RAPE", "PENETRATION", "INDECENT", "COPULATION", "PEEPING", "PIMPING"]):
        return "Sexual Offence"
    elif any(word in crime for word in ["KIDNAPPING", "IMPRISONMENT", "TRAFFICKING"]):
        return "Kidnapping/Trafficking"
    else:
        return "Other"

#apply function to dataframe, drop original column
df["crime_group"] = df["crime_type"].apply(crime_grouping).fillna("Other")
df.drop("crime_type", axis=1, inplace=True)

#### Premises Type

There are 319 premises types, with very little possibility for grouping using the same methods as above, as the vast majority have unique names with few repeating words. As such, I decided to use semantic similarity clustering. 



a hybrid approach combining semantic similarity clustering with manual corrections. First, I employed a sentence transformer model (all-MiniLM-L6-v2) to encode both the unique premises types and 12 predefined categories that better reflected the nature of the data. Each premises type was then assigned to the most semantically similar category based on cosine similarity between their embeddings. However, given the complexity and variety of the premises descriptions, this automated approach produced several misclassifications. To address this, I implemented a correction function that uses keyword matching to reassign premises types that contained specific identifying words to their appropriate categories. For example, any premises containing 'BANK' was reassigned to 'Financial', while those containing words like 'HOME', 'DRIVEWAY', or 'GARAGE' were moved to 'Residence/Private Outdoor Space'. This two-stage process combined the efficiency of automated semantic clustering with the accuracy of domain-specific manual corrections, ensuring that the final groupings were both comprehensive and contextually appropriate.

There are 319 premises types, with very little possibility for grouping using the same methods as above, as the vast majority have unique names with few repeating words. As such, I decided to use semantic similarity clustering as a data preprocessing step to create meaningful categorical features for the machine learning model. Using a sentence transformer model (all-MiniLM-L6-v2), I automatically grouped premises types into 12 predefined categories based on semantic similarity. However, due to the complexity and variety of premises descriptions, I implemented additional keyword-based corrections to ensure accurate categorization. This preprocessing approach reduced the dimensionality from 319 unique premises types to 12 meaningful categories, creating a more manageable and interpretable feature for predicting when and where children are most likely to be victims of crime

In [None]:
#define preferred clusters
categories = ["Residence/Private Outdoor Space", "Street/Public Outdoor Space", "Transport Hub/Vehicle", "Restaurant/Eatery", "Store/Mall/Business", "Education", "Public Services/Healthcare", "Place of Worship", "Leisure/Entertainment/Sport", "Online", "Financial", "Other"]

model = SentenceTransformer("all-MiniLM-L6-v2")
unique_premises = df["premises_type"].dropna().unique()

premises_embeddings = model.encode(unique_premises)
category_embeddings = model.encode(categories)

type_clusters = {}
for i, premise in enumerate(unique_premises):
    similarities = np.dot(premises_embeddings[i], category_embeddings.T)
    best_category = categories[np.argmax(similarities)]
    type_clusters[premise] = best_category

df["premises_group"] = df["premises_type"].map(type_clusters)

#define function to regroup incorrect clusters
def premises_grouping(premises, current_group):
    if pd.isna(premises):
        return "Other"
    premises = premises.upper()
    if any(word in premises for word in ["BANK"]):
        return "Financial"
    elif any(word in premises for word in ["PUBLIC STORAGE", "DIY", "VALET", "OFFICE", "RADIO", "FACTORY", "MARKET", "OTHER BUSINESS", "CONNECTION", "SALES", "BMW", "CAR WASH", "GROVE", "EQUIPMENT", "COURIER"]):
        return "Store/Mall/Business"
    elif any(word in premises for word in ["HOME", "DRIVEWAY", "PATIO", "PORCH", "FOSTER", "GARAGE", "MOBILE", "BALCONY", "PROJECT"]):
        return "Residence/Private Outdoor Space"
    elif any(word in premises for word in ["FIRE", "SEWAGE", "CLINIC", "LIBRARY", "HOSPITAL", "MORTUARY", "HOSPICE", "ENERGY", "CARE", "WATER", "JAIL", "POLICE", "DENTAL", "RECYCLING"]):
        return "Public Services/Healthcare"
    elif any(word in premises for word in ["HARBOR", "LINE", "PARKING", "TRAM", "AIRCRAFT", "CHARTER", "MTA"]):
        return "Transport Hub/Vehicle"
    elif any(word in premises for word in ["RINK", "BASKETBALL", "ARCADE", "COCKTAIL", "MUSEUM", "STAPLES", "STADIUM", "BEVERLY", "VACATION", "HOTEL", "MOTEL", "BOWLING"]):
        return "Leisure/Entertainment/Sport"
    elif any(word in premises for word in ["ALLEY", "TRASH", "TUNNEL", "PAYPHONE", "FREEWAY", "GATHERING", "TRANSIENT", "BEACH", "RESERVOIR", "RIVER", "BRIDGE", "OTHER/OUTSIDE"]):
        return "Street/Public Outdoor Space"
    elif any(word in premises for word in ["COFFEE"]):
        return "Restaurant/Eatery"
    elif any(word in premises for word in ["SWAP", "ESCALATOR", "STAIR", "ELEVATOR", "ABATEMENT", "TACTICAL", "RETIRED", "SHED"]):
        return "Other"
    else:
        return current_group

#apply function to dataframe, drop original column
df["premises_group"] = df.apply(lambda row: premises_grouping(row["premises_type"], row["premises_group"]), axis=1).fillna("Other")
df.drop("premises_type", axis=1, inplace=True)

## Encoding Categorical Variables

## Exploratory Data Analysis