# Machine Learning and Predictive Analytics: Solving a Real World Problem with Machine Learning

## Datasets: Los Angeles Crime Data 2010-19 and 2020-25
2010-19 dataset: https://catalog.data.gov/dataset/crime-data-from-2010-to-2019      
2020-25 dataset (accessed up to 25/05/2025): https://catalog.data.gov/dataset/crime-data-from-2020-to-present

## Research Question: How can we protect children from being victims of crime in Los Angeles?

The model will predict the risk level of a child becoming a victim of crime, based on demographic factors (such as age, sex, and descent) in combination with spatial and temporal variables (such as location, time of day, and day of the week).

Real-world interventions can be based on the predictions of the model. For example, if the model predicts that there is a high risk level for Black children being victimised in Central LA during weekday evenings, a local youth centre could implement targeted outreach programmes during those hours — offering safe spaces, support services, or structured activities.

## Data Preparation and Exploration

### Setup

In [1]:
#import libraries
import pandas as pd
import numpy as np
import janitor 
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

First, I will read in my CSV files, forcing the `time_occ` column to be read as a string. This is necessary because the time is stored as military time in text format, so if I allow pandas to convert it to int64, it will remove the leading 0s (e.g. 0800 for 8am).

I will clean the variable names to make them uniform and join the two datasets together.

Then I will remove true duplicates, as each instance should have a unique identifier (according to the metadata, this is `dr_no` - Division of Records number). Any row that is a complete duplicate is therefore likely to be attributable to a data entry error. 

In [2]:
#get 2010-19 data from csv
df1 = pd.read_csv("la_crimes_2010-19.csv", dtype={"TIME OCC": str})

#get 2020-25 data from csv
df2 = pd.read_csv("la_crimes_2020-25.csv", dtype={"TIME OCC": str})

#clean variable names
df1 = (
    df1.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time_str", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

df2 = (
    df2.clean_names()
    .rename(columns={"date_occ":"date", "time_occ":"time_str", "area":"area_", "area_name":"area", "crm_cd":"crime_code", "crm_cd_desc":"crime_type", "premis_cd":"premises_code", "premis_desc":"premises_type", "weapon_used_cd":"weapon_code", "weapon_desc":"weapon_type"})
    )

#join dataframes and view all columns
df = pd.concat([df1, df2], ignore_index=True)
pd.set_option('display.max_columns', None)
df.head(4)

Unnamed: 0,dr_no,date_rptd,date,time_str,area_,area,rpt_dist_no,part_1_2,crime_code,crime_type,mocodes,vict_age,vict_sex,vict_descent,premises_code,premises_type,weapon_code,weapon_type,status,status_desc,crm_cd_1,crm_cd_2,crm_cd_3,crm_cd_4,location,cross_street,lat,lon
0,1307355,02/20/2010 12:00:00 AM,02/20/2010 12:00:00 AM,1350,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,48,M,H,501.0,SINGLE FAMILY DWELLING,,,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,09/13/2010 12:00:00 AM,09/12/2010 12:00:00 AM,45,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,0,M,W,101.0,STREET,,,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962
2,70309629,08/09/2010 12:00:00 AM,08/09/2010 12:00:00 AM,1515,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,0,M,H,103.0,ALLEY,,,IC,Invest Cont,946.0,,,,1300 E 21ST ST,,34.0224,-118.2524
3,90631215,01/05/2010 12:00:00 AM,01/05/2010 12:00:00 AM,150,6,Hollywood,646,2,900,VIOLATION OF COURT ORDER,1100 0400 1402,47,F,W,101.0,STREET,102.0,HAND GUN,IC,Invest Cont,900.0,998.0,,,CAHUENGA BL,HOLLYWOOD BL,34.1016,-118.3295


In [3]:
#remove duplicate rows and check size
df = df.drop_duplicates()
df.shape

(3078018, 28)

### Exploring and cleaning variables

#### Demographics

I will view the unique values that appear in the demographic variables to check whether they need cleaning, using the metadata to support my understanding and decisions.

In [4]:
#get demographic values and counts
vict_sex_values = df['vict_sex'].value_counts(dropna=False).to_dict()
vict_descent_values = df['vict_descent'].value_counts(dropna=False).to_dict()

print(f"Victim Sex Values:\n{vict_sex_values}\n\nVictim Descent Values:\n{vict_descent_values}")

Victim Sex Values:
{'M': 1359082, 'F': 1229587, nan: 337219, 'X': 151926, 'H': 185, 'N': 17, '-': 2}

Victim Descent Values:
{'H': 1006084, 'W': 704253, 'B': 464311, nan: 337277, 'O': 276859, 'X': 183028, 'A': 71246, 'K': 14333, 'F': 7368, 'C': 5693, 'J': 1999, 'I': 1959, 'V': 1401, 'Z': 717, 'P': 624, 'U': 408, 'G': 152, 'D': 115, 'L': 97, 'S': 89, '-': 5}


In [5]:
#tidy victim sex variable
vict_sex_map = {
    "M": "Male",
    "F": "Female",
    "X": "Other/Unknown",
    "H": "Other/Unknown",
    "N": "Other/Unknown",
    "-": "Other/Unknown"
}
df["vict_sex"] = df["vict_sex"].map(vict_sex_map).fillna("Other/Unknown")

#tidy victim descent variable
vict_descent_map = {
    "A": "Other Asian",
    "B": "Black",
    "C": "Chinese",
    "D": "Cambodian",
    "F": "Filipino",
    "G": "Guamanian",
    "H": "Hispanic/Latin/Mexican",
    "I": "American Indian/Alaskan Native",
    "J": "Japanese",
    "K": "Korean",
    "L": "Laotian",
    "O": "Other",
    "P": "Pacific Islander",
    "S": "Samoan",
    "U": "Hawaiian",
    "V": "Vietnamese",
    "W": "White",
    "X": "Unknown",
    "Z": "Asian Indian",
    "-": "Unknown"
}
df["vict_descent"] = df["vict_descent"].map(vict_descent_map).fillna("Unknown")

#### Dates and times

I will clean the date and time variables by converting them into datetime objects. This will allow me to extract granular features later, including day of the week and hour of the day.


In [6]:
#convert dates
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y %I:%M:%S %p").dt.normalize()

#convert times
df["time"] = pd.to_datetime(df["time_str"], format="%H%M", errors="coerce").dt.time

#get datetime column
df["datetime_str"] = df["date"].dt.strftime("%Y-%m-%d") + " " + df["time_str"].str[:2] + ":" + df["time_str"].str[2:]
df["datetime"] = pd.to_datetime(df["datetime_str"], format="%Y-%m-%d %H:%M")
df.drop(columns="datetime_str", inplace=True)

#### Removing unnecessary data

I will drop rows with missing victim age, as this variable is essential for building my model. I will also drop rows where the age is zero or less (vict_age contains many 0s and negative numbers, possibly as crimes without known/human victims e.g. vandalism).

In [8]:
#drop columns that won't be used for the model
df = df.drop(columns=["dr_no", "date_rptd", "area_", "rpt_dist_no", "part_1_2", "crime_code", "mocodes", "premises_code", "weapon_code", "status", "status_desc", "crm_cd_1", "crm_cd_2", "crm_cd_3", "crm_cd_4", "location", "cross_street"])

#drop rows with missing victim age
df = df.dropna(subset=["vict_age"])

#drop rows where age is zero or less
df = df[df["vict_age"] > 0]

In [9]:
df.head(4)

Unnamed: 0,date,time_str,area,crime_type,vict_age,vict_sex,vict_descent,premises_type,weapon_type,lat,lon,time,datetime
0,2010-02-20,1350,Newton,VIOLATION OF COURT ORDER,48,Male,Hispanic/Latin/Mexican,SINGLE FAMILY DWELLING,,33.9825,-118.2695,13:50:00,2010-02-20 13:50:00
3,2010-01-05,150,Hollywood,VIOLATION OF COURT ORDER,47,Female,White,STREET,HAND GUN,34.1016,-118.3295,01:50:00,2010-01-05 01:50:00
4,2010-01-02,2100,Central,"RAPE, ATTEMPTED",47,Female,Hispanic/Latin/Mexican,ALLEY,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",34.0387,-118.2488,21:00:00,2010-01-02 21:00:00
5,2010-01-04,1650,Central,SHOPLIFTING - PETTY THEFT ($950 & UNDER),23,Male,Black,DEPARTMENT STORE,,34.048,-118.2577,16:50:00,2010-01-04 16:50:00


### Grouping Categorical Data

#### Demographics

The `vict_sex` variable requires no grouping as there are three categories, none of which are too low frequency.

However, there are several low-frequency values in `vict_descent`. Knowing that this will be a feature in my model, and to avoid anonymisation issues, I will combine some of the lower-frequency ethnicities. I chose to group all groups of Asian descent, and to reassign the remaining ethnicities (all of which represented less than 1% of the dataset) to "Other". This decision was made with sensitivity and recognition that different Asian subgroups — such as Indian, Vietnamese, Korean - have distinct cultural/socioeconomic backgrounds (UK Gov, 2020). For another type of project, this aggregration would potentially be inappropriate, but I deemed that grouping was most suitable to ensure the robustness of my model and avoid overly granular categories that could lead to unreliable output.

In [10]:
#define descent categories
vict_descent_map_2 = {
    "Other Asian": "Asian",
    "Chinese": "Asian",
    "Cambodian": "Asian",
    "Filipino": "Asian",
    "Guamanian": "Other/Unknown",
    "American Indian/Alaskan Native": "Other/Unknown",
    "Japanese": "Asian",
    "Korean": "Asian",
    "Laotian": "Asian",
    "Pacific Islander": "Other/Unknown",
    "Samoan": "Other/Unknown",
    "Hawaiian": "Other/Unknown",
    "Vietnamese": "Asian",
    "Asian Indian": "Asian",
    "Other": "Other/Unknown",
    "Unknown": "Other/Unknown"
}

#map to dataframe
df["vict_descent"] = df["vict_descent"].map(vict_descent_map_2).combine_first(df["vict_descent"])

#### Weapon Type

There are 80 unique weapon types in the dataset. I felt that this was a small enough number to handle mostly manually, so I defined categories that I felt to be logical. For efficienct, I asked ChatGPT to categorise each of the weapon types into my pre-defined categories (OpenAI, 2025). I edited the dictionary slightly to tweak the decisions made by ChatGPT, to ensure that the categorisation was logical, e.g. changing "SYRINGE" from "Burning/Toxic Substance" to "Knife/Blade/Sharp Object".

In [11]:
#define weapon categories
weapon_map = {
    "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)": "Bodily Force",
    "UNKNOWN WEAPON/OTHER WEAPON": "Other/Unknown/No Weapon Used",
    "VERBAL THREAT": "Verbal Threat",
    "HAND GUN": "Gun/Firearm",
    "KNIFE WITH BLADE 6INCHES OR LESS": "Knife/Blade/Sharp Object",
    "SEMI-AUTOMATIC PISTOL": "Gun/Firearm",
    "OTHER KNIFE": "Knife/Blade/Sharp Object",
    "UNKNOWN FIREARM": "Gun/Firearm",
    "VEHICLE": "Vehicle",
    "MACE/PEPPER SPRAY": "Burning/Toxic Substance",
    "BOTTLE": "Blunt/Hitting Object",
    "STICK": "Blunt/Hitting Object",
    "ROCK/THROWN OBJECT": "Blunt/Hitting Object",
    "CLUB/BAT": "Blunt/Hitting Object",
    "FOLDING KNIFE": "Knife/Blade/Sharp Object",
    "REVOLVER": "Gun/Firearm",
    "KITCHEN KNIFE": "Knife/Blade/Sharp Object",
    "BLUNT INSTRUMENT": "Blunt/Hitting Object",
    "KNIFE WITH BLADE OVER 6 INCHES IN LENGTH": "Knife/Blade/Sharp Object",
    "PIPE/METAL PIPE": "Blunt/Hitting Object",
    "AIR PISTOL/REVOLVER/RIFLE/BB GUN": "Gun/Firearm",
    "SIMULATED GUN": "Gun/Firearm",
    "BELT FLAILING INSTRUMENT/CHAIN": "Blunt/Hitting Object",
    "OTHER CUTTING INSTRUMENT": "Knife/Blade/Sharp Object",
    "HAMMER": "Blunt/Hitting Object",
    "PHYSICAL PRESENCE": "Bodily Force",
    "SCREWDRIVER": "Knife/Blade/Sharp Object",
    "MACHETE": "Knife/Blade/Sharp Object",
    "UNKNOWN TYPE CUTTING INSTRUMENT": "Knife/Blade/Sharp Object",
    "SCISSORS": "Knife/Blade/Sharp Object",
    "OTHER FIREARM": "Gun/Firearm",
    "CONCRETE BLOCK/BRICK": "Blunt/Hitting Object",
    "SHOTGUN": "Gun/Firearm",
    "RIFLE": "Gun/Firearm",
    "FIXED OBJECT": "Blunt/Hitting Object",
    "STUN GUN": "Gun/Firearm",
    "BOARD": "Blunt/Hitting Object",
    "FIRE": "Burning/Toxic Substance",
    "GLASS": "Blunt/Hitting Object",
    "SWITCH BLADE": "Knife/Blade/Sharp Object",
    "CAUSTIC CHEMICAL/POISON": "Burning/Toxic Substance",
    "BRASS KNUCKLES": "Blunt/Hitting Object",
    "AXE": "Knife/Blade/Sharp Object",
    "TIRE IRON": "Blunt/Hitting Object",
    "SCALDING LIQUID": "Burning/Toxic Substance",
    "TOY GUN": "Gun/Firearm",
    "RAZOR BLADE": "Knife/Blade/Sharp Object",
    "SWORD": "Knife/Blade/Sharp Object",
    "BOMB THREAT": "Verbal Threat",
    "RAZOR": "Knife/Blade/Sharp Object",
    "ICE PICK": "Knife/Blade/Sharp Object",
    "HECKLER & KOCH 93 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "ASSAULT WEAPON/UZI/AK47/ETC": "Gun/Firearm",
    "DIRK/DAGGER": "Knife/Blade/Sharp Object",
    "LIQUOR/DRUGS": "Other/Unknown/No Weapon Used",
    "EXPLOXIVE DEVICE": "Burning/Toxic Substance",
    "AUTOMATIC WEAPON/SUB-MACHINE GUN": "Gun/Firearm",
    "SAWED OFF RIFLE/SHOTGUN": "Gun/Firearm",
    "STARTER PISTOL/REVOLVER": "Gun/Firearm",
    "ROPE/LIGATURE": "Other/Unknown/No Weapon Used",
    "SEMI-AUTOMATIC RIFLE": "Gun/Firearm",
    "CLEAVER": "Knife/Blade/Sharp Object",
    "BOWIE KNIFE": "Knife/Blade/Sharp Object",
    "DOG/ANIMAL (SIC ANIMAL ON)": "Other/Unknown/No Weapon Used",
    "DEMAND NOTE": "Verbal Threat",
    "STRAIGHT RAZOR": "Knife/Blade/Sharp Object",
    "BLACKJACK": "Blunt/Hitting Object",
    "SYRINGE": "Knife/Blade/Sharp Object",
    "BOW AND ARROW": "Other/Unknown/No Weapon Used",
    "MARTIAL ARTS WEAPONS": "Blunt/Hitting Object",
    "UNK TYPE SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "UZI SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "RELIC FIREARM": "Gun/Firearm",
    "HECKLER & KOCH 91 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "ANTIQUE FIREARM": "Gun/Firearm",
    "MAC-10 SEMIAUTOMATIC ASSAULT WEAPON": "Gun/Firearm",
    "MAC-11 SEMIAUTOMATIC ASSAULT WEAPON": "Gun/Firearm",
    "M1-1 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm",
    "M-14 SEMIAUTOMATIC ASSAULT RIFLE": "Gun/Firearm"
}

#map to dataframe
df["weapon_group"] = df["weapon_type"].map(weapon_map).fillna("Other/Unknown/No Weapon Used")
df.drop("weapon_type", axis=1, inplace=True)

#### Crime Type

There are 142 crime types in the dataset, which I felt was too many to deal with completely manually, but I noticed that many of them had repeating words (e.g. "THEFT") so I wrote a function to group them by keyword. 

In [12]:
#define function to group crimes
def crime_grouping(crime):
    if pd.isna(crime):
        return "Other"
    crime = crime.upper()
    if any(word in crime for word in ["CHILD"]):
        return "Offense Against a Child"
    elif any(word in crime for word in ["HOMICIDE", "MANSLAUGHTER", "LYNCHING"]):
        return "Murder/Manslaughter"
    elif any(word in crime for word in ["ASSAULT", "BRANDISH", "SHOTS", "BATTERY", "BOMB"]):
        return "Assault/Violence"
    elif any(word in crime for word in ["THEFT", "BURGLARY", "ROBBERY", "STOLEN", "EXTORTION", "PICKPOCKET", "SNATCHING", "BUNCO", "FRAUD", "COUNTERFEIT"]):
        return "Theft-Related"
    elif any(word in crime for word in ["VANDALISM", "ARSON"]):
        return "Property Damage"
    elif any(word in crime for word in ["VIOLATION", "TRESPASSING", "DISTURBING", "CONTEMPT", "THROWING", "RESISTING", "STALKING", "PROWLER", "THREAT"]):
        return "Public Order/Threatening Behaviour"
    elif any(word in crime for word in ["LEWD", "SEX", "RAPE", "PENETRATION", "INDECENT", "COPULATION", "PEEPING", "PIMPING"]):
        return "Sexual Offence"
    elif any(word in crime for word in ["KIDNAPPING", "IMPRISONMENT", "TRAFFICKING"]):
        return "Kidnapping/Trafficking"
    else:
        return "Other"

#apply function to dataframe
df["crime_group"] = df["crime_type"].apply(crime_grouping).fillna("Other")
df.drop("crime_type", axis=1, inplace=True)

#### Premises Type

There are 319 premises types, with very little possibility for grouping using the same methods as above, as the vast majority have unique names with few repeating words. As such, I decided to use semantic similarity clustering to create meaningful categories for the model. I used a sentence transformer model (all-MiniLM-L6-v2) to group premises types into 12 predefined categories. The model did an acceptable job, but I followed up the clustering with manual corrections to ensure that the final groupings were logical and appropriate. I used the same keyword function as I did to categorise `crime_type`, which allowed me to retain human oversight whilst saving time.

In [None]:
#define preferred clusters
categories = ["Residence/Private Outdoor Space", "Street/Public Outdoor Space", "Transport Hub/Vehicle", "Restaurant/Eatery", "Store/Mall/Business", "Education", "Public Services/Healthcare", "Place of Worship", "Leisure/Entertainment/Sport", "Online", "Financial", "Other"]

model = SentenceTransformer("all-MiniLM-L6-v2")
unique_premises = df["premises_type"].dropna().unique()

premises_embeddings = model.encode(unique_premises)
category_embeddings = model.encode(categories)

type_clusters = {}
for i, premise in enumerate(unique_premises):
    similarities = np.dot(premises_embeddings[i], category_embeddings.T)
    best_category = categories[np.argmax(similarities)]
    type_clusters[premise] = best_category

df["premises_group"] = df["premises_type"].map(type_clusters)

#define function to regroup incorrect clusters
def premises_grouping(premises, current_group):
    if pd.isna(premises):
        return "Other"
    premises = premises.upper()
    if any(word in premises for word in ["BANK"]):
        return "Financial"
    elif any(word in premises for word in ["PUBLIC STORAGE", "DIY", "VALET", "OFFICE", "RADIO", "FACTORY", "MARKET", "OTHER BUSINESS", "CONNECTION", "SALES", "BMW", "CAR WASH", "GROVE", "EQUIPMENT", "COURIER"]):
        return "Store/Mall/Business"
    elif any(word in premises for word in ["HOME", "DRIVEWAY", "PATIO", "PORCH", "FOSTER", "GARAGE", "MOBILE", "BALCONY", "PROJECT"]):
        return "Residence/Private Outdoor Space"
    elif any(word in premises for word in ["FIRE", "SEWAGE", "CLINIC", "LIBRARY", "HOSPITAL", "MORTUARY", "HOSPICE", "ENERGY", "CARE", "WATER", "JAIL", "POLICE", "DENTAL", "RECYCLING"]):
        return "Public Services/Healthcare"
    elif any(word in premises for word in ["HARBOR", "LINE", "PARKING", "TRAM", "AIRCRAFT", "CHARTER", "MTA"]):
        return "Transport Hub/Vehicle"
    elif any(word in premises for word in ["RINK", "BASKETBALL", "ARCADE", "COCKTAIL", "MUSEUM", "STAPLES", "STADIUM", "BEVERLY", "VACATION", "HOTEL", "MOTEL", "BOWLING"]):
        return "Leisure/Entertainment/Sport"
    elif any(word in premises for word in ["ALLEY", "TRASH", "TUNNEL", "PAYPHONE", "FREEWAY", "GATHERING", "TRANSIENT", "BEACH", "RESERVOIR", "RIVER", "BRIDGE", "OTHER/OUTSIDE"]):
        return "Street/Public Outdoor Space"
    elif any(word in premises for word in ["COFFEE"]):
        return "Restaurant/Eatery"
    elif any(word in premises for word in ["SWAP", "ESCALATOR", "STAIR", "ELEVATOR", "ABATEMENT", "TACTICAL", "RETIRED", "SHED"]):
        return "Other"
    else:
        return current_group

#apply function to dataframe
df["premises_group"] = df.apply(lambda row: premises_grouping(row["premises_type"], row["premises_group"]), axis=1).fillna("Other")
df.drop("premises_type", axis=1, inplace=True)

### Feature Engineering

I have already removed several features that I do not deem to be important for my model, and I will drop the rest that I don't intend to use now that I have finished my EDA.

I also need to extract some features: from the datetime column, to explore the temporal factors around child victimisation, and from `vict_age` to identify if the victim is a child. I will apply cyclical encoding to `hour` and `day_of_week`. I will extract the temporal features as categorical variables, which can then be explored and encoded.

I will handle the 12pm concentration bias problem by converting a sample of 12pm to NaN and then using hot deck imputation.

In [None]:
#get temporal features
df["hour"] = df["datetime"].dt.hour
df["minute"] = df["datetime"].dt.minute
df["day"] = df["datetime"].dt.dayofweek

#drop irrelevant features
df = df.drop(columns=["date", "time_str", "lat", "lon", "time", "datetime"])

#indicate if victim is a child
df["is_child"] = df["vict_age"] < 18

In [None]:
#reimport df following hot deck imputation
#df = pd.read_csv("df_after_imputation.csv")

## Encoding Categorical Variables

## Exploratory Data Analysis

- Noted that 2390 rows have 0.0 for lon and lat. Will keep as `area` will be used for location. Visualisations (i.e. heatmap) will still be useful as this is only 0.1% of the data.

**Demographic Factors:**
- More girls were victimised than other genders. This is true across all age groups, but is more pronounced for teenage girls. There doesn't appear to be a major difference between the sexes in temporal terms. Twice as many girls than boys were victimised in a private residence.
- Hispanic/Latin/Mexican is the demographic most likely to be victimised at 62%. This group represents 47% of City of LA's population.
- Overall, white people are the second most victimised group but Black children are the second most victimised group of children.
- The number of children victimised at each age steadily increases by c. 1000 each year until the age of 10, after which it excelerates by c. 2000 each year. (or something to that effect)
- From the visualisation, it appears that the proportion of each descent group does not substantially differ across ages.

**Temporal Factors:**
- Month year is surprising as there is a dip in July, which goes against other findings in Criminological research. Could be attributed to a successful programme "Summer Night Lights" in LA. Graph shows that child victimisation in July has decreased since 2010. But I think I will not use this as a feature as I think a more granular approach will be more insightful.
- There are two peaks during the day that are likely to be attributable to the time when childen are travelling to/from school (8am and 3pm) - when they are in public, on transport, moving between home and school, potentially unsupervised by adults. These peaks are not noticeable upon removing the `is_child` filter. There is also an enormous peak in the middle of the day, which is almost certainly due to officers logging a crime at midday when the actual time is not known, or due to legacy errors created as historical records have been digitised or transferred to new systems (src FBI).
- To deal with the 12pm peak, I considered taking the advice of the FBI book and removing all 12pm. But it would be very unwise to have a 12pm "safe spot" where no crimes happen. So I calculated the average of 11am and 1pm to gauge how many crimes would reasonably be expected at 12pm. I retained this number of 12pm instances, but converted the remainder to NaN. I was then able to conduct hot deck imputation on the "missing" data which neutralised the 12pm peak that was due to concentration bias.
- There is a peak in victimisation on Fridays, but it may not be significant.
- I will not use "day of month" as a feature as it appears that there is a huge influx of crimes being logged on the first of the month. 200% more crimes logged on this day than the next most busy - the 15th of the month, which again maybe due to officers choosing the middle of the month if the exact day is unknown. In future it may be useful to look at date_rptd as well as date, because that may be a more reliable measure of fluctuations in crime over time. I think it is okay not to consider day of month, as it is less likely to have an impact on victimisation than other temporal fluctuations (e.g. day of week and time of day).

**Spatial Factors:**
- There is almost nothing to note in terms of premises vs day of week or hour of day. Education is lower on the weekend and at night, obviously, but that's about it. Private residence higher on weekends and evening/night.
- Top 3 most dangerous areas are 77th Street, Southwest and Southeast. Doesn't appear to be a pattern across the week - no areas stand out as more dangerous on a particular day. Area vs descent does seem to differ, but this is likely because of neighbourhoods where different demographics live (e.g. West LA has a disproportionate number of white child victims, but it's a very white area relatively)


**Visualisations:**
- NB heatmap: area pins are centroids to average lat and lon.

## Useful bits of code

In [None]:
# #look at unique values for different columns
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_colwidth', None)

# df['hour_less_12'].value_counts().to_frame()

In [None]:
#df.to_csv("df_for_imputation.csv", index=False, encoding='utf-8')

In [None]:
#restrict df to children
#df = df[(df["vict_age"] < 18) & (df["vict_age"] > 0)]