# Dealing with Over-Representation of 12pm Data

## Setup

In [1]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv("../data/df_after_grouping.csv", dtype={"time_str": str}, parse_dates=["date", "datetime"])

## Feature Engineering

I have already removed several features that I do not deem to be important for my model, and I will drop the rest that I don't intend to use now that I have finished my EDA.

I also need to extract some features: from the datetime column, to explore the temporal factors around child victimisation, and from `vict_age` to identify if the victim is a child. I will apply cyclical encoding to `hour`, `day` and `month`. I will extract the temporal features as categorical variables, which can then be explored and encoded.

There is a notable spike at 12pm, indicating that officers may use 12pm as a placeholder when they are unsure of the exact time that a crime took place. This is backed up by Kaplan (2025), who states that this phenomenon also sometimes applies to the first of the month - which I noted in this dataset. I will handle the 12pm concentration bias problem by converting a sample of 12pm to NaN and then using hot deck imputation. However, I will drop `day_of_month` instead of imputing or trying to determine another approach, as I do not feel that it is likely to be as useful a feature.

In [3]:
#get temporal features
df["hour"] = df["datetime"].dt.hour
df["day"] = df["datetime"].dt.dayofweek
df["month"] = df["datetime"].dt.month

#indicate if victim is a child
df["is_child"] = df["vict_age"] < 18

## Hot Deck Imputation

In [4]:
#get likely true number of crimes at noon
noon_avg = int((df[df["hour"] == 11].shape[0] + df[df["hour"] == 13].shape[0]) / 2)

#get length of data to drop
all_noons = df[df["hour"] == 12]
drop_len = len(all_noons) - noon_avg

#drop random noons
np.random.seed(42)
drop_noons = all_noons.sample(n=drop_len, random_state=42).index

#replace noons with nans
df["hour_less_12"] = df["hour"].copy()
df.loc[drop_noons, "hour_less_12"] = np.nan

In [5]:
#instantiate encoders for selected categorical features
area_le = LabelEncoder()
crime_le = LabelEncoder()
day_le = LabelEncoder()

#encode categorical features
df["crime_encoded"] = crime_le.fit_transform(df["crime_group"])
df["day_encoded"] = day_le.fit_transform(df["day"])
df["area_encoded"] = area_le.fit_transform(df["area"])

#get feature matrix
imputer_features = df[["crime_encoded", "day_encoded", "area_encoded"]].values

#impute missing hours
imputer = KNNImputer(n_neighbors=3)
df["hour_imputed"] = imputer.fit_transform(np.column_stack([imputer_features, df["hour_less_12"].values]))[:, -1]
df["hour_imputed"] = np.round(df["hour_imputed"]).astype(int)

In [6]:
#drop and rename columns
df = df.drop(columns=["hour", "hour_less_12", "crime_encoded", "day_encoded", "area_encoded"]).rename(columns={"hour_imputed":"hour"})

It may have been a better approach to carry out cyclical encoding of the hour before imputing, as the model may misinterpret the cyclical nature of `hour`.

In [7]:
#final check for nulls
df.isna().sum()

date              0
time_str          0
area              0
vict_age          0
vict_sex          0
vict_descent      0
lat               0
lon               0
time              0
datetime          0
weapon_group      0
crime_group       0
premises_group    0
day               0
month             0
is_child          0
hour              0
dtype: int64

## Dataframe Export

In [8]:
# df.to_csv("../data/df_after_imputation.csv", index=False)