# 📄 Crime Data Preprocessing Documentation

This notebook documents the full preprocessing pipeline for combining, cleaning, and engineering features from UK crime data.

### Key objectives:
- Combine monthly crime data across all regions
- Merge with deprivation and civic infrastructure metadata
- Handle missing values, duplicates, and data types
- Engineer new variables for analysis
- Prepare for time series modelling and EDA

---

## 🔁 High-Level Preprocessing Flowchart (Mermaid)

```mermaid
flowchart LR
    subgraph LOAD_AND_MERGE[Data Loading and Merging]
        A[Loop through month folders]
        B[Loop through CSV files in each folder]
        C[Extract Force and Month]
        D[Append to list]
        E[Concatenate to one DataFrame]
        F[Load deprivation and civic CSVs]
        G[Rename LSOA columns]
        H[Merge deprivation + civic data]
        I[Merge with crime data]
        A --> B --> C --> D --> E
        E --> F --> G --> H --> I
    end

    subgraph CLEANING[Initial Cleaning]
        J[Drop irrelevant columns]
        K[Loop: Convert object columns to numeric]
        L[Convert Month + categoricals]
        M[Rename columns]
        N[Drop missing LSOA]
        I --> J --> K --> L --> M --> N
    end

    subgraph IDS_AND_OUTCOMES[IDs and Outcome Handling]
        O[Loop: Generate UUIDs]
        P[Drop row duplicates]
        Q[New unique ID per row]
        R[Fill 'Last outcome' nulls]
        N --> O --> P --> Q --> R
    end

    subgraph IMPUTATION[Handling Missing Values]
        S[Loop: Impute by LSOA]
        T[Check LSOA completeness]
        U[Loop: Fallback by Force]
        V[Drop null-heavy cols]
        R --> S --> T --> U --> V
    end

    subgraph FEATURE_ENGINEERING[Feature Engineering]
        W[Create service access score]
        X[Create crime count per LSOA]
        Y[Sort by Force + Month]
        Z[Create lagged crime count]
        AA[Map crime types to categories]
        AB[Reverse rank/decile scales]
        V --> W --> X --> Y --> Z --> AA --> AB
    end

    AB --> AC[Save final CSV]

    classDef loop stroke:#000,stroke-width:2px,stroke-dasharray: 5 5;
    B:::loop
    K:::loop
    O:::loop
    S:::loop
    U:::loop
```

In [1]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_lo

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_lo

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_lo

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\hanam\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_lo

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



In [2]:
import pandas as pd
from pathlib import Path

# Use current working directory
base_path = Path(".")

street_dfs = []

for month_folder in sorted(base_path.iterdir()):
    if not month_folder.is_dir():
        continue

    month = month_folder.name 

    for file in month_folder.glob("*.csv"):
        filename = file.stem.lower()
        parts = filename.split("-")

        force = "-".join(parts[2:-1])  
        df = pd.read_csv(file)
        df['Month'] = month
        df['Force'] = force

        street_dfs.append(df)

# Combine everything
all_street_df = pd.concat(street_dfs, ignore_index=True)

# Check
print("✅ Combined shape:", all_street_df.shape)
all_street_df.head()

✅ Combined shape: (3763863, 13)


Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Context,Force
0,50adc6e18bdf475cdae2ef77fec2e4dc29c97ec97cb0d0...,2023-02,City of London Police,City of London Police,-0.115997,51.527254,On or near Cubitt Street,E01000936,Camden 024A,Other crime,Status update unavailable,,city-of-london
1,26c9f8b125c7a15d6515bd4f95d679693d5a883b619f02...,2023-02,City of London Police,City of London Police,-0.11035,51.51809,On or near Holborn,E01000917,Camden 027C,Other theft,Investigation complete; no suspect identified,,city-of-london
2,5f6a5bf218eaad8e135fdb9d87c4dba04f4084944aab18...,2023-02,City of London Police,City of London Police,-0.111596,51.518281,On or near Chancery Lane,E01000914,Camden 028B,Other theft,Investigation complete; no suspect identified,,city-of-london
3,54d2a39fab2279b6d0e852844c16ec46c1f6948a386adc...,2023-02,City of London Police,City of London Police,-0.111596,51.518281,On or near Chancery Lane,E01000914,Camden 028B,Other theft,Status update unavailable,,city-of-london
4,00959f6007b7e60d2a377c39ae837a20803a9a6d52d3fc...,2023-02,City of London Police,City of London Police,-0.112096,51.515942,On or near,E01000914,Camden 028B,Other theft,Investigation complete; no suspect identified,,city-of-london


In [3]:
deprivation_df = pd.read_csv("deprivation LSOA.csv")
civic_df = pd.read_csv("Civic infrastructure LSOA.csv") #loading additional data to merge with the stacked data
#this data includes service access data, income, education, living environment, and homelessness data and it's relation to crime

In [4]:
# Rename for consistent join key
deprivation_df.rename(columns={'LSOA code (2011)': 'LSOA code'}, inplace=True)
civic_df.rename(columns={'LSOA code (2011)': 'LSOA code'}, inplace=True)

# Merge deprivation and infrastructure into one LSOA metadata table
combined_lsoa_data = pd.merge(deprivation_df, civic_df, on='LSOA code', how='outer')

# Now merge that with the crime data using LSOA code
final_df = all_street_df.merge(combined_lsoa_data, on='LSOA code', how='left')

In [5]:
final_df.to_csv("merged_raw_data.csv", index=False)
#File is too large for me to upload on sharepoint so user needs to download it

In [6]:
df = pd.read_csv("merged_raw_data.csv") #loading final stacked/merged dataset

In [7]:
df.columns #inspecting data to do further cleaning

Index(['Crime ID', 'Month', 'Reported by', 'Falls within', 'Longitude',
       'Latitude', 'Location', 'LSOA code', 'LSOA name', 'Crime type',
       'Last outcome category', 'Context', 'Force', 'LSOA name_x',
       'Local Authority District code (2019)_x',
       'Local Authority District name (2019)_x',
       'Index of Multiple Deprivation (IMD) Rank (where 1 is most deprived)',
       'Index of Multiple Deprivation (IMD) Decile (where 1 is most deprived 10% of LSOAs)',
       'Income Rank (where 1 is most deprived)',
       'Income Decile (where 1 is most deprived 10% of LSOAs)',
       'Employment Rank (where 1 is most deprived)',
       'Employment Decile (where 1 is most deprived 10% of LSOAs)',
       'Education, Skills and Training Rank (where 1 is most deprived)',
       'Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)',
       'Health Deprivation and Disability Rank (where 1 is most deprived)',
       'Health Deprivation and Disability Decile (

## **Dropping columns**

I dropped information on longitude and latitude as it include coordinates not useful for EDA. 

Context column was full of N/A data, removed repeated location based columns, housing affordability and similar columns etc. don't have strong ties to crime-- income/education/living environment etc. acts as more of a key predictor then housing affordability which I opted for instead. 

Local authority districts were not joined on and had no ties to the original crime/police data henced I removed it, and index of multiple deprivation lacked granularity compared to other indexes/ rank data.

In [8]:
# Dropping unwanted columns
columns_to_drop = [
    'Longitude', 'Latitude', 'Reported by', 'Context', 'Falls within',
    'Local Authority District code (2019)_x', 'Local Authority District name (2019)_x',
    'Owner-occupation affordability (component of housing affordability indicator)',
    'Private rental affordability (component of housing affordability indicator)',
    'Housing affordability indicator',
    'Local Authority District code (2019)_y', 'Local Authority District name (2019)_y',
    'Household overcrowding indicator', 'Index of Multiple Deprivation (IMD) Rank (where 1 is most deprived)','Index of Multiple Deprivation (IMD) Decile (where 1 is most deprived 10% of LSOAs)','Health Deprivation and Disability Rank (where 1 is most deprived)','Health Deprivation and Disability Decile (where 1 is most deprived 10% of LSOAs)'
]

df.drop(columns=columns_to_drop, inplace=True)

In [9]:
df.dtypes #ensuring data is in correct format

Crime ID                                                                            object
Month                                                                               object
Location                                                                            object
LSOA code                                                                           object
LSOA name                                                                           object
Crime type                                                                          object
Last outcome category                                                               object
Force                                                                               object
LSOA name_x                                                                         object
Income Rank (where 1 is most deprived)                                              object
Income Decile (where 1 is most deprived 10% of LSOAs)                              float64

In [10]:
#converting numerical columns from object to numeric
columns_to_convert = [
    'Income Rank (where 1 is most deprived)',
    'Employment Rank (where 1 is most deprived)',
    'Education, Skills and Training Rank (where 1 is most deprived)',
    'Barriers to Housing and Services Rank (where 1 is most deprived)',
    'Living Environment Rank (where 1 is most deprived)','Crime Rank (where 1 is most deprived)'
]

#converting each object in the list to numeric type using a for loop 
for col in columns_to_convert:
    df[col] = pd.to_numeric(df[col], errors='coerce')

#converting categories into categorical data 
df['Crime type'] = df['Crime type'].astype('category')
df['Force'] = df['Force'].astype('category')

#converting month/year column into date 
df['Month'] = pd.to_datetime(df['Month'], format='%Y-%m')

In [11]:
#renaming decile and rank columns as shorter/ easier column names

rename_dict = {
    'Income Rank (where 1 is most deprived)': 'Income Deprivation Rank',
    'Income Decile (where 1 is most deprived 10% of LSOAs)': 'Income Deprivation Decile',
    'Employment Rank (where 1 is most deprived)': 'Employment Deprivation Rank',
    'Employment Decile (where 1 is most deprived 10% of LSOAs)': 'Employment Deprivation Decile',
    'Education, Skills and Training Rank (where 1 is most deprived)': 'Education Deprivation Rank',
    'Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)': 'Education Deprivation Decile',
    'Barriers to Housing and Services Rank (where 1 is most deprived)': 'Housing Barrier Rank',
    'Barriers to Housing and Services Decile (where 1 is most deprived 10% of LSOAs)': 'Housing Barrier Decile',
    'Living Environment Rank (where 1 is most deprived)': 'Environment Deprivation Rank',
    'Living Environment Decile (where 1 is most deprived 10% of LSOAs)': 'Environment Deprivation Decile',
    'Road distance to a post office indicator (km)': 'Distance to Post Office (km)',
    'Road distance to a primary school indicator (km)': 'Distance to Primary School (km)',
    'Road distance to general store or supermarket indicator (km)': 'Distance to Supermarket (km)',
    'Road distance to a GP surgery indicator (km)': 'Distance to GP (km)',
    'Homelessness indicator (rate per 1000 households)': 'Homelessness Rate', 
    'Crime Rank (where 1 is most deprived)': 'Crime Deprivation Rank',
    'Crime Decile (where 1 is most deprived 10% of LSOAs)': 'Crime Deprivation Decile'
}

df.rename(columns=rename_dict, inplace=True)

In [12]:
df.columns #inspecting final columns

Index(['Crime ID', 'Month', 'Location', 'LSOA code', 'LSOA name', 'Crime type',
       'Last outcome category', 'Force', 'LSOA name_x',
       'Income Deprivation Rank', 'Income Deprivation Decile',
       'Employment Deprivation Rank', 'Employment Deprivation Decile',
       'Education Deprivation Rank', 'Education Deprivation Decile',
       'Crime Deprivation Rank', 'Crime Deprivation Decile',
       'Housing Barrier Rank', 'Housing Barrier Decile',
       'Environment Deprivation Rank', 'Environment Deprivation Decile',
       'LSOA name_y', 'Distance to Post Office (km)',
       'Distance to Primary School (km)', 'Distance to Supermarket (km)',
       'Distance to GP (km)', 'Homelessness Rate'],
      dtype='object')

### **Missing data identification**

Now that I've dropped redundant data and casted data into proper/ more accurate data types I can now remove missing data. First I will identify data before removing and imputing. I will also remove duplicates.

In [13]:
#missing data identification
df.isnull().sum()

Crime ID                            584573
Month                                    0
Location                                 0
LSOA code                            32486
LSOA name                            32486
Crime type                               0
Last outcome category               584573
Force                                    0
LSOA name_x                         330222
Income Deprivation Rank            3632656
Income Deprivation Decile           330222
Employment Deprivation Rank        3675329
Employment Deprivation Decile       330222
Education Deprivation Rank         3649030
Education Deprivation Decile        330222
Crime Deprivation Rank             3512139
Crime Deprivation Decile            330222
Housing Barrier Rank               3479927
Housing Barrier Decile              330222
Environment Deprivation Rank       3491631
Environment Deprivation Decile      330222
LSOA name_y                         330222
Distance to Post Office (km)        330222
Distance to

In [14]:
df.shape #3763863 rows and 22 columns

(3763863, 27)

Columns that are fully observed are- **Month, Location, crime type, force**. 

Data that needs mean imputing- **decile and rank data**.

**Crime ID** needs randomised ID data to be the datasets primary key.

**'Last outcome'** data needs nulls replaced with **'undefined outcome'** 

Nulls I will consider dropping- **LSOA code**, because this acts as a geographic anchor needed for mean imputation, so removing null LSOA can help achieve mean imputation. Common threshold for deleting missing data is < 5%, and LSOA are 0.86% null out of the entire dataset. 

**Original dataset** already saved in csv within the folder, following best practice.

In [15]:
df = df[df['LSOA code'].notna()]
#overwriting and dropping LSOA missing values

In [16]:
df['LSOA code'].isna().sum() #confirming no n/a left

np.int64(0)

### **Mean imputation/ filling in missing values**

In [17]:
#there are alot of missing crime ID's which are important for looking at the frequency of crime per areas
#so we're going to generate unique ID's for nulls in the crime ID column 
import uuid

# Function to generate a UUID if Crime ID is missing
def generate_crime_id(row):
    if pd.isna(row['Crime ID']):
        return str(uuid.uuid4())
    else:
        return row['Crime ID']

# Apply row-wise and overwrite missing Crime IDs
df['Crime ID'] = df.apply(generate_crime_id, axis=1)

KeyboardInterrupt: 

In [None]:
df.shape #making sure crime ID has as many unique ID's as the amount of rows within our dataset

In [None]:
#validating that each crime ID is unique
df['Crime ID'].nunique() #should have 3731377

This suggests there are duplicates which could skew our crime frequency analysis, we need each row to be unique to avoid grouping and count every single occurrence of crime, to accurately measure crime frequency. I will now run further sanity checks on if there are duplicates and where they are. 

In [None]:
df['Crime ID'].duplicated().sum()

I have 38495 duplicates in my crime ID

In [None]:
df[df['Crime ID'].duplicated()].head()

This is an error wihtin the original data, as you can see, duplicates can't simply be dropped but need reassigning as each row is unique yet ID's are reused. Sometimes police forces bundle minor or nearby incidents under the same ID for reporting but can misrepresent crime frequency. 

In [None]:
df.duplicated().sum() #counting entire row duplicates wihtin entire dataframe

In [None]:
df = df.drop_duplicates() #removing entire row duplicates

In [None]:
df['Crime ID'].duplicated().sum() #this has slightly reduced the amount of duplicates in crime ID but further unique identification is needed

Now I will create a new column which every row has a unique identifier, in place of the crime ID column. 

In [None]:
df['Crime unique ID'] = [str(uuid.uuid4()) for _ in range(len(df))] 

We must address the **last outcome** column. I will simply assign a new term 'unidentified outcome' in place of the nulls

In [None]:
df['Last outcome category'] = df['Last outcome category'].fillna('Unidentified outcome')

Now I will impute every missing decile/ rank data point with their respective means from their LSOA code groups, I used LSOA code to impute means as that's what the datasets were joined together by. 

In [None]:
columns_to_impute = [
    'Income Deprivation Decile', 'Employment Deprivation Decile', 'Education Deprivation Decile',
    'Housing Barrier Decile', 'Environment Deprivation Decile',
    'Income Deprivation Rank', 'Employment Deprivation Rank', 'Education Deprivation Rank',
    'Housing Barrier Rank', 'Environment Deprivation Rank', 'Crime Deprivation Rank', 'Crime Deprivation Decile'
]

for col in columns_to_impute:
    df[col] = df.groupby('LSOA code')[col].transform(lambda x: x.fillna(x.mean()))

In [None]:
df[columns_to_impute].isnull().sum() #mean imputation did not work. This signals an issue with data incompleteness --> not enough data to compute mean

In [None]:
for col in columns_to_impute:
    missing_lsoas = df[df[col].isna()]['LSOA code'].nunique()
    print(f"{col}: {missing_lsoas} LSOA codes have all missing values")
#this output will show us the data completeness regarding LSOA's and their respective ranks/ deciles.
#most likely the diagnostic to why the mean imputaion is not working as intended

This highlights the severity of data incompleteness. Many LSOA code groups have no data at all to impute the mean. The original dataset lacked information on many LSOA's. What I will proceed to do instead is use broader groupby's such as force to re-conduct my mean imputation. Though this poses a threat to the granularity of our data, it is better than no data at all. 

In [None]:
# Impute missing values first by Force group mean
for col in ['Income Deprivation Decile', 'Employment Deprivation Decile', 'Education Deprivation Decile', 'Housing Barrier Decile', 'Environment Deprivation Decile',
            'Income Deprivation Rank', 'Employment Deprivation Rank', 'Education Deprivation Rank', 'Housing Barrier Rank', 'Environment Deprivation Rank',
            'Distance to Post Office (km)', 'Distance to Primary School (km)',
            'Distance to Supermarket (km)', 'Distance to GP (km)', 'Homelessness Rate', 'Crime Deprivation Rank', 'Crime Deprivation Decile']:

    # Fallback: Fill remaining missing values using Force-level mean
    df[col] = df.groupby('Force')[col].transform(lambda x: x.fillna(x.mean()))

In [None]:
df.isnull().sum()

This improved mean imputation significantly, the remaining missing value columns will be dropped as they have significant data incompleteness and won't be useful for EDA.

In [None]:
columns_to_drop_final = [
    'LSOA name_x', 'LSOA name_y',
    'Income Deprivation Rank', 'Employment Deprivation Rank', 'Education Deprivation Rank',
    'Housing Barrier Rank', 'Environment Deprivation Rank'
]
df.drop(columns=columns_to_drop_final, inplace=True)

In [None]:
df.isnull().sum()

Data is fully complete, now we can move onto feature engineering 

### **Feature engineering**

Here I am creating a variable that represents the access to integral services like schools, supermarkets, GP's etc. called **service access**, and make real estate decisions based on 'service desserts' and their relation to crime. I want to standardise the index/ decile data, as lower values represent more deprivation/ worser outcomes which can be misleading and hinders interpretability. Lastly, I want to make a monthly time lagged crime_count variable, time lagged variables are pivotal for machine learning models to analyse incremental temporal changes in crime, and can also be used to predict future increase/decreases in crime level.

In [None]:
df['service_access'] = df[
    ['Distance to Post Office (km)', 'Distance to Primary School (km)', 
     'Distance to Supermarket (km)', 'Distance to GP (km)']
].mean(axis=1)

#created my service access variable by averaging distance columns to services

In [None]:
df['crime_count'] = df['LSOA code'].map(df['LSOA code'].value_counts())
#this counts the number of crimes in an LSOA area, since every row is a unique crime, it essentially counts the number of recorded crimes in each area

In [None]:
#creating my time lagged variable-- to predict increases in crime frequency using ML time series models
df.sort_values(by=['Force', 'Month'], inplace=True)
# sorting data by date and force, necessary for machine learning time series models 

df['lagged_crime_count'] = df.groupby('Force')['crime_count'].shift(1)
# lagging crime count, this creates a variable where every row represents data from a month prior 

In [None]:
#creating crime group column to do categorical analysis on in EDA. Creating broader groups. 
def map_crime_type(x):
    if x in ['Violence and sexual offences', 'Robbery', 'Possession of weapons']:
        return 'Violent'
    elif x in ['Burglary', 'Vehicle crime', 'Other theft', 'Shoplifting', 'Theft from the person']:
        return 'Property'
    elif x in ['Anti-social behaviour', 'Public order']:
        return 'Anti-social'
    elif x == 'Drugs':
        return 'Drugs'
    else:
        return 'Other'

df['crime_category'] = df['Crime type'].apply(map_crime_type) #using apply method to apply custom function

In [None]:
#reversing index/decile data to make it more readable (higher values indicate worse outcome)
# Reverse RANKS
df['Crime Deprivation Rank'] = (df['Crime Deprivation Rank'].max() + 1) - df['Crime Deprivation Rank']

# Reverse DECILES
df['Income Deprivation Decile'] = 11 - df['Income Deprivation Decile']
df['Employment Deprivation Decile'] = 11 - df['Employment Deprivation Decile']
df['Education Deprivation Decile'] = 11 - df['Education Deprivation Decile']
df['Crime Deprivation Decile'] = 11 - df['Crime Deprivation Decile']
df['Environment Deprivation Decile'] = 11 - df['Environment Deprivation Decile']
df['Housing Barrier Decile'] = 11 - df['Housing Barrier Decile']

In [None]:
df.to_csv("final_data.csv", index=False)
#saving final cleaned dataset to the directory