# Identifying High-risk Areas for Theft in London using Machine Learning

***
#### Preparation

- See project in Github: [Github link](https://github.com/meimao76/006assessment)

- Number of words: ***

- Runtime: *** hours (*Memory 10 GB, CPU Intel i7-10700 CPU @2.90GHz*)

- Coding environment: VS Code + Python 3.12 (Windows 11)

- License: this notebook is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).

- Additional library *[libraries not included in SDS Docker or not used in this module]*:
    - **watermark**: A Jupyter Notebook extension for printing timestamps, version numbers, and hardware information.
    - ......

***

## Table of contents

1. [Introduction](#Introduction)

1. [Literature review](#literature-review)

1. [Research questions](#Research-questions)

1. [Methodology](#Methodology)

1. [Data](#Data)

1. [Results](#Results)

1. [Discussion and conclusion](#discussion-and-conclusion)

1. [References](#References)

***

## Introduction

Urban safety remains a fundamental concern for cities worldwide, with crime posing persistent challenges to social stability and quality of life. According to a set of Office for Natinal Statistics (ONS) crime data covering the 12 months up to September 2024, London -as the capital of UK- has one of the lowest rates of violent crime and violent crime with injury but also the highest overall rate of crime per 1,000 people in its population. A main contributor for this situation is the high volume of various theft crime.(Hill, 2025) 

Thus, in order to better protect citizens' property, it is essential to identify and predict when and where theft crimes are most likely to occur across different parts of London.

This reseach focuses on finding a machine learning model that performs better in predicting theft risk across London. With open data provided by Metropolitan Police Service (MPS) and ONS, this research seeks to uncover key socioeconomic elements that influence high-risk area distributions.

[[ go back to the top ]](#Table-of-contents)

***

## Literature review

Crime has been a key topic in reseach, many have proved that it's not randomly distributed, but shaped by underlying socio-economic conditions and spatial processes. Classical theories like Routine Activity Theory (Cohen and Felson, 1979) have long guided statistical modelling of crime, often through logic regression models linking crime rates to elements like population, housing, educaation and deprivation (Glasson and Cozens, 2011; Hojman, 2004).

Machine learning (ML) offered a more effective way to predicting crime. Models like XGBoost and random forest can capture non-linear and high-dimensional relationships, usually outperforming traditional models in terms of predictive power (Yin, 2022; Yunus and Loo, 2024). Despite their accuracy, these "black-box" models have been criticized for lacking interpretability—making them difficult to apply in policy-making and urban governance (Mandalapu et al., 2023).

To address this, Zhang et al. (2022) applied XGBoost combined with SHAP (Shapley Additive exPlanations) to predict crime rate, showing that the proportion of non-local residents and age group contribute the most to crime prediction. 

[[ go back to the top ]](#Table-of-contents)

***

## Research questions

According to previous studies, this study focusing on theft in London, aims to investigate whether interpretable machine learning methods can help to identify and predict crime hotspots, and explain the key drivers of crime rate.

To achieve this goal, the study is divided into three research questions:

1. Compared to traditional statistical methods, do machine learning models significantly improve the accuracy of predicting the risk of theft across different areas of London?

2. What are the most strongly influencing factors associated with theft risk?

3. How can contributions of the factors be interpreted in different areas of London using machine learning model?

[[ go back to the top ]](#Table-of-contents)

***

## Methodology

In order to better compare the effectiveness of traditional statistical and machine learning models in predicting theft risk in London, this reseach choose Ordinary Least Squares (OLS) regression and eXtreme Gradient Boosting (XGBoost) model as the representative of both kinds.

### Crime Rate Prediction

Traditional statistical methods remain popular for crime prediction and analysis (Yue and Chen, 2025). With a straightforward mathematical foundation and the ability to assign interpretable weights to different variables, OLS is considered as one of the simplest model for parameter estimation in data-driven crime analysis.(Junxiang Yin, 2022)

In recent years, machine learning models have been widely used in crime analysis. Among various machine learning algorithms, XGBoost (Extreme Gradient Boosting) was selected due to its superior performance on tabular data and its capacity for handling nonlinear relationships. Compared with Random Forest, XGboost offers more efficient training through gradient boosting. Other ML methods, such as neural networks or K-nearest neighbors, were not chosen for this research due to they have relatively high demands of data size and low interpretability.

### Feature Interpretation

In the OLS model, feature interpretation is reflected through regression coefficients. Each coefficient represents the marginal effect of a one-unit change in the corresponding variable on the predicted theft rate. Significance tests (p-values) were used to assess the statistical reliability of each coefficient.

In order to interpret feature importance of XGBoost, this study employs SHAP (SHapley Additive Explanations). SHAP values allow for both global analysis highlighting the most influential features across all observations, and local interpretation explaining why specific areas have higher or lower predicted theft rates.

![image.png](./flowmapdraft.drawio2.png)

[[ go back to the top ]](#Table-of-contents)


## Data

[[ go back to the top ]](#Table-of-contents)

| Variable                            | Type        | Description                                         | Notes |
|-------------------------------------|-------------|-----------------------------------------------------|-------|
| Burglary crime rate                 | Numeric     | The burglary rate of MSOAs. Used as dependent var. |  &nbsp;     |
| Temperature                         | Numeric     | The daytime temperature                             |  &nbsp;     |
| Indicator of Inner or Outer London | Categorical | Whether the MSOA is in Inner London                 |    &nbsp;   |

## Results

[[ go back to the top ]](#Table-of-contents)

## Discussion and conclusion

[[ go back to the top ]](#Table-of-contents)

## References

Boeing, G. 2024. “Modeling and Analyzing Urban Networks and Amenities with OSMnx.” Working paper. URL: https://geoffboeing.com/publications/osmnx-paper/

[[ go back to the top ]](#Table-of-contents)