<h1>Additional Data Collection through Police API</h1>
In this notebook, we perform initial data analysis on the dataset downloaded from the London Borough of Camden link <a href='https://opendata.camden.gov.uk/Crime-and-Criminal-Justice/On-Street-Crime-In-Camden/qeje-7ve7'>here</a>. We first identify the 'Neighbourhood ID's [Ward Code and Names] of Camden neighbourhoods in order to retrieve the <strong>street crimes</strong> data within those areas from the Police API directly.<br/>
Subsequently, we perform some initial checks whether the data of the stored dataset from Camden [CSV] matches with the retrieved ones - one such sample check can be found in this notebook <a href='verify_police_api_and_camden_stored_dataset.ipynb'>here</a>. Some reflections can be fiound inside the Conclusion section.<br/>
We retrieve addtional data that are available through the Police API but does not exist inside this Camden dataset. For example, the Camden dataset only contains data upto August, 2019 but Police API data for neighbourhoods are updated monthly. Currently, the crime data can be retreived upto December, 2020. Therefore, we retrieve addtional data from this API from September, 2019 upto December, 2020.

The following data retrieval process is achieved by going through the Police API Client implementation in Python as described <a href="https://police-api-client-python.readthedocs.io/en/latest/">here</a>. Plenty of data manipulation was required in order to make this additional dataset shape consistent with the available Camden dataset on top of going through the Police API carefully.

In [1]:
# library imports: numpy, pandas packages
import numpy as np
import pandas as pd

In [2]:
# loading the dataset retrieved from: https://opendata.camden.gov.uk/Crime-and-Criminal-Justice/On-Street-Crime-In-Camden/qeje-7ve7
# The data is clipped to the London Borough of Camden boundary and created by using Police API [Crime] of this Borough
# More details about this API later in this notebook which is also used to retrieve addtional data
df = pd.read_csv('data/On_Street_Crime_In_Camden.csv')

In [3]:
print('Dataset shape: {}'.format(df.shape)) # rows and columns of the dataset
print('Different Ward Names ({}): {}'.format(len(df['Ward Name'].unique()),sorted(df['Ward Name'].unique()))) # different ward/neighbourhood names within London Borough of Camden
under_stn_lst = df[df['Location Subtype']=='London Underground Station']['Street ID'].unique() # underground station list within London Borough of Camden
#print('Different Street IDs related to London Underground Station within this area: {}'.format(under_stn_lst))
print('Different Street Names related to London Underground Station within this area:\n{}'.format(df[df['Location Subtype']=='London Underground Station']['Street Name'].unique()))
df.head(2)

Dataset shape: (210979, 20)
Different Ward Names (18): ['Belsize', 'Bloomsbury', 'Camden Town with Primrose Hill', 'Cantelowes', 'Fortune Green', 'Frognal and Fitzjohns', 'Gospel Oak', 'Hampstead Town', 'Haverstock', 'Highgate', 'Holborn and Covent Garden', 'Kentish Town', 'Kilburn', "King's Cross", "Regent's Park", 'St Pancras and Somers Town', 'Swiss Cottage', 'West Hampstead']
Different Street Names related to London Underground Station within this area:
['Kings Cross St Pancras (underground)' 'Euston (station)'
 'Chalk Farm (lu Station)' 'Chancery Lane (lu Station)'
 'Euston (underground)' 'Holborn (lu Station)' 'Finchley Road' 'Hampstead'
 'Belsize Park (lu Station)' 'Russell Square (lu Station)'
 'Kings Cross St Pancras (lu Station)' 'Chalk Farm'
 'Kings Cross St Pancras' 'Goodge Street' 'Holborn' 'Euston Square'
 'Kentish Town (underground)' 'Camden Town' 'West Hampstead (underground)'
 'Russell Square' 'Belsize Park' 'Finchley Road (lu Station)'
 'Warren Street (lu Station)' 'G

Unnamed: 0,Category,Street ID,Street Name,Context,Outcome Category,Outcome Date,Service,Location Subtype,ID,Persistent ID,Epoch,Ward Code,Ward Name,Easting,Northing,Longitude,Latitude,Spatial Accuracy,Last Uploaded,Location
0,Other theft,1489515,Kings Cross (station),,Status update unavailable,08/01/2017 12:00:00 AM,British Transport Police,Station,64777250,,04/01/2017 12:00:00 AM,E05000143,St Pancras and Somers Town,530277.37,183101.39,-0.123189,51.5318,This is only an approximation of where the cri...,11/07/2018,"(51.5318, -0.123189)"
1,Anti-social behaviour,960522,On or near Wellesley Place,,,,Police Force,,51520755,,09/01/2016 12:00:00 AM,E05000143,St Pancras and Somers Town,529707.23,182682.77,-0.131558,51.528169,This is only an approximation of where the cri...,11/07/2018,"(51.528169, -0.131558)"


In [4]:
d1 = dict(zip(list(df.columns), list(df.dtypes))) # datatypes indexed by column name
d2 = dict(zip(list(df.columns), list(df.isnull().sum()))) # number of null values indexed by column names
temp_df = pd.DataFrame([d1, d2], index=['data type', 'null values']) # creating a pandas dataframe: just for visualisation
temp_df

Unnamed: 0,Category,Street ID,Street Name,Context,Outcome Category,Outcome Date,Service,Location Subtype,ID,Persistent ID,Epoch,Ward Code,Ward Name,Easting,Northing,Longitude,Latitude,Spatial Accuracy,Last Uploaded,Location
data type,object,int64,object,float64,object,object,object,object,int64,object,object,object,object,float64,float64,float64,float64,object,object,object
null values,0,0,0,210979,45429,45429,0,199431,0,56507,0,0,0,0,0,0,0,0,0,0


In [5]:
# installing the police-api-client: REQUIRED, the rest of the notebook won't work if this fails
!pip install police-api-client
# a custom package used to convert lat/lon into easting/northing: used to keep the same format as the Camden CSV dataset
!pip install convertbng



In [6]:
# About data.police.uk (data available from UK Police): https://data.police.uk/about/
# police API Client: https://police-api-client-python.readthedocs.io/en/latest/
# Python API Client on GitHub: https://github.com/rkhleics/police-api-client-python/

from police_api import PoliceAPI
from police_api.forces import Force

In [7]:
pol_api = PoliceAPI()
print('The latest date recorded data that can be accessed via this API: {}'.format(pol_api.get_latest_date()))
print('Police forces across UK:\n********************\n{}'.format(pol_api.get_forces()))

The latest date recorded data that can be accessed via this API: 2020-12
Police forces across UK:
********************
[<Force> Avon and Somerset Constabulary, <Force> Bedfordshire Police, <Force> Cambridgeshire Constabulary, <Force> Cheshire Constabulary, <Force> City of London Police, <Force> Cleveland Police, <Force> Cumbria Constabulary, <Force> Derbyshire Constabulary, <Force> Devon & Cornwall Police, <Force> Dorset Police, <Force> Durham Constabulary, <Force> Dyfed-Powys Police, <Force> Essex Police, <Force> Gloucestershire Constabulary, <Force> Greater Manchester Police, <Force> Gwent Police, <Force> Hampshire Constabulary, <Force> Hertfordshire Constabulary, <Force> Humberside Police, <Force> Kent Police, <Force> Lancashire Constabulary, <Force> Leicestershire Police, <Force> Lincolnshire Police, <Force> Merseyside Police, <Force> Metropolitan Police Service, <Force> Norfolk Constabulary, <Force> North Wales Police, <Force> North Yorkshire Police, <Force> Northamptonshire Polic

In [8]:
force = Force(pol_api, id='metropolitan') # Camden is under Metropolitan Police Service
# creating a new dataframe using all the neighbourhoods looked after by the Metropolitan Police Service
#neighbourhood_df = pd.DataFrame({'Neighbourhood ID':[x.id for x in force.neighbourhoods], 'Neighbourhood Name':[x.name for x in force.neighbourhoods]})

# neighbourhood dataframe created from the existing Camden's ward name/code information
# we will be retrieving the new crime data of these neighbourhoods through the Police API
neighbourhood_df = pd.DataFrame({'Neighbourhood ID':df['Ward Code'].unique(), 'Neighbourhood Name':df['Ward Name'].unique()})
# Here we are trying to find the changed neighbhourhood IDs in the current Police API with respect to Camden's dataset version
# we noticed that some of them are changed as they appear inside the Camden dataset

# the following is creating a name and id pair dictionary for the neighbourhoods
# not absolutely ideal since there are some duplicate names: does not affect the 'Camden' dataset though
n_id_name_dict = {}
n_name_lst = list(neighbourhood_df['Neighbourhood Name']) # list of neighbourhood (ward) names inside the Camden dataset
for i in range(len(force.neighbourhoods)): # traverse through all the neighbourhoods that are inside the force that is looking after Camden
    if force.neighbourhoods[i].name in n_name_lst: # only select the neighbourhoods that are inside the 'Camden' CSV dataset
        n_id_name_dict[force.neighbourhoods[i].name] = force.neighbourhoods[i].id 

# replacing the name list with IDs
for i in range(len(n_name_lst)):
    n_name_lst[i] = n_id_name_dict[n_name_lst[i]]
        
# creating a new column inside the neighbourhood dataframe with the current ID seen from Police API
neighbourhood_df['Current Neighbourhood ID (Police API)'] =  n_name_lst
neighbourhood_df

Unnamed: 0,Neighbourhood ID,Neighbourhood Name,Current Neighbourhood ID (Police API)
0,E05000143,St Pancras and Somers Town,E05000143
1,E05000144,Swiss Cottage,00AG03N
2,E05000141,King's Cross,E05000141
3,E05000137,Highgate,E05000272
4,E05000130,Camden Town with Primrose Hill,00AG02N
5,E05000136,Haverstock,E05000136
6,E05000133,Frognal and Fitzjohns,E05000133
7,E05000128,Belsize,E05000128
8,E05000129,Bloomsbury,E05000129
9,E05000131,Cantelowes,E05000131


In [9]:
from time import sleep
from helper_functions import create_df_from_crime_data

# IN THIS PART:
# traverse through the current Police API (ID), and first retrieve the dataset for a single month
# then change it to incrementally increase the month parameter and retrieve for all months across all wards
# merge this dataset with the existing one [perform checks if correctly done for a small dataset first]
# save the dataset.....note down the new retrieved data size and the existing size (210979)
#month_year = ['2019-09', '2019-10','2019-11', '2019-12','2020-01', '2020-02','2020-03', '2020-04','2020-05', '2020-06','2020-07', '2020-08', '2020-09', '2020-10','2020-11', '2020-12'] 
month_year = ['2019-09', '2019-10','2019-11', '2019-12'] # this is just a sample;  we used the above list of all the months involved....

# creating an empty dataframe where the additional data will be saved
df1 = pd.DataFrame(columns=df.columns)

# Retrieve data through Police API where data from Sep, 2019 upto Dec, 2020 are not available inside the Camden dataset [CSV file]
for j in range(len(month_year)):
    for i in range(neighbourhood_df.shape[0]):
    #for i in range(1):
        neighbourhood = force.get_neighbourhood(neighbourhood_df.iloc[i]['Current Neighbourhood ID (Police API)'])
        crime_data = pol_api.get_crimes_area(neighbourhood.boundary, date=month_year[j]) # retrieving the crime data of a particular month/year
        # create_df_from_crime_data function is deinfed inside helper_functions.py: this will return a Pandas dataframe street crime information following the same structure of the Camden dataset format
        temp_df = create_df_from_crime_data(crime_data, neighbourhood_df.iloc[i]['Current Neighbourhood ID (Police API)'], neighbourhood_df.iloc[i]['Neighbourhood Name'], under_stn_lst)
        df1 = pd.concat([df1, temp_df], sort=False) # concatenate the retrieved dataframe with the existing one
        sleep(30) # sleep for 30 seconds

In [10]:
print('The additional data shape retrieved through Police API dataset: ',df1.shape)

The additional data shape retrieved through Police API dataset:  (16019, 20)


<h1>Conclusions</h1>

1. Inside the Camden dataset [CSV], if the crime Outcome Category and its Date are NaN, then those records were discarded. Police API retrieved such records which were not present in the CSV dataset for previous years. This was revealed during checking whether the Police API retrieved data from pervious years largely match this CSV dataset. Except those entries - they were largely similar.<br/>
2. It was also revealed that the Camden dataset's crime Outcome does not consist of latest updates which can be obtained through Police API. For example, many were <strong>under investigation</strong> inside the CSV dataset which are changed to <strong>status update unavailable</strong> later. This is not captured inside the dataset.<br/>
3. Police API's way of identifying some neighbourhood [e.g., Swiss Cottage, Camden Town with Primrose Hill, Highgate and Regent's Park] has been changed from the time the Camden dataset was created. Some additional preprocessing steps were required to make both data [CSV dataset and addtional data retrieved through Police API] consistent.<br/>
4. From Police API data analysis, it is found that Epoch field is the date of reporting. Since only the month is reported with date field is missing [anonymisation], this Epoch term may be used as a column name rather than reporting date.

In [11]:
# saving the additional dataset retrieved through the API
df1.to_csv('data/police-api-dataset-dec.csv', index=False)

In [12]:
df1['Ward Name'].unique()

array(['St Pancras and Somers Town', 'Swiss Cottage', "King's Cross",
       'Highgate', 'Camden Town with Primrose Hill', 'Haverstock',
       'Frognal and Fitzjohns', 'Belsize', 'Bloomsbury', 'Cantelowes',
       'Kilburn', 'Kentish Town', 'West Hampstead', 'Fortune Green',
       'Holborn and Covent Garden', "Regent's Park", 'Gospel Oak',
       'Hampstead Town'], dtype=object)