# Data Programming In Python


# Coursework 1

# Exploratory Data Analysis and Crime Prediction across London Boroughs



1. Introduction
       1.1 Motivation
       1.2 Aims and Objective
       1.3 Problem Formulation
1. EDA
        2.1 Import Python Modules
        2.2 Dataset Import
        2.3 Combining Datasets
        2.4 Remove Redundant Columns
        2.5 Remove Duplicates
        2.6 Remove wrong data(Outliers)
        2.7 Handle Missing Values
1. Expolatory Data Analysis
1. Limitations and Challenges
1. Conclusion
1. Future Work

## 1.1 Motivation


One of the main factors determining a city's quality of life is the crime rate. A city with a high crime rate is often seen as unsafe and uninviting, making it difficult to enjoy life there. On the other hand, a city with a low crime rate is often seen as more desirable, making it easier to enjoy life in the city. Of course, other factors can affect the quality of life in a city, but the crime rate is one of the most important.

There's no question that the crime rate in a city directly impacts the quality of life of its residents. A high crime rate means more fear and insecurity, which can lead to all sorts of problems. It's hard to feel safe in your own home if you're constantly worried about being burglarized or attacked. And, of course, the stress comes with living in a dangerous place.

But it's not just the fear and stress that can make life in a high-crime city difficult. It can also make everyday tasks like getting around town or running errands a lot more complicated and time-consuming. It takes much longer to get places and do things if you're always looking for danger.

In short, a high crime rate can have a serious impact on the quality of life in a city. If you're considering moving to a new city, be sure to do your research and make sure you're moving to a place where you'll feel safe and secure.

A high crime rate can have a number of consequences for a community. It can make it more difficult for people to feel safe in their homes and in public spaces. It can also lead to a higher cost for insurance, as well as higher taxes to pay for law enforcement and other services needed to combat crime. Additionally, a high crime rate can have a negative impact on the economy, as businesses may be less likely to locate in a community with a high crime rate.

## 1.2 Aims and Objectives

The crime rates in London in 2022 are a cause for concern. According to the Metropolitan Police Service, the crime rate has increased by 20% in the past year. The most common crimes are vandalism, theft, and assault. There has been a recent spike in knife crime, and the murder rate is the highest it has been in a decade. Mayor Sadiq Khan has pledged to take action to reduce crime, but it is unclear what measures he will take.

## 1.3 Problem Formulation

The problem being tackled in this research can be best explained in two distinct
parts:

        1.3.1 Performing exploratory analysis of the data to mine patterns in crime

The first step in determining the safety within different areas in the city is analyzing the spread and impact of the crime. The data provided by London Government will be used to perform exploratory analysis and observe existing patterns in the crime throughout all the borough of the city of London.

The aim is to study the crime spread in London based on the geographical location of each crime, the possible areas of victimization on the streets, seasonal changes in the crime rate and the type, and the hourly variations in crime.

        1.3.2. Building a prediction model to predict the type of crime that can take place in

After observing the patterns of crime from the historical data in this coursework, the next thing is to predict the crimes that can occur in the future. The goal is to build a prediction model that treats this problem as a multiclass classification problem, by classifying the unseen data into one of the crime categories (classes) thereby predicting the crime that can occur. Such data is expected to help the police plan their patrol and effectively contribute to building a smarter city.

For the first part of the coursework, various Python libraries and data analytics tools will be used for initial data preprocessing, to analyze the spread of the crime in the city. For the second part of the coursework, in order to build a prediction model, the result of existing models will be improved by experimenting with different types of algorithms. 

## 2. EDA
        
        


## 2.1 Import Python Modules

The modules required for the first part of the coursework were imported accordingly from various python libraries. 




In [1]:
#Import libraries and modules

# Basic Analysis and Visualization
import numpy as np
import pandas as pd
import glob

# parse HTML documents
import requests
from bs4 import BeautifulSoup 


## 2.2 Data Collection

London's police force is divided by two regions: Greater London and the City of London. The [City of London Police](https://www.cityoflondon.police.uk/) is the territorial police force responsible for law enforcement within the City of London. Where as, the [Metropolitan Police](https://www.met.police.uk/), a much larger and a separate organisation is responsible for law enforcement within the remainder of the London region, outside the city, also knows as Greater London. 


#### 2.2.1 Crime rate (Primary Dataset)
So, to analyse the crime rates in London, the data was taken from [https://data.police.uk/data/](https://data.police.uk/data/). From the list of all police forces across United Kingdom, boxes for both City of London Police and Metropolitan Police Service were included. As for the remainder of three options, only ‘crime data’ and ‘Stop and Search’ data was included. The reason for not including the ‘outcomes data’ was because the option would generate a separate file for data which is already included in the crime data csv. This step generated separate csv files for monthly crime data recorded by both police forces: City of London Police and Metropolitan Police Service. 

In [2]:
# list all the csv files in the relative path
for f in glob.glob("*.csv"):
    print(f)

2019-10-city-of-london-street.csv
2019-10-metropolitan-street.csv
2019-11-city-of-london-street.csv
2019-11-metropolitan-street.csv
2019-12-city-of-london-street.csv
2019-12-metropolitan-street.csv
2020-01-city-of-london-street.csv
2020-01-metropolitan-street.csv
2020-02-city-of-london-street.csv
2020-02-metropolitan-street.csv
2020-03-city-of-london-street.csv
2020-03-metropolitan-street.csv
2020-04-city-of-london-street.csv
2020-04-metropolitan-street.csv
2020-05-city-of-london-street.csv
2020-05-metropolitan-street.csv
2020-06-city-of-london-street.csv
2020-06-metropolitan-street.csv
2020-07-city-of-london-street.csv
2020-07-metropolitan-street.csv
2020-08-city-of-london-street.csv
2020-08-metropolitan-street.csv
2020-09-city-of-london-street.csv
2020-09-metropolitan-street.csv
2020-10-city-of-london-street.csv
2020-10-metropolitan-street.csv
2020-11-city-of-london-street.csv
2020-11-metropolitan-street.csv
2020-12-city-of-london-street.csv
2020-12-metropolitan-street.csv
2021-01-ci

## 2.3 Combining Datasets

71 seprate csv files were downloaded as a part of the primary dataset containing all police reported crimes between October 2020 to September 2022. 

Data from both police forces will be concataned in a single file for pre-processing and analysis. 

In [3]:
df = []

for one_file in glob.glob("*.csv"):
    new_df = pd.read_csv(one_file)
    df.append(new_df)

df = pd.concat(df)
print(df)

                                                Crime ID    Month  \
0                                                    NaN  2019-10   
1      347cacb0d02bdb756998b54d7487407bcf6f171ba72f6c...  2019-10   
2      a53440d933613bf927349c9af506c2a2e49bb72786fefd...  2019-10   
3      24a8afd5512e09b44b272db7bee560cb458fc718a11ed7...  2019-10   
4                                                    NaN  2019-10   
...                                                  ...      ...   
85823  8c7516073095ad86f7fa2eed88033e565627546c560134...  2022-09   
85824  e3f86085fc533b5bebeaa5b5fdfae51f3a4c879a4c9231...  2022-09   
85825  acb271fa590e8fa5319a51618e94710026b02d73e31ebd...  2022-09   
85826  9ea2a138ddf72acdaeab4365eda3f4fa86de6bd2fed466...  2022-09   
85827  dac466cf34a0ad5341ab52e72eeb4de204b42f4d25c352...  2022-09   

                       Reported by                 Falls within  Longitude  \
0            City of London Police        City of London Police  -0.111497   
1            Ci

In [4]:
#Shape of the dataset
df.shape

(3371273, 12)

In [5]:
df.head()

Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Context
0,,2019-10,City of London Police,City of London Police,-0.111497,51.518226,On or near Pedestrian Subway,E01000914,Camden 028B,Anti-social behaviour,,
1,347cacb0d02bdb756998b54d7487407bcf6f171ba72f6c...,2019-10,City of London Police,City of London Police,-0.112422,51.515381,On or near Star Yard,E01000914,Camden 028B,Other theft,Investigation complete; no suspect identified,
2,a53440d933613bf927349c9af506c2a2e49bb72786fefd...,2019-10,City of London Police,City of London Police,-0.111497,51.518226,On or near Pedestrian Subway,E01000914,Camden 028B,Other theft,Investigation complete; no suspect identified,
3,24a8afd5512e09b44b272db7bee560cb458fc718a11ed7...,2019-10,City of London Police,City of London Police,-0.111962,51.518494,On or near Nightclub,E01000914,Camden 028B,Violence and sexual offences,Local resolution,
4,,2019-10,City of London Police,City of London Police,-0.095914,51.520348,On or near Beech Street,E01000001,City of London 001A,Anti-social behaviour,,


In [6]:
#datatype of all columns of the dataset
df.dtypes

Crime ID                  object
Month                     object
Reported by               object
Falls within              object
Longitude                float64
Latitude                 float64
Location                  object
LSOA code                 object
LSOA name                 object
Crime type                object
Last outcome category     object
Context                  float64
dtype: object

The 'LSOA name' column also has more specific location addresses on each borough which isn't necessary for analysis under this project. The extra information from all the rows will be stripped so plotting the data can be made easier for crimes across each borough. 

In [7]:
df['LSOA name'] = df['LSOA name'].str[:-5]

In [8]:
df.head()

Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Context
0,,2019-10,City of London Police,City of London Police,-0.111497,51.518226,On or near Pedestrian Subway,E01000914,Camden,Anti-social behaviour,,
1,347cacb0d02bdb756998b54d7487407bcf6f171ba72f6c...,2019-10,City of London Police,City of London Police,-0.112422,51.515381,On or near Star Yard,E01000914,Camden,Other theft,Investigation complete; no suspect identified,
2,a53440d933613bf927349c9af506c2a2e49bb72786fefd...,2019-10,City of London Police,City of London Police,-0.111497,51.518226,On or near Pedestrian Subway,E01000914,Camden,Other theft,Investigation complete; no suspect identified,
3,24a8afd5512e09b44b272db7bee560cb458fc718a11ed7...,2019-10,City of London Police,City of London Police,-0.111962,51.518494,On or near Nightclub,E01000914,Camden,Violence and sexual offences,Local resolution,
4,,2019-10,City of London Police,City of London Police,-0.095914,51.520348,On or near Beech Street,E01000001,City of London,Anti-social behaviour,,


## 2.4 Remove Redundant Columns



In [9]:
#Finding the sum of null values in each colum
df.isnull().sum()

Crime ID                  963365
Month                          0
Reported by                    0
Falls within                   0
Longitude                  44406
Latitude                   44406
Location                       0
LSOA code                  44406
LSOA name                  44406
Crime type                     0
Last outcome category     963365
Context                  3371273
dtype: int64

On manually analyzing the above table, it can be seen that the 'Context' column is empty for the entire dataset. Since the column is empty, it makes sense to remove it from the dataset. 

In [10]:
#Dropping Context Column
df = df.drop('Context', axis=1)

The values for both 'Reported by' and 'Falls Within' column are the same, i.e. the cases are reported by their respect jurisdiction. Because this information isn't useful for analysis, both the columns will be removed from the dataframe. 

In [11]:
df = df.drop(['Reported by','Falls within'], axis=1)

In [12]:
# checking the result by displaying columns 
df.columns

Index(['Crime ID', 'Month', 'Longitude', 'Latitude', 'Location', 'LSOA code',
       'LSOA name', 'Crime type', 'Last outcome category'],
      dtype='object')

After removing redundant columns, there are 9 columns left:

- Crime ID: Unique ID of the reported crime
- Month: the month and year of the crime
- Longitude
- Latitude
- Location: Specifies the nearest landmark
- LSOA code: this represents a policing area
- LSOA name: the london borough for which the statistic is related
- Crime type: crime category
- Last outcome category: Outcome of the reported case

## 2.5 Remove Duplicates

In [13]:
#Will get rid of any duplicate cases reported in the table
df = df.drop_duplicates()

In [14]:
df.shape

(2834789, 9)

## 2.6 Remove wrong data

Most files under the Metropolitan police force have data from boroughs that don't fall under London. London has 33 borough in total. Outliers (crimes of locations that don't fall under London) will be removed by web scrapping a list of London boroughs and comparing them to the 'LSOA name'. Any row that doesn't fall under London borough will be removed from the dataframe(df).

In [15]:
# Beautiful Soup library will be used to get the response in the form of html

London_council = "https://en.wikipedia.org/wiki/List_of_London_boroughs"
table_class="wikitable sortable jquery-tablesorter"

# Code to check if website allows web scrapping
response=requests.get(London_council)
print(response.status_code)

200


In [16]:
# parse data from the html into a beautifulsoup object

soup = BeautifulSoup(response.text, 'html.parser')
Borough_table=soup.find('table',{'class':"wikitable"})

In [17]:
df1=pd.read_html(str(Borough_table))

# convert list to dataframe
df1=pd.DataFrame(df1[0])
print(df1.head())

                        Borough Inner Status  \
0  Barking and Dagenham[note 1]   NaN    NaN   
1                        Barnet   NaN    NaN   
2                        Bexley   NaN    NaN   
3                         Brent   NaN    NaN   
4                       Bromley   NaN    NaN   

                               Local authority Political control  \
0  Barking and Dagenham London Borough Council            Labour   
1                Barnet London Borough Council            Labour   
2                Bexley London Borough Council      Conservative   
3                 Brent London Borough Council            Labour   
4               Bromley London Borough Council      Conservative   

                                Headquarters  Area (sq mi)  \
0                   Town Hall, 1 Town Square         13.93   
1  Barnet House, 2 Bristol Avenue, Colindale         33.49   
2            Civic Offices, 2 Watling Street         23.38   
3          Brent Civic Centre, Engineers Way         1

In [18]:
#Dropping all other coloums since they aren't of any use for this EDA. 
df1 = df1.drop(['Inner', 'Status', 'Local authority', 'Political control',
       'Headquarters', 'Area (sq mi)', 'Population(2019 est)', 'Co-ordinates',
       'Nr. inmap'], axis=1)

In [19]:
#Remove redundant data from specific indexes

for index in df1.index:
    if df1.loc[index,'Borough']=='Barking and Dagenham[note 1]':
        df1.loc[index,'Borough'] = 'Barking and Dagenham'
    elif df1.loc[index,'Borough']=='Greenwich[note 2]':
        df1.loc[index,'Borough'] = 'Greenwich'
    elif df1.loc[index,'Borough']=='Hammersmith and Fulham[note 4]':
        df1.loc[index,'Borough'] = 'Hammersmith and Fulham'
        
print(df1)

                   Borough
0     Barking and Dagenham
1                   Barnet
2                   Bexley
3                    Brent
4                  Bromley
5                   Camden
6                  Croydon
7                   Ealing
8                  Enfield
9                Greenwich
10                 Hackney
11  Hammersmith and Fulham
12                Haringey
13                  Harrow
14                Havering
15              Hillingdon
16                Hounslow
17               Islington
18  Kensington and Chelsea
19    Kingston upon Thames
20                 Lambeth
21                Lewisham
22                  Merton
23                  Newham
24               Redbridge
25    Richmond upon Thames
26               Southwark
27                  Sutton
28           Tower Hamlets
29          Waltham Forest
30              Wandsworth
31             Westminster


London has 32 boroughs and a seprate City of London Corporation that falls under London but isn't a borough. Thus, the value is excluded from 'df1' but will be included during analysis.  

## 2.7 Handle Missing Values

In [20]:
df.isnull().sum()

Crime ID                 462207
Month                         0
Longitude                 43417
Latitude                  43417
Location                      0
LSOA code                 43417
LSOA name                 43417
Crime type                    0
Last outcome category    462207
dtype: int64

'Crime ID' and 'Last outcome category' are missing data in 462,207 rows and location specific data (Longitude, Latitude, LSOA name, and LSOA code) is missing for 43,417 cases.

However, in case of missing 'Location' data, rather than having a null value, it's been marked as 'No location' in the primary dataset. Any instance of 'No location' will be changed to null value through the Numpy Python Library to improve the quality of data and for better analysis.


In [21]:
# Passing null value in case of 'No location' data in 'Location' column

df['Location'] = df['Location'].replace(to_replace="No Location",
           value=np.nan)

# Checking if missing 'Location' data has been updated as a null value
df.isnull().sum()

Crime ID                 462207
Month                         0
Longitude                 43417
Latitude                  43417
Location                  43417
LSOA code                 43417
LSOA name                 43417
Crime type                    0
Last outcome category    462207
dtype: int64

Cases with no information regarding their location will not be useful to determine the frequency of crimes across boroughs. However, the rows with missing location information do have details about the crime type which might be useful for further analysis. 

Thus, a new dataframe will be created where rows with missing data in these ('Location', 'Longitude', 'Latitude', 'LSOA code','LSOA name') columns will be dropped. EDA will be done on this data for coursework 1. 

In [22]:
# Creating a new dataframe while dropping cases will null location values

df2 = df.dropna(subset=['Location', 'Longitude', 'Latitude', 'LSOA code','LSOA name'], how='all')

df2.isnull().sum()

Crime ID                 462171
Month                         0
Longitude                     0
Latitude                      0
Location                      0
LSOA code                     0
LSOA name                     0
Crime type                    0
Last outcome category    462171
dtype: int64

A quick view of missing data depicts that the number of missing 'Case ID' is same as the null values in 'Last Outcome category.'


## 4.Limitations

There is one drawback of this dataset that the datasets for June 2022 for City of London is missing. I’ll try to web scrape this data to fill the missing information. 


## 5. Future work

#### 5.1 Unemployment Rate by Ethnic Group, Nationality, and Borough
The second dataset is the Unemployment Rate by Ethnic Group & Nationality, Borough, and was obtained from (data.london.govuk). The dataset contains unemployment rates broken down by ethnic group from the year 2020 to 2022. The data istaken from the Annual Population Survey, produced by the Office for National Statistics. For this analysis to be effective and match the other datasets the year 2020 to 2022 will be selected.

#### 5.2 Homelessness
The third dataset is the homelessness provided by borough obtained from ([www.data.london.gov.uk](http://www.data.london.gov.uk/)). The dataset contains homelessness rates broken down by ethnic group from the year 2020 to 2022 and Source from DCLG P1E Homelessness returns (quarterly). It’s essential for the analysis that the dataset is the same which is why the year 2020 to 2022 was selected.

#### 5.3 Employees earning below the London Living Wage
The fourth and final dataset is Employees earning below the London Living Wage obtained from ([www.data.london.gov.uk](http://www.data.london.gov.uk/)). Data is also provided by borough from 2020 to 2022 including employees earning below the UK Living Wage by region in London. Like the other dataset the year 2020 to 2022 was selected.