# <b> Data Science Capstone - The Best District in Inner London to Live in

Importing libraries

In [124]:
import pandas as pd
import numpy as np
import geocoder
import folium
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from geopy import distance
import sklearn.neighbors
import string
import requests

<b> 1. Introduction/Business Problem

1.1 Business Problem

You have been thinking for a long time about what it's like to live in England. Of all places, you specified inner London to be your preferred area. Probably because you're a big football fan and there are many teams in the city to follow. However, that's only the choice you've made so far and you're not sure which district is best suited to your needs and other specifications. Such can be proximity to necessities, whether or not the place is a hot spot for crimes, or maybe the general atmosphere of the area. Of course, there can be many specifications but, in this project, we consider only these factors to determine the ideal place for you. 

1.2 Audience

The target audience of this project would be any one who is looking into moving to Inner London and is actively seeking which area is most suitable for their needs.

<b> 2. Data

2.1. Location 

First, we need to have all postcodes in Inner London with their respective district names. We scrape Wikipedia for this. 

In [2]:
url='https://en.wikipedia.org/wiki/London_postal_district'
response=pd.read_html(url)
res=response[1]
df=pd.DataFrame()
res_1=res
res_1=res_1.drop([0,1,5,8],axis=0).reset_index(drop=True)
for let in res_1[0].unique():
    for dis in res_1[res_1[0]==let][1]:
        area=res[res[0]==let][res[res[0]==let][1]==dis][2].reset_index(drop=True)[0]
        zips=[]
        districts=[]
        #making a list of zip codes
        for j in range(0,50):
            code=let+str(j)+' '
            if area.find(code)!=-1:
                zips.append(code)  
        #making a list of districts
        for i in range(0,len(zips)):
            if i+1<len(zips):
                districts.append(area[area.find(zips[i]):area.find(zips[i+1])])
            else:
                districts.append(area[area.find(zips[i]):])
        for i in range(0,len(districts)):
            for j in zips:
                if districts[i].find('Head district')==-1:
                    districts[i]=districts[i].replace(j,'')
        for i in range(0,len(zips)):
            zips[i]=zips[i].replace(' ','')
        zips=pd.DataFrame(zips,columns=['Postcode'])
        districts=pd.DataFrame(districts,columns=['District'])
        df_area=pd.concat([zips,districts],axis=1)
        df=pd.concat([df,df_area],axis=0).reset_index(drop=True)
df

Unnamed: 0,Postcode,District
0,E1,E1 Head district
1,E2,Bethnal Green
2,E3,Bow
3,E4,Chingford
4,E5,Clapton
...,...,...
115,W12,Shepherds Bush
116,W13,West Ealing
117,W14,West Kensington
118,WC1,WC1 Head district


We use the Geocoder API to retrieve the respective geographical locations of these postcodes.

In [5]:
lats=[]
longs=[]
for i in df['Postcode']:
    lat_lng_coords=None
    while(lat_lng_coords is None):
        g=geocoder.arcgis('{}, London'.format(i))
        lat_lng_coords=g.latlng
    latitude=lat_lng_coords[0]
    longitude=lat_lng_coords[1]
    lats.append(latitude)
    longs.append(longitude)

df_lat=pd.DataFrame(lats,columns=['Latitude'])
df_long=pd.DataFrame(longs,columns=['Longitude'])
df_ll=pd.concat([df,df_lat,df_long],axis=1)
df_ll

Unnamed: 0,Postcode,District,Latitude,Longitude
0,E1,E1 Head district,51.52022,-0.05431
1,E2,Bethnal Green,51.52669,-0.06257
2,E3,Bow,51.52702,-0.02594
3,E4,Chingford,51.61780,-0.00934
4,E5,Clapton,51.55897,-0.05323
...,...,...,...,...
115,W12,Shepherds Bush,51.50645,-0.23691
116,W13,West Ealing,51.51453,-0.31951
117,W14,West Kensington,51.49568,-0.20993
118,WC1,WC1 Head district,51.52450,-0.12273


As stated in the introduction, we consider crime and venues as factors. Each of these has a required dataset.

2.2. Crime

Ideally, we want to live in a crime-free place. However, this is hard to come by in any place in the world, let alone the capital of England. Hence, we only choose districts with relatively low crime rates compared to the rest of the areas in the city. In this project, we choose to account for crimes in the last 2 years of available crime data. As of writing, the latest data in the UK crime stats website is crimes for November 2020, hence, we use the data from January 2019 to November 2020.

In [6]:
crimes_count=[]
for i in range(0,len(df_ll)):
    district=df_ll['Postcode'].iloc[i]
    url='https://www.ukcrimestats.com/Postcode_Districts/{}/'.format(district)
    response=pd.read_html(url)
    crimes=response[0]['Total'].iloc[0:23].sum()
    crimes_count.append(crimes)
    print("{} crimes counted in postcode {}".format(crimes,district))
crimes_count=pd.DataFrame(crimes_count,columns=['Crime Count'])
crimes_count.to_csv('crimes_count.csv',index=False)
df2=pd.concat([df_ll,crimes_count],axis=1)

33567 crimes counted in postcode E1
18982 crimes counted in postcode E2
19294 crimes counted in postcode E3
13461 crimes counted in postcode E4
13736 crimes counted in postcode E5
20208 crimes counted in postcode E6
11696 crimes counted in postcode E7
18746 crimes counted in postcode E8
12368 crimes counted in postcode E9
12028 crimes counted in postcode E10
12174 crimes counted in postcode E11
7947 crimes counted in postcode E12
11813 crimes counted in postcode E13
27232 crimes counted in postcode E14
20342 crimes counted in postcode E15
13841 crimes counted in postcode E16
26820 crimes counted in postcode E17
3218 crimes counted in postcode E18
3456 crimes counted in postcode E20
0 crimes counted in postcode EC1
0 crimes counted in postcode EC2
0 crimes counted in postcode EC3
0 crimes counted in postcode EC4
33161 crimes counted in postcode N1
3737 crimes counted in postcode N2
4375 crimes counted in postcode N3
17365 crimes counted in postcode N4
6578 crimes counted in postcode N5


After running the code above, we are still left with districts with zero crimes. This happened because for such districts, the website only provides data of their subdistricts. We retrieve these and sum them up to get the desired values.

In [12]:
crimes_count=[]
for i in df2[df2['Crime Count']==0]['Postcode']:
    crimes=0
    for j in list(string.ascii_uppercase):
        district=i+j
        url='https://www.ukcrimestats.com/Postcode_Districts/{}/'.format(district)
        response=pd.read_html(url)
        crimes+=response[0]['Total'].iloc[0:23].sum()
    crimes_count.append(crimes)
    print("{} crimes counted in postcode {}".format(crimes,i))
df2.loc[df2['Crime Count']==0,'Crime Count']=crimes_count

13970 crimes counted in postcode EC1
9199 crimes counted in postcode EC2
4294 crimes counted in postcode EC3
4304 crimes counted in postcode EC4
31142 crimes counted in postcode SW1
64420 crimes counted in postcode W1
18515 crimes counted in postcode WC1
25083 crimes counted in postcode WC2


Of course, number of crimes isn't enough for our analysis because more populous districts tend to have more crimes. Therefore, what we need to consider instead is the crime rate of each district. To get such values, we need to divide the number of crimes by the district's population. Crime rate is expressed as number of crimes per 100,000 persons. Hence, we put a multiplier of 100,000 to attain the crime rates. Population data is based on the 2011 census and can be gathered from https://postal-codes.cybo.com/

In [29]:
population=[]
for i in range(0,len(df2)):
    post=df2['Postcode'].iloc[i]
    url='https://postal-codes.cybo.com/united-kingdom/{}/'.format(post)
    response=pd.read_html(url)
    try:
        pop=int(response[5]['Population'].iloc[0])
        print('Population of {} is {}'.format(post,pop))
    except:
        print('No Data Available for {}'.format(post))
        pop=0
    population.append(pop)
population=pd.DataFrame(population,columns=['Population'])
df2=pd.concat([df2,population['Population']],axis=1)

Population of E1 is 71450
Population of E2 is 49223
Population of E3 is 58412
Population of E4 is 63468
Population of E5 is 50762
Population of E6 is 87076
Population of E7 is 59885
Population of E8 is 43436
Population of E9 is 42850
Population of E10 is 45733
Population of E11 is 58374
Population of E12 is 46139
Population of E13 is 49158
Population of E14 is 82345
Population of E15 is 58308
Population of E16 is 48754
Population of E17 is 110756
Population of E18 is 20065
No Data Available for E20
No Data Available for EC1
No Data Available for EC2
No Data Available for EC3
No Data Available for EC4
Population of N1 is 92515
Population of N2 is 24905
Population of N3 is 27857
Population of N4 is 52343
Population of N5 is 22286
Population of N6 is 21603
Population of N7 is 52208
Population of N8 is 42803
Population of N9 is 55723
Population of N10 is 28425
Population of N11 is 30412
Population of N12 is 30074
Population of N13 is 32685
Population of N14 is 31505
Population of N15 is 46

Similar to crime data, some districts only have data available for their subdistricts. 

In [20]:
population=[]
for i in df2[df2['Population']==0]['Postcode']:
    pops=0
    for j in list(string.ascii_uppercase):
        post=i+j
        url='https://postal-codes.cybo.com/united-kingdom/{}/'.format(post)
        try:
            response=pd.read_html(url)
            pops+=int(response[5]['Population'].iloc[0])
        except:
            pops+=0
    population.append(pops)
    print('Population of {} is {}'.format(i,pops))
missing_pop=pd.DataFrame(population,columns=['Population'])
df2.loc[df2['Population']==0,'Population']=population
df2

Population of E20 is 0
Population of EC1 is 42435
Population of EC2 is 11210
Population of EC3 is 5617
Population of EC4 is 5366
Population of SW1 is 54818
Population of W1 is 56613
Population of WC1 is 34114
Population of WC2 is 9503


Unnamed: 0,Postcode,District,Latitude,Longitude,Crime Count,Population
0,E1,E1 Head district,51.52022,-0.05431,33567,71450
1,E2,Bethnal Green,51.52669,-0.06257,18982,49223
2,E3,Bow,51.52702,-0.02594,19294,58412
3,E4,Chingford,51.61780,-0.00934,13461,63468
4,E5,Clapton,51.55897,-0.05323,13736,50762
...,...,...,...,...,...,...
115,W12,Shepherds Bush,51.50645,-0.23691,17494,51698
116,W13,West Ealing,51.51453,-0.31951,7147,32864
117,W14,West Kensington,51.49568,-0.20993,7941,34338
118,WC1,WC1 Head district,51.52450,-0.12273,18515,34114


In [35]:
df2

Unnamed: 0,Postcode,District,Latitude,Longitude,Crime Count,Population
0,E1,E1 Head district,51.52022,-0.05431,33567,71450
1,E2,Bethnal Green,51.52669,-0.06257,18982,49223
2,E3,Bow,51.52702,-0.02594,19294,58412
3,E4,Chingford,51.61780,-0.00934,13461,63468
4,E5,Clapton,51.55897,-0.05323,13736,50762
...,...,...,...,...,...,...
115,W12,Shepherds Bush,51.50645,-0.23691,17494,51698
116,W13,West Ealing,51.51453,-0.31951,7147,32864
117,W14,West Kensington,51.49568,-0.20993,7941,34338
118,WC1,WC1 Head district,51.52450,-0.12273,18515,34114


Population of E20 is not available from the postal-codes.cybo website so we put this manually. According to http://www.eastvillagelondon.co.uk/, the population of E20 is about 6,000 people.

In [36]:
population=6000
df2.loc[df2['Population']==0,'Population']=population
df2

Unnamed: 0,Postcode,District,Latitude,Longitude,Crime Count,Population
0,E1,E1 Head district,51.52022,-0.05431,33567,71450
1,E2,Bethnal Green,51.52669,-0.06257,18982,49223
2,E3,Bow,51.52702,-0.02594,19294,58412
3,E4,Chingford,51.61780,-0.00934,13461,63468
4,E5,Clapton,51.55897,-0.05323,13736,50762
...,...,...,...,...,...,...
115,W12,Shepherds Bush,51.50645,-0.23691,17494,51698
116,W13,West Ealing,51.51453,-0.31951,7147,32864
117,W14,West Kensington,51.49568,-0.20993,7941,34338
118,WC1,WC1 Head district,51.52450,-0.12273,18515,34114


The code below calculates the crime rates.

In [38]:
crime_rate=[]
for i in range(0,len(df2)):
    crime_rate.append((df2['Crime Count'].iloc[i]/df2['Population'].iloc[i])*100000)
crime_rate=pd.DataFrame(crime_rate,columns=['Crime Rate'])
df3=pd.concat([df2,crime_rate],axis=1)

Below is our transformed crime and locations dataset

In [39]:
df3

Unnamed: 0,Postcode,District,Latitude,Longitude,Crime Count,Population,Crime Rate
0,E1,E1 Head district,51.52022,-0.05431,33567,71450,46979.706088
1,E2,Bethnal Green,51.52669,-0.06257,18982,49223,38563.273267
2,E3,Bow,51.52702,-0.02594,19294,58412,33030.884065
3,E4,Chingford,51.61780,-0.00934,13461,63468,21209.113254
4,E5,Clapton,51.55897,-0.05323,13736,50762,27059.611520
...,...,...,...,...,...,...,...
115,W12,Shepherds Bush,51.50645,-0.23691,17494,51698,33838.833224
116,W13,West Ealing,51.51453,-0.31951,7147,32864,21747.200584
117,W14,West Kensington,51.49568,-0.20993,7941,34338,23125.982876
118,WC1,WC1 Head district,51.52450,-0.12273,18515,34114,54273.905142


2.3 Venues

One of the factors in choosing a site for a house is convenience. Is there a supermarket? Is there a train station nearby? How about a pharmacy in case you need an emergency medicine? Moreover, personal taste comes into play. What would be the ideal atmosphere of the place you live in? Do you want to live in a hotspot for coffee shops, restaurants, bars, or recreation? All of these questions can be answered by venue data we can attain from Foursquare API. This will be done later once we've analysed the crime data and have filtered down viable districts.