# Capstone Project: Evaluating Food Restaurant Feasibility in London, United Kingdom using k-Means Clustering

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Suppose a client wants to expand his Jollybuzz food corporation to Europe. He wanted to build his first store in London, United Kingdom. However, he has little knowledge on areas and neighbourhood in the city. He also wants to know where are his competitors are located and where are the areas with least competition.

In addition, we will evaluate which value of k has the highest model accuracy in identifying clusters of neighbourhoods in London. After determining the target neighborhood cluster, we will profile the cluster based on demography and predict two-year business survival rate in the cluster.

## Data <a name="Data"></a>

We will need to get information on:
* List of Neighbourhood in London, United Kingdom
* Post codes and location of the neighbourhoods
* Venues in the neighbourhood
* London Borough Profile

There will be four sources of data:

1. List of Neighbourhood(Areas) in London - https://en.wikipedia.org/wiki/List_of_areas_of_London </li>
    This contains Location which will be the neighborhood, area which is the borough.
2. Post codes and Location data - https://www.doogal.co.uk/AdministrativeAreas.php </li>
    This contains the borough list with latitude and longitude values.
3. Venues in the neighbourhood - Foursquare API </li>
    To be extracted from Foursquare API
4. London Borough profiles - https://data.london.gov.uk/dataset/london-borough-profiles#:~:text=The%20London%20Borough%20Profiles%20help,borough%2C%20alongside%20relevant%20comparator%20areas. </li>
    Compute for Average two-year business survival rate in London and create new variable where value is equal to 1 if two-year business survival rate is above average and 0 otherwise. </li>
   </li> Use this as dependent variable and run using decision tree and logistic regression to identify significant factors affecting in the target cluster to their business survival.
 

### Importing Libraries

In [259]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import re

print('Libraries imported.')

Libraries imported.


### Webscraping List of Neighbourhoods in London

In [277]:
url  = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
df_list = pd.read_html(url)

# get 1st array which contains list of the neighbourhood
df_list[1]

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
5,Aldborough Hatch,Redbridge[9],ILFORD,IG2,020,TQ455895
6,Aldgate,City[10],LONDON,EC3,020,TQ334813
7,Aldwych,Westminster[10],LONDON,WC2,020,TQ307810
8,Alperton,Brent[11],WEMBLEY,HA0,020,TQ185835
9,Anerley,Bromley[11],LONDON,SE20,020,TQ345695


In [278]:
colnames = ['Neighborhood', 'Borough', 'Post_town','Postcode', 'Dial code']
ldn = pd.DataFrame(df_list[1])
ldn.drop(columns =["OS grid ref"], axis = 1, inplace = True)
ldn.columns=colnames
ldn.drop(columns =["Dial code"], axis = 1, inplace = True)

ldn.head()


Unnamed: 0,Neighborhood,Borough,Post_town,Postcode
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Addington,Croydon[8],CROYDON,CR0
3,Addiscombe,Croydon[8],CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [279]:
#remove subscripts in Borough column
ldn['Borough']=ldn.Borough.str.replace('[^a-zA-Z, ]', "")
ldn.head()

Unnamed: 0,Neighborhood,Borough,Post_town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [280]:
#split Boroughs and expand so that, one borough = one line
ldn['Borough1'] = ldn['Borough'].str.split(',',expand = True).get(0)
ldn['Borough2'] = ldn['Borough'].str.split(',',expand = True).get(1)
ldn.head()

Unnamed: 0,Neighborhood,Borough,Post_town,Postcode,Borough1,Borough2
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,Bexley,Greenwich
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",Ealing,Hammersmith and Fulham
2,Addington,Croydon,CROYDON,CR0,Croydon,
3,Addiscombe,Croydon,CROYDON,CR0,Croydon,
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",Bexley,


In [310]:
ldn1 = ldn[['Neighborhood', 'Borough1']]
ldn2 = ldn[['Neighborhood', 'Borough2']]
ldn1.columns = ['Neighborhood', 'Borough']
ldn2.columns = ['Neighborhood', 'Borough']
ldnf = ldn1.append( ldn2)
ldnf = ldnf[ldnf['Borough'].notnull()]
ldnf.sort_values(by = ['Borough'], inplace = True)
ldnf

Unnamed: 0,Neighborhood,Borough
85,Chadwell Heath,Barking and Dagenham
121,Cricklewood,Brent
379,Queensbury,Brent
479,Upper Ruxley,Bromley
454,"Sydenham (also Lower Sydenham, Upper Sydenham)",Bromley
392,Ruxley,Bromley
470,Tufnell Park,Camden
268,Kilburn,Camden
358,Park Royal,Ealing
99,Chiswick,Ealing


## Webscraping location data

In [290]:
url  = 'https://www.doogal.co.uk/AdministrativeAreas.php'
df_list2 = pd.read_html(url)
df_list2[0]

Unnamed: 0,Administrative area,County,Latitude,Longitude,Postcodes,Active postcodes,Population,Households
0,Aberdeen City,,57.1495,-2.13294,14319,6228,222599,103302
1,Aberdeenshire,,57.3539,-2.32244,21432,9722,252588,104594
2,Adur,West Sussex,50.8332,-0.284415,2964,1743,61167,26952
3,Allerdale,Cumbria,54.7132,-3.36148,5508,4073,96471,42364
4,Amber Valley,Derbyshire,53.0377,-1.42431,4941,2805,122339,52604
5,Angus,,56.6195,-2.74947,4547,3972,115791,51537
6,Antrim and Newtownabbey,,54.6982,-6.05146,4160,3321,137494,53565
7,Ards and North Down,,54.6072,-5.66706,5454,4056,155261,64004
8,Argyll and Bute,,56.0324,-5.21336,4141,3409,87912,40013
9,"Armagh City, Banbridge and Craigavon",,54.3849,-6.42147,5926,5144,196845,74376


Convert to webscraped data into pandas data frame.

In [311]:
#convert to data frame
ll = pd.DataFrame(df_list2[0])
ll = ll.filter(items = ['Administrative area',  'Latitude', 'Longitude'])
ll.columns = ['Borough', 'Latitude', 'Longitude']
ll.head()

Unnamed: 0,Borough,Latitude,Longitude
0,Aberdeen City,57.1495,-2.13294
1,Aberdeenshire,57.3539,-2.32244
2,Adur,50.8332,-0.284415
3,Allerdale,54.7132,-3.36148
4,Amber Valley,53.0377,-1.42431


Merge Location data and Neighborhood data. 

In [323]:
##combine location data and neighborhood data
ldnf_ll = pd.merge(ldnf, ll, on = "Borough")
ldnf_ll

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Barking,Barking and Dagenham,51.5465,0.124976
1,Rush Green,Barking and Dagenham,51.5465,0.124976
2,Dagenham,Barking and Dagenham,51.5465,0.124976
3,Becontree Heath,Barking and Dagenham,51.5465,0.124976
4,Becontree,Barking and Dagenham,51.5465,0.124976
5,Castle Green,Barking and Dagenham,51.5465,0.124976
6,Creekmouth,Barking and Dagenham,51.5465,0.124976
7,Marks Gate,Barking and Dagenham,51.5465,0.124976
8,Hendon,Barnet,51.6055,-0.207728
9,Hampstead Garden Suburb,Barnet,51.6055,-0.207728


## Data <a name="Data"></a>