# Capstone Project: Evaluating Food Restaurant Feasibility in London, United Kingdom using k-Means Clustering

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Suppose a client wants to expand his Jollybuzz food corporation to Europe. He wanted to build his first store in London, United Kingdom. However, he has little knowledge on areas and neighbourhood in the city. He also wants to know where are his competitors are located and where are the areas with least competition.

In addition, we will evaluate which value of k has the highest model accuracy in identifying clusters of neighbourhoods in London. After determining the target neighborhood cluster, we will profile the cluster based on demography and predict two-year business survival rate in the cluster.

## Data <a name="Data"></a>

We will need to get information on:
* List of Neighbourhood in London, United Kingdom
* Post codes and location of the neighbourhoods
* Venues in the neighbourhood
* London Borough Profile

There will be four sources of data:

1. List of Neighbourhood(Areas) in London - https://en.wikipedia.org/wiki/List_of_areas_of_London </li>
    This contains Location which will be the neighborhood, area which is the borough.
2. Post codes and Location data - https://www.doogal.co.uk/AdministrativeAreas.php </li>
    This contains the borough list with latitude and longitude values.
3. Venues in the neighbourhood - Foursquare API </li>
    To be extracted from Foursquare API
4. London Borough profiles - https://data.london.gov.uk/dataset/london-borough-profiles#:~:text=The%20London%20Borough%20Profiles%20help,borough%2C%20alongside%20relevant%20comparator%20areas. </li>
    Compute for Average two-year business survival rate in London and create new variable where value is equal to 1 if two-year business survival rate is above average and 0 otherwise. </li>
   </li> Use this as dependent variable and run using decision tree and logistic regression to identify significant factors affecting in the target cluster to their business survival.
 

### Importing Libraries

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import re

print('Libraries imported.')

### Webscraping List of Neighbourhoods in London

In [None]:
url  = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
df_list = pd.read_html(url)

# get 1st array which contains list of the neighbourhood
df_list[1]

In [None]:
colnames = ['Neighborhood', 'Borough', 'Post_town','Postcode', 'Dial code']
ldn = pd.DataFrame(df_list[1])
ldn.drop(columns =["OS grid ref"], axis = 1, inplace = True)
ldn.columns=colnames
ldn.drop(columns =["Dial code"], axis = 1, inplace = True)

ldn.head()


In [None]:
#remove subscripts in Borough column
ldn['Borough']=ldn.Borough.str.replace('[^a-zA-Z, ]', "")
ldn.head()

In [None]:
#split Boroughs and expand so that, one borough = one line
ldn['Borough1'] = ldn['Borough'].str.split(',',expand = True).get(0)
ldn['Borough2'] = ldn['Borough'].str.split(',',expand = True).get(1)
ldn.head()

In [None]:
ldn1 = ldn[['Neighborhood', 'Borough1']]
ldn2 = ldn[['Neighborhood', 'Borough2']]
ldn1.columns = ['Neighborhood', 'Borough']
ldn2.columns = ['Neighborhood', 'Borough']
ldnf = ldn1.append( ldn2)
ldnf = ldnf[ldnf['Borough'].notnull()]
ldnf.sort_values(by = ['Borough'], inplace = True)
ldnf

## Webscraping location data

In [None]:
url  = 'https://www.doogal.co.uk/AdministrativeAreas.php'
df_list2 = pd.read_html(url)
df_list2[0]

Convert to webscraped data into pandas data frame.

In [None]:
#convert to data frame
ll = pd.DataFrame(df_list2[0])
ll = ll.filter(items = ['Administrative area',  'Latitude', 'Longitude'])
ll.columns = ['Borough', 'Latitude', 'Longitude']
ll.head()

Merge Location data and Neighborhood data. 

In [None]:
##combine location data and neighborhood data
ldnf_ll = pd.merge(ldnf, ll, on = "Borough")
ldnf_ll

## Data <a name="Data"></a>