# Opening a South Indian restaurant in Leeds

### This notebook runs through analysis of restaurant competition in Leeds, England to determine the best neighbourhood to set up a south indian restaurant for a relative. This is as part of an IBM Data Science Capstone Coursera course.

#### 1. Import libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


#### 2. Scrape placename data from wikipedia article

In [2]:
data = requests.get('https://en.wikipedia.org/wiki/List_of_places_in_Leeds').text # send the GET request

soup = BeautifulSoup(data, 'html.parser') # parse data

# set up empty arrays for the 2 columns of interest
placeNameList = [] 
postTownList = [] 

# append the data from the wikipedia table
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        placeNameList.append(cells[0].text.rstrip('\n')),
        postTownList.append(cells[5].text.rstrip('\n'))

# store in a dataframe
leeds_df = pd.DataFrame({"PlaceName": placeNameList, "PostTown": postTownList})

leeds_df.head()

Unnamed: 0,PlaceName,PostTown
0,Aberford,LEEDS
1,Adel,LEEDS
2,Adwalton,BRADFORD
3,Ainsty,WETHERBY
4,Aireborough,LEEDS


In [3]:
# filter out posttowns which are based in Leeds only and manually remove places with troublesome data
leeds_df = leeds_df[leeds_df.PostTown == 'LEEDS'].reset_index(drop=True)
leeds_df = leeds_df[leeds_df.PlaceName != 'Arena Quarter'].reset_index(drop=True)
leeds_df = leeds_df[leeds_df.PlaceName != 'Yeadon'].reset_index(drop=True)
leeds_df = leeds_df[leeds_df.PlaceName != 'Hawksworth'].reset_index(drop=True)

# remove unnecessary posttown column now
leeds_df.drop(['PostTown'], axis=1, inplace=True)
leeds_df.head()

Unnamed: 0,PlaceName
0,Aberford
1,Adel
2,Aireborough
3,Alwoodley
4,Armley


In [4]:
leeds_df.shape[0] # check number of rows (places)

109

#### 3. Geocode the places in Leeds and store in the dataframe

In [5]:
# function to geocode a given place
def get_latlng(place):
    locator = Nominatim(user_agent="myGeocoder")
    location = locator.geocode("{}, Leeds, UK".format(place))
    coords = [location.latitude,location.longitude]
    return coords

In [6]:
coords =  [get_latlng(row) for row in leeds_df["PlaceName"].tolist()] # pass placenames to geocode

#store coordinates in main dataframe
coords_df = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
leeds_df['Latitude'] = coords_df['Latitude']
leeds_df['Longitude'] = coords_df['Longitude']

leeds_df.head()

Unnamed: 0,PlaceName,Latitude,Longitude
0,Aberford,53.843233,-1.354424
1,Adel,53.847787,-1.583762
2,Aireborough,53.866616,-1.684758
3,Alwoodley,53.858879,-1.557492
4,Armley,53.797691,-1.588387


In [7]:
# save dataframe
leeds_df.to_csv("leeds_df.csv", index=False)

#### 4. Create a map of Leeds with places superimposed


In [8]:
#located Leeds
locator = Nominatim(user_agent="myGeocoder")
location = locator.geocode("Leeds, UK")
latitude = location.latitude
longitude = location.longitude

# create map of Leeds using latitude and longitude values
map_leeds = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, place in zip(leeds_df['Latitude'], leeds_df['Longitude'], leeds_df['PlaceName']):
    label = '{}'.format(place)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_leeds)  
    
map_leeds

#### 5. Explore venues in Leeds with Foursquare

In [9]:
CLIENT_ID = 'YRBNEF5JLZMQCNKYUE4CWWH40K3CSNX3MQREF444JPINJ1QX' # your Foursquare ID
CLIENT_SECRET = 'MSY0CWLDK0QY1KZFZFMAEE1WMA5HDGDZU4YHQKMU2DITO3FN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [10]:
# get the top 100 venues that are within a radius of 2 kilometers
radius = 2000
LIMIT = 100

venues = []

for lat, long, place in zip(leeds_df['Latitude'], leeds_df['Longitude'], leeds_df['PlaceName']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            place,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))
        

# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['PlaceName', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(4599, 7)


Unnamed: 0,PlaceName,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Aberford,53.843233,-1.354424,Jays International Airport For Blind Pilots.,53.846027,-1.343407,Airport Terminal
1,Aberford,53.843233,-1.354424,askham bar,53.840057,-1.3419,Bus Station
2,Aberford,53.843233,-1.354424,Jays Kitchen,53.836636,-1.340028,Tea Room
3,Aberford,53.843233,-1.354424,Barwick Tennis Club,53.855987,-1.343654,Tennis Court
4,Aberford,53.843233,-1.354424,White Swan Pub,53.8295,-1.34331,Pub


In [11]:
venues_df.groupby(["PlaceName"]).count() #count the number of venues per place

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
PlaceName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aberford,6,6,6,6,6,6
Adel,26,26,26,26,26,26
Aireborough,35,35,35,35,35,35
Alwoodley,13,13,13,13,13,13
Armley,48,48,48,48,48,48
Austhorpe,30,30,30,30,30,30
Bardsey,4,4,4,4,4,4
Bardsey cum Rigton,4,4,4,4,4,4
Barwick-in-Elmet,4,4,4,4,4,4
Beck Hill,20,20,20,20,20,20


In [12]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique()))) # count of unique venue categories
print('There are {} uniques places.'.format(len(venues_df['PlaceName'].unique()))) 

There are 192 uniques categories.
There are 109 uniques places.


In [13]:
venues_df['VenueCategory'].unique() #see all types of unique venues, note the many types of restaurants

array(['Airport Terminal', 'Bus Station', 'Tea Room', 'Tennis Court',
       'Pub', 'Paintball Field', 'Park', 'Hotel', 'Gym / Fitness Center',
       'Grocery Store', 'Athletics & Sports', 'Gastropub', 'Supermarket',
       'Hardware Store', 'Café', 'Cemetery', 'Italian Restaurant',
       'Indoor Play Area', 'College Stadium', 'Golf Course', 'Bar',
       'Shopping Mall', 'Fast Food Restaurant', 'Fish & Chips Shop',
       'Rental Car Location', 'Furniture / Home Store', 'Clothing Store',
       'Coffee Shop', 'Airport Lounge', 'Pharmacy', 'Electronics Store',
       'Indian Restaurant', 'Bookstore', 'Restaurant',
       'Chinese Restaurant', 'Duty-free Shop', 'Pet Store',
       'Currency Exchange', 'Airport Gate', 'Sports Club',
       'Warehouse Store', 'Burger Joint', 'Portuguese Restaurant',
       'Middle Eastern Restaurant', 'Soccer Field', 'Nightclub',
       'Caribbean Restaurant', 'Sandwich Place', 'Rock Climbing Spot',
       'Gym', 'Pizza Place', 'Cuban Restaurant', 'Pool

#### 6. Clustering analysis for restaurants in Leeds

In [14]:
# one hot encoding
leeds_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")
leeds_onehot = leeds_onehot[leeds_onehot.filter(like='Restaurant').columns]

# add place column back to dataframe
leeds_onehot['PlaceName'] = venues_df['PlaceName'] 

# move place column to the first column
fixed_columns = [leeds_onehot.columns[-1]] + list(leeds_onehot.columns[:-1])
leeds_onehot = leeds_onehot[fixed_columns]

print(leeds_onehot.shape)
leeds_onehot.head()

(4599, 31)


Unnamed: 0,PlaceName,American Restaurant,Argentinian Restaurant,Asian Restaurant,Brazilian Restaurant,Caribbean Restaurant,Chinese Restaurant,Cuban Restaurant,English Restaurant,Fast Food Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Molecular Gastronomy Restaurant,Moroccan Restaurant,Paella Restaurant,Portuguese Restaurant,Restaurant,Scandinavian Restaurant,Seafood Restaurant,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,Aberford,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Aberford,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Aberford,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Aberford,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Aberford,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [15]:
leeds_grouped = leeds_onehot.groupby(["PlaceName"]).mean().reset_index()

leeds_grouped2 = leeds_onehot.groupby(["PlaceName"]).sum().reset_index()
leeds_grouped2["Total Restaurants"] = leeds_grouped2.sum(axis = 1) 

leeds_grouped_sums = leeds_grouped2[['PlaceName','Indian Restaurant','Total Restaurants']]

print(leeds_grouped.shape)

leeds_grouped.head()

(109, 31)


Unnamed: 0,PlaceName,American Restaurant,Argentinian Restaurant,Asian Restaurant,Brazilian Restaurant,Caribbean Restaurant,Chinese Restaurant,Cuban Restaurant,English Restaurant,Fast Food Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Molecular Gastronomy Restaurant,Moroccan Restaurant,Paella Restaurant,Portuguese Restaurant,Restaurant,Scandinavian Restaurant,Seafood Restaurant,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,Aberford,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aireborough,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alwoodley,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Armley,0.0,0.0,0.0,0.0,0.020833,0.0,0.020833,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# create new dataframe with top ten venues
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PlaceName']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
places_venues_sorted = pd.DataFrame(columns=columns)
places_venues_sorted['PlaceName'] = leeds_grouped['PlaceName']

for ind in np.arange(leeds_grouped.shape[0]):
    row_categories = leeds_grouped.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    places_venues_sorted.iloc[ind, 1:] = row_categories_sorted.index.values[0:num_top_venues]

# places_venues_sorted.sort_values(freqColumns, inplace=True)
print(places_venues_sorted.shape)
places_venues_sorted

(109, 2)


Unnamed: 0,PlaceName,1st Most Common Venue
0,Aberford,Vietnamese Restaurant
1,Adel,Fast Food Restaurant
2,Aireborough,Chinese Restaurant
3,Alwoodley,Chinese Restaurant
4,Armley,Caribbean Restaurant
5,Austhorpe,Fast Food Restaurant
6,Bardsey,Vietnamese Restaurant
7,Bardsey cum Rigton,Vietnamese Restaurant
8,Barwick-in-Elmet,Vietnamese Restaurant
9,Beck Hill,Indian Restaurant


In [17]:
# set number of clusters
kclusters = 5

leeds_grouped_clustering = leeds_grouped.drop(["PlaceName"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(leeds_grouped_clustering)

# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
leeds_merged = leeds_df.copy()
a = leeds_merged['PlaceName'].isin(leeds_grouped["PlaceName"].tolist())
leeds_merged = leeds_merged[a].reset_index(drop=True)

# add clustering labels
leeds_merged["Cluster Labels"] = kmeans.labels_
leeds_merged = leeds_merged.join(leeds_grouped_sums.set_index("PlaceName"), on="PlaceName")
leeds_merged = leeds_merged.join(places_venues_sorted.set_index("PlaceName"), on="PlaceName")

# sort the results by Cluster Labels
print(leeds_merged.shape)
leeds_merged.sort_values(["Cluster Labels"], inplace=True)
leeds_merged

(109, 7)


Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
37,Gipton,53.812326,-1.497997,0,0,7,Fast Food Restaurant
25,Colton,53.793724,-1.438171,0,0,6,Fast Food Restaurant
34,Farnley,53.786978,-1.615634,0,0,2,Fast Food Restaurant
93,Stourton,53.772986,-1.511919,0,0,1,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Fast Food Restaurant
16,Bramley,53.811876,-1.62519,0,0,3,Fast Food Restaurant
56,Knowsthorpe,53.774992,-1.501473,0,0,2,Fast Food Restaurant
96,Swinnow,53.799738,-1.641321,0,1,3,Fast Food Restaurant
13,Belle Isle,53.764502,-1.528332,0,0,2,Fast Food Restaurant


In [18]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, place, cluster in zip(leeds_merged['Latitude'], leeds_merged['Longitude'], leeds_merged['PlaceName'], leeds_merged['Cluster Labels']):
    label = folium.Popup('{} - Cluster {}'.format(place, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [19]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 0]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
37,Gipton,53.812326,-1.497997,0,0,7,Fast Food Restaurant
25,Colton,53.793724,-1.438171,0,0,6,Fast Food Restaurant
34,Farnley,53.786978,-1.615634,0,0,2,Fast Food Restaurant
93,Stourton,53.772986,-1.511919,0,0,1,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Fast Food Restaurant
16,Bramley,53.811876,-1.62519,0,0,3,Fast Food Restaurant
56,Knowsthorpe,53.774992,-1.501473,0,0,2,Fast Food Restaurant
96,Swinnow,53.799738,-1.641321,0,1,3,Fast Food Restaurant
13,Belle Isle,53.764502,-1.528332,0,0,2,Fast Food Restaurant


In [20]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 1]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
85,Rothwell,53.749384,-1.478151,1,0,5,Restaurant
45,Headingley,53.821008,-1.577774,1,2,21,Italian Restaurant
78,Oulton,53.74927,-1.45412,1,0,3,Restaurant
79,Potternewton,53.82427,-1.541891,1,1,19,Fast Food Restaurant
38,Gledhow,53.829192,-1.517898,1,1,13,Italian Restaurant
75,Morley,53.744075,-1.59886,1,0,3,Restaurant
70,Miles Hill,53.826168,-1.549337,1,2,19,Italian Restaurant
71,Moor Allerton,53.844229,-1.553848,1,1,9,Paella Restaurant
76,Oakwood,53.824284,-1.49318,1,0,4,Restaurant
33,Far Headingley,53.82899,-1.583399,1,1,14,Italian Restaurant


In [21]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 2]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
55,Kirkstall,53.815555,-1.600191,2,2,14,Thai Restaurant
82,Rawdon,53.84712,-1.679133,2,1,2,Asian Restaurant
81,Quarry Hill,53.798338,-1.532478,2,4,12,Indian Restaurant
59,Leeds city centre,53.794414,-1.548621,2,3,14,Indian Restaurant
61,Little London,53.808725,-1.540924,2,2,15,Thai Restaurant
62,Lovell Park,53.804422,-1.538352,2,2,14,Thai Restaurant
30,Cross Green,53.789407,-1.517575,2,5,18,Indian Restaurant
14,Blenheim,53.797418,-1.543794,2,3,13,Indian Restaurant
83,Richmond Hill,53.793118,-1.522554,2,2,11,Thai Restaurant
9,Beck Hill,53.789888,-1.684341,2,2,3,Indian Restaurant


In [23]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 3]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
80,Potterton,53.844777,-1.385929,3,0,0,Vietnamese Restaurant
89,Scott Hall,53.768218,-1.419377,3,0,0,Vietnamese Restaurant
91,Shadwell,53.850898,-1.482831,3,0,2,Asian Restaurant
99,Weardley,53.895913,-1.549757,3,0,1,Chinese Restaurant
87,Scarcroft,53.863675,-1.449724,3,0,0,Vietnamese Restaurant
84,Rodley,53.820888,-1.660059,3,1,5,English Restaurant
104,Wike,53.874242,-1.488179,3,0,0,Vietnamese Restaurant
98,Tinshill,53.84546,-1.618913,3,0,0,Vietnamese Restaurant
97,Thorner,53.857668,-1.423376,3,0,0,Vietnamese Restaurant
95,Swillington,53.767538,-1.422421,3,0,0,Vietnamese Restaurant


In [22]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 4]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue
66,Methley,53.739204,-1.396397,4,0,1,Restaurant
68,Mickletown,53.739446,-1.384764,4,0,1,Restaurant


#### Observations

Some brief observations
1. There is not enough data on restaurants. Even when increasing the venue reach to 2km (quite a large allowance for a town centre), some places only have 3 or 4 restaurants. The effect of this means that some clusters only have 1 or 2 restaurants so it is hard to derive storng insights
2. Cluster 0 seems fast food dominant, Cluster 1 seems italian restaurant dominant, Cluster 2 seems thai/indian dominant, Cluster 3 seems to be Vietnamese dominant which in this case due to ordering convention means there is a lack of restaurants in this area, Cluster 4 have only one unspecified cuisine restaurant. 
3. Cluster 2 seems of most interest for an Indian restaurant owner, cluster 2 is bassed centrally on the map. Thus there may be a lot of competition in central Leeds for an Indian restaurant and perhaps more of a market niche in other areas (particularly cluster 3 as there are no restaurants here. On the other hand, this could mean that there is only a taste for Indian food in central Leeds. 

#### 7. Clustering analysis redone, but including a custom created feature which counts the total number of restaurants

In [24]:
# create new dataframe with top ten venues
num_top_venues = 2

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PlaceName']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
places_venues_sorted = pd.DataFrame(columns=columns)
places_venues_sorted['PlaceName'] = leeds_grouped2['PlaceName']

for ind in np.arange(leeds_grouped2.shape[0]):
    row_categories = leeds_grouped2.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    places_venues_sorted.iloc[ind, 1:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(places_venues_sorted.shape)
places_venues_sorted

# set number of clusters
kclusters = 5

leeds_grouped_clustering = leeds_grouped2.drop(["PlaceName"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(leeds_grouped_clustering)

# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
leeds_merged = leeds_df.copy()
a = leeds_merged['PlaceName'].isin(leeds_grouped2["PlaceName"].tolist())
leeds_merged = leeds_merged[a].reset_index(drop=True)

# add clustering labels
leeds_merged["Cluster Labels"] = kmeans.labels_
leeds_merged = leeds_merged.join(leeds_grouped_sums.set_index("PlaceName"), on="PlaceName")
leeds_merged = leeds_merged.join(places_venues_sorted.set_index("PlaceName"), on="PlaceName")

# sort the results by Cluster Labels
print(leeds_merged.shape)
leeds_merged.sort_values(["Cluster Labels"], inplace=True)
leeds_merged

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, place, cluster in zip(leeds_merged['Latitude'], leeds_merged['Longitude'], leeds_merged['PlaceName'], leeds_merged['Cluster Labels']):
    label = folium.Popup('{} - Cluster {}'.format(place, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
leeds_merged



(109, 3)
(109, 8)


Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
108,Wykebeck,53.826948,-1.489172,0,0,3,Total Restaurants,Restaurant
28,Cranmer Bank,53.846916,-1.556024,0,1,4,Total Restaurants,English Restaurant
96,Swinnow,53.799738,-1.641321,0,1,3,Total Restaurants,Fast Food Restaurant
36,Gildersome,53.756539,-1.625127,0,0,7,Total Restaurants,American Restaurant
37,Gipton,53.812326,-1.497997,0,0,7,Total Restaurants,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Total Restaurants,Fast Food Restaurant
49,Horsforth,53.83894,-1.639964,0,0,3,Total Restaurants,English Restaurant
25,Colton,53.793724,-1.438171,0,0,6,Total Restaurants,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Total Restaurants,Fast Food Restaurant
85,Rothwell,53.749384,-1.478151,0,0,5,Total Restaurants,Restaurant


In [25]:
map_clusters

In [26]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 0]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
108,Wykebeck,53.826948,-1.489172,0,0,3,Total Restaurants,Restaurant
28,Cranmer Bank,53.846916,-1.556024,0,1,4,Total Restaurants,English Restaurant
96,Swinnow,53.799738,-1.641321,0,1,3,Total Restaurants,Fast Food Restaurant
36,Gildersome,53.756539,-1.625127,0,0,7,Total Restaurants,American Restaurant
37,Gipton,53.812326,-1.497997,0,0,7,Total Restaurants,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Total Restaurants,Fast Food Restaurant
49,Horsforth,53.83894,-1.639964,0,0,3,Total Restaurants,English Restaurant
25,Colton,53.793724,-1.438171,0,0,6,Total Restaurants,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Total Restaurants,Fast Food Restaurant
85,Rothwell,53.749384,-1.478151,0,0,5,Total Restaurants,Restaurant


In [27]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 1]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
70,Miles Hill,53.826168,-1.549337,1,2,19,Total Restaurants,Italian Restaurant
51,Hyde Park,53.812431,-1.567857,1,2,20,Total Restaurants,Thai Restaurant
45,Headingley,53.821008,-1.577774,1,2,21,Total Restaurants,Italian Restaurant
65,Meanwood,53.828168,-1.568247,1,1,19,Total Restaurants,Italian Restaurant
79,Potternewton,53.82427,-1.541891,1,1,19,Total Restaurants,Fast Food Restaurant
18,Burley,53.810831,-1.58392,1,2,19,Total Restaurants,Thai Restaurant
20,Buslingthorpe,53.814866,-1.544179,1,2,19,Total Restaurants,Thai Restaurant
23,Chapeltown,53.816667,-1.531175,1,2,18,Total Restaurants,Thai Restaurant


In [28]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 2]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
30,Cross Green,53.789407,-1.517575,2,5,18,Total Restaurants,Indian Restaurant
19,Burmantofts,53.801803,-1.519672,2,3,13,Total Restaurants,Indian Restaurant
63,Mabgate,53.799786,-1.533669,2,3,11,Total Restaurants,Thai Restaurant
62,Lovell Park,53.804422,-1.538352,2,2,14,Total Restaurants,Thai Restaurant
61,Little London,53.808725,-1.540924,2,2,15,Total Restaurants,Thai Restaurant
105,Woodhouse,53.809926,-1.548821,2,2,16,Total Restaurants,Thai Restaurant
59,Leeds city centre,53.794414,-1.548621,2,3,14,Total Restaurants,Thai Restaurant
55,Kirkstall,53.815555,-1.600191,2,2,14,Total Restaurants,Thai Restaurant
50,Hunslet,53.783439,-1.535932,2,3,16,Total Restaurants,Restaurant
31,East End Park,53.793941,-1.514391,2,4,17,Total Restaurants,Indian Restaurant


In [29]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 3]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
74,Moortown,53.842358,-1.53308,3,2,9,Total Restaurants,Italian Restaurant
12,Beeston Hill,53.776883,-1.550946,3,1,8,Total Restaurants,Fast Food Restaurant
11,Beeston,53.769057,-1.569957,3,0,11,Total Restaurants,Fast Food Restaurant
38,Gledhow,53.829192,-1.517898,3,1,13,Total Restaurants,Fast Food Restaurant
83,Richmond Hill,53.793118,-1.522554,3,2,11,Total Restaurants,Thai Restaurant
101,West Park,53.834574,-1.597194,3,1,9,Total Restaurants,Thai Restaurant
107,Wortley,53.790319,-1.583114,3,1,11,Total Restaurants,Middle Eastern Restaurant
22,Chapel Allerton,53.829035,-1.538322,3,1,13,Total Restaurants,Fast Food Restaurant
71,Moor Allerton,53.844229,-1.553848,3,1,9,Total Restaurants,Restaurant
24,Churwell,53.76131,-1.590844,3,0,8,Total Restaurants,Fast Food Restaurant


In [30]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 4]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
82,Rawdon,53.84712,-1.679133,4,1,2,Total Restaurants,Asian Restaurant
80,Potterton,53.844777,-1.385929,4,0,0,Total Restaurants,Jewish Restaurant
106,Woodlesford,53.757064,-1.443565,4,0,2,Total Restaurants,Restaurant
91,Shadwell,53.850898,-1.482831,4,0,2,Total Restaurants,Asian Restaurant
87,Scarcroft,53.863675,-1.449724,4,0,0,Total Restaurants,Jewish Restaurant
88,Scholes,53.826067,-1.427651,4,0,0,Total Restaurants,Jewish Restaurant
89,Scott Hall,53.768218,-1.419377,4,0,0,Total Restaurants,Jewish Restaurant
93,Stourton,53.772986,-1.511919,4,0,1,Total Restaurants,Fast Food Restaurant
94,Swarcliffe,53.820119,-1.447767,4,0,1,Total Restaurants,Fast Food Restaurant
102,Whinmoor,53.836052,-1.45605,4,0,1,Total Restaurants,Fast Food Restaurant


#### Observations
This analysis was done to incorporate the importance of the total number of restaurants by feature engineering a new column to include in cluster analysis.
1. Cluster 0 has 3-7 total restaurants in each place of varying cuisine, cluster 1 has 18-21 restaurants in each place of mainly Thai/Italian cuisine, cluster 2 has 12-18 restaurants in each place of Thai/Indian cuisine, cluster 3 has 8-13 restaurants of varying cuisine and cluster 4 has mainly Jewish restaurants which by the numbering convention actually means it has 0-1 total restaurants per place.
2. In this new analysis the overweighting of the total number of restaurants category in clustering analysis (due to high unnormalised value range) means that the clusters mainly depend on the new feature only. 
3. The bulk of indian dominated places lie in cluster 1 and 2. These area lies in centre Leeds. Which again reiterate the conclusions drawn from the first analysis

In future, I would re-normalise the variables along all features to reduce the over weighting issue of 'total restaurants'

In [43]:
#### 8. Clustering analysis redone, but normalising the features to reduce overweighting caused by new 'total restaurants' feature

In [44]:
leeds_grouped2 = leeds_onehot.groupby(["PlaceName"]).sum().reset_index()
leeds_grouped2["Total Restaurants"] = leeds_grouped2.sum(axis = 1) 
from sklearn.preprocessing import Normalizer
leeds_grouped2.iloc[:,1:] = Normalizer(norm='l1').fit_transform(leeds_grouped2.iloc[:,1:])

# create new dataframe with top ten venues
num_top_venues = 2

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PlaceName']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
places_venues_sorted = pd.DataFrame(columns=columns)
places_venues_sorted['PlaceName'] = leeds_grouped2['PlaceName']

for ind in np.arange(leeds_grouped2.shape[0]):
    row_categories = leeds_grouped2.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    places_venues_sorted.iloc[ind, 1:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(places_venues_sorted.shape)
places_venues_sorted

# set number of clusters
kclusters = 5

leeds_grouped_clustering = leeds_grouped2.drop(["PlaceName"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(leeds_grouped_clustering)

# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
leeds_merged = leeds_df.copy()
a = leeds_merged['PlaceName'].isin(leeds_grouped2["PlaceName"].tolist())
leeds_merged = leeds_merged[a].reset_index(drop=True)

# add clustering labels
leeds_merged["Cluster Labels"] = kmeans.labels_
leeds_merged = leeds_merged.join(leeds_grouped_sums.set_index("PlaceName"), on="PlaceName")
leeds_merged = leeds_merged.join(places_venues_sorted.set_index("PlaceName"), on="PlaceName")

# sort the results by Cluster Labels
print(leeds_merged.shape)
leeds_merged.sort_values(["Cluster Labels"], inplace=True)
leeds_merged

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, place, cluster in zip(leeds_merged['Latitude'], leeds_merged['Longitude'], leeds_merged['PlaceName'], leeds_merged['Cluster Labels']):
    label = folium.Popup('{} - Cluster {}'.format(place, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
leeds_merged

(109, 3)
(109, 8)


Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
16,Bramley,53.811876,-1.62519,0,0,3,Total Restaurants,Fast Food Restaurant
40,Guiseley,53.87483,-1.707511,0,0,2,Total Restaurants,Chinese Restaurant
29,Cross Gates,53.807558,-1.454076,0,0,2,Total Restaurants,Portuguese Restaurant
41,Halton,53.798605,-1.462026,0,0,1,Total Restaurants,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Total Restaurants,Fast Food Restaurant
53,Killingbeck,53.808823,-1.478916,0,0,1,Total Restaurants,Fast Food Restaurant
56,Knowsthorpe,53.774992,-1.501473,0,0,2,Total Restaurants,Fast Food Restaurant
77,Osmondthorpe,53.798067,-1.497851,0,0,2,Total Restaurants,Fast Food Restaurant
13,Belle Isle,53.764502,-1.528332,0,0,2,Total Restaurants,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Total Restaurants,Fast Food Restaurant


In [50]:
map_clusters

In [51]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 0]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
16,Bramley,53.811876,-1.62519,0,0,3,Total Restaurants,Fast Food Restaurant
40,Guiseley,53.87483,-1.707511,0,0,2,Total Restaurants,Chinese Restaurant
29,Cross Gates,53.807558,-1.454076,0,0,2,Total Restaurants,Portuguese Restaurant
41,Halton,53.798605,-1.462026,0,0,1,Total Restaurants,Fast Food Restaurant
42,Halton Moor,53.795393,-1.4823,0,0,3,Total Restaurants,Fast Food Restaurant
53,Killingbeck,53.808823,-1.478916,0,0,1,Total Restaurants,Fast Food Restaurant
56,Knowsthorpe,53.774992,-1.501473,0,0,2,Total Restaurants,Fast Food Restaurant
77,Osmondthorpe,53.798067,-1.497851,0,0,2,Total Restaurants,Fast Food Restaurant
13,Belle Isle,53.764502,-1.528332,0,0,2,Total Restaurants,Fast Food Restaurant
90,Seacroft,53.822119,-1.457986,0,0,3,Total Restaurants,Fast Food Restaurant


In [46]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 1]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
104,Wike,53.874242,-1.488179,1,0,0,Total Restaurants,Jewish Restaurant
44,Harewood,53.885347,-1.51427,1,0,0,Total Restaurants,Jewish Restaurant
98,Tinshill,53.84546,-1.618913,1,0,0,Total Restaurants,Jewish Restaurant
97,Thorner,53.857668,-1.423376,1,0,0,Total Restaurants,Jewish Restaurant
58,Ledsham,53.759318,-1.310876,1,0,0,Total Restaurants,Jewish Restaurant
67,Micklefield,53.795982,-1.32938,1,0,0,Total Restaurants,Jewish Restaurant
72,Moor Grange,53.837129,-1.607866,1,0,0,Total Restaurants,Jewish Restaurant
73,Moorside,53.775736,-1.621857,1,0,0,Total Restaurants,Jewish Restaurant
80,Potterton,53.844777,-1.385929,1,0,0,Total Restaurants,Jewish Restaurant
39,Great Preston,53.761192,-1.39156,1,0,0,Total Restaurants,Jewish Restaurant


In [47]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 2]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
74,Moortown,53.842358,-1.53308,2,2,9,Total Restaurants,Italian Restaurant
76,Oakwood,53.824284,-1.49318,2,0,4,Total Restaurants,Restaurant
18,Burley,53.810831,-1.58392,2,2,19,Total Restaurants,Thai Restaurant
79,Potternewton,53.82427,-1.541891,2,1,19,Total Restaurants,Fast Food Restaurant
86,Roundhay,53.8402,-1.51002,2,0,6,Total Restaurants,Chinese Restaurant
20,Buslingthorpe,53.814866,-1.544179,2,2,19,Total Restaurants,Thai Restaurant
33,Far Headingley,53.82899,-1.583399,2,1,14,Total Restaurants,Thai Restaurant
17,Bramstan,53.81254,-1.645599,2,2,4,Total Restaurants,Indian Restaurant
81,Quarry Hill,53.798338,-1.532478,2,4,12,Total Restaurants,Indian Restaurant
82,Rawdon,53.84712,-1.679133,2,1,2,Total Restaurants,Asian Restaurant


In [48]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 3]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
52,Ireland Wood,53.850253,-1.608143,3,0,1,Total Restaurants,Italian Restaurant
48,Holt Park,53.858538,-1.600583,3,0,1,Total Restaurants,Italian Restaurant
57,Lawnswood,53.847334,-1.597112,3,0,1,Total Restaurants,Italian Restaurant


In [49]:
leeds_merged.loc[leeds_merged['Cluster Labels'] == 4]

Unnamed: 0,PlaceName,Latitude,Longitude,Cluster Labels,Indian Restaurant,Total Restaurants,1st Most Common Venue,2nd Most Common Venue
3,Alwoodley,53.858879,-1.557492,4,0,3,Total Restaurants,Chinese Restaurant
106,Woodlesford,53.757064,-1.443565,4,0,2,Total Restaurants,Restaurant
21,Carlton (LS19),53.880739,-1.668939,4,1,4,Total Restaurants,Restaurant
68,Mickletown,53.739446,-1.384764,4,0,1,Total Restaurants,Restaurant
85,Rothwell,53.749384,-1.478151,4,0,5,Total Restaurants,Restaurant
78,Oulton,53.74927,-1.45412,4,0,3,Total Restaurants,Restaurant
75,Morley,53.744075,-1.59886,4,0,3,Total Restaurants,Restaurant
66,Methley,53.739204,-1.396397,4,0,1,Total Restaurants,Restaurant
26,Cookridge,53.854422,-1.618655,4,0,2,Total Restaurants,Restaurant
108,Wykebeck,53.826948,-1.489172,4,0,3,Total Restaurants,Restaurant


#### Observations
In this final analysis, feature normalisation using scikit-learn has been performed.
1. Cluster 0 has 1-3 restaurants per place, primarily fast food. Cluster 1 has 0 restaurants per place. Cluster 2 has variety of types of cuisines and is hard to define as a cluster. Cluster 3 has only one italian restaurant per place. Cluster 4 is dominated by unspecified 'restaurant' type.
2. These clusters don't give much more useful insight into Indian Restaurants as such. 

If more restaurants existed then clustering could be more insightful. What is clear is that there is a lack of restaurants in many areas in outer Leeds, or perhaps just a low amount of restaurants registering their location close to the town centre. Perhaps there is indeed a market nice in outer Leeds town centres!