# A Tale of Two Cities
## Clustering the Neighbourhoods of New York and London: Comparison

## 1. Introduction

The current world allows us to travel and move between the cities, countries and even continents. The world became more global than ever and the distance does not matter that much anymore. Having in mind the freedom of movement I have decided to look into two cities, which are as similar as different. Global, almost equally large, important and iconic cities. In the same time these two cities represent very different culture and history. The cities which I have chosen are:

- New York
- London

My aim is to help understand the venues landscape in each of the cities.


## 2. Business Problem

The aim is to help tourists choose their destinations depending on the experiences that the neighbourhoods have to offer and what they would want to have. This also helps people make decisions if they are thinking about migrating to London or New York or even if they want to relocate neighbourhoods within the city. The project findings will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, activity spots, grocery stores and what the city has to offer.

## 3. Data

We require geographical location data for both London and Paris. Postal codes in each city serve as a starting point. Using Postal codes we use can find out the neighbourhoods, boroughs, venues and their most popular venue categories.

### New York Data

Data Source I (json file): https://cocl.us/new_york_dataset

Geo-spacial data of the New York to get a better understanding of the neighbourhoods in it and their corresponding locations in the Folium map would make certain things clear for the Project. This will be achieved using the acquired data and visualize the same using Choropleth maps.
Data Source II (json file):
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json



### London Data

To derive our solution, We scrape our data from: https://en.wikipedia.org/wiki/List_of_areas_of_London.
This Wikipedia page has information about all the neighbourhoods, we limit it to London only.

- a) _borough_: Name of Neighbourhood
- b) _town_: Name of borough
- c) _post_code_: Postal codes for London

This Wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

### ArcGIS API
ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups.
More specifically, we use ArcGIS to get the geo locations of the neighbourhoods of London. The following columns are added to our initial dataset which prepares our data.
- a) _latitude_: Latitude for Neighbourhood
- b) _longitude_: Longitude for Neighbourhood


### Foursquare API Data

The data is going to be collected/acquired from the Foursquare API about the various venues in each neighbourhood of New York city. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.
After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. *Neighbourhood* : Name of the Neighbourhood
2. *Neighbourhood Latitude* : Latitude of the Neighbourhood
3. *Neighbourhood Longitude* : Longitude of the Neighbourhood
4. *Venue* : Name of the Venue
5. *Venue Latitude* : Latitude of Venue
6. *Venue Longitude* : Longitude of Venue
7. *Venue Category* : Category of Venue


Based on all the information collected for both London and New York city, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data to our stakeholders, so they can take the necessary decision.

### Importing Libraries

In [141]:
# Library to work on dataframe:
import pandas as pd

# Python HTTP library:
import requests

# Library to work with arrays:
import numpy as np

# Visualisation Library:
import matplotlib.cm as cm
import matplotlib.colors as colors

# Library to work on the map:
import folium

# Importing k-means for the clustering stage:
from sklearn.cluster import KMeans

# Python package which can be used to work with JSON data:
import json

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [10]:
# Might be required to istall wget to download data
# pip install wget

Note: you may need to restart the kernel to use updated packages.


## Data Collection

### New York

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

In [142]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


'wget' is not recognized as an internal or external command,
operable program or batch file.


#### Load and explore the data

In [144]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

All the relevant data is in the _features_ key format, which is basically a list of the neighborhoods.

In [145]:
# define a new variable that includes this data
neighborhoods_data = newyork_data['features']

# Let's take a look at the first item in this list.
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

### Tranform the data into a pandas dataframe: New York

In [146]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [147]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
# check if data is uploaded correctly:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [148]:
neighborhoods.shape

(306, 4)

In [149]:
# And make sure that the dataset has all 5 boroughs and 306 neighborhoods.
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.

In [150]:
# In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [151]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [91]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [152]:
# Let's get the geographical coordinates of Manhattan.

address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [153]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

#### Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [154]:
CLIENT_ID = 'SOGAXFXQW3EJN3PJCCXQI25KMRYZ0TQYFD2SAFSRCOYMJCOB' # your Foursquare ID
CLIENT_SECRET = 'MSOC40F5JAK2R4WVVNTTDCKCD1IZYIVKEC3CZAAHR0BMRP2J' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: SOGAXFXQW3EJN3PJCCXQI25KMRYZ0TQYFD2SAFSRCOYMJCOB
CLIENT_SECRET:MSOC40F5JAK2R4WVVNTTDCKCD1IZYIVKEC3CZAAHR0BMRP2J


#### Let's explore the first neighborhood in our dataframe.
Get the neighborhood's name.

In [155]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

#### Get the neighborhood's latitude and longitude values.

In [156]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [157]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=SOGAXFXQW3EJN3PJCCXQI25KMRYZ0TQYFD2SAFSRCOYMJCOB&client_secret=MSOC40F5JAK2R4WVVNTTDCKCD1IZYIVKEC3CZAAHR0BMRP2J&v=20180605&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'

Send the GET request and examine the resutls

In [158]:
results = requests.get(url).json()
# results

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [159]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [160]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

# nearby_venues[['categories']]

# nearby_venues.categories.unique()


  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Starbucks,Coffee Shop,40.877531,-73.905582
4,Dunkin',Donut Shop,40.877136,-73.906666


And how many venues were returned by Foursquare?

In [161]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

25 venues were returned by Foursquare.


## 2. Explore Neighborhoods in Manhattan

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [162]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.

In [163]:
# type your answer here
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


### Let's check the size of the resulting dataframe

In [165]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3172, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


Let's check how many venues were returned for each neighborhood

In [166]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,76,76,76,76,76,76
Carnegie Hill,86,86,86,86,86,86
Central Harlem,44,44,44,44,44,44
Chelsea,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Civic Center,100,100,100,100,100,100
Clinton,100,100,100,100,100,100
East Harlem,37,37,37,37,37,37
East Village,100,100,100,100,100,100
Financial District,100,100,100,100,100,100


#### Let's find out how many unique categories can be curated from all the returned venues

In [167]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 333 uniques categories.


## 3. Analyze Each Neighborhood

In [168]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [169]:
manhattan_onehot.shape

(3172, 334)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [170]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.013158,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.011628,0.0,0.0,0.011628,0.0,...,0.0,0.011628,0.0,0.0,0.0,0.011628,0.034884,0.0,0.011628,0.034884
2,Central Harlem,0.0,0.0,0.0,0.068182,0.045455,0.0,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.01
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Civic Center,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.03
6,Clinton,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,...,0.0,0.02,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0
9,Financial District,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0


#### Let's confirm the new size

In [171]:
manhattan_grouped.shape

(40, 334)

#### Let's print each neighborhood along with the top 5 most common venues

In [172]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
            venue  freq
0           Hotel  0.07
1     Coffee Shop  0.07
2             Gym  0.05
3  Clothing Store  0.05
4            Park  0.05


----Carnegie Hill----
         venue  freq
0  Coffee Shop  0.08
1         Café  0.05
2  Yoga Studio  0.03
3    Wine Shop  0.03
4  Pizza Place  0.03


----Central Harlem----
                  venue  freq
0    African Restaurant  0.07
1                   Bar  0.05
2    Seafood Restaurant  0.05
3  Gym / Fitness Center  0.05
4   American Restaurant  0.05


----Chelsea----
                 venue  freq
0          Coffee Shop  0.06
1               Bakery  0.05
2  American Restaurant  0.04
3          Art Gallery  0.04
4                Hotel  0.03


----Chinatown----
                 venue  freq
0               Bakery  0.08
1   Chinese Restaurant  0.07
2         Dessert Shop  0.04
3  American Restaurant  0.04
4    Hotpot Restaurant  0.04


----Civic Center----
                  venue  freq
0           Coffee Shop  0.08
1     

#### Let's put that into a _pandas_ dataframe

First, let's write a function to sort the venues in descending order.

In [173]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [174]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Hotel,Coffee Shop,Park,Gym,Clothing Store,Memorial Site,Gourmet Shop,Sandwich Place,Boat or Ferry,Food Court
1,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Pizza Place,Wine Shop,Gym,Bookstore,Cosmetics Shop,French Restaurant,Japanese Restaurant
2,Central Harlem,African Restaurant,Chinese Restaurant,Art Gallery,Gym / Fitness Center,Seafood Restaurant,Bar,American Restaurant,French Restaurant,Fried Chicken Joint,Bookstore
3,Chelsea,Coffee Shop,Bakery,American Restaurant,Art Gallery,Nightclub,Hotel,Seafood Restaurant,French Restaurant,Park,Market
4,Chinatown,Bakery,Chinese Restaurant,Hotpot Restaurant,American Restaurant,Cocktail Bar,Dessert Shop,Optical Shop,Salon / Barbershop,Spa,Noodle House


## 4. Cluster Neighborhoods

Run _k_-means to cluster the neighborhood into 5 clusters.

In [175]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 4, 0, 0, 0, 1, 1, 3, 0, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [176]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,1,Discount Store,Coffee Shop,American Restaurant,Sandwich Place,Gym,Supplement Shop,Steakhouse,Shopping Mall,Seafood Restaurant,Yoga Studio
1,Manhattan,Chinatown,40.715618,-73.994279,0,Bakery,Chinese Restaurant,Hotpot Restaurant,American Restaurant,Cocktail Bar,Dessert Shop,Optical Shop,Salon / Barbershop,Spa,Noodle House
2,Manhattan,Washington Heights,40.851903,-73.9369,0,Café,Bakery,Mobile Phone Shop,Deli / Bodega,Bank,Pizza Place,Grocery Store,Tapas Restaurant,New American Restaurant,Park
3,Manhattan,Inwood,40.867684,-73.92121,3,Café,Mexican Restaurant,Restaurant,Lounge,Park,Bakery,Caribbean Restaurant,Chinese Restaurant,Pizza Place,Wine Bar
4,Manhattan,Hamilton Heights,40.823604,-73.949688,3,Pizza Place,Coffee Shop,Deli / Bodega,Café,Mexican Restaurant,Yoga Studio,Latin American Restaurant,Liquor Store,Park,Cocktail Bar


Finally, let's visualize the resulting clusters

In [177]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters
Now, each cluster can be examined and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

### Cluster 1 - Leisure Time (Cafe & Restaurants)

In [178]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Chinatown,Bakery,Chinese Restaurant,Hotpot Restaurant,American Restaurant,Cocktail Bar,Dessert Shop,Optical Shop,Salon / Barbershop,Spa,Noodle House
2,Washington Heights,Café,Bakery,Mobile Phone Shop,Deli / Bodega,Bank,Pizza Place,Grocery Store,Tapas Restaurant,New American Restaurant,Park
6,Central Harlem,African Restaurant,Chinese Restaurant,Art Gallery,Gym / Fitness Center,Seafood Restaurant,Bar,American Restaurant,French Restaurant,Fried Chicken Joint,Bookstore
12,Upper West Side,Café,Italian Restaurant,Bakery,Bar,Indian Restaurant,Coffee Shop,Mediterranean Restaurant,Wine Bar,Bagel Shop,Seafood Restaurant
13,Lincoln Square,Plaza,Café,Gym / Fitness Center,Concert Hall,Performing Arts Venue,Theater,Indie Movie Theater,Italian Restaurant,Bakery,Wine Shop
17,Chelsea,Coffee Shop,Bakery,American Restaurant,Art Gallery,Nightclub,Hotel,Seafood Restaurant,French Restaurant,Park,Market
19,East Village,Bar,Pizza Place,Italian Restaurant,Salon / Barbershop,Mexican Restaurant,Cocktail Bar,Coffee Shop,Korean Restaurant,Wine Bar,Vietnamese Restaurant
20,Lower East Side,Chinese Restaurant,Pizza Place,Bakery,Café,Art Gallery,Latin American Restaurant,Yoga Studio,Filipino Restaurant,Speakeasy,Mediterranean Restaurant
22,Little Italy,Bakery,Chinese Restaurant,Café,Bubble Tea Shop,Cocktail Bar,Coffee Shop,Hotel,Mediterranean Restaurant,Sandwich Place,Jewelry Store
25,Manhattan Valley,Mexican Restaurant,Pizza Place,Bar,Indian Restaurant,Coffee Shop,Furniture / Home Store,Peruvian Restaurant,Malay Restaurant,Korean Restaurant,Ice Cream Shop


### Cluster 2 - Tourists and Wellbeing Areas

In [179]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Marble Hill,Discount Store,Coffee Shop,American Restaurant,Sandwich Place,Gym,Supplement Shop,Steakhouse,Shopping Mall,Seafood Restaurant,Yoga Studio
14,Clinton,Italian Restaurant,Gym / Fitness Center,Theater,American Restaurant,Sandwich Place,Coffee Shop,Wine Shop,Cocktail Bar,Hotel,Gym
15,Midtown,Hotel,Theater,Sporting Goods Shop,Coffee Shop,Clothing Store,Bakery,American Restaurant,Sandwich Place,Bookstore,Steakhouse
16,Murray Hill,Coffee Shop,Hotel,Japanese Restaurant,Sandwich Place,Bar,Burger Joint,Gym / Fitness Center,Grocery Store,Steakhouse,Pub
28,Battery Park City,Hotel,Coffee Shop,Park,Gym,Clothing Store,Memorial Site,Gourmet Shop,Sandwich Place,Boat or Ferry,Food Court
29,Financial District,Coffee Shop,Italian Restaurant,Bar,Cocktail Bar,Pizza Place,Café,Park,Salad Place,American Restaurant,Monument / Landmark
32,Civic Center,Coffee Shop,Spa,Cocktail Bar,Gym / Fitness Center,Hotel,Yoga Studio,American Restaurant,French Restaurant,Park,Italian Restaurant
33,Midtown South,Korean Restaurant,Hotel,Japanese Restaurant,Hotel Bar,Coffee Shop,Dessert Shop,Gym / Fitness Center,Salad Place,Café,Cosmetics Shop
35,Turtle Bay,Italian Restaurant,Coffee Shop,Sushi Restaurant,Hotel,Park,Ramen Restaurant,Seafood Restaurant,Japanese Restaurant,Deli / Bodega,Steakhouse
39,Hudson Yards,Gym / Fitness Center,Italian Restaurant,American Restaurant,Hotel,Café,Bar,Coffee Shop,Park,Dog Run,Gym


### Cluster 3 - Park Cluster

In [180]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,Stuyvesant Town,Park,Bar,Coffee Shop,Boat or Ferry,Pet Service,German Restaurant,Cocktail Bar,Fountain,Harbor / Marina,Heliport


### Cluster 4 - Diverse Areas

In [181]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Inwood,Café,Mexican Restaurant,Restaurant,Lounge,Park,Bakery,Caribbean Restaurant,Chinese Restaurant,Pizza Place,Wine Bar
4,Hamilton Heights,Pizza Place,Coffee Shop,Deli / Bodega,Café,Mexican Restaurant,Yoga Studio,Latin American Restaurant,Liquor Store,Park,Cocktail Bar
5,Manhattanville,Coffee Shop,Seafood Restaurant,Italian Restaurant,Deli / Bodega,Sushi Restaurant,Mexican Restaurant,Chinese Restaurant,Lounge,Boutique,Supermarket
7,East Harlem,Mexican Restaurant,Bakery,Thai Restaurant,Latin American Restaurant,Sandwich Place,Deli / Bodega,Gym,Grocery Store,Café,Liquor Store
11,Roosevelt Island,Park,Dry Cleaner,Kosher Restaurant,Gym / Fitness Center,Gym,Coffee Shop,Liquor Store,Sandwich Place,Outdoors & Recreation,Greek Restaurant
26,Morningside Heights,Park,American Restaurant,Bookstore,Coffee Shop,Burger Joint,Deli / Bodega,Café,New American Restaurant,Liquor Store,Mediterranean Restaurant
36,Tudor City,Park,Mexican Restaurant,Café,Diner,Deli / Bodega,Gym,Coffee Shop,Greek Restaurant,Pizza Place,Sushi Restaurant


### Cluster 5 - Italian Restaurant Cluster

In [182]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Upper East Side,Exhibit,Italian Restaurant,Coffee Shop,Bakery,Gym / Fitness Center,American Restaurant,Spa,French Restaurant,Hotel,Juice Bar
9,Yorkville,Italian Restaurant,Gym,Coffee Shop,Sushi Restaurant,Bar,Wine Shop,Deli / Bodega,Japanese Restaurant,Mexican Restaurant,Ice Cream Shop
10,Lenox Hill,Italian Restaurant,Pizza Place,Coffee Shop,Cocktail Bar,Sushi Restaurant,Gym / Fitness Center,Café,Gym,Burger Joint,Thai Restaurant
18,Greenwich Village,Italian Restaurant,Clothing Store,Sushi Restaurant,American Restaurant,Bakery,Indian Restaurant,Gym,Boutique,French Restaurant,Café
21,Tribeca,Park,Italian Restaurant,American Restaurant,Café,Coffee Shop,Spa,Wine Bar,Playground,Poke Place,Hotel
23,Soho,Clothing Store,Italian Restaurant,Coffee Shop,Mediterranean Restaurant,Boutique,Bakery,Women's Store,Café,Pizza Place,Cocktail Bar
24,West Village,Italian Restaurant,Cocktail Bar,New American Restaurant,Park,American Restaurant,Ice Cream Shop,Wine Bar,Theater,Coffee Shop,French Restaurant
27,Gramercy,Bar,Italian Restaurant,Bagel Shop,Pizza Place,Mexican Restaurant,American Restaurant,Wine Shop,Playground,Coffee Shop,Thrift / Vintage Store
30,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Pizza Place,Wine Shop,Gym,Bookstore,Cosmetics Shop,French Restaurant,Japanese Restaurant
31,Noho,Italian Restaurant,Art Gallery,Pizza Place,Coffee Shop,Mexican Restaurant,French Restaurant,Sushi Restaurant,Grocery Store,Café,Bookstore


# London

Let's start with data (download and processing)

In [183]:
url_london = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
wiki_london_url = requests.get(url_london)

wiki_london_data = pd.read_html(wiki_london_url.text)

wiki_london_data = wiki_london_data[1]
wiki_london_data

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


## Data Preprocessing
the spaces in the column titles are removed and then we add _ between words.

In [184]:
wiki_london_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
wiki_london_data

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


We see that few columns have no '_' between the words despite applying our function meaning that there are special characters

## Feature Selection

We need only the boroughs, Postal codes, Post town for further steps. We can drop the locations, dial codes and OS grid.

In [185]:
df1 = wiki_london_data.drop( [ wiki_london_data.columns[0], wiki_london_data.columns[4], wiki_london_data.columns[5] ], axis=1)
df1.head()

Unnamed: 0,London borough,Post_town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


let's rename the Postcode district column and the london borough to something simpler

In [186]:
df1.columns = ['borough','town','post_code']
df1

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


Let's remove the Square brackets [ ] and numbers from the borough column

In [187]:
df1['borough'] = df1['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


Take the dimension of the dataframe

In [188]:
df1.shape

(531, 3)

We currently have 533 records and 3 columns of our data. It's time to perform Feature Engineering

## Feature Engineering
We can only focusing on the neighbourhoods of London, so performing the changes

In [189]:
df1 = df1[df1['town'].str.contains('LONDON')]
df1

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
...,...,...,...
521,Redbridge,LONDON,"IG8, E18"
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8
525,Barnet,LONDON,N12
526,Greenwich,LONDON,SE18


In [190]:
df1.shape

(308, 3)

We now have only 310 rows. We can proceed with our further steps. Getting some descriptive statistics

In [191]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 308 entries, 0 to 528
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   borough    308 non-null    object
 1   town       308 non-null    object
 2   post_code  308 non-null    object
dtypes: object(3)
memory usage: 9.6+ KB


## Geolocations of the London Neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot out map. We will use the arcgis package to do so. 

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [130]:
# To install if needed:
# pip install arcgis

In [192]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

Defining London arcgis geocode function to return latitude and longitude

In [193]:
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

In [194]:
c = get_x_y_uk('SE2')
c

'51.492450000000076,0.12127000000003818'

Looks good, We Copy over the postal codes of london to pass it into the geolocator function that we just defined above

In [195]:
geo_coordinates_uk = df1['post_code']    
geo_coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
521    IG8, E18
522         IG8
525         N12
526        SE18
528         W12
Name: post_code, Length: 308, dtype: object

Passing postal codes of london to get the geographical co-ordinates

In [196]:
coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
coordinates_latlng_uk

0       51.492450000000076,0.12127000000003818
1        51.51324000000005,-0.2674599999999714
6       51.51200000000006,-0.08057999999994081
7       51.51651000000004,-0.11967999999995982
9       51.41009000000008,-0.05682999999993399
                        ...                   
521    51.589770000000044,0.030520000000024083
522      51.50642000000005,-0.1272099999999341
525     51.615920000000074,-0.1767399999999384
526      51.48207000000008,0.07143000000002075
528      51.50645000000003,-0.2369099999999662
Name: post_code, Length: 308, dtype: object

### Latitude

Extracting the latitude from our previously collected coordinates

In [197]:
lat_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[0])
lat_uk

0      51.492450000000076
1       51.51324000000005
6       51.51200000000006
7       51.51651000000004
9       51.41009000000008
              ...        
521    51.589770000000044
522     51.50642000000005
525    51.615920000000074
526     51.48207000000008
528     51.50645000000003
Name: post_code, Length: 308, dtype: object

### Longitude

Extracting the Longitude from our previously collected coordinates

In [198]:
lng_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[1])
lng_uk

0       0.12127000000003818
1       -0.2674599999999714
6      -0.08057999999994081
7      -0.11967999999995982
9      -0.05682999999993399
               ...         
521    0.030520000000024083
522     -0.1272099999999341
525     -0.1767399999999384
526     0.07143000000002075
528     -0.2369099999999662
Name: post_code, Length: 308, dtype: object

We now have the geographical co-ordinates of the London Neighbourhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [199]:
london_merged = pd.concat([df1,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['borough','town','post_code','latitude','longitude']
london_merged

Unnamed: 0,borough,town,post_code,latitude,longitude
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746
6,City,LONDON,EC3,51.51200,-0.08058
7,Westminster,LONDON,WC2,51.51651,-0.11968
9,Bromley,LONDON,SE20,51.41009,-0.05683
...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721
525,Barnet,LONDON,N12,51.61592,-0.17674
526,Greenwich,LONDON,SE18,51.48207,0.07143


In [200]:
london_merged.dtypes

borough       object
town          object
post_code     object
latitude     float64
longitude    float64
dtype: object

### Co-ordinates for London

Getting the geocode for London to help visualize it on the map

In [201]:
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']
london_lng_coords

-0.1272099999999341

In [202]:
london_lat_coords

51.50642000000005

### Visualize the Map of London

To help visualize the Map of London and the neighbourhoods in London, we make use of the folium package.

In [203]:
# Creating the map of London
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)
map_London

# adding markers to map
for latitude, longitude, borough, town in zip(london_merged['latitude'], london_merged['longitude'], london_merged['borough'], london_merged['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London)  
    
map_London

## Venues in London

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in London.

In [204]:
CLIENT_ID = 'SOGAXFXQW3EJN3PJCCXQI25KMRYZ0TQYFD2SAFSRCOYMJCOB' # your Foursquare ID
CLIENT_SECRET = 'MSOC40F5JAK2R4WVVNTTDCKCD1IZYIVKEC3CZAAHR0BMRP2J' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [205]:
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues in London

In [206]:
venues_in_London = getNearbyVenues(london_merged['borough'], london_merged['latitude'], london_merged['longitude'])

Bexley, Greenwich 
Ealing, Hammersmith and Fulham
City
Westminster
Bromley
Islington
Islington
Barnet
Enfield
Wandsworth
Southwark
City
Richmond upon Thames
Barnet
Islington
Wandsworth
Westminster
Bromley
Newham
Ealing
Westminster
Lewisham
Camden
Southwark
Tower Hamlets
Bexley
City
Lewisham
Greenwich
Tower Hamlets
Camden
Haringey
Tower Hamlets
Haringey
Barnet
Brent
Lambeth
Lewisham
Tower Hamlets
Kensington and Chelsea, Hammersmith and Fulham
Brent
Barnet
Barnet
Southwark
Tower Hamlets
Camden
Tower Hamlets
Waltham Forest
Newham
Islington
Richmond upon Thames
Lewisham
Camden
Westminster
Greenwich
Kensington and Chelsea
Barnet
Westminster
Lewisham
Waltham Forest
Hounslow, Ealing, Hammersmith and Fulham
Brent
Barnet
Lambeth, Wandsworth
Islington
Barnet
Merton
Barnet
Westminster
Barnet, Brent, Camden
Lewisham
Bexley
Haringey
Bromley
Tower Hamlets
Newham
Hackney
Islington
Southwark
Lewisham
Brent
Southwark
Ealing
Kensington and Chelsea
Wandsworth
Southwark
Barnet
Newham
Richmond upon Thames


Sampling our data

In [207]:
venues_in_London.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.49245,0.12127,Lesnes Abbey,Historic Site
1,"Bexley, Greenwich",51.49245,0.12127,Sainsbury's,Supermarket
2,"Bexley, Greenwich",51.49245,0.12127,Lidl,Supermarket
3,"Bexley, Greenwich",51.49245,0.12127,Abbey Wood Railway Station (ABW),Train Station
4,"Bexley, Greenwich",51.49245,0.12127,Bean @ Work,Coffee Shop


In [208]:
venues_in_London.shape

(10441, 5)

Wow, we have scraped together 10276 records for venues. This will definitely make the clustering interesting.

## Grouping by Venue Categories
We need to now see how many Venue Categories are there for further processing

In [209]:
venues_in_London.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,Westminster,51.51656,-0.14770,Balenciaga
Adult Boutique,Islington,51.52969,-0.08697,Sh! Women's Erotic Emporium
African Restaurant,Westminster,51.52587,-0.08808,Red Sea Restaurant
American Restaurant,Waltham Forest,51.61780,0.02795,Spielburger
Antique Shop,Westminster,51.51651,-0.11968,The London Silver Vaults
...,...,...,...,...
Wings Joint,Hammersmith and Fulham,51.54187,-0.19795,Wingmans
Women's Store,Westminster,51.55457,-0.11478,Vivien of Holloway
Xinjiang Restaurant,Southwark,51.47480,-0.09313,Silk Road
Yoga Studio,Westminster,51.55457,-0.03558,yogahaven


We can see 297 records, just goes to show how diverse and interesting the place is.

## One Hot Encoding 
We need to Encode our venue categories to get a better result for our clustering

In [210]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")
London_venue_cat

Unnamed: 0,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10436,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10437,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10438,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10439,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighbourhood into the mix.

In [211]:
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

# moving neighborhood column to the first column
fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

London_venue_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Venue categories mean value
We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [212]:
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Barnet,0.0,0.0,0.0,0.001757,0.0,0.0,0.0,0.00703,0.0,...,0.001757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's make a function to get the top most common venue categories

In [213]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [214]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


## Top venue categories

Getting the top venue categories in London

In [215]:
# create a new dataframe for London
neighborhoods_venues_sorted_london = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london['Neighbourhood'] = London_grouped['Neighbourhood']

for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_london.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnet,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Pharmacy,Supermarket,Italian Restaurant,Turkish Restaurant,Pizza Place
1,"Barnet, Brent, Camden",Clothing Store,Convenience Store,Music Store,Supermarket,Gym / Fitness Center,Bakery,Hardware Store,Bus Station,Doner Restaurant,Film Studio
2,Bexley,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping
3,"Bexley, Greenwich",Bus Stop,Construction & Landscaping,Convenience Store,Golf Course,Historic Site,Park,Daycare,Food Stand,Food Court,Food & Drink Shop
4,"Bexley, Greenwich",Supermarket,Historic Site,Train Station,Coffee Shop,Platform,Convenience Store,Zoo Exhibit,Fast Food Restaurant,Filipino Restaurant,Film Studio


In [216]:
neighborhoods_venues_sorted_london

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnet,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Pharmacy,Supermarket,Italian Restaurant,Turkish Restaurant,Pizza Place
1,"Barnet, Brent, Camden",Clothing Store,Convenience Store,Music Store,Supermarket,Gym / Fitness Center,Bakery,Hardware Store,Bus Station,Doner Restaurant,Film Studio
2,Bexley,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping
3,"Bexley, Greenwich",Bus Stop,Construction & Landscaping,Convenience Store,Golf Course,Historic Site,Park,Daycare,Food Stand,Food Court,Food & Drink Shop
4,"Bexley, Greenwich",Supermarket,Historic Site,Train Station,Coffee Shop,Platform,Convenience Store,Zoo Exhibit,Fast Food Restaurant,Filipino Restaurant,Film Studio
5,Brent,Café,Indian Restaurant,Pharmacy,Sandwich Place,Warehouse Store,Bus Stop,Fast Food Restaurant,Convenience Store,Chinese Restaurant,Pub
6,"Brent, Camden",Indian Restaurant,Pub,Brazilian Restaurant,Café,Supermarket,Portuguese Restaurant,Sandwich Place,Grocery Store,Theater,Coffee Shop
7,"Brent, Ealing",Fast Food Restaurant,Bus Stop,Chinese Restaurant,Café,Warehouse Store,Convenience Store,Sandwich Place,Pharmacy,Fishing Store,Filipino Restaurant
8,"Brent, Harrow",Hotel,Plaza,Monument / Landmark,Theater,Garden,Art Gallery,Japanese Restaurant,Boutique,Ramen Restaurant,Burger Joint
9,Bromley,Supermarket,Grocery Store,Convenience Store,Hotel,Fast Food Restaurant,Park,Golf Course,Bistro,Gastropub,Bus Stop


## k-Means Model Building
Let's cluster the city of london to roughly 5 to make it easier to analyze. 

We use the K Means clustering technique to do so.

In [223]:
# set number of clusters
k_num_clusters = 5

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)
kmeans_london

KMeans(n_clusters=5, random_state=0)

## Labelling Clustered Data

In [224]:
kmeans_london.labels_

array([0, 2, 3, 2, 3, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

So our model has labeled the city

In [226]:
# neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

Join London_merged with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting

In [227]:
london_data = london_merged

london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='borough')

london_data.head()

Unnamed: 0,borough,town,post_code,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127,4,Supermarket,Historic Site,Train Station,Coffee Shop,Platform,Convenience Store,Zoo Exhibit,Fast Food Restaurant,Filipino Restaurant,Film Studio
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746,3,Grocery Store,Park,Indian Restaurant,Breakfast Spot,Train Station,Hotel,Fishing Store,Fast Food Restaurant,Filipino Restaurant,Film Studio
6,City,LONDON,EC3,51.512,-0.08058,1,Coffee Shop,Hotel,Italian Restaurant,Gym / Fitness Center,Pub,Sandwich Place,Restaurant,Wine Bar,Scenic Lookout,Cocktail Bar
7,Westminster,LONDON,WC2,51.51651,-0.11968,1,Hotel,Coffee Shop,Sandwich Place,Café,Pub,Italian Restaurant,Theater,Restaurant,Hotel Bar,Burger Joint
9,Bromley,LONDON,SE20,51.41009,-0.05683,3,Supermarket,Grocery Store,Convenience Store,Hotel,Fast Food Restaurant,Park,Golf Course,Bistro,Gastropub,Bus Stop


Drop all the NaN values to prevent data skew

In [228]:
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

## Visualizing the clustered neighbourhood
Let's plot the clusters

In [229]:
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['latitude'], london_data_nonan['longitude'], london_data_nonan['borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london

## Examining our Clusters

Cluster 1 - Coffee Shops and Pubs

In [245]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[0] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,City,1,Coffee Shop,Hotel,Italian Restaurant,Gym / Fitness Center,Pub,Sandwich Place,Restaurant,Wine Bar,Scenic Lookout,Cocktail Bar
7,Westminster,1,Hotel,Coffee Shop,Sandwich Place,Café,Pub,Italian Restaurant,Theater,Restaurant,Hotel Bar,Burger Joint
10,Islington,1,Coffee Shop,Pub,Café,Food Truck,Italian Restaurant,Vietnamese Restaurant,Breakfast Spot,Park,Cocktail Bar,Hotel
12,Islington,1,Coffee Shop,Pub,Café,Food Truck,Italian Restaurant,Vietnamese Restaurant,Breakfast Spot,Park,Cocktail Bar,Hotel
14,Barnet,1,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Pharmacy,Supermarket,Italian Restaurant,Turkish Restaurant,Pizza Place
...,...,...,...,...,...,...,...,...,...,...,...,...
521,Redbridge,1,Pub,Coffee Shop,Grocery Store,Café,Seafood Restaurant,Bar,Bakery,Restaurant,BBQ Joint,Park
522,"Redbridge, Waltham Forest",1,Hotel,Pub,Garden,Café,Plaza,Monument / Landmark,Theater,Art Gallery,Restaurant,Pharmacy
525,Barnet,1,Coffee Shop,Café,Grocery Store,Pub,Bus Stop,Pharmacy,Supermarket,Italian Restaurant,Turkish Restaurant,Pizza Place
526,Greenwich,1,Pub,Grocery Store,Coffee Shop,Bus Stop,Indian Restaurant,Turkish Restaurant,Chinese Restaurant,Construction & Landscaping,Pier,Park


Cluster 2 - Bakery

In [247]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 2, london_data_nonan.columns[[0] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
377,"Harrow, Brent",2,Bakery,Indian Restaurant,Gym,Metro Station,Zoo Exhibit,Flea Market,Filipino Restaurant,Film Studio,Fish & Chips Shop,Fish Market


Cluster 3 - Grocery Store & Indian Restaurant

In [248]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 3, london_data_nonan.columns[[0] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Ealing, Hammersmith and Fulham",3,Grocery Store,Park,Indian Restaurant,Breakfast Spot,Train Station,Hotel,Fishing Store,Fast Food Restaurant,Filipino Restaurant,Film Studio
9,Bromley,3,Supermarket,Grocery Store,Convenience Store,Hotel,Fast Food Restaurant,Park,Golf Course,Bistro,Gastropub,Bus Stop
29,Bromley,3,Supermarket,Grocery Store,Convenience Store,Hotel,Fast Food Restaurant,Park,Golf Course,Bistro,Gastropub,Bus Stop
61,Brent,3,Café,Indian Restaurant,Pharmacy,Sandwich Place,Warehouse Store,Bus Stop,Fast Food Restaurant,Convenience Store,Chinese Restaurant,Pub
69,Brent,3,Café,Indian Restaurant,Pharmacy,Sandwich Place,Warehouse Store,Bus Stop,Fast Food Restaurant,Convenience Store,Chinese Restaurant,Pub
100,Brent,3,Café,Indian Restaurant,Pharmacy,Sandwich Place,Warehouse Store,Bus Stop,Fast Food Restaurant,Convenience Store,Chinese Restaurant,Pub
121,"Barnet, Brent, Camden",3,Clothing Store,Convenience Store,Music Store,Supermarket,Gym / Fitness Center,Bakery,Hardware Store,Bus Station,Doner Restaurant,Film Studio
127,Bromley,3,Supermarket,Grocery Store,Convenience Store,Hotel,Fast Food Restaurant,Park,Golf Course,Bistro,Gastropub,Bus Stop
137,Brent,3,Café,Indian Restaurant,Pharmacy,Sandwich Place,Warehouse Store,Bus Stop,Fast Food Restaurant,Convenience Store,Chinese Restaurant,Pub
167,"Bexley, Greenwich",3,Bus Stop,Construction & Landscaping,Convenience Store,Golf Course,Historic Site,Park,Daycare,Food Stand,Food Court,Food & Drink Shop


Cluster 4

In [251]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 4, london_data_nonan.columns[[0] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",4,Supermarket,Historic Site,Train Station,Coffee Shop,Platform,Convenience Store,Zoo Exhibit,Fast Food Restaurant,Filipino Restaurant,Film Studio
45,Bexley,4,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping
124,Bexley,4,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping
291,Bexley,4,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping
505,Bexley,4,Supermarket,Historic Site,Platform,Coffee Shop,Convenience Store,Train Station,Bus Stop,Park,Golf Course,Construction & Landscaping


Cluster 5

In [250]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 5, london_data_nonan.columns[[0] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
453,"Lewisham, Southwark",5,Pub,Flower Shop,Gym / Fitness Center,Tennis Court,Train Station,Restaurant,Park,Wine Shop,Fast Food Restaurant,Food Truck


# Discussion / Observations / Recommendations
One of the observations regarding the comparison is that cities are very different. Each city offers very different venues. London is more homogeneous, while, Manhattan is more diverse – the neighbourhoods represent very different sets of venues. Also, London might have more to offer, as the number of venues is about 3 times higher. Additionally, the most common venues for London are: Pubs (very British) and Coffee Shops – people love coffee apparently. For Manhattan, the most popular restaurant is Italian, which is expected, having in mind that New York loves Pizza.
Overall, if you are a tourist, and want to decide where to go, based on our model we can recommend New York for someone who likes diversity and Italian food. If you prefer more convenient city (lots of supermarkets) and you like to have a pint – choose London, its full of pubs.

# Conclusions
In this project, we were trying to find out the characteristics of the two cities, neighbourhoods and check how these neighbourhoods cluster.
k-Means model helped us to better understand the landscape of areas in each city, which can help tourists or new comers to choose best district for their taste.
Both cities have so much to offer, but based on the data, analysis and modelling we can conclude some facts around them.
If you are a fan of Italian cuisine and you like variety you should choose New York. If you prefer to spent leisure time in pub, or you value convenience and shop in supermarket – go with London.


