## Coursera - IBM Data Science Professional Certificate

### Capstone Assignment - The Battle of Neighborhoods - New York and Paris

# 1. Introduction

**Problem statement:**  A travel booking websites wants to help their potential customers understand the similarities/dissimilarites between New Your and Paris so that the customers can understand and make informed decision to choose their holiday destination.

There are many famous tourist attractions/places in Paris like "Eifel Tower", "The Arc de Triomphe (Arch of Triumph)", "Courtyard of the Museum of Louvre, and its pyramid", art galleries, theaters, antique stores.  Apart from these you can find kids favourite parks, cinemas, museums; ladies favourite shopping malls, street markets; mens favourite cafes and other shops.

Similarly, you can find many famous tourist attactions in New York too such as Midtown Manhattan, Times Square, the Unisphere, the Brooklyn Bridge, Lower Manhattan with One World Trade Center, Central Park, the headquarters of the United Nations, and the Statue of Liberty and others.

**Approach:** The above information is published and anyone can google it and read it on wikipedia site.  But someone has to spend lots of time researching and find the similarities and dissimilarities among the cities.

To understand the similarities between New York and Paris, we analyze the data from Foursquare, a popular local search-and-discovery application which provides search results for its users. The application provides personalized recommendations of places to go near a certain location.  Foursquare enables users to share their current location with friends, rate and comment on venues they visit and read reviews of venues that other users have provided on the application.

To compare New York and Paris, we will use geographical datasets of the two cities. We will consider the neighborhoods with in the 500-1000 meters from the center of the city location. We then analyze the venues recommendations from Foursquare through their API.  We create 5-10 clusters from each city and then compare and analyze them against each other and provide the conclusion about how the neighborhoods are similar and dissimilar they are.

# 2. Data

#### Paris and New York Geographical Data

**Paris geographical data**


The city of Paris is divided into twenty arrondissements municipaux, administrative districts, more simply referred to as arrondissements. These are not to be confused with departmental arrondissements, which subdivide the 100 French départements. The word "arrondissement", when applied to Paris, refers almost always to the municipal arrondissements listed below. The number of the arrondissement is indicated by the last two digits in most Parisian postal codes (75001 up to 75020).

The twenty arrondissements are arranged in the form of a clockwise spiral (often likened to a snail shell), starting from the middle of the city, with the first on the Right Bank (north bank) of the Seine. Lyon and Marseille have, more recently, also been subdivided into arrondissements.

https://en.wikipedia.org/wiki/Arrondissements_of_Paris

In French, notably on street signs, the number is often given in Roman numerals. For example, the Eiffel Tower belongs to the VIIe arrondissement while Gare de l'Est is in the Xe arrondissement. In daily speech, people use only the ordinal number corresponding to the arrondissement, e.g. "Elle habite dans le sixième", "She lives in the 6th (arrondissement)".


We will extract the Paris municipal borough data from https://opendata.paris.fr/page/home/ open datasets available online.  The below url can be used to download the json data.

https://opendata.paris.fr/explore/dataset/arrondissements/download?format=json&timezone=Europe/Berlin&use_labels_for_header=true


In [1]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab


Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.18.1                     py_0    conda-forge


In [2]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab


Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge


In [3]:
import pandas as pd
import numpy as np
import urllib.request 
import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

In [4]:
# Get Paris geography data rom https://opendata.paris.fr

with urllib.request.urlopen("https://opendata.paris.fr/explore/dataset/arrondissements/download?format=json&timezone=Europe/Berlin&use_labels_for_header=true") as paris_url:
   paris_data = json.loads(paris_url.read().decode('utf-8'))

### Below is a sample Paris dataset.

In [5]:
paris_df = json_normalize(paris_data)
paris_df.head()

Unnamed: 0,datasetid,fields.c_ar,fields.c_arinsee,fields.geom.coordinates,fields.geom.type,fields.geom_x_y,fields.l_ar,fields.l_aroff,fields.longueur,fields.n_sq_ar,fields.n_sq_co,fields.objectid,fields.perimetre,fields.surface,geometry.coordinates,geometry.type,record_timestamp,recordid
0,arrondissements,2,75102,"[[[2.351518483670821, 48.8644258050741], [2.35...",Polygon,"[48.86827922252252, 2.3428025468913636]",2ème Ardt,Bourse,4553.938764,750000002,750001537,2,4554.10436,991153.7,"[2.3428025468913636, 48.86827922252252]",Point,2019-03-01T00:00:31+01:00,fdcdd162efd8d445fdecb7b95ed7df1ff4c59f26
1,arrondissements,3,75103,"[[[2.363828096062925, 48.86750443060333], [2.3...",Polygon,"[48.86287238001689, 2.3600009858976927]",3ème Ardt,Temple,4519.071982,750000003,750001537,3,4519.263648,1170883.0,"[2.3600009858976927, 48.86287238001689]",Point,2019-03-01T00:00:31+01:00,469806e90b8b4676461b1845f113b25397cd5241
2,arrondissements,12,75112,"[[[2.413879624300607, 48.83357143972265], [2.4...",Polygon,"[48.83497438148051, 2.421324900784681]",12ème Ardt,Reuilly,24088.038922,750000012,750001537,12,24089.666298,16314780.0,"[2.421324900784681, 48.83497438148051]",Point,2019-03-01T00:00:31+01:00,e8ec3494fa75e33f9cc5308108db755f2bafbd7c
3,arrondissements,1,75101,"[[[2.328007329038849, 48.86991742140715], [2.3...",Polygon,"[48.86256270183605, 2.3364433620533847]",1er Ardt,Louvre,6054.680862,750000001,750001537,1,6054.936862,1824613.0,"[2.3364433620533847, 48.86256270183605]",Point,2019-03-01T00:00:31+01:00,fd746ffccedf5bb7893b6ec2d7c8daf24a6f1fb5
4,arrondissements,4,75104,"[[[2.368512371393433, 48.85573412813671], [2.3...",Polygon,"[48.854341426272896, 2.357629620324993]",4ème Ardt,Hôtel-de-Ville,5420.636779,750000004,750001537,4,5420.908434,1600586.0,"[2.357629620324993, 48.854341426272896]",Point,2019-03-01T00:00:31+01:00,437ce5d06deeb12a187baea9fbd3e15c2ae87852


In our exercise we will use the location (latitude and longitude) information and from the above table the arrondissement number and name.


### Foursquare Local search and recommendations API:

Foursquare lets users search for restaurants, nightlife spots, shops and other places of interest in their surrounding area. It is also possible to search other areas by entering the name of a remote location. The app displays personalized recommendations based on the time of day, displaying breakfast places in the morning, dinner places in the evening etc. Recommendations are personalized based on factors that include a user's check-in history, their "Tastes" and their venue ratings.

In our assignment we will use the Foursquare API feature for exploring the top recommended venues nearby a particular neighboorhood location. We will combining the Paris and New York geographical data with Foursquare venues.  Then we will use the data for clustering the neighborhoods and look for basic similarities/dissimilarities between these neighborhoods of Paris and New York.

## 3. Methodology



In [6]:
paris_df.columns

Index(['datasetid', 'fields.c_ar', 'fields.c_arinsee',
       'fields.geom.coordinates', 'fields.geom.type', 'fields.geom_x_y',
       'fields.l_ar', 'fields.l_aroff', 'fields.longueur', 'fields.n_sq_ar',
       'fields.n_sq_co', 'fields.objectid', 'fields.perimetre',
       'fields.surface', 'geometry.coordinates', 'geometry.type',
       'record_timestamp', 'recordid'],
      dtype='object')

In [7]:
paris_df.loc[:, ["fields.c_ar", "fields.l_aroff", "fields.geom_x_y"]]

Unnamed: 0,fields.c_ar,fields.l_aroff,fields.geom_x_y
0,2,Bourse,"[48.86827922252252, 2.3428025468913636]"
1,3,Temple,"[48.86287238001689, 2.3600009858976927]"
2,12,Reuilly,"[48.83497438148051, 2.421324900784681]"
3,1,Louvre,"[48.86256270183605, 2.3364433620533847]"
4,4,Hôtel-de-Ville,"[48.854341426272896, 2.357629620324993]"
5,8,Élysée,"[48.87272083743446, 2.3125540224020638]"
6,14,Observatoire,"[48.829244500489835, 2.3265420441989453]"
7,19,Buttes-Chaumont,"[48.887075996572506, 2.3848209601525143]"
8,20,Ménilmontant,"[48.86346057889556, 2.4011881292846864]"
9,6,Luxembourg,"[48.84913035858519, 2.3328979990533134]"


In [8]:
paris_data = pd.concat([paris_df.loc[:, ["fields.c_ar", "fields.l_aroff"]], pd.DataFrame(paris_df['fields.geom_x_y'].tolist(), columns = ["Latitude", "Longitude"])], axis = 1)
paris_data.columns = ["Arrondissement", "Neighborhood", "Latitude", "Longitude"]

paris_data = paris_data.sort_values(["Arrondissement"]).reset_index(drop = True)



In [9]:
paris_data.head()

Unnamed: 0,Arrondissement,Neighborhood,Latitude,Longitude
0,1,Louvre,48.862563,2.336443
1,2,Bourse,48.868279,2.342803
2,3,Temple,48.862872,2.360001
3,4,Hôtel-de-Ville,48.854341,2.35763
4,5,Panthéon,48.844443,2.350715


Use geopy library to get the latitude and longitude values of Paris.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent pr_explorer, as shown belo

In [10]:
address = 'Paris, Fr'

geolocator = Nominatim(user_agent="pr_explorer")
location_paris = geolocator.geocode(address)
latitude_paris = location_paris.latitude
longitude_paris = location_paris.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude_paris, longitude_paris))

The geograpical coordinate of Paris are 48.8566101, 2.3514992.


#### Create a map of Paris with neighborhoods superimposed on top.

In [11]:
# create map of New York using latitude and longitude values
map_paris = folium.Map(location=[latitude_paris, longitude_paris], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(paris_data['Latitude'], paris_data['Longitude'], paris_data['Arrondissement'], paris_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

**New York geographical data**

New York Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web, here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For our convenience, I will simply use the file that is already placed on the IBM server, so we can simply run a wget command and access the data.

In [12]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
    
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Below is a sample New York dataset.

In [13]:
ny_df = json_normalize(newyork_data)
ny_df.head()

Unnamed: 0,bbox,crs.properties.name,crs.type,features,totalFeatures,type
0,"[-74.2492599487305, 40.5033187866211, -73.7061...",urn:ogc:def:crs:EPSG::4326,name,"[{'geometry_name': 'geom', 'type': 'Feature', ...",306,FeatureCollection


Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [14]:
neighborhoods_data = newyork_data['features']

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [15]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [16]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [17]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [18]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### For our project we will only consider Manhattan borough to compare with Paris

In [19]:
manhattan_data = neighborhoods[neighborhoods["Borough"] == "Manhattan"].reset_index(drop = True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.

In [20]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location_ny = geolocator.geocode(address)
latitude_ny = location_ny.latitude
longitude_ny = location_ny.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude_ny, longitude_ny))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


In [21]:
# create map of New York using latitude and longitude values
map_manhattan = folium.Map(location=[latitude_ny, longitude_ny], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Borough'], manhattan_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan) 
    
map_manhattan

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [65]:
CLIENT_ID = 'OIGLNO2RMC5HLTCAA1Q1GWL1BFCBOSG5SU5C1M4KK2PNKTCT' # your Foursquare ID
CLIENT_SECRET = '*********************************************' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: OIGLNO2RMC5HLTCAA1Q1GWL1BFCBOSG5SU5C1M4KK2PNKTCT
CLIENT_SECRET:*********************************************


In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])

    nearby_venues.columns = ['Neighborhood', 
          'Neighborhood Latitude', 
          'Neighborhood Longitude', 
          'Venue', 
          'Venue Latitude', 
          'Venue Longitude', 
          'Venue Category']
    
    return(nearby_venues)

## Explore Neighborhoods in Paris

In [24]:
# Create a new dataframe called paris_venues that combines the recommended venues from Foursquare 
# and neigborhoods geo-information from the city dataset

paris_venues = getNearbyVenues(names = paris_data["Neighborhood"],
                               latitudes = paris_data["Latitude"],
                               longitudes = paris_data["Longitude"]
                               )

Louvre
Bourse
Temple
Hôtel-de-Ville
Panthéon
Luxembourg
Palais-Bourbon
Élysée
Opéra
Entrepôt
Popincourt
Reuilly
Gobelins
Observatoire
Vaugirard
Passy
Batignolles-Monceau
Buttes-Montmartre
Buttes-Chaumont
Ménilmontant


In [25]:
print(paris_venues.shape)
paris_venues.head()

(1834, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Louvre,48.862563,2.336443,Musée du Louvre,48.860847,2.33644,Art Museum
1,Louvre,48.862563,2.336443,Comédie-Française,48.863088,2.336612,Theater
2,Louvre,48.862563,2.336443,Palais Royal,48.863758,2.337121,Historic Site
3,Louvre,48.862563,2.336443,Place du Palais Royal,48.862523,2.336688,Plaza
4,Louvre,48.862563,2.336443,Les Arts Décoratifs,48.863077,2.333393,Art Museum


In [26]:
print(paris_venues.shape)
print("There are {} unique categories.".format(len(paris_venues["Venue Category"].unique())))

(1834, 7)
There are 227 unique categories.


Let's check how many venues were returned for each neighborhood in Paris

In [27]:
paris_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Batignolles-Monceau,100,100,100,100,100,100
Bourse,100,100,100,100,100,100
Buttes-Chaumont,100,100,100,100,100,100
Buttes-Montmartre,100,100,100,100,100,100
Entrepôt,100,100,100,100,100,100
Gobelins,100,100,100,100,100,100
Hôtel-de-Ville,100,100,100,100,100,100
Louvre,100,100,100,100,100,100
Luxembourg,100,100,100,100,100,100
Ménilmontant,53,53,53,53,53,53


#### Let's find out how many unique categories can be curated from all the returned venues

In [28]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))

There are 227 uniques categories.


In [29]:
print('So the paris_venues dataframe contains {} venues in {} unique categories.'.format(paris_venues.shape[0], len(paris_venues['Venue Category'].unique())))

So the paris_venues dataframe contains 1834 venues in 227 unique categories.


In [30]:
manhattan_venues = getNearbyVenues(names = manhattan_data["Neighborhood"],
                               latitudes = manhattan_data["Latitude"],
                               longitudes = manhattan_data["Longitude"]
                               )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [31]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3977, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Sam's Pizza,40.879435,-73.905859,Pizza Place
4,Marble Hill,40.876551,-73.91066,Loeser's Delicatessen,40.879242,-73.905471,Sandwich Place


Let's check how many venues were returned for each neighborhood in Manhattan

In [32]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,100,100,100,100,100,100
Carnegie Hill,100,100,100,100,100,100
Central Harlem,100,100,100,100,100,100
Chelsea,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Civic Center,100,100,100,100,100,100
Clinton,100,100,100,100,100,100
East Harlem,100,100,100,100,100,100
East Village,100,100,100,100,100,100
Financial District,100,100,100,100,100,100


#### Let's find out how many unique categories can be curated from all the returned venues

In [33]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 313 uniques categories.


In [34]:
print('So the manhattan_venues dataframe contains {} venues in {} unique categories.'.format(manhattan_venues.shape[0], len(manhattan_venues['Venue Category'].unique())))

So the manhattan_venues dataframe contains 3977 venues in 313 unique categories.


### Data Model

Now we will build the clustering model.  After clustering the data we will try to identify possible similarities in clusters of venues between Paris and Manhattan. For this we will use the k-Means method provided by Scikit-Learn Machine Learning library.

We wil achieve this by following the below steps :::::

A. First we will explore each neighborhood from Paris and Manhattan.  We will consider 10 categories of venues for each arrondissement from Paris and neighborhood from Manhattan

B. We will cluster the venues from Paris and Manhattan and then visualize them in on a map separately.

C. Fit the data with the k-Means model

D. Visualize the clusters on the city Paris and Manhattan maps

#### Explore each neighborhood from Paris

In [35]:
# one hot encoding
paris_onehot = pd.get_dummies(paris_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
paris_onehot['Neighborhood'] = paris_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [paris_onehot.columns[-1]] + list(paris_onehot.columns[:-1])
paris_onehot = paris_onehot[fixed_columns]

paris_onehot.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Vineyard,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Louvre,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Louvre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Louvre,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [36]:
paris_onehot.shape

(1834, 228)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [37]:
paris_grouped = paris_onehot.groupby('Neighborhood').mean().reset_index()
paris_grouped.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Vineyard,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Batignolles-Monceau,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0
1,Bourse,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.06,0.01,0.01,0.0,0.0
2,Buttes-Chaumont,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
3,Buttes-Montmartre,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,...,0.0,0.02,0.0,0.02,0.01,0.02,0.0,0.0,0.0,0.0
4,Entrepôt,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0


#### Let's confirm the new size

In [38]:
paris_grouped.shape

(20, 228)

#### Let's print each neighborhood along with the top 5 most common venues

In [39]:
num_top_venues = 5

for hood in paris_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = paris_grouped[paris_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Batignolles-Monceau----
                venue  freq
0   French Restaurant  0.23
1  Italian Restaurant  0.11
2               Hotel  0.10
3              Bakery  0.05
4          Restaurant  0.04


----Bourse----
                 venue  freq
0    French Restaurant  0.13
1             Wine Bar  0.06
2         Cocktail Bar  0.06
3                Hotel  0.05
4  Japanese Restaurant  0.04


----Buttes-Chaumont----
               venue  freq
0  French Restaurant  0.13
1                Bar  0.09
2               Café  0.05
3         Restaurant  0.04
4             Bistro  0.04


----Buttes-Montmartre----
                venue  freq
0   French Restaurant  0.17
1                 Bar  0.10
2  Italian Restaurant  0.05
3              Bistro  0.05
4         Pizza Place  0.05


----Entrepôt----
                venue  freq
0         Coffee Shop  0.08
1         Pizza Place  0.06
2   French Restaurant  0.06
3              Bistro  0.05
4  Italian Restaurant  0.05


----Gobelins----
                   venu

#### Let's put that into a *pandas* dataframe

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [41]:


num_top_venues = 10

indicators = ["st", "nd", "rd"]

# create columns according to number of top venues
columns = ["Neighborhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1, indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))

# create a new dataframe
paris_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
paris_neighborhoods_venues_sorted[["Neighborhood"]] = paris_grouped[["Neighborhood"]]

for ind in np.arange(paris_grouped.shape[0]):
    paris_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(paris_grouped.iloc[ind, 1:], num_top_venues)

paris_neighborhoods_venues_sorted.head()




Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Batignolles-Monceau,French Restaurant,Italian Restaurant,Hotel,Bakery,Restaurant,Bistro,Pastry Shop,Park,Bar,Indian Restaurant
1,Bourse,French Restaurant,Wine Bar,Cocktail Bar,Hotel,Japanese Restaurant,Restaurant,Boutique,New American Restaurant,Pedestrian Plaza,Pastry Shop
2,Buttes-Chaumont,French Restaurant,Bar,Café,Restaurant,Bistro,Concert Hall,Italian Restaurant,Beer Garden,Seafood Restaurant,Park
3,Buttes-Montmartre,French Restaurant,Bar,Pizza Place,Italian Restaurant,Bistro,Café,Middle Eastern Restaurant,Restaurant,Park,Plaza
4,Entrepôt,Coffee Shop,French Restaurant,Pizza Place,Italian Restaurant,Bistro,Bakery,Cocktail Bar,Seafood Restaurant,Japanese Restaurant,Breakfast Spot


#### Explore each neighborhood from Manhattan

In [42]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [43]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.01,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.01,0.01,0.0,0.0,0.0,0.01,0.03,0.0,0.01,0.03
2,Central Harlem,0.0,0.0,0.04,0.03,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02
3,Chelsea,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.04,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues

In [44]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
                  venue  freq
0                  Park  0.07
1           Coffee Shop  0.07
2             Wine Shop  0.04
3                 Hotel  0.03
4  Gym / Fitness Center  0.03


----Carnegie Hill----
                venue  freq
0         Pizza Place  0.06
1         Coffee Shop  0.06
2                 Gym  0.04
3              Bakery  0.04
4  Italian Restaurant  0.04


----Central Harlem----
                             venue  freq
0  Southern / Soul Food Restaurant  0.05
1                             Café  0.05
2               African Restaurant  0.04
3                French Restaurant  0.04
4             Gym / Fitness Center  0.03


----Chelsea----
                 venue  freq
0          Art Gallery  0.11
1  American Restaurant  0.05
2   Seafood Restaurant  0.04
3                Hotel  0.04
4          Coffee Shop  0.04


----Chinatown----
                 venue  freq
0   Chinese Restaurant  0.07
1         Cocktail Bar  0.05
2                 Café  0.04
3  

#### Let's put that into a *pandas* dataframe

In [45]:


num_top_venues = 10

indicators = ["st", "nd", "rd"]

# create columns according to number of top venues
columns = ["Neighborhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1, indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))

# create a new dataframe
manhattan_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
manhattan_neighborhoods_venues_sorted[["Neighborhood"]] =  manhattan_grouped[["Neighborhood"]]

for ind in np.arange(manhattan_grouped.shape[0]):
    manhattan_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, 1:], num_top_venues)

manhattan_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Wine Shop,Hotel,Gym / Fitness Center,Gym,Plaza,Performing Arts Venue,BBQ Joint,Burger Joint
1,Carnegie Hill,Coffee Shop,Pizza Place,Bakery,Italian Restaurant,Gym,Yoga Studio,Gym / Fitness Center,Spa,Bookstore,Art Museum
2,Central Harlem,Southern / Soul Food Restaurant,Café,African Restaurant,French Restaurant,Sushi Restaurant,American Restaurant,Seafood Restaurant,Theater,Gym / Fitness Center,Lounge
3,Chelsea,Art Gallery,American Restaurant,Seafood Restaurant,Hotel,Coffee Shop,Italian Restaurant,Nightclub,Asian Restaurant,Cycle Studio,Theater
4,Chinatown,Chinese Restaurant,Cocktail Bar,Ice Cream Shop,Café,Wine Bar,American Restaurant,Sandwich Place,French Restaurant,Coffee Shop,Shoe Store


## Lets do clustering of Paris neighbourhoods

Run *k*-means to cluster the Paris neighborhood into 5 clusters.

In [46]:
# set number of clusters
kclusters = 5

paris_grouped_clustering = paris_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(paris_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 



array([1, 0, 4, 4, 0, 3, 0, 0, 0, 4], dtype=int32)

In [47]:
kmeans.labels_

array([1, 0, 4, 4, 0, 3, 0, 0, 0, 4, 1, 1, 1, 0, 1, 4, 2, 0, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [48]:


# add clustering labels
paris_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

paris_merged = paris_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
paris_merged = paris_merged.join(paris_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

paris_merged.head() # check the last columns!


Unnamed: 0,Arrondissement,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Louvre,48.862563,2.336443,0,French Restaurant,Hotel,Japanese Restaurant,Café,Italian Restaurant,Plaza,Historic Site,Cocktail Bar,Art Museum,Restaurant
1,2,Bourse,48.868279,2.342803,0,French Restaurant,Wine Bar,Cocktail Bar,Hotel,Japanese Restaurant,Restaurant,Boutique,New American Restaurant,Pedestrian Plaza,Pastry Shop
2,3,Temple,48.862872,2.360001,0,Art Gallery,French Restaurant,Bistro,Coffee Shop,Clothing Store,Wine Bar,Sandwich Place,Cocktail Bar,Café,Burger Joint
3,4,Hôtel-de-Ville,48.854341,2.35763,0,French Restaurant,Ice Cream Shop,Plaza,Clothing Store,Wine Bar,Art Gallery,Pastry Shop,Cocktail Bar,Falafel Restaurant,Coffee Shop
4,5,Panthéon,48.844443,2.350715,0,French Restaurant,Bar,Bakery,Plaza,Wine Bar,Café,Greek Restaurant,Pub,Coffee Shop,Museum


Finally, let's visualize the resulting clusters

In [49]:
# create map
map_clusters = folium.Map(location=[latitude_paris, longitude_paris], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(paris_merged["Latitude"], paris_merged["Longitude"], paris_merged["Neighborhood"], paris_merged["Cluster Labels"]):
    label = folium.Popup(str(poi) + "Cluster" + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters


## Lets do clustering of Manhattan neighbourhoods

In [50]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1, 1, 3, 2, 1, 3, 4, 2, 0], dtype=int32)

In [51]:
# add clustering labels
manhattan_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(manhattan_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,4,Park,Pizza Place,Spanish Restaurant,Café,Supermarket,Mexican Restaurant,Donut Shop,Coffee Shop,Bakery,Bar
1,Manhattan,Chinatown,40.715618,-73.994279,2,Chinese Restaurant,Cocktail Bar,Ice Cream Shop,Café,Wine Bar,American Restaurant,Sandwich Place,French Restaurant,Coffee Shop,Shoe Store
2,Manhattan,Washington Heights,40.851903,-73.9369,4,Pizza Place,Latin American Restaurant,Bakery,Café,Mexican Restaurant,Grocery Store,Bar,Park,Tapas Restaurant,Deli / Bodega
3,Manhattan,Inwood,40.867684,-73.92121,4,Latin American Restaurant,Café,Pizza Place,Deli / Bodega,Mexican Restaurant,Wine Bar,Spanish Restaurant,Restaurant,Bakery,Lounge
4,Manhattan,Hamilton Heights,40.823604,-73.949688,2,Coffee Shop,Bar,Mexican Restaurant,Café,Yoga Studio,Caribbean Restaurant,Chinese Restaurant,Sushi Restaurant,American Restaurant,Park


In [52]:
# create map
map_clusters = folium.Map(location=[latitude_ny, longitude_ny], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 4. Examine

### Paris Clusters

#### Cluster 0 - Red

In [53]:
paris_merged.loc[paris_merged['Cluster Labels'] == 0, paris_merged.columns[[1] + list(range(5, paris_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Louvre,French Restaurant,Hotel,Japanese Restaurant,Café,Italian Restaurant,Plaza,Historic Site,Cocktail Bar,Art Museum,Restaurant
1,Bourse,French Restaurant,Wine Bar,Cocktail Bar,Hotel,Japanese Restaurant,Restaurant,Boutique,New American Restaurant,Pedestrian Plaza,Pastry Shop
2,Temple,Art Gallery,French Restaurant,Bistro,Coffee Shop,Clothing Store,Wine Bar,Sandwich Place,Cocktail Bar,Café,Burger Joint
3,Hôtel-de-Ville,French Restaurant,Ice Cream Shop,Plaza,Clothing Store,Wine Bar,Art Gallery,Pastry Shop,Cocktail Bar,Falafel Restaurant,Coffee Shop
4,Panthéon,French Restaurant,Bar,Bakery,Plaza,Wine Bar,Café,Greek Restaurant,Pub,Coffee Shop,Museum
5,Luxembourg,French Restaurant,Hotel,Wine Bar,Italian Restaurant,Plaza,Chocolate Shop,Ice Cream Shop,Bookstore,Seafood Restaurant,Theater
9,Entrepôt,Coffee Shop,French Restaurant,Pizza Place,Italian Restaurant,Bistro,Bakery,Cocktail Bar,Seafood Restaurant,Japanese Restaurant,Breakfast Spot


#### Cluster 1 - Purple

In [54]:
paris_merged.loc[paris_merged['Cluster Labels'] == 1, paris_merged.columns[[1] + list(range(5, paris_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Palais-Bourbon,French Restaurant,Hotel,Plaza,Café,Historic Site,Italian Restaurant,History Museum,Cocktail Bar,Garden,Cheese Shop
7,Élysée,French Restaurant,Hotel,Italian Restaurant,Art Gallery,Cosmetics Shop,Clothing Store,Café,Boutique,Coffee Shop,Steakhouse
8,Opéra,French Restaurant,Hotel,Italian Restaurant,Wine Bar,Cocktail Bar,Bistro,Pizza Place,Music Venue,Vegetarian / Vegan Restaurant,Bar
13,Observatoire,French Restaurant,Hotel,Italian Restaurant,Bistro,Bar,Japanese Restaurant,Pizza Place,Vietnamese Restaurant,Coffee Shop,Restaurant
14,Vaugirard,French Restaurant,Italian Restaurant,Hotel,Bakery,Persian Restaurant,Thai Restaurant,Coffee Shop,Japanese Restaurant,Lebanese Restaurant,Park
15,Passy,French Restaurant,Bakery,Italian Restaurant,Japanese Restaurant,Chinese Restaurant,Lake,Garden,Art Museum,Park,Plaza
16,Batignolles-Monceau,French Restaurant,Italian Restaurant,Hotel,Bakery,Restaurant,Bistro,Pastry Shop,Park,Bar,Indian Restaurant


#### Cluster 2 - Light Blue

In [55]:
paris_merged.loc[paris_merged['Cluster Labels'] == 2, paris_merged.columns[[1] + list(range(5, paris_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Reuilly,Lake,Diner,French Restaurant,Bike Rental / Bike Share,Zoo,Hotel,Recreation Center,Park,Café,Exhibit


#### Cluster 3 - Light Green

In [56]:
paris_merged.loc[paris_merged['Cluster Labels'] == 3, paris_merged.columns[[1] + list(range(5, paris_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Gobelins,Vietnamese Restaurant,Thai Restaurant,Asian Restaurant,Chinese Restaurant,French Restaurant,Hotel,Bakery,Bistro,Supermarket,Café


#### Cluster 4 - Orange

In [57]:
paris_merged.loc[paris_merged['Cluster Labels'] == 4, paris_merged.columns[[1] + list(range(5, paris_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Popincourt,French Restaurant,Bar,Cocktail Bar,Pizza Place,Bistro,Italian Restaurant,Restaurant,Beer Bar,Wine Bar,Vegetarian / Vegan Restaurant
17,Buttes-Montmartre,French Restaurant,Bar,Pizza Place,Italian Restaurant,Bistro,Café,Middle Eastern Restaurant,Restaurant,Park,Plaza
18,Buttes-Chaumont,French Restaurant,Bar,Café,Restaurant,Bistro,Concert Hall,Italian Restaurant,Beer Garden,Seafood Restaurant,Park
19,Ménilmontant,Bar,French Restaurant,Bistro,Bakery,Music Venue,Bookstore,Café,Theater,Korean Restaurant,Stadium


### Manhattan Clusters

#### Cluster 0 - Red

In [58]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Murray Hill,Coffee Shop,Korean Restaurant,Gym / Fitness Center,Japanese Restaurant,Pizza Place,Sandwich Place,Gym,Chinese Restaurant,Gourmet Shop,Hotel
28,Battery Park City,Park,Coffee Shop,Wine Shop,Hotel,Gym / Fitness Center,Gym,Plaza,Performing Arts Venue,BBQ Joint,Burger Joint
29,Financial District,Coffee Shop,Hotel,Sandwich Place,Steakhouse,Pizza Place,Wine Shop,Cocktail Bar,Gym / Fitness Center,Park,Museum
33,Midtown South,Korean Restaurant,Coffee Shop,Gym / Fitness Center,Hotel,Yoga Studio,Japanese Restaurant,New American Restaurant,Café,Italian Restaurant,Pizza Place


#### Cluster 1 - Purple

In [59]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Harlem,Southern / Soul Food Restaurant,Café,African Restaurant,French Restaurant,Sushi Restaurant,American Restaurant,Seafood Restaurant,Theater,Gym / Fitness Center,Lounge
8,Upper East Side,Exhibit,Italian Restaurant,Bakery,Yoga Studio,Gym / Fitness Center,Coffee Shop,Hotel,Seafood Restaurant,American Restaurant,Spa
9,Yorkville,Italian Restaurant,Gym,Coffee Shop,Ice Cream Shop,Bar,Wine Shop,Deli / Bodega,Thai Restaurant,Mexican Restaurant,Japanese Restaurant
10,Lenox Hill,Italian Restaurant,Sushi Restaurant,Coffee Shop,Gym / Fitness Center,French Restaurant,Dessert Shop,Café,Bakery,Steakhouse,Pizza Place
13,Lincoln Square,Gym / Fitness Center,Coffee Shop,Italian Restaurant,Gym,French Restaurant,Jazz Club,Concert Hall,Wine Bar,Theater,Indie Movie Theater
18,Greenwich Village,Italian Restaurant,Coffee Shop,American Restaurant,Seafood Restaurant,Spa,Pizza Place,French Restaurant,Clothing Store,Yoga Studio,Ice Cream Shop
22,Little Italy,Clothing Store,Chinese Restaurant,Café,Men's Store,Italian Restaurant,American Restaurant,Cocktail Bar,Coffee Shop,Art Gallery,Boutique
23,Soho,Italian Restaurant,French Restaurant,Coffee Shop,Boutique,Café,Clothing Store,Shoe Store,American Restaurant,Men's Store,Women's Store
30,Carnegie Hill,Coffee Shop,Pizza Place,Bakery,Italian Restaurant,Gym,Yoga Studio,Gym / Fitness Center,Spa,Bookstore,Art Museum
32,Civic Center,French Restaurant,Bakery,Hotel,American Restaurant,Coffee Shop,Chinese Restaurant,Cocktail Bar,Gym / Fitness Center,Spa,Yoga Studio


#### Cluster 2 - Light Blue

In [60]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Chinatown,Chinese Restaurant,Cocktail Bar,Ice Cream Shop,Café,Wine Bar,American Restaurant,Sandwich Place,French Restaurant,Coffee Shop,Shoe Store
4,Hamilton Heights,Coffee Shop,Bar,Mexican Restaurant,Café,Yoga Studio,Caribbean Restaurant,Chinese Restaurant,Sushi Restaurant,American Restaurant,Park
5,Manhattanville,Mexican Restaurant,Park,American Restaurant,Italian Restaurant,Seafood Restaurant,Café,Tennis Court,Coffee Shop,Lounge,Indian Restaurant
11,Roosevelt Island,Park,Sushi Restaurant,Coffee Shop,Pizza Place,Deli / Bodega,Italian Restaurant,Greek Restaurant,Yoga Studio,Tennis Court,Café
12,Upper West Side,Italian Restaurant,Coffee Shop,American Restaurant,Wine Bar,Park,Bakery,Ice Cream Shop,Bar,Pub,Thai Restaurant
19,East Village,Cocktail Bar,Coffee Shop,Wine Bar,Ice Cream Shop,Japanese Restaurant,Korean Restaurant,Chinese Restaurant,Garden,Bagel Shop,Pizza Place
20,Lower East Side,Italian Restaurant,Mexican Restaurant,Coffee Shop,Wine Bar,Ice Cream Shop,Japanese Restaurant,Deli / Bodega,Boutique,Café,Shoe Store
21,Tribeca,Park,Coffee Shop,American Restaurant,Hotel,French Restaurant,Spa,Italian Restaurant,Bakery,Sushi Restaurant,Men's Store
24,West Village,Italian Restaurant,American Restaurant,Wine Bar,Jazz Club,Bakery,New American Restaurant,Coffee Shop,Ice Cream Shop,Park,Chinese Restaurant
25,Manhattan Valley,Park,Coffee Shop,Pizza Place,Grocery Store,Ice Cream Shop,Playground,Chinese Restaurant,Indian Restaurant,Mexican Restaurant,French Restaurant


#### Cluster 3 - Light Green

In [61]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Clinton,Theater,Italian Restaurant,American Restaurant,Hotel,Bakery,Burger Joint,Indie Theater,Gym / Fitness Center,Pizza Place,Bar
15,Midtown,Theater,Coffee Shop,Hotel,Gym,Plaza,Sandwich Place,Concert Hall,Cuban Restaurant,Park,Pizza Place
17,Chelsea,Art Gallery,American Restaurant,Seafood Restaurant,Hotel,Coffee Shop,Italian Restaurant,Nightclub,Asian Restaurant,Cycle Studio,Theater
27,Gramercy,American Restaurant,New American Restaurant,Mediterranean Restaurant,Hotel,Wine Shop,Cosmetics Shop,Gym,Restaurant,Pizza Place,Indian Restaurant
39,Hudson Yards,Theater,Dance Studio,Gym / Fitness Center,Coffee Shop,Italian Restaurant,Pizza Place,American Restaurant,Hotel,Gym,Indie Theater


#### Cluster 4 - Orange

In [62]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Marble Hill,Park,Pizza Place,Spanish Restaurant,Café,Supermarket,Mexican Restaurant,Donut Shop,Coffee Shop,Bakery,Bar
2,Washington Heights,Pizza Place,Latin American Restaurant,Bakery,Café,Mexican Restaurant,Grocery Store,Bar,Park,Tapas Restaurant,Deli / Bodega
3,Inwood,Latin American Restaurant,Café,Pizza Place,Deli / Bodega,Mexican Restaurant,Wine Bar,Spanish Restaurant,Restaurant,Bakery,Lounge
7,East Harlem,Mexican Restaurant,Bakery,Café,Pizza Place,Deli / Bodega,Latin American Restaurant,Plaza,Thai Restaurant,Italian Restaurant,Gym


## 5. Observation

As we have the contents of our clusters from both the neighborhoods of Paris and Manhattan, lets identify these clusters by comparing the venues that distinguish each cluster.

We will use the results to discuss our observations in this section.

### Paris neighbourhood clusters

In [63]:
# create map
map_clusters = folium.Map(location=[latitude_paris, longitude_paris], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(paris_merged["Latitude"], paris_merged["Longitude"], paris_merged["Neighborhood"], paris_merged["Cluster Labels"]):
    label = folium.Popup(str(poi) + "Cluster" + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters

We have five clusters in Paris with two of them (2 and 3) having only 1 neighborhood.

Below is the color coding from the above image:

Cluster 0 - Red, 
Cluster 1 - Purple, 
Cluster 2 - Light Blue, 
Cluster 3 - Light Green, 
Cluster 4 - Orange 

If you observe all the clustres have more than 95% resturants in the top 10 venues.  Looks like Paris neighbourhoods are full of resturants, which is a great new for tourists who likes to explore resturants and each variety of foods.

No wonder the "French Resturants" are the most common venues in all the neighbourhoods except the clusters (Light Blue and Light Green), but these clusters have only 1 neighbourhood as we observed above, but the "French Resturants" are placed and 3rd and 5th most common venues in these clusters respectively.

The top 4 most common venues in the Light Green cluster are the asian resturants, we call can understand that this particular neighbourhood is dominated by asian immigrants.  That means the asian tourits can feel at home in this neighbourhood (Gobelins).

In most of the Cluster Purple neighbourhoods "Hotel" is the 2nd most common venue, that means these neighbourhoods has large concentration of Hotels and the tourists can consider cluster Purple while making their travel planning.

It looks like kids will not be having much entertainment in Paris (other than the Disney Land) as Pars and Zoo didn't featured in any of the clusters as the top rated venues, it was only features in couple of clusters but that to not in top 5.  The can be mainly becuase the ratings/reviews of places in foursquare application was written by adults and thats the reason I think these reviews are influenced and biased.

Each cluster has Bars, Bistros and other hangouts as the top 5 most common venues and I think adults would definites will enjoy their stay in Paris.

In all I can say tourists can find plenty off options in Paris, thats the reason Paris is one of the top 3 tourists destination as it has something for everyone.

### Manhattan neighbourhood clusters

In [64]:
# create map
map_clusters = folium.Map(location=[latitude_ny, longitude_ny], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We have five clusters in Manhattan.

Below is the color coding from the above image:

Cluster 0 - Red, 
Cluster 1 - Purple, 
Cluster 2 - Light Blue, 
Cluster 3 - Light Green, 
Cluster 4 - Orange 

No wonder again in Manhattan too the resturants or the food places are the most common venues.

I like the cluster "Blue" as it is not dominated by any one type of food, it is a perfect mix of all kinds of resturants from America, Italy, Asian, Latin, Mexican etc.  I think this cluster caters the needs of every type of tourists.

The cluster "Light Blue" is dominated by Park as the most common venue.

The cluster "Purple" is slightly dominated by the asian resturants though it has good mix of other resturants too.

Only the cluster Light Green has non-Resturant venues as the top most common venue as "Theater" and "Art-Galery".  That means the tourists that are intersted in "Theater" and "Art-Galery" can visit the neighourhoods from the cluster "Light Green"

Every cluster has good mix of resturants, gyms, Hotels, Bars.

### 6. Conclusion

In this report, we tried to find the similarities between the neighborhoods of Paris and Manhattan using Foursquare venues recommendations data. We have only used rates of venues listed in neighborhoods for comparison. When evaluating against ground truth data, we found that there are many similarities between clusters found in Paris and Manhattan.

Tourists will find Paris and Manhattan as good attractions and plan confidently visiting the places.

The travel booking site can provide these similarities and finding to the potential tourists and help them make the booking confidently.