# Capstone Project - The Battle of Neighborhoods (FINAL PROJECT )

# Business Problem


For the opening of a new restaurant in Paris, we will find out what is the best location 

## Background and Problem Statement

There are a lot of restaurants in Paris and the competition is stiff. To establish a new restaurant it is important to study the different districts. We want to open a Mexican restaurant, so we have to find the most suitable location to attract the most customers 

## Objective

In this assignment, we want to find the good district from 20 districts to open a **Mexican Restaurant** in Paris, France.


## Audience

Firstly, we want to reach the Mexican community in Paris in order to ensure a loyal clientele, then we also want to be in a frequented place where there is little competition for this type of restaurant.

# Our Approach

Firstly, we build the Paris neighborhood data (Postcode, Neighborhood).


Secondly, we build the coordinates of all districts in Paris, France.

Thirdly, we need to explore, segment and using KMeans to cluster the neighborhoods in the city of Paris based on the top 10 venues for each neighborhood district.



Finally, we analyze the clustering result and then propose some suggestion location (district) to open Vietnamese Restaurant in Paris. Then, we give some perspectives to enhance the performances

## Data Settings

To explore our problem, we need build Paris neighborhood data and their coordinates.
Concerning to Paris neighborhood data, we use the following references:
Paris Arrondissements & Neighborhoods Map (https://parismap360.com/paris-arrondissement-map#.XfVpqtEo91l)
Arrondissements in Paris, France (https://francetravelplanner.com/go/paris/areas/arrondismt.html)
Concerning to relative coordinates (latitude, longitude) of each district in Paris
Using package geopy to convert an address into latitude and longitude values

In [10]:
import os
import pandas as pd

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.extra.rate_limiter import RateLimiter

In [20]:
COL_NAME_POSTCODE = "postcode"
COL_NAME_COUNTRY = "country"
COL_NAME_ADDRESS = "address"
COL_NAME_LOCATION = "location"
COL_NAME_POINT = "point"
COL_NAME_LATITUDE = "latitude"
COL_NAME_LONGITUDE = "longitude"
COL_NAME_ALTITUDE = "altitude"
COL_NAME_NEIGHBOURHOOD = "neighbourhood"

file_coordinate_path = "./var/Geospatial_Coordinates_Paris.csv"
file_neighbourhood_path = "./var/Paris_Neighbourhood.csv"

## Building the Paris neighborhood data

## Creating new or Loading neighbourhood data of Paris from csv

In [12]:
if os.path.exists(file_neighbourhood_path):
    print("Loading Paris neighbourhood data from file : %s" % file_neighbourhood_path)
    df_neighbourhood = pd.read_csv(file_neighbourhood_path, header=0)
else:
    # The following neighbourhood data of Paris that I built based on the information in
    # https://parismap360.com/paris-arrondissement-map#.XfXp89Eo91m
    # https://francetravelplanner.com/go/paris/areas/arrondismt.html
    list_neighbourhood = [
    ["75001", "75002"], ["75001", "75003"], ["75001", "75004"], ["75001", "75005"], 
    ["75001", "75006"], ["75001", "75007"], ["75001", "75008"], ["75001", "75009"], 
    ["75002", "75001"], ["75002", "75003"], ["75002", "75009"], ["75002", "75010"],
    ["75003", "75001"], ["75003", "75002"], ["75003", "75004"], ["75003", "75010"],
    ["75003", "75011"], ["75004", "75001"], ["75004", "75003"], ["75004", "75005"],
    ["75004", "75006"], ["75004", "75011"], ["75004", "75012"], ["75005", "75001"],
    ["75005", "75004"], ["75005", "75006"], ["75005", "75012"], ["75005", "75013"],
    ["75005", "75014"], ["75006", "75001"], ["75006", "75004"], ["75006", "75005"],
    ["75006", "75007"], ["75006", "75014"], ["75006", "75015"], ["75007", "75001"],
    ["75007", "75006"], ["75007", "75008"], ["75007", "75015"], ["75007", "75016"],
    ["75008", "75001"], ["75008", "75007"], ["75008", "75009"], ["75008", "75016"],
    ["75008", "75017"], ["75008", "75018"], ["75009", "75001"], ["75009", "75002"],
    ["75009", "75008"], ["75009", "75010"], ["75009", "75017"], ["75009", "75018"],
    ["75010", "75002"], ["75010", "75003"], ["75010", "75009"], ["75010", "75011"],
    ["75010", "75018"], ["75010", "75019"], ["75010", "75020"], ["75011", "75003"],
    ["75011", "75004"], ["75011", "75010"], ["75011", "75012"], ["75011", "75019"],
    ["75011", "75020"], ["75012", "75004"], ["75012", "75005"], ["75012", "75011"],
    ["75012", "75013"], ["75012", "75020"], ["75013", "75005"], ["75013", "75012"],
    ["75013", "75014"], ["75014", "75005"], ["75014", "75006"], ["75014", "75013"],
    ["75014", "75015"], ["75015", "75006"], ["75015", "75007"], ["75015", "75014"],
    ["75015", "75016"], ["75016", "75007"], ["75016", "75008"], ["75016", "75015"],
    ["75016", "75017"], ["75017", "75008"], ["75017", "75009"], ["75017", "75016"],
    ["75017", "75018"], ["75018", "75008"], ["75018", "75009"], ["75018", "75010"],
    ["75018", "75017"], ["75018", "75019"], ["75019", "75010"], ["75019", "75011"],
    ["75019", "75018"], ["75019", "75020"], ["75020", "75010"], ["75020", "75011"],
    ["75020", "75012"], ["75020", "75019"]]

    df_neighbourhood = pd.DataFrame(data=list_neighbourhood, columns=[COL_NAME_POSTCODE, COL_NAME_NEIGHBOURHOOD])

    df_neighbourhood.to_csv(file_neighbourhood_path, header=True, index=False)

FileNotFoundError: [Errno 2] No such file or directory: 'var/Paris_Neighbourhood.csv'

## Combining the neighborhoods that have the same Postcode

In reality, one district of Paris has various neighbourhood. That's why we need to combine all of neighbourhood of each district of Paris.


In [13]:
# Quicky reviewing the information of dataframe
df_neighbourhood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   postcode       102 non-null    object
 1   neighbourhood  102 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [14]:
### Convert into string all of values in dataframe
df_neighbourhood = df_neighbourhood.astype(str)

In [15]:
df_neighbourhood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   postcode       102 non-null    object
 1   neighbourhood  102 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [16]:
df_neighbourhood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   postcode       102 non-null    object
 1   neighbourhood  102 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [17]:
df_neighbourhood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   postcode       102 non-null    object
 1   neighbourhood  102 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [18]:
df_combined = df_neighbourhood.groupby(by=[COL_NAME_POSTCODE]).agg(lambda x: ",".join(x)).reset_index()
df_combined

Unnamed: 0,postcode,neighbourhood
0,75001,7500275003750047500575006750077500875009
1,75002,75001750037500975010
2,75003,7500175002750047501075011
3,75004,750017500375005750067501175012
4,75005,750017500475006750127501375014
5,75006,750017500475005750077501475015
6,75007,7500175006750087501575016
7,75008,750017500775009750167501775018
8,75009,750017500275008750107501775018
9,75010,75002750037500975011750187501975020


## Building the Coordinates of All Districts in Paris¶


In [19]:
if os.path.exists(file_coordinate_path):
    print("Loading file input : {}".format(file_coordinate_path))
    df_coordinates = pd.read_csv(file_coordinate_path, header=0)
else:
    # In Paris, France, there are 20 districts
    list_of_districts_in_Paris = ["750" + str(x).zfill(2) for x in range(1, 21)]
    
    # Create DataFrame with given list of districts of Paris
    df_coordinates = pd.DataFrame(data=list_of_districts_in_Paris, columns=[COL_NAME_POSTCODE])

    df_coordinates[COL_NAME_COUNTRY] = "FR"
    df_coordinates[COL_NAME_ADDRESS] = df_coordinates.apply(lambda row: str(row[COL_NAME_POSTCODE]) + ", " + row[COL_NAME_COUNTRY], axis=1)

    locator = Nominatim(user_agent="paris_explorer")

    # convenient function to delay between geocoding calls
    geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

    # create column "location"
    df_coordinates[COL_NAME_LOCATION] = df_coordinates[COL_NAME_ADDRESS].apply(geocode)

    # extract from location column to (longitude, latitude, altitude)  (returns tuple)
    df_coordinates[COL_NAME_POINT] = df_coordinates[COL_NAME_LOCATION].apply(lambda loc: tuple(loc.point) if loc else None)

    # split point column into latitude, longitude and altitude columns
    df_coordinates[[COL_NAME_LATITUDE, COL_NAME_LONGITUDE, COL_NAME_ALTITUDE]] = pd.DataFrame(df_coordinates[COL_NAME_POINT].tolist(), index=df_coordinates.index)
    
    # save to file csv
    df_coordinates.to_csv(file_coordinate_path, header=True, index=False)

FileNotFoundError: [Errno 2] No such file or directory: 'var/Geospatial_Coordinates_Paris.csv'

In [None]:
df_coordinates.info()


In [None]:
# Removing the useless columns
df_coordinates.drop([COL_NAME_COUNTRY, COL_NAME_POINT, COL_NAME_ALTITUDE, COL_NAME_LOCATION], axis=1, inplace=True)


In [None]:
# Converting postcode to string
df_coordinates[COL_NAME_POSTCODE] = df_coordinates[COL_NAME_POSTCODE].astype(str)

In [None]:
df_coordinates.head()

## Let's review the coordinate of district 1

In [None]:
df_coordinates[df_coordinates[COL_NAME_POSTCODE]=="75001"]


## Let's review the coordinate of district 2

In [None]:
df_coordinates[df_coordinates[COL_NAME_POSTCODE]=="75002"]


## Merging two dataframes¶


In [None]:
# List of columns in dataframe df_neighbourhood
df_combined.columns

In [None]:
# List of columns in dataframe df_coordinates
df_coordinates.columns

In [None]:
df_combined.head(2)

In [None]:
df_combined.head(2)

In [None]:
df_merged = pd.merge(df_combined, df_coordinates, 
                     left_on=COL_NAME_POSTCODE, right_on=COL_NAME_POSTCODE,
                     how="inner")

In [None]:
df_merged


## Let's review general information of dataframe¶


In [None]:
df_merged.info()


In [None]:
df_merged.describe()


## Getting the size of merged dataframe


In [None]:
print("(row, column) = %s" % str(df_merged.shape))


## Exploring and clustering the neighborhoods in Paris


## Listing distinct districts


In [None]:
df_merged[COL_NAME_POSTCODE].unique()


Quickly examine the resulting dataframe.


In [None]:
print("(row, column) = %s" % str(df_merged.shape))


In [None]:
df_merged.info()


In [8]:
df_merged.head(3)


NameError: name 'df_merged' is not defined

In [None]:
print('The dataframe has {} district and {} neighborhoods.'.format(
      df_merged[COL_NAME_POSTCODE].nunique(),
      df_merged.shape[0]))

## Using geopy library to get the latitude and longitude values of Paris


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent paris_explorer, as shown below.


In [None]:
# Get the coordinate of Paris, France
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

def get_latitude_longitude(address=""):
    if not address:
        return None, None
    
    geolocator = Nominatim(user_agent="paris_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return (latitude, longitude)

def get_latitude_longitude_paris_fr():
    address = 'Paris, FR'
    return get_latitude_longitude(address)

latitude, longitude = get_latitude_longitude_paris_fr()
print('The geograpical coordinate of Paris, FR are {}, {}.'.format(latitude, longitude))

## Creating a map of Paris with neighborhoods superimposed on top¶


In [None]:
import folium

# create map using latitude and longitude values
m = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, district, neighborhood in zip(df_merged[COL_NAME_LATITUDE], 
                                            df_merged[COL_NAME_LONGITUDE], 
                                            df_merged[COL_NAME_POSTCODE], 
                                            df_merged[COL_NAME_NEIGHBOURHOOD]):
    label = 'District:{}, Neighbourhood:{}'.format(district, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(m)  
    
m

## Defining Foursquare Credentials and Version¶


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


In [None]:
CLIENT_ID = 'XXX'     # Foursquare ID
CLIENT_SECRET = 'XXX' # Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

## Let's explore the first neighborhood in our dataframe


Get the neighborhood's name.


In [None]:
df_merged.loc[0, COL_NAME_NEIGHBOURHOOD]


Get the neighborhood's latitude and longitude values.


In [None]:
neighborhood_latitude = df_merged.loc[0, COL_NAME_LATITUDE]   # neighborhood latitude value
neighborhood_longitude = df_merged.loc[0, COL_NAME_LONGITUDE] # neighborhood longitude value

neighborhood_name = df_merged.loc[0, COL_NAME_NEIGHBOURHOOD] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


## Now, let's get the top 100 venues that are in "75002,75003,75004,75005,75006,75007,75008,75009" within a radius of 500 meters.


First, let's create the GET request URL. Name your URL url.


In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url # display URL

Send the GET request and examine the resutls


In [None]:
import requests # library to handle requests

results = requests.get(url).json()
results

Let's borrow the get_category_type function from the Foursquare lab.


In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.


In [None]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

And how many venues were returned by Foursquare?


In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))


## Let's create a function to repeat the same process to all the neighborhoods


In [None]:
COL_NAME_VENUE = "Venue"
COL_NAME_CATEGORY = "Category"

COL_NAME_NEIGHBOURHOOD_LATITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LATITUDE
COL_NAME_NEIGHBOURHOOD_LONGITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_LATITUDE = COL_NAME_VENUE + " " + COL_NAME_LATITUDE
COL_NAME_VENUE_LONGITUDE = COL_NAME_VENUE + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_CATEGORY = COL_NAME_VENUE + " " + COL_NAME_CATEGORY


def get_near_by_venues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [COL_NAME_NEIGHBOURHOOD, 
                             COL_NAME_NEIGHBOURHOOD_LATITUDE,
                             COL_NAME_NEIGHBOURHOOD_LONGITUDE,
                             COL_NAME_VENUE,
                             COL_NAME_VENUE_LATITUDE,
                             COL_NAME_VENUE_LONGITUDE,
                             COL_NAME_VENUE_CATEGORY]
    return(nearby_venues)

## Getting dataframe that contains all the neighborhoods of Paris


In [None]:
venues_neighbourhoods = get_near_by_venues(
    names=df_merged[COL_NAME_NEIGHBOURHOOD],
    latitudes=df_merged[COL_NAME_LATITUDE],                           
    longitudes=df_merged[COL_NAME_LONGITUDE])

## Let's check the size of the resulting dataframe


In [None]:
print("(row, column) = %s" % str(venues_neighbourhoods.shape))
venues_neighbourhoods.head()

## Let's check how many venues were returned for each neighborhood


In [None]:
venues_neighbourhoods.groupby(COL_NAME_NEIGHBOURHOOD).count()


## Let's find out how many unique categories can be curated from all the returned venues¶


In [None]:
venues_neighbourhoods[COL_NAME_VENUE_CATEGORY].unique()


In [None]:
print('There are {} uniques categories.'.format(
    len(venues_neighbourhoods[COL_NAME_VENUE_CATEGORY].unique())))

## Analyzing Each Neighborhood District in Paris


In [None]:
# one hot encoding
df_onehot = pd.get_dummies(venues_neighbourhoods[[COL_NAME_VENUE_CATEGORY]], 
                                        prefix="", 
                                        prefix_sep="")

# add neighborhood column back to dataframe
df_onehot[COL_NAME_NEIGHBOURHOOD] = venues_neighbourhoods[COL_NAME_NEIGHBOURHOOD] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

And let's examine the new dataframe size.


In [None]:
df_onehot.shape


## Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
df_grouped = df_onehot.groupby(COL_NAME_NEIGHBOURHOOD).mean().reset_index()
df_grouped.head()

## Let's confirm the new size


In [None]:
print("(row, column) = %s" % str(df_grouped.shape))


## Let's print each neighborhood along with the top 5 most common venues


In [None]:
num_top_venues = 5
COL_NAME_FREQUENCE = 'freq'

for hood in df_grouped[COL_NAME_NEIGHBOURHOOD]:
    print("----"+hood+"----")
    temp = df_grouped[df_grouped[COL_NAME_NEIGHBOURHOOD] == hood].T.reset_index()
    temp.columns = [COL_NAME_VENUE, COL_NAME_FREQUENCE]
    temp = temp.iloc[1:]
    temp[COL_NAME_FREQUENCE] = temp[COL_NAME_FREQUENCE].astype(float)
    temp = temp.round({COL_NAME_FREQUENCE: 2})
    print(temp.sort_values(COL_NAME_FREQUENCE, ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

## Let's put that into a pandas dataframe


First, let's write a function to sort the venues in descending order.


In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [None]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = [COL_NAME_NEIGHBOURHOOD]
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted[COL_NAME_NEIGHBOURHOOD] = df_grouped[COL_NAME_NEIGHBOURHOOD]

for ind in np.arange(df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], 
                                                                          num_top_venues)

neighborhoods_venues_sorted.head()

## Result and Analysis 

## Clustering Neighborhoods of Paris, France


Run k-means to cluster the neighborhood into 6 clusters.


In [None]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 6

clustering_grouped_paris = df_grouped.drop(COL_NAME_NEIGHBOURHOOD, 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clustering_grouped_paris)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

## Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [None]:
COL_NAME_CLUSTER_LABELS = 'Cluster Labels'

# add clustering labels
neighborhoods_venues_sorted.insert(0, COL_NAME_CLUSTER_LABELS, kmeans.labels_)

df_merged_paris = df_merged

df_merged_paris = df_merged_paris.join(neighborhoods_venues_sorted.set_index(COL_NAME_NEIGHBOURHOOD), 
                                                           on=COL_NAME_NEIGHBOURHOOD)

df_merged_paris.head() # check the last columns!

## Let's visualize the resulting clusters


In [None]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Let's get the geographical coordinates of Paris, France
latitude, longitude = get_latitude_longitude_paris_fr()
print('The geograpical coordinate of Paris, FR are {}, {}.'.format(latitude, longitude))
# ------------------------------------------------------------------------------------------------

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, district, poi, cluster in zip(df_merged_paris[COL_NAME_LATITUDE], 
                                  df_merged_paris[COL_NAME_LONGITUDE],
                                  df_merged_paris[COL_NAME_POSTCODE],
                                  df_merged_paris[COL_NAME_NEIGHBOURHOOD], 
                                  df_merged_paris[COL_NAME_CLUSTER_LABELS]):
    label = 'District:{}, Neighbourhood:{}, Number of Cluster:{}'.format(district, poi, cluster+1)
    label = folium.Popup(label,
                         parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examining Clusters


Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster.


Based on the defining categories, we can then assign a name to each cluster.


## Cluster 1 

In [None]:
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 0, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Cluster 2

df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 1, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Cluster 3

In [None]:
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 2, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Cluster 4

In [None]:
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 3, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Cluster 5

In [None]:
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 4, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Cluster 6

In [None]:
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 5, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

## Conclusion and Perspective


## Conclusion


In [None]:
In above result, we have the clustering result of the various districts based on top 10 venues for each neighborhood.
Thus, as you see, the French Restaurant is the first most common venue in most of districts in Paris.
When reviewing the clusters, we could see that the Vietnamese restaurant in cluster 5. Indeed, as you see, Vietnamese Restaurant is the second and the 10th most common venus in District 13 and District 2, respectively.
So, depending on the several requirements of the investors, if we would like to open new Vietnamese restaurant in the district that have already had many Vietnamese restaurant, we should open in District 13.
Or, we should open new one in District 2, Paris, because this district is also good community for opening Vietnamese restaurant.
Moreover, if the investor would like to open new one in the districts that are similar to District 13, we could locate it in the districts that are clustered in Cluster 5 such as District 3, 4, 16 in Paris.

## Perspectives

In [None]:
Concerning to enhance the features of district, we should add more relevant features for each district such as:
the transport info (public transport, parking, etc.),
the information of asian communities,
the information of major tourist venues
etc.
Concerning to clustering methods and enhancing the performances, we could do some experiments with other algorithms, for instance,
Fuzzy c-means method
DBSCAN: Density-based clustering
Hierarchical K-Means Clustering
HCPC: Hierarchical clustering on principal components
Deep Learning Models. To see more detail, please see in "A Survey of Clustering With Deep Learning: Fromthe Perspective of Network Architecture" (2018) - https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8412085

## References

In [None]:
The tutorials in course "Applied Data Science Capstone" (https://www.coursera.org/learn/applied-data-science-capstone/)
Paris Arrondissements & Neighborhoods Map (https://parismap360.com/paris-arrondissement-map#.XfVpqtEo91l)
Arrondissements in Paris, France (https://francetravelplanner.com/go/paris/areas/arrondismt.html)