<h1 align=center><font size = 5> Data science project - Best location for a new restaurant in Paris ? </font></h1>

<h2 align=center><font size = 4>Manda RAZAFIMANANTSOA </font></h2>


<h3 align=center><font size = 2> March 28, 2021 </font></h3>

## 1.	Introduction

### 1.1.	Problem and interest </br>

A multinational entrepreneur decided to open a new gastronomic restaurant in Paris. Paris is the capital and most populous city of France with an estimated population of 12 millions as of 2018. 

Creating a new restaurant in the capital is a big decision and requires much preparation and a strong business study from the entrepreneur. 

This project will help the entrepreneur to find the optimal location (or group of location) to build up his business.


### 1.2.	Methodology

The aim of the analysis is to hihglight the best gastronomical and touristic location in order to make valuable suggestion to the entrepreneur. To do that, we will use machine learning K-Means algorithm to divide Paris into several clusters and give a form of "rating" from the most to the least interesting regions where to start a restaurant business. 

But just prior to this, we will first explore Paris and its departments, get all venues informations (data) per department and neighborhood. This will help us to easily highlights the most visited and gastronomical area. 

## 2.	Data acquisition


All Geographical coordinates of Paris departments will be downloaded here: https://www.data.gouv.fr/fr/datasets/geofla-departements-idf/#_. We will create a table containing each district as row whereas longitud and latitud as columns. 


In the next step, we collect venues for each department (listed in the table mentionned above) and see which venues are the most common. In this step, we will use Foursquare API to collect all data regarding venues (name, category, GPS coordinates, ratings and even photos). After collecting the data and organising into a pandas dataframe, we will have a table that show the top 10 venues for each department. 

The last task is to use unsupervised machine learning techniques to cluster them according to the most common venues and visualize all clusers using Folio library (Folium.map). 







#### Let's create a table containing all departments in Paris

In [1]:
import json
import pandas as pd
import numpy as np 

import requests # library to handle requests

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')


Libraries imported.


### 1 Data description

#### The data about Paris is available here 

In [20]:
file = open('data/code-insee-code-postal.json', "r")
text = file.read()
text = json.loads(text)

paris_bourough_df = pd.DataFrame(pd.json_normalize(text))
paris_bourough_df.head()


Unnamed: 0,datasetid,recordid,record_timestamp,fields.code_comm,fields.nom_dept,fields.statut,fields.z_moyen,fields.nom_region,fields.code_reg,fields.insee_com,...,fields.id_geofla,fields.code_cant,fields.geo_shape.type,fields.geo_shape.coordinates,fields.superficie,fields.nom_comm,fields.code_arr,fields.population,geometry.type,geometry.coordinates
0,correspondances-code-insee-code-postal,2bf36b38314b6c39dfbcd09225f97fa532b1fc45,2016-09-21T00:29:06.175+02:00,645,ESSONNE,Commune simple,121.0,ILE-DE-FRANCE,11,91645,...,16275,3,Polygon,"[[[2.238024349288764, 48.735565859837095], [2....",999.0,VERRIERES-LE-BUISSON,3,15.5,Point,"[2.251712972144151, 48.750443119964764]"
1,correspondances-code-insee-code-postal,7ee82e74e059b443df18bb79fc5a19b1f05e5a88,2016-09-21T00:29:06.175+02:00,133,SEINE-ET-MARNE,Commune simple,88.0,ILE-DE-FRANCE,11,77133,...,31428,20,Polygon,"[[[3.076046701822989, 48.397361878531605], [3....",1082.0,COURCELLES-EN-BASSEE,3,0.2,Point,"[3.052940505560729, 48.41256065214989]"
2,correspondances-code-insee-code-postal,e2cd3186f07286705ed482a10b6aebd9de633c81,2016-09-21T00:29:06.175+02:00,378,ESSONNE,Commune simple,150.0,ILE-DE-FRANCE,11,91378,...,30975,9,Polygon,"[[[2.203466690733517, 48.51655284725087], [2.1...",313.0,MAUCHAMPS,1,0.3,Point,"[2.19718165044305, 48.52726809075556]"
3,correspondances-code-insee-code-postal,868bf03527a1d0a9defe5cf4e6fa0a730d725699,2016-09-21T00:29:06.175+02:00,243,SEINE-ET-MARNE,Chef-lieu canton,71.0,ILE-DE-FRANCE,11,77243,...,17000,14,Polygon,"[[[2.727542158243183, 48.85975862454365], [2.7...",579.0,LAGNY-SUR-MARNE,5,20.2,Point,"[2.7097808131278462, 48.87307018579678]"
4,correspondances-code-insee-code-postal,21e809b1d4480333c8b6fe7addd8f3b06f343e2c,2016-09-21T00:29:06.175+02:00,3,VAL-DE-MARNE,Chef-lieu canton,70.0,ILE-DE-FRANCE,11,94003,...,32123,34,Polygon,"[[[2.34385114554979, 48.79766105911435], [2.32...",232.0,ARCUEIL,3,19.5,Point,"[2.333510249842654, 48.80588035965699]"


### 1.1 Explore Neighborhoods in Paris

In [21]:
data = [paris_bourough_df['fields.nom_dept'],paris_bourough_df['fields.code_dept'],paris_bourough_df['fields.nom_comm'],paris_bourough_df['fields.superficie']/100,paris_bourough_df['geometry.coordinates']]

headers = ["Borough", "ZIPcode", "Neighborhood", "Area", "LongLat"]

df = pd.concat(data, axis=1, keys=headers)

# Create new column Longitiud and Latitud

df['Latitude'] = df['LongLat'].apply(lambda x: float(x[1]))
df['Longitude'] = df['LongLat'].apply(lambda x: float(x[0]))
df.drop('LongLat', inplace=True, axis=1)
df.head()
df_temp = df.head(100)

#### Let's define a function that return all nearby venues in a neighborhood within a radius of 500m (we only get restaurant venues)

In [22]:
def getNearbyVenues(names, latitudes, longitudes, areas):
    venues_list=[]
    for name, lat, lng, area in zip(names, latitudes, longitudes, areas):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            np.sqrt(area/3.14) * 1000, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]    
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)


 #### Get all venues for each Neighborhood

#### Run the above function on each neighborhood 

In [24]:
CLIENT_ID = 'CIYM54ZOILTJMGIZYSX2FREJZ1ENG0OCP4U03OM1ZSTVFB5H' # your Foursquare ID
CLIENT_SECRET = 'KLWDZC0GN2HCV1NVJ2KHSTMTZHJYTKOC1Q5NNQ2TFALMPUGW' # your Foursquare Secret
VERSION = '20210501' # Foursquare API version
LIMIT = 50 # A default Foursquare API limit value

paris_venues = getNearbyVenues(df_temp['Neighborhood'], df_temp['Latitude'], df_temp['Longitude'], df_temp['Area'])

paris_venues.groupby('Neighborhood')['Venue'].count()


KeyError: 'groups'

#### Let's find out how many unique categories can be curated from all the returned venues

In [10]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))

NameError: name 'paris_venues' is not defined

In [9]:
paris_venues.head()

NameError: name 'paris_venues' is not defined

### 1.2 Analyze each neighborhood

In [None]:
# one hot encoding
paris_onehot = pd.get_dummies(paris_venues[['Venue Category']], prefix="", prefix_sep="")


# Add neighborhood column back to dataframe
paris_onehot['Neighborhood'] = paris_venues['Neighborhood'] 


# move neighborhood column to the first column
first_column = paris_onehot.pop('Neighborhood')
paris_onehot.insert(0,'Neighborhood',first_column)

print (paris_onehot.shape) #New size
paris_onehot.head()

In [None]:
paris_grouped = paris_onehot.groupby('Neighborhood').mean().reset_index()
print (paris_grouped.shape) #New size
paris_grouped.head(20)

#### Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [None]:
num_top_venues = 5
for hood in paris_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = paris_grouped[paris_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


#### Let's put that into a panda dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = paris_grouped['Neighborhood']

for ind in np.arange(paris_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(paris_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## 3.	Cluster Neighborhood

Run _k_-means to cluster the neighborhood into 5 clusters.

Let's find first the best k using the Elbow method

In [None]:
# Function Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of l 
def calculate_WSS(points, kmax):
  sse = []
  for k in range(1, kmax+1):
    kmeans = KMeans(n_clusters = k).fit(points)
    centroids = kmeans.cluster_centers_
    pred_clusters = kmeans.predict(points)
    curr_sse = 0
    
    # calculate square of Euclidean distance of each point from its cluster center and add to current WSS
    for i in range(len(points)):
      curr_center = centroids[pred_clusters[i]]
      curr_sse += (points.iloc[i, 0] - curr_center[0]) ** 2 + (points.iloc[i, 1] - curr_center[1]) ** 2
      
    sse.append(curr_sse)
    plt.plot(range(1,kmax),sse,'g')
    plt.tight_layout()
    plt.show()
  return sse

In [25]:
wcss = []
for i in range(1, 7):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(paris_grouped_clustering)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 7), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


NameError: name 'paris_grouped_clustering' is not defined

#### As we can see on the plot. We can use k=4 to build our Kmeans clustering Algorithm 

In [None]:
# set number of clusters
kclusters = 4

paris_grouped_clustering = paris_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(paris_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_[0:10])


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
paris_merged = df_temp

# merge paris_grouped with paris_data to add latitude/longitude for each neighborhood
paris_merged = paris_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
paris_merged.head(30) # check the last columns!


Let's drop rows with NaN values 

In [None]:
paris_merged = paris_merged.dropna()


In [None]:
!conda install -c conda-forge geopy --yes # uncomment this line to install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')
address = 'Paris, France'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))




In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

paris_merged['Cluster Labels'] =paris_merged['Cluster Labels'].astype(int)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(paris_merged['Latitude'], paris_merged['Longitude'], paris_merged['Neighborhood'], paris_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters