# Coursera/IBM 
# Applied Data Science Capstone Project 
## The Battle of Neighborhoods: Opening an Italian Restaurant in Paris


*****

## Table of contents
* [Introduction: Project & Background](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results & Discussion](#results)
* [Conclusions](#conclusions)

******

## Introduction: Project & Background <a name="introduction"></a>

In this Notebook, we'll attempt to find the best suggestions of locations to open an Italian Restaurant in Paris.

Of course, this is no easy task and the final decision should be made after further on-site investigation, but with some data we can already reduce the area to check to a handful of locations.

This analysis will be based on the following assumptions:

- Areas in which the restaurant density is very low will not be considered as good spots: indeed, if one could assume that they represent a gap to fill (which might be true!) they are most likely empty for a reason - e.g. prohibitive prices, protected or historical areas, etc.


- It is better to open an italian restaurant in neighborhoods where Italian Restaurants are among the most popular as this implies a demand for this type of food.

- However the new italian restaurant should be as far as possible from the existing ones as clients might prefer the venues they are used to.


- Even though french and italian cuisines are very distinct, they are usually enjoyed in a similar way: both french and italian food lovers will tend to sit for quite a while, taking their time to enjoy a good meal, as opposed to other types of food that can be enjoyed on the go. As a consequence, we will favour areas with many french restaurants, as they will mainly attract clients that would enjoy italian food as well.




******
## Data <a name="data"></a>

For this analysis, we will need to get data regarding Paris' arrondissements (i.e. neighborhoods):
- their shape
- their center

This will allow us to map our findings, gather data on nearby venues and frame potential clusters.

Then we will get data on parisian venues, that is:
- their name
- their coordinates
- the category they belong to

Finally we'll create a ```paris_restaurants``` dataframe with all we need for the analysis

### Getting a map of Paris' <i> Arrondissements </i>

A copy of the GeoJSON file we use is stored in the same repository as this notebook.


The original file can be found on the following page:
https://opendata.paris.fr/explore/dataset/arrondissements/export/?disjunctive.c_ar&disjunctive.c_arinsee&disjunctive.l_ar&location=13,48.85156,2.32327

In [1]:
import json

geo = json.load(open("/arrondissements.geojson"))

From this dataset, we can actually easily plot the shape of each <i>arrondissement</i> (i.e. neighborhood)

In [2]:
import folium
paris_choropleth = folium.Map(location = [48.856578, 2.351828], zoom_start = 12)
paris_choropleth.choropleth(geo_data = geo,fill_opacity=0.3,fill_color='blue')
paris_choropleth

But in order to request data about parisian venues, we will need to get the coordinates of the center of each neighborhood:

In [3]:
import pandas as pd

paris_ardt = []
for arr in geo["features"]:
    prop = arr["properties"]
    paris_ardt.append([prop["l_ar"].split('è')[0].split('e')[0],prop["geom_x_y"][0],prop["geom_x_y"][1]])
paris_ardt_df= pd.DataFrame(paris_ardt,columns=['Ardt','Latitude','Longitude'])
paris_ardt_df['Ardt'] = paris_ardt_df['Ardt'].astype(int)
paris_ardt_df.sort_values('Ardt',inplace=True)
paris_ardt_df = paris_ardt_df.reset_index().drop('index',axis=1)
paris_ardt_df

Unnamed: 0,Ardt,Latitude,Longitude
0,1,48.862563,2.336443
1,2,48.868279,2.342803
2,3,48.862872,2.360001
3,4,48.854341,2.35763
4,5,48.844443,2.350715
5,6,48.84913,2.332898
6,7,48.856174,2.312188
7,8,48.872721,2.312554
8,9,48.877164,2.337458
9,10,48.87613,2.360728


Which we can add to the previous map...

In [4]:
for ardt, lat, lng in zip(paris_ardt_df['Ardt'], paris_ardt_df['Latitude'], paris_ardt_df['Longitude']):
    label = folium.Popup("Ardt n°"+ str(ardt), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='black',
        parse_html=False).add_to(paris_choropleth)

paris_choropleth



We can see however that the centers of the 12th and 16th neighborhood are quite off, as they account for large parks - so we will correct them as follows:


In [5]:
corrections = [
    [12, 48.841, 2.388],
    [16, 48.863, 2.276]
]

corrections_df = pd.DataFrame(corrections,columns=['Ardt','Latitude','Longitude'])
paris_ardt_df = paris_ardt_df.append(corrections_df).drop_duplicates('Ardt',keep='last').sort_values('Ardt',ignore_index=True)

In [6]:
paris_choropleth = folium.Map(location = [48.856578, 2.351828], zoom_start = 12)
paris_choropleth.choropleth(geo_data = geo,fill_opacity=0.3,fill_color='blue')
paris_choropleth

for ardt, lat, lng in zip(paris_ardt_df['Ardt'], paris_ardt_df['Latitude'], paris_ardt_df['Longitude']):
    label = folium.Popup("Ardt n°"+ str(ardt), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='black',
        parse_html=False).add_to(paris_choropleth)

#adding markers using previous coordinates...
folium.CircleMarker(
        [48.834974, 2.421325],
        radius=2,
        color='red',
        parse_html=False).add_to(paris_choropleth)
folium.CircleMarker(
        [48.860392, 2.261971],
        radius=2,
        color='red',
        parse_html=False).add_to(paris_choropleth)



paris_choropleth

Looks much better !

### Downloading Venues' data using the FourSquare API

Now that we have the coordinates of the center of each neighborhood, we will use them to get data related to the nearby venues using the FourSquare API: 

- First, will need to input our FourSquare credentials

In [7]:
CLIENT_ID = '################' # your Foursquare ID
CLIENT_SECRET = '################' # your Foursquare Secret
ACCESS_TOKEN = '################' # your FourSquare Access Token
VERSION = '20210411'
LIMIT = 100


- then we'll create a function to actually request the data for each neighborhood and store it in a dataframe

In [8]:
import requests

def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Ardt ' + str(name) + ' : Getting data...')
            
        # creating the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # making the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # returning only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
        print('Done'+'\n')

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

- once it's done, we can use the function with the neighborhood centers that we defined above:

In [9]:
paris_venues = getNearbyVenues(names=paris_ardt_df['Ardt'],
                                   latitudes=paris_ardt_df['Latitude'],
                                   longitudes=paris_ardt_df['Longitude'],
                                   radius=1750
                                  )

Ardt 1 : Getting data...
Done

Ardt 2 : Getting data...
Done

Ardt 3 : Getting data...
Done

Ardt 4 : Getting data...
Done

Ardt 5 : Getting data...
Done

Ardt 6 : Getting data...
Done

Ardt 7 : Getting data...
Done

Ardt 8 : Getting data...
Done

Ardt 9 : Getting data...
Done

Ardt 10 : Getting data...
Done

Ardt 11 : Getting data...
Done

Ardt 12 : Getting data...
Done

Ardt 13 : Getting data...
Done

Ardt 14 : Getting data...
Done

Ardt 15 : Getting data...
Done

Ardt 16 : Getting data...
Done

Ardt 17 : Getting data...
Done

Ardt 18 : Getting data...
Done

Ardt 19 : Getting data...
Done

Ardt 20 : Getting data...
Done



- this has collected data for all categories of venues, so we will create a dataframe that only includes restaurants:

In [10]:
# Keeping only restaurants
paris_restaurants = paris_venues[paris_venues['Venue Category'].str.contains("Restaurant")]
paris_restaurants.shape

(650, 7)

Now, as we have collected data based on the proximity of each venue to the center of each neighborhood ('<i>Arrondissement</i>' in french), we happen to have duplicates in our dataframe...

In [11]:
paris_restaurants.groupby(['Venue','Venue Latitude','Venue Longitude']).count().sort_values("Neighborhood",ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Category
Venue,Venue Latitude,Venue Longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Raviolis Chinois Nord-Est,48.862851,2.349547,4,4,4,4
Foodi Jia-Ba-Buay,48.867894,2.348266,4,4,4,4
Taing Song-Heng,48.864701,2.356888,4,4,4,4
Chez Le Libanais,48.853285,2.341673,4,4,4,4
Man'ouché,48.861858,2.351093,4,4,4,4


So our first task will be to 'clean' this dataframe by removing all these duplicates...

To do so, we'll calculate the distance of each venue from the center to each neighborhood and keep the one with the lowest value:

- So we start by generating a matrix which gives us for each venue its distance to the center of all Paris' neighborhoods (please note that ```Ardt``` stands for <i>Arrondissement</i>, or neighborhood)

In [12]:
import sklearn.neighbors
import numpy as np 

# generating radians 
paris_ardt_df[['lat_radians_A','long_radians_A']] = (
    np.radians(paris_ardt_df.loc[:,['Latitude','Longitude']])
)

paris_restaurants[['lat_radians_B','long_radians_B']] = (
    np.radians(paris_restaurants.loc[:,['Venue Latitude','Venue Longitude']])
)

# calculating the distances using the Haversine formula
dist = sklearn.neighbors.DistanceMetric.get_metric('haversine')

dist_matrix_center = (dist.pairwise
    (paris_restaurants[['lat_radians_B','long_radians_B']],
        paris_ardt_df[['lat_radians_A','long_radians_A']])*6371
)
# Note that 6371 is the radius of the earth in kilometers

df_dist_center_matrix = (
    pd.DataFrame(dist_matrix_center,index=paris_restaurants['Venue'],
                 columns=paris_ardt_df['Ardt'])
)

df_dist_center_matrix['Ardt'] = df_dist_center_matrix.idxmin(axis=1)

df_dist_center_matrix

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Ardt,1,2,3,4,5,6,7,8,9,10,...,12,13,14,15,16,17,18,19,20,Ardt
Venue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Sanukiya,0.307293,0.768340,1.927121,2.089930,2.571143,1.734015,1.844557,1.791325,1.409938,2.342911,...,4.761886,4.544699,3.979599,4.060695,4.232715,3.198376,3.270502,4.483616,4.930985,1
Restaurant Kunitoraya,0.395082,0.522141,1.758858,2.027537,2.625663,1.906649,2.092016,1.896974,1.230593,2.094925,...,4.692050,4.600516,4.163717,4.309657,4.436618,3.205776,3.063274,4.235165,4.743442,1
Boutique yam'Tcha,0.444513,0.731135,1.295473,1.384174,2.014482,1.561353,2.292932,2.501703,1.755683,2.090969,...,4.055209,3.980905,3.791460,4.350663,4.857939,3.859194,3.457352,4.194022,4.306303,1
Enza & Famiglia,0.534675,0.789590,1.225187,1.287015,1.936670,1.547370,2.354014,2.598143,1.829322,2.087299,...,3.958101,3.898958,3.761564,4.384910,4.938120,3.954562,3.506077,4.176077,4.231280,1
Au Vieux Comptoir,0.817624,1.071672,1.107074,0.981898,1.641425,1.454484,2.501427,2.897499,2.128276,2.194071,...,3.653076,3.591853,3.594883,4.425529,5.150478,4.275408,3.747561,4.222372,4.059680,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Khun Akorn,4.724190,4.534347,3.140741,2.997939,3.509612,4.763998,6.319249,6.753254,5.377632,4.006405,...,1.215570,3.531021,5.704168,7.770685,9.046721,7.873810,5.998258,4.263167,1.544565,12
Café Lino,4.440884,4.178482,2.788728,2.795951,3.480619,4.641880,6.123943,6.426529,4.968183,3.531383,...,1.590155,3.789649,5.788497,7.703053,8.820271,7.476204,5.498695,3.728699,1.087272,20
La Petite Fabrique,4.678842,4.418760,3.029087,3.022655,3.671422,4.856581,6.352607,6.666795,5.204698,3.753644,...,1.612060,3.879177,5.948873,7.907768,9.053758,7.713582,5.707269,3.863919,1.110106,20
Les Mondes Bohèmes,4.710890,4.445521,3.056969,3.059739,3.713847,4.896352,6.389219,6.695335,5.225984,3.766829,...,1.650449,3.921512,5.992649,7.949007,9.088556,7.735574,5.714780,3.852468,1.082732,20


In [13]:
venue_ardt = df_dist_center_matrix[['Ardt']].reset_index()
venue_ardt.drop_duplicates(subset=['Venue'],inplace=True)

paris_restaurants = pd.merge(paris_restaurants, venue_ardt, on=['Venue'], how='inner')
paris_restaurants = paris_restaurants.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','lat_radians_B','long_radians_B'],axis=1).drop_duplicates(subset='Venue')
paris_restaurants.head()


Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category,Ardt
0,Sanukiya,48.864713,2.333805,Udon Restaurant,1
2,Restaurant Kunitoraya,48.866116,2.336467,Japanese Restaurant,1
5,Boutique yam'Tcha,48.86171,2.34238,Chinese Restaurant,1
7,Enza & Famiglia,48.861191,2.343449,Italian Restaurant,1
9,Au Vieux Comptoir,48.858893,2.346129,French Restaurant,1


In [14]:
# No more duplicates...

paris_restaurants.groupby(['Venue','Venue Latitude','Venue Longitude']).count().sort_values("Ardt",ascending=False).head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Venue Category,Ardt
Venue,Venue Latitude,Venue Longitude,Unnamed: 3_level_1,Unnamed: 4_level_1
0 d'Attente,48.837847,2.35512,1,1
Le Temps des Cerises,48.852554,2.364195,1,1
Les Fauves,48.841937,2.322581,1,1
Les Chics Types,48.883873,2.38044,1,1
Les Canailles,48.879281,2.33457,1,1


We can now plot all restaurants on the map to check if the venues in the dataset are indeed included within the right neighborhood

In [15]:
paris = folium.Map(location = [48.856578, 2.351828], zoom_start = 12)

import matplotlib.cm as cm
import matplotlib.colors as colors

x = np.arange(20) # There are 20 neighbohoods in Paris
ys = [i + x + (i*x)**2 for i in range(20)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

paris.choropleth(geo_data = geo,fill_opacity=0.3,fill_color='black')
for lat, lng, label, ardt in zip(paris_restaurants['Venue Latitude'], paris_restaurants['Venue Longitude'], paris_restaurants['Venue'],paris_restaurants['Ardt']):
    label = folium.Popup(label + " ("+ str(ardt) +")", parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color=rainbow[ardt-1],
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(paris)

for ardt, lat, lng in zip(paris_ardt_df['Ardt'], paris_ardt_df['Latitude'], paris_ardt_df['Longitude']):
    label = folium.Popup("Ardt n°"+ str(ardt), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='black',
        parse_html=False).add_to(paris)
    
paris

That looks about right! The data is almost ready to be analyzed!


We'll now take care of the ```Venue Category``` values:
- First, we'll replace <i>Restaurant</i> by <i>Unspecified</i>  in ```paris_restaurants['Venue Category']```

In [16]:
paris_restaurants.loc[paris_restaurants['Venue Category']=='Restaurant'] = paris_restaurants.loc[paris_restaurants['Venue Category']=='Restaurant'].replace('Restaurant','Unspecified')
paris_restaurants.groupby(['Venue Category']).count().sort_values('Venue',ascending=False).drop(['Venue Latitude','Venue Longitude','Ardt'],axis=1).head()

Unnamed: 0_level_0,Venue
Venue Category,Unnamed: 1_level_1
French Restaurant,192
Italian Restaurant,46
Japanese Restaurant,31
Unspecified,24
Thai Restaurant,20


- Then we will attribute a unique identifier to each category:

In [17]:
paris_restaurants.insert(4,'code',(pd.factorize(paris_restaurants['Venue Category'])[0]+1))
paris_restaurants.groupby(['Venue Category','code']).count().sort_values('Venue',ascending=False).drop(['Venue Latitude','Venue Longitude','Ardt'],axis=1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Venue
Venue Category,code,Unnamed: 2_level_1
French Restaurant,5,192
Italian Restaurant,4,46
Japanese Restaurant,2,31
Unspecified,6,24
Thai Restaurant,14,20


- Finally, from this cleaned dataframe, we can plot the restaurants on the map with colors based on the category they belong to:

In [18]:
paris_cat = folium.Map(location = [48.856578, 2.351828], zoom_start = 12)

import matplotlib.cm as cm
import matplotlib.colors as colors

nb_cat = len(paris_restaurants.groupby(['code']))

x = np.arange(nb_cat)
ys = [i + x + (i*x)**2 for i in range(nb_cat)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

paris_cat.choropleth(geo_data = geo,fill_opacity=0.25,fill_color='blue')
for lat, lng, label, cat , group in zip(paris_restaurants['Venue Latitude'], paris_restaurants['Venue Longitude'], paris_restaurants['Venue'],paris_restaurants['code'], paris_restaurants['Venue Category']):
    label = folium.Popup(label + " (" + group + ") [" + str(ardt)+ "]", parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color=rainbow[cat-1],
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(paris_cat)

for ardt, lat, lng in zip(paris_ardt_df['Ardt'], paris_ardt_df['Latitude'], paris_ardt_df['Longitude']):
    label = folium.Popup("Ardt n°"+ str(ardt), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='black',
        parse_html=False).add_to(paris_cat)
    
paris_cat

****
## Methodology <a name="methodology"></a>

In order to define where would be the best spots to open an italian restaurant in Paris, we will take the following steps:

<b> 1. Verifying our assumptions  </b>
- Basic analysis of the data
- Compare popularity of french vs. italian restaurants for each neighborhood

<b> 2. Density Analyses  </b>
- Mapping neighborhoods with an <i>italian restaurants deficit</i>
- Mapping venue densities for french and italiantrestaurants
- Isolating french restaurants that are far from italian restaurants


<b> 3. Clustering & Cross-checking </b>
- Creating clusters using k-means
- Superimposing the analyses
- Listing of the results

****

(see full report for complete analysis)