This notebook is created for Coursera Capstone Project.

# What is the best arrondissement for a new healthy food spot in Paris?

### Extract the data

Let's install all the necessary libraries:

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from geopy.geocoders import Nominatim
import folium
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import json

Download the data about Paris districts (arrondissements) from opendata.paris.fr (in json format):

In [4]:
!wget -q -O 'geo_data.json' https://opendata.paris.fr/explore/dataset/arrondissements/download/?format=json
print('Data downloaded!')

Data downloaded!


In [5]:
with open('geo_data.json') as json_data:
    geo_data = json.load(json_data)

In [6]:
geo_data[0]

{'datasetid': 'arrondissements',
 'fields': {'c_ar': 2,
  'c_arinsee': 75102,
  'geom': {'coordinates': [[[2.351518483670821, 48.8644258050741],
     [2.350949105218923, 48.86340592861751],
     [2.346676032763327, 48.864430925901665],
     [2.346675453051013, 48.86443106483368],
     [2.345101655171463, 48.864809197959836],
     [2.341271025930368, 48.86572767724484],
     [2.34126849090564, 48.86572828653819],
     [2.341204510696185, 48.865743681005995],
     [2.341178272058699, 48.86574963323163],
     [2.341083555178273, 48.86577201721946],
     [2.337371969067098, 48.86664907439458],
     [2.335869691238243, 48.86699647535598],
     [2.335869054057415, 48.86699662650754],
     [2.333675321300195, 48.867516125009374],
     [2.33172601351949, 48.867954816599685],
     [2.331725629348361, 48.86795490259037],
     [2.330656733960091, 48.86819218066118],
     [2.330306795320876, 48.86835619167468],
     [2.329965588686572, 48.86851416917429],
     [2.328007329038849, 48.86991742140715

Creating a new dataframe to put the data: 

In [7]:
column_names = ['ArrNumber', 'ArrName', 'Arrondissement', 'Latitude', 'Longitude'] 
paris_data = pd.DataFrame(columns=column_names)
paris_data

Unnamed: 0,ArrNumber,ArrName,Arrondissement,Latitude,Longitude


Filling the dataframe with the data from json file:

In [8]:
for data in geo_data:
    number = arr_name = data['fields']['c_ar'] 
    off_name = data['fields']['l_aroff']
    arr_name = data['fields']['l_ar']
    arr_latlon = data['geometry']['coordinates']
    arr_lat = arr_latlon[1]
    arr_lon = arr_latlon[0]
    
    paris_data = paris_data.append({'ArrNumber': number,
                                          'ArrName': arr_name,
                                          'Arrondissement':off_name,
                                          'Latitude': arr_lat,
                                          'Longitude': arr_lon}, ignore_index=True)

In [10]:
paris_data.sort_values(['ArrNumber'])

Unnamed: 0,ArrNumber,ArrName,Arrondissement,Latitude,Longitude
3,1,1er Ardt,Louvre,48.862563,2.336443
0,2,2ème Ardt,Bourse,48.868279,2.342803
1,3,3ème Ardt,Temple,48.862872,2.360001
4,4,4ème Ardt,Hôtel-de-Ville,48.854341,2.35763
13,5,5ème Ardt,Panthéon,48.844443,2.350715
9,6,6ème Ardt,Luxembourg,48.84913,2.332898
14,7,7ème Ardt,Palais-Bourbon,48.856174,2.312188
5,8,8ème Ardt,Élysée,48.872721,2.312554
10,9,9ème Ardt,Opéra,48.877164,2.337458
15,10,10ème Ardt,Entrepôt,48.87613,2.360728


Getting geographic coordinates of Paris:

In [11]:
address = 'Paris, France'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Paris are 48.8566101, 2.3514992.


Creating map of Paris with a marker for each district:

In [12]:
# create map of Paris
map_paris = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, name, arrondissement in zip(paris_data['Latitude'], paris_data['Longitude'], paris_data['Arrondissement'], paris_data['ArrName']):
    label = '{}, {}'.format(arrondissement, name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

Check out the number of rows of the dataframe:

In [13]:
paris_data.shape

(20, 5)

### Find all the nearby venues

In [15]:
CLIENT_ID = 'FH5I05XJG2U50E2Y2NCPE0KFP1RLWFDS1HJBVLMGW3LUECQE' 
CLIENT_SECRET = 'KC0FMSBCRUGX1ZBHW2LRKAPV5WDECJ34YCURYQLBLDZPVJBS' 
VERSION = '20190605'

Function to get nearby venues of all the arrondissements in Paris:

In [47]:
def getNearbyVenues(arrs, names, latitudes, longitudes, radius=5000, categoryIds=''):
    
    venues_list=[]
    for arr, name, lat, lng in zip(arrs, names, latitudes, longitudes):
#        print(arr, '-', name)
        LIMIT = 100 
        radius = 500    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        if (categoryIds != ''):
            url = url + '&categoryId={}'
            url = url.format(categoryIds)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            arr,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['ArrName', 
                  'Arrondissement',   
                  'Arrondissement Latitude', 
                  'Arrondissement Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each arrondissement and create a new dataframe containing the venues in Paris:

In [37]:
paris_venues = getNearbyVenues(arrs=paris_data['ArrName'],
                                   names=paris_data['Arrondissement'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude']
                                  )

In [38]:
print(paris_venues.shape)
paris_venues.head()

(1841, 8)


Unnamed: 0,ArrName,Arrondissement,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2ème Ardt,Bourse,48.868279,2.342803,Pizzeria Popolare,48.868252,2.343398,Pizza Place
1,2ème Ardt,Bourse,48.868279,2.342803,Le Silencio,48.868998,2.343417,Nightclub
2,2ème Ardt,Bourse,48.868279,2.342803,L'Appartement Sézane,48.869574,2.34506,Women's Store
3,2ème Ardt,Bourse,48.868279,2.342803,Le Moderne,48.868856,2.342142,French Restaurant
4,2ème Ardt,Bourse,48.868279,2.342803,Galerie Vivienne,48.866731,2.3398,Historic Site


Check how many venues were returned for each arrondissement:

In [39]:
paris_venues.groupby('Arrondissement').count()

Unnamed: 0_level_0,ArrName,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Arrondissement,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Batignolles-Monceau,100,100,100,100,100,100,100
Bourse,100,100,100,100,100,100,100
Buttes-Chaumont,100,100,100,100,100,100,100
Buttes-Montmartre,100,100,100,100,100,100,100
Entrepôt,100,100,100,100,100,100,100
Gobelins,100,100,100,100,100,100,100
Hôtel-de-Ville,100,100,100,100,100,100,100
Louvre,100,100,100,100,100,100,100
Luxembourg,100,100,100,100,100,100,100
Ménilmontant,57,57,57,57,57,57,57


Find out how many unique categories can be curated from all the returned venues:

In [None]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))
#paris_venues['Venue Category',''].unique()

### Find potential competitors

Let's get the venues of the potential competitors. There's no such a separate category 'Healthy food restaurant', so we'll use those ones that could have some intersections with it:
- Salad Place 
- Sandwich Place 
- Soup Place
- Vegetarian / Vegan Restaurant
- Health Food Store 

In [48]:
# We use the following category IDs to get only the venues that could be our potential competitors:
#(can be found on https://developer.foursquare.com/docs/resources/categories)
#Salad Place - 4bf58dd8d48988d1bd941735
#Sandwich Place - 4bf58dd8d48988d1c5941735
#Soup Place - 4bf58dd8d48988d1dd931735
#Vegetarian / Vegan Restaurant - 4bf58dd8d48988d1d3941735
#Health Food Store - 50aa9e744b90af0d42d5de0e 

paris_venues_healthy = getNearbyVenues(arrs=paris_data['ArrName'],
                                   names=paris_data['Arrondissement'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude'],
                                   categoryIds='4bf58dd8d48988d1bd941735,4bf58dd8d48988d1c5941735,4bf58dd8d48988d1dd931735,4bf58dd8d48988d1d3941735,50aa9e744b90af0d42d5de0e'
                                  )



paris_venues_healthy.head()

Unnamed: 0,ArrName,Arrondissement,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2ème Ardt,Bourse,48.868279,2.342803,Label Ferme,48.86923,2.343058,Salad Place
1,2ème Ardt,Bourse,48.868279,2.342803,Mûre,48.870271,2.342203,Salad Place
2,2ème Ardt,Bourse,48.868279,2.342803,Mabel,48.867544,2.34615,Sandwich Place
3,2ème Ardt,Bourse,48.868279,2.342803,Chez Philibert,48.868918,2.345776,Salad Place
4,2ème Ardt,Bourse,48.868279,2.342803,Frenchie to Go,48.867693,2.34774,Sandwich Place


In [49]:
paris_venues_healthy.shape

(391, 8)

Let's create a map of potential competitors:

In [50]:
# function to add markers for given venues to map
def addToMap(df, color, existingMap):
    for lat, lng, local, venue, venueCat in zip(df['Venue Latitude'], df['Venue Longitude'], df['Arrondissement'], df['Venue'], df['Venue Category']):
        label = '{} ({}) - {}'.format(venue, venueCat, local)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7).add_to(existingMap)

In [51]:
map_paris_healthy = folium.Map(location=[latitude, longitude], zoom_start=12)
addToMap(paris_venues_healthy, 'green', map_paris_healthy)
map_paris_healthy

### Find target audience

Let's define target audience locations: schools, colleges, universities, offices

#####  High schools

In [52]:
# High School ID: 4bf58dd8d48988d13d941735 - see https://developer.foursquare.com/docs/resources/categories


paris_venues_schools = getNearbyVenues(arrs=paris_data['ArrName'],
                                   names=paris_data['Arrondissement'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude'],
                                   categoryIds='4bf58dd8d48988d13d941735'
                                  )



paris_venues_schools.head()

Unnamed: 0,ArrName,Arrondissement,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2ème Ardt,Bourse,48.868279,2.342803,Lycée Jean-Baptiste Lulli,48.869871,2.34347,High School
1,2ème Ardt,Bourse,48.868279,2.342803,Cours lafayette,48.866667,2.342775,High School
2,2ème Ardt,Bourse,48.868279,2.342803,Lycée Privé Edgar Poe,48.870815,2.34811,High School
3,3ème Ardt,Temple,48.862872,2.360001,Lycée François Truffaut,48.861896,2.363554,High School
4,3ème Ardt,Temple,48.862872,2.360001,Lycée Simone Weil,48.861239,2.36397,High School


In [53]:
paris_venues_schools.shape

(36, 8)

In [65]:
map_paris_schools = folium.Map(location=[latitude, longitude], zoom_start=12)
addToMap(paris_venues_schools, 'darkblue', map_paris_schools)
map_paris_schools

#####  Colleges and Universities

In [55]:
# IDs - see https://developer.foursquare.com/docs/resources/categories:
# College & University - 4d4b7105d754a06372d81259
# University - 4bf58dd8d48988d1ae941735


paris_venues_universities = getNearbyVenues(arrs=paris_data['ArrName'],
                                   names=paris_data['Arrondissement'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude'],
                                   categoryIds='4d4b7105d754a06372d81259,4bf58dd8d48988d1ae941735'
                                  )



paris_venues_universities.head()

Unnamed: 0,ArrName,Arrondissement,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2ème Ardt,Bourse,48.868279,2.342803,Institut National d'Histoire de l'Art (INHA),48.86668,2.338986,General College & University
1,2ème Ardt,Bourse,48.868279,2.342803,École de la Chambre Syndicale de la Couture,48.868478,2.342351,University
2,2ème Ardt,Bourse,48.868279,2.342803,EEMI,48.868944,2.341118,University
3,2ème Ardt,Bourse,48.868279,2.342803,Esmod Isem,48.867723,2.345213,University
4,2ème Ardt,Bourse,48.868279,2.342803,EEMI Passage Des Panoramas,48.870299,2.342846,University


In [56]:
paris_venues_universities.shape

(187, 8)

In [57]:
map_paris_universities = folium.Map(location=[latitude, longitude], zoom_start=12)
addToMap(paris_venues_universities, 'orange', map_paris_universities)
map_paris_universities

#####  Offices

In [58]:
#Office ID: 4bf58dd8d48988d124941735 - see https://developer.foursquare.com/docs/resources/categories


paris_venues_offices = getNearbyVenues(arrs=paris_data['ArrName'],
                                   names=paris_data['Arrondissement'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude'],
                                   categoryIds='4bf58dd8d48988d124941735'
                                  )



paris_venues_offices.head()

Unnamed: 0,ArrName,Arrondissement,Arrondissement Latitude,Arrondissement Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2ème Ardt,Bourse,48.868279,2.342803,FCINQ,48.868473,2.341914,Office
1,2ème Ardt,Bourse,48.868279,2.342803,Webloyalty France | webloyalty.fr,48.86812,2.344235,Office
2,2ème Ardt,Bourse,48.868279,2.342803,Autorité des Marchés Financiers (AMF),48.86944,2.340892,Office
3,2ème Ardt,Bourse,48.868279,2.342803,Red Bull HQ,48.866644,2.342369,Office
4,2ème Ardt,Bourse,48.868279,2.342803,M&C Saatchi Corporate,48.86848,2.341972,Office


In [59]:
paris_venues_offices.shape

(381, 8)

In [63]:
map_paris_offices = folium.Map(location=[latitude, longitude], zoom_start=12)
addToMap(paris_venues_offices, 'purple', map_paris_offices)
map_paris_offices

In [66]:
def addColumn(sourcedf, columnTitle, newdf):
    grouped = newdf.groupby('Arrondissement').count()
    
    for n in sourcedf['Arrondissement']:
        try:
            sourcedf.loc[sourcedf['Arrondissement'] == n,columnTitle] = grouped.loc[n, 'Venue']
        except:
            sourcedf.loc[sourcedf['Arrondissement'] == n,columnTitle] = 0

In [72]:
df_data = paris_data.copy()
addColumn(df_data, 'Healthy Food Spots', paris_venues_healthy)
addColumn(df_data, 'Schools', paris_venues_schools)
addColumn(df_data, 'Universities', paris_venues_universities)
addColumn(df_data, 'Offices', paris_venues_offices)
df_data.sort_values(['ArrNumber'])

Unnamed: 0,ArrNumber,ArrName,Arrondissement,Latitude,Longitude,Healthy Food Spots,Schools,Universities,Offices
3,1,1er Ardt,Louvre,48.862563,2.336443,29.0,0.0,7.0,22.0
0,2,2ème Ardt,Bourse,48.868279,2.342803,79.0,3.0,24.0,75.0
1,3,3ème Ardt,Temple,48.862872,2.360001,41.0,4.0,4.0,22.0
4,4,4ème Ardt,Hôtel-de-Ville,48.854341,2.35763,26.0,3.0,10.0,13.0
13,5,5ème Ardt,Panthéon,48.844443,2.350715,12.0,2.0,34.0,8.0
9,6,6ème Ardt,Luxembourg,48.84913,2.332898,20.0,4.0,15.0,13.0
14,7,7ème Ardt,Palais-Bourbon,48.856174,2.312188,14.0,2.0,8.0,13.0
5,8,8ème Ardt,Élysée,48.872721,2.312554,39.0,1.0,10.0,47.0
10,9,9ème Ardt,Opéra,48.877164,2.337458,58.0,2.0,8.0,44.0
15,10,10ème Ardt,Entrepôt,48.87613,2.360728,17.0,4.0,10.0,26.0


### Find the best district for our spot

In [75]:
# negative weight - for our potential competitors
weight_competitors = -1

# positive weight - for all the target audiences (ascending by their payment capacity):
weight_schools = 1
weight_universities = 1.5
weight_offices = 2

In [83]:
df_weighted = df_data[['ArrName','Arrondissement']].copy()

In [84]:
df_weighted['Score'] = df_data['Healthy Food Spots'] * weight_competitors + df_data['Schools'] * weight_schools + df_data['Universities'] * weight_universities + df_data['Offices'] * weight_offices
df_weighted = df_weighted.sort_values(by=['Score'], ascending=False)
df_weighted

Unnamed: 0,ArrName,Arrondissement,Score
0,2ème Ardt,Bourse,110.0
5,8ème Ardt,Élysée,71.0
11,17ème Ardt,Batignolles-Monceau,64.0
13,5ème Ardt,Panthéon,57.0
15,10ème Ardt,Entrepôt,54.0
16,11ème Ardt,Popincourt,46.5
19,18ème Ardt,Buttes-Montmartre,45.5
10,9ème Ardt,Opéra,44.0
9,6ème Ardt,Luxembourg,32.5
14,7ème Ardt,Palais-Bourbon,26.0


The 2nd arrondissement - **Bourse** - appears to be the best location for our new healthy food spot.

Let's see all the used venues of the 2nd arrondissement - Bourse - on the map:

In [80]:
map_paris_result = folium.Map(location=[latitude, longitude], zoom_start=15)

paris_best = paris_data[paris_data['Arrondissement'] == 'Bourse']

for lat, lng, local in zip(paris_best['Latitude'], paris_best['Longitude'], paris_best['Arrondissement']):
    label = '{}'.format(local)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='pink',
        fill=True,
        fill_color='pink',
        fill_opacity=0.7).add_to(map_paris_result) 

addToMap(paris_venues_healthy[paris_venues_healthy['Arrondissement'] == 'Bourse'], 'green', map_paris_result)
addToMap(paris_venues_schools[paris_venues_schools['Arrondissement'] == 'Bourse'], 'darkblue', map_paris_result)
addToMap(paris_venues_universities[paris_venues_universities['Arrondissement'] == 'Bourse'], 'orange', map_paris_result)
addToMap(paris_venues_offices[paris_venues_offices['Arrondissement'] == 'Bourse'], 'purple', map_paris_result)

map_paris_result