# Capstone Project - The Battle of the Neighborhoods (Week 2)

## Applied Data Science Capstone by IBM / Coursera

_by Ludovic D'ALESSIO_

---
# Opening a traditional French bakery in Paris, France

<table><tr><td>
<img src='https://image.freepik.com/free-vector/paris-cityscape-illustration-cartoon-paris-landmarks-night-eiffel-tower_33099-291.jpg' width='470'/>
</td><td>
<img src='https://i0.wp.com/worldwideadventurers.com/wp-content/uploads/2012/04/baguette-tradition-paris.jpg?w=600' width='396'/>
</td></tr></table>

### Table of contents

* [Introduction - Business Problem](#Introduction:-Business-Problem)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results](#Results)
* [Discussion](#Discussion)
* [Conclusion](#conclusion)


## Introduction: Business Problem ##

<p>In this project we will try to find suitable locations to open a <b>bakery</b> in <b>Paris</b>.</p>
<p>Let's imagine ourselves in the shoes of a young, gifted, traditional French baker willing to settle down in Paris. France is well known for its culinary wealth, and bakery makes no exception. An infinite variety of breads, croissants, pastries... hold a significant part of the <i><b>French Way of Life</b></i>. French people are very proud of their bakeries, and you can find them everywhere. Every district, every block has its own bakery, which really plays an important part in the neighborhood's life.</p>

<p>So the question is: <b>how to find a suitable place to open a new bakery in a city already crowded with bakeries of all kinds?</b></p>
<p>Paris is a truly beautiful place, a mix of well-known landmarks, historical architecture, residential buildings and small local shops. Paris itself is quite small and homogeneous. Unlike the large U.S. cities for example, the business center is outside the city, and there are residential areas just everywhere.</p>
<p>Many factors could be taken into consideration to determine if an area is suitable to open a new traditional and high standing bakery, but we will only concentrate in this project on the three criteria below:<ul>
    <li><i><u>Density of population</u></i> in the area: you typically don't want to take you car to buy your bread for the day or the chocolate croissant for your breakfast, so the area's attractiveness is direcly linked to the number of Parisians living around.
    <li><i><u>Number of bakeries already present in the area</u></i>: competition is good for the customers, but as a shop owner less competitors means more market share.
    <li><i><u>Distance of to the closest "quality bakery"</u></i>: all the bakeries are different and you might want to walk a bit more to find an exceptional quality, hand-made product, within a reasonable range; so an area where the closest top bakery is more than one kilometer away offers a true opportunity for a baker able and willing to provide this level of service. </li>
</ul></p>

## Data ##

### Data sources overview

The following data sources will be used to generate the required information:
* **Paris population density** can be found on [Wikipedia](https://fr.wikipedia.org/wiki/Liste_des_quartiers_administratifs_de_Paris) per borough and per administrative district
* **Paris boroughs and districts shapes**, in geojson format, can be downloaded for free on  [opendata.paris.fr](https://opendata.paris.fr/explore/dataset/quartier_paris/export/?location=13,48.85879,2.34704&basemap=jawg.streets) website
* **The list and geolocalization** of all the bakeries in Paris will be retrieved through [Foursquare API](https://developer.foursquare.com/developer/) standard requests
* **Bakeries ratings**, that will be used to identify top bakeries, will be retrieved through **Foursquare API** premium requests
* The [folium](https://github.com/python-visualization/folium) and [geopy.geocoders](https://github.com/geopy/geopy) packages will be used respectively to visualize data on a map and to retrieve map coordinates from given addresses.

All those pieces of data are completely and freely available on the internet. The following sections describe the data sources in detail and the data once retrieved.

### Population density

Paris is conveniently divided into 20 boroughs, called *arrondissements*, arranged in spiral and numbered from 1 to 20 starting from the center:
<img src='http://www.paris-en-photos.fr/wp-content/uploads/2008/08/paris-arrondissements-300x216.png'/>
However, those boroughs are sometimes quite big and the population density is not homogeneous. Fortunately, each of them is also divided into 4 administrative districts, which makes a total of **80 districts** covering the whole city. The list of the districts and their characteristics can be found [here on Wikipedia](https://fr.wikipedia.org/wiki/Liste_des_quartiers_administratifs_de_Paris) (it's in French as the similar page in English does not directly show the density figures).<br>
We will also need the *geojson* coordinates of the districts, that can be found on the [opendata.paris.fr](https://opendata.paris.fr/explore/dataset/quartier_paris/export/?location=13,48.85879,2.34704&basemap=jawg.streets) website.

We can download the Wikipedia table directly into a Pandas dataframe.

In [1]:
import pandas as pd
df_districts = pd.read_html('https://fr.wikipedia.org/wiki/Liste_des_quartiers_administratifs_de_Paris')[0]
df_districts.head(2)

Unnamed: 0,"Arrondissement[1],[n 1]",Quartiers,Quartiers.1,Population en1999 (hab.)[2],Superficie(ha)[2],Densitéhab/km2,Plan
0,1er arrondissementdit « du Louvre »,1er,Saint-Germain-l'Auxerrois,1 672,869,1 924,
1,1er arrondissementdit « du Louvre »,2e,Halles,8 984,412,21 806,


In [2]:
df_districts.dtypes

Arrondissement[1],[n 1]         object
Quartiers                       object
Quartiers.1                     object
Population en1999 (hab.)[2]     object
Superficie(ha)[2]                int64
Densitéhab/km2                  object
Plan                           float64
dtype: object

In [3]:
df_districts.shape

(80, 7)

It looks great as we retrieved the expected 80 lines, however we need to drop the first and last columns, and change the names and types of the other columns:

In [4]:
df_districts = df_districts.drop([df_districts.columns[0], df_districts.columns[-1]], axis=1)
df_districts.columns = ['Borough Nb', 'District Name', 'Population', 'Area', 'Density']
df_districts['Population'] = pd.to_numeric(df_districts['Population'].str.replace('[^0-9]', ''))
df_districts['Density'] = pd.to_numeric(df_districts['Density'].str.replace('[^0-9]', ''))
df_districts['Borough Nb'] = pd.to_numeric(df_districts['Borough Nb'].str.replace('[^0-9]', ''))
df_districts.head(2)

Unnamed: 0,Borough Nb,District Name,Population,Area,Density
0,1,Saint-Germain-l'Auxerrois,1672,869,1924
1,2,Halles,8984,412,21806


### Districts shapes

Now let's display the boroughs as a choropleth map according to their population density. We have downloaded beforehand the *geojson* files from [opendata.paris.fr](https://opendata.paris.fr/explore/dataset/quartier_paris/export/?location=13,48.85879,2.34704&basemap=jawg.streets) for the districts and the boroughs.

In [5]:
# imports
import json
import folium
from geopy.geocoders import Nominatim

# create a map around Paris 
geolocator = Nominatim(user_agent="capstone_explorer")
loc = geolocator.geocode("Paris, France")
paris_coord = [loc.latitude, loc.longitude]
map_paris = folium.Map(location=paris_coord, zoom_start=12)

# draw boroughs and districts
geo_boroughs = json.load(open("boroughs_paris.geojson"))
geo_districts = json.load(open("districts_paris.geojson"))
folium.Choropleth(geo_data=geo_districts, key_on='feature.properties.c_qu', data=df_districts,
    columns=['Borough Nb', 'Density'], fill_color='Oranges', fill_opacity=0.5, name="choro").add_to(map_paris)
folium.GeoJson(geo_boroughs, style_function=lambda x:{'color':'darkblue','weight':3,'fill':False}).add_to(map_paris)

# show mamp
map_paris

### Complete list and geolocalization of the bakeries

To retrieve the list of bakeries around a given point we use the Foursquare API with a verified account, that allows for 99,500 standard and 500 premium requests per day. Below the account configuration:

In [6]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

CLIENT_ID = 'X44GZE3TFHMZN4XSSRDPNQELI4O3WQXC2XBBI5VOWY52J3AE' # Foursquare ID
CLIENT_SECRET = 'BQSEQ22WUR5Z1P25YZN0GPBWK0T0PTKJ3C5HBW0YLWXP2A1C' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
BAKERY_CATEGORY = '4bf58dd8d48988d16a941735' # Foursquare Category Id for bakeries
LIMIT = 1000 # Limit we will use as the maximum number of responses we want to get from Foursquare requests 

To demonstrate how Foursquare API works, let's retrieve the bakeries within 600 meters of the center of one random district; for the sake of simplicity, we'll use the first district listed in the geojson file retrieved from [opendata.paris.fr](https://opendata.paris.fr/explore/dataset/quartier_paris/export/?location=13,48.85879,2.34704&basemap=jawg.streets).

In [7]:
# coordinates of the district center
center_coord = geo_districts['features'][0]['properties']['geom_x_y']
# area we consider around the center, in meters
radius = 600

# build the URL corresponding to the Foursquare request
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, center_coord[0], center_coord[1], radius, LIMIT, BAKERY_CATEGORY)

# determine if a venue is a bakery
def is_bakery(venue):
    for cat in venue['categories']:
        if cat['id'] == BAKERY_CATEGORY:
            return True
    # 'Bakery' category not found in the list of categories
    return False

# send request
results = requests.get(url).json()['response']['groups'][0]['items']

# store results in list
bakeries = [(item['venue']['id'],
           item['venue']['name'],
           (item['venue']['location']['lat'], item['venue']['location']['lng']),
           item['venue']['location']['formattedAddress']) for item in results if is_bakery(item['venue'])]

# transform to dataframe
df_bakeries = pd.DataFrame(bakeries, columns=['Id', 'Name', 'Coord', 'Address'])
df_bakeries.shape

(15, 4)

Ok so we found 15 bakeries around the district's center.

### Bakeries' ratings

To retrieve the bakeries' rating we use another type of request from the Foursquare API that provides detailed information on a specific venue. To illustrate how it works, we will try to retrieve the ratings of all the 15 bakeries found in the previous section. It is important to note that not all venues in Foursquare are rated, so if the request returns no results we will assign NaN as the bakery's rating.

In [8]:
import numpy as np
import math

# Fetch the rating from Foursquare using the bakery Id
def get_rating(row):
    try:
        url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
            row['Id'], CLIENT_ID, CLIENT_SECRET, VERSION)
        return requests.get(url).json()['response']['venue']['rating']
    except:
        return np.NaN
    
# Retrieve the rating for all the bakeries and add it as a new column in the dataframe
df_bakeries['Rating'] = df_bakeries.apply(lambda row: get_rating(row), axis=1)
df_bakeries[['Name', 'Address', 'Rating']].head(8)

Unnamed: 0,Name,Address,Rating
0,Boulangerie Saint-Antoine,"[29 rue Saint-Antoine, 75004 Paris, France]",8.3
1,Miss Manon,"[87 rue Saint-Antoine, 75004 Paris, France]",7.9
2,Maison Landemaine,[28 boulevard Beaumarchais (Rue du Pasteur Wag...,7.7
3,Paul,"[Rue de Rivoli, 75001 Paris, France]",7.1
4,Maison Passos,"[28 rue de la Roquette, 75011 Paris, France]",6.8
5,Aux Désirs de Manon,"[129 rue Saint-Antoine, 75004 Paris, France]",6.3
6,Boulangerie Maison Hilaire,"[11 rue Saint-Antoine, 75004 Paris, France]",
7,Chambre Professionnelle des Artisans Boulanger...,"[7 Quai d'Anjou, 75004 Paris, France]",


Finally, let's plot the bakeries we have discovered on a map centered around the district. We will plot the bakeries using different colors: in green the bakeries with a high rating (>= 8.0), in orange the other rated bakeries, and in black those that don't have any rating.

In [9]:
# Create the map
district_map = folium.Map(location=center_coord, zoom_start=15)

# Draw districts borders
folium.GeoJson(geo_districts, style_function=lambda x:{'color':'darkblue','weight':3,'fill':False}).add_to(district_map)

# Identify in red the center of the district and the area within 600m of the center
folium.Marker(center_coord).add_to(district_map)
folium.Circle(center_coord, radius=600, color="red", fill=True, fill_opacity=.1).add_to(district_map)

# Place the bakeries with the roight color
for coord, name, rating in zip(df_bakeries['Coord'], df_bakeries['Name'], df_bakeries['Rating']):
    if math.isnan(rating):
        folium.CircleMarker(coord, radius=5, color='black', fill=True, fill_opacity=.7,
            popup='{} (no rating)'.format(name)).add_to(district_map)
    else:
        folium.CircleMarker(coord, radius=5, color=('green' if rating>=8 else 'blue'), fill=True, fill_opacity=.7,
            popup='{} ({})'.format(name, rating)).add_to(district_map)
        
# Show map
district_map

<a id='#methodology'></a>

## Methodology ##

<a id='#analysis'></a>

## Analysis ##