# Capstone project – Data science : Neighborhood war in Paris

## 1. Introduction/Business Problem

Paris is the capital of France and is considered as one of economic centers in Europe. The city is a major center for many buisnesses : banking, fashion, restaurant, theater, ... Paris is the most competitive market in France.
Due to this great competitivity, the XYZ Company, which is a Buisness to Buisness (B2B) specialized in conciergerie of other companies want to create a new filiale in Paris. But to maximize his profit, various factor need to be studied in order to decide on the location such as:
* the concentration of venues (the futur client)
* is there competitors in that location

The objective of this study is to locate and recommend to anyone who wants to open a general B2B company (like a conciergerie company, outsourcing services, fast food for company...) in Paris wich neighborhood will be the best to start. The success criteria of the project will be a good recommandation of neighborhood choice based on concentration of venue and their quality.

## 2. Data
Two dataset will be used:
* the neighborhood of Paris
* Buisness in  from Foursquare

### 2.1 The neighborhood of Paris
The data is from https://www.data.gouv.fr/fr/datasets/quartiers-administratifs/ . Paris has 20 boroughs called arrondissement, and 80 neighborhoods. The dataset present :
* l_qu : the name of the neighborhood
* geom_x_y : the longitude and latitude of the neighborhood
* geom : the geometric shape of the neighborhood

In [1]:
!wget -q 'https://opendata.paris.fr/explore/dataset/quartier_paris/download?format=csv&timezone=Europe/Berlin&use_labels_for_header=True' -O quartier_paris
import numpy as np
import pandas as pd
neigh_paris= pd.read_csv('quartier_paris', sep=';')
neigh_paris.head()

Unnamed: 0,N_SQ_QU,C_QU,C_QUINSEE,L_QU,C_AR,N_SQ_AR,PERIMETRE,SURFACE,Geometry X Y,Geometry
0,750000006,6,7510202,Vivienne,2,750000002,2058.472959,243550.770623,"48.8691001998,2.33946074375","{""type"": ""Polygon"", ""coordinates"": [[[2.341232..."
1,750000010,10,7510302,Enfants-Rouges,3,750000003,2139.625388,271750.323937,"48.863887392,2.36312330099","{""type"": ""Polygon"", ""coordinates"": [[[2.367101..."
2,750000024,24,7510604,Saint-Germain-des-Prés,6,750000006,2565.899893,282279.939864,"48.85528872,2.33365686809","{""type"": ""Polygon"", ""coordinates"": [[[2.336959..."
3,750000037,37,7511001,Saint-Vincent-de-Paul,10,750000010,4072.789633,926865.229776,"48.8807352373,2.35747081045","{""type"": ""Polygon"", ""coordinates"": [[[2.360513..."
4,750000042,42,7511102,Saint-Ambroise,11,750000011,4052.567737,837992.921567,"48.8623450235,2.37611805592","{""type"": ""Polygon"", ""coordinates"": [[[2.370939..."


### 2.2. Venues in Paris
This data will cover all the venues that already exists in Paris with the number of likes (quality of the venu).

Dataset from Foursquare will be used in this project.

## 3. Methodology
### 3.1 Business understanding
Main goal of this project is to get locations that will be suitable for opening new general B2B business in Paris.
### 3.2 Analytic Approach
Paris has a total 80 neighborhoods. In this project neighborhoods will be clustered following exploratory data that will be discovered in the next part.
### 3.3 Exploratory Data Analysis
#### 3.3.1. Data Neighborhood of Paris
We will use the csv presented in the Data section.

This data is also cleaned to extract the necessary of information and processed like this:
* We will drop all the unecessary column
* We will extract the latitude and longitude for each Neighborhood

In [2]:
neigh_coord=pd.DataFrame(neigh_paris['Geometry X Y'].str.split(',',1).tolist(),columns = ['Latitude','Longitude'])
neigh_coord.Latitude = pd.to_numeric(neigh_coord.Latitude, errors='coerce')
neigh_coord.Longitude = pd.to_numeric(neigh_coord.Longitude, errors='coerce')
neigh_paris.drop(columns=['N_SQ_QU','C_QU','C_QUINSEE','C_AR','N_SQ_AR','PERIMETRE','SURFACE','Geometry','Geometry X Y'], inplace=True)
neigh_paris=pd.concat ([neigh_paris,neigh_coord], axis=1).rename(columns={'L_QU':'Neighborhood'})
neigh_paris.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Vivienne,48.8691,2.339461
1,Enfants-Rouges,48.863887,2.363123
2,Saint-Germain-des-Prés,48.855289,2.333657
3,Saint-Vincent-de-Paul,48.880735,2.357471
4,Saint-Ambroise,48.862345,2.376118


We will use Folium to map those data.

In [9]:
!pip install -q folium # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# create map
map_clusters = folium.Map(location=[48.8534 , 2.3488], zoom_start=13)

# add markers to the map
markers_colors = []
for lat, lon, poi in zip(neigh_paris['Latitude'], neigh_paris['Longitude'], neigh_paris['Neighborhood']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


### 3.3.1. Data Neighborhood of Paris
The data about Neighborhood will be used tomake an exploration of all venues in a radius of 500m around the Coordinates.

A function will be defined to explore all Neighboorhood and will return:
* the venue name
* the venue latitude and longitude
* the venue categorie


In [3]:
import requests # library to handle requests

# make call of API to search for neighborhood
CLIENT_ID = '4IQGVSEUWJLC0EWSCIJRNGGOHHUMRDKSWCJOJ3XBWR1B4AJZ' # your Foursquare ID
CLIENT_SECRET = 'L3MBURCW2ZIFG53SXT0PTAJY3L2GQKY3DUEMA45OGEG122IN' # your Foursquare Secret
ACCESS_TOKEN = 'SG5XPBYFBBHH5ZMIEP2DI0QADUGWAGW0YQEGN400KJ0WEPKY' 
VERSION = '20210205' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&oauth_token={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            ACCESS_TOKEN,
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['id'],
            v['venue']['name'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'id',  
                  'Venue',
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We will loop the ***getNearbyVenues*** function to all the 80 Neighborhood

In [4]:
Paris_venues = getNearbyVenues(names=neigh_paris['Neighborhood'],
                                   latitudes=neigh_paris['Latitude'],
                                   longitudes=neigh_paris['Longitude']
                                  )

In [5]:
print(Paris_venues.shape)
Paris_venues['Venue Latitude']=pd.to_numeric(Paris_venues['Venue Latitude'], errors='coerce')
Paris_venues['Venue Longitude']=pd.to_numeric(Paris_venues['Venue Longitude'], errors='coerce')
Paris_venues.head()


(6518, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Vivienne,48.8691,2.339461,4bc324a874a9a5936c18d4f6,Le Moderne,48.868856,2.342142,French Restaurant
1,Vivienne,48.8691,2.339461,53ff2b53498ed3314ce29fa2,A. Noste,48.869122,2.339138,Tapas Restaurant
2,Vivienne,48.8691,2.339461,4b39f154f964a520b25f25e3,Workshop Issé,48.868895,2.337066,Gourmet Shop
3,Vivienne,48.8691,2.339461,5519894a498e3be5c5222550,Dynamo Cycling,48.868871,2.336581,Cycle Studio
4,Vivienne,48.8691,2.339461,59144a9ae97dfb02c92e3e3e,Karaage-Ya Bourse,48.87015,2.341826,Japanese Restaurant


Not all venues will be a possible clients of a Conciergerie company, we are going to select only : Bank, Bookstore, Church, Cycle Studio, Dance Studio, Design Studio, Government Building, Hostel, Hotel, Office, Rental Car Location, Resort, School, Shopping Mall.

In [6]:
cat_filter=['Bank', 'Bookstore', 'Church', 'Cycle Studio', 'Dance Studio', 'Design Studio', 'Government Building', 'Hostel', 'Hotel', 'Office', 'Rental Car Location', 'Resort', 'School', 'Shopping Mall']
Paris_venues=Paris_venues[Paris_venues['Venue Category'].isin(cat_filter)]
Paris_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,id,Venue,Venue Latitude,Venue Longitude,Venue Category
3,Vivienne,48.8691,2.339461,5519894a498e3be5c5222550,Dynamo Cycling,48.868871,2.336581,Cycle Studio
15,Vivienne,48.8691,2.339461,4f606b63e4b006673ab72d57,Hôtel La Maison Favart,48.8708,2.33729,Hotel
29,Vivienne,48.8691,2.339461,4c28dda897d00f47f58540ea,BookOff,48.868946,2.335462,Bookstore
40,Vivienne,48.8691,2.339461,4b488e63f964a520464f26e3,Hotel Gramont Opera,48.87055,2.336845,Hotel
77,Vivienne,48.8691,2.339461,4adcda01f964a520fe3021e3,Hôtel Ascot Opéra,48.868232,2.335154,Hotel


Just like the neighborhood data, We will use Folium to map those data.

In [10]:
# create map
map_venues = folium.Map(location=[48.8534 , 2.3488], zoom_start=13)

# add markers to the map
for lat, lon, poi, cat in zip(Paris_venues['Venue Latitude'], Paris_venues['Venue Longitude'], Paris_venues['Venue'], Paris_venues['Venue Category']):
    label = folium.Popup(str(poi) + '-' + str(cat), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_venues)
       
map_venues



So we are goign to make a dataset of all Neighborhood with the number of all venue by categories

In [11]:
# one hot encoding
Paris_onehot = pd.get_dummies(Paris_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Paris_onehot['Neighborhood'] = Paris_venues['Neighborhood'] 
#df_onehot=pd.concat([torronto_venues['Neighborhood'], torronto_onehot], axis=1)

# move neighborhood column to the first column
fixed_columns = [Paris_onehot.columns[-1]] + list(Paris_onehot.columns[:-1])
Paris_onehot = Paris_onehot[fixed_columns]
Paris_group = Paris_onehot.groupby('Neighborhood').count().reset_index()

print(Paris_group.shape)
Paris_group.head()

(75, 14)


Unnamed: 0,Neighborhood,Bank,Bookstore,Church,Cycle Studio,Dance Studio,Design Studio,Government Building,Hostel,Hotel,Office,Rental Car Location,Resort,Shopping Mall
0,Amérique,1,1,1,1,1,1,1,1,1,1,1,1,1
1,Archives,4,4,4,4,4,4,4,4,4,4,4,4,4
2,Arsenal,9,9,9,9,9,9,9,9,9,9,9,9,9
3,Arts-et-Métiers,7,7,7,7,7,7,7,7,7,7,7,7,7
4,Auteuil,1,1,1,1,1,1,1,1,1,1,1,1,1


### 3.4 Machine learning uses

For segmenting neighborhood in Paris based on existed venues and finding recommendation for a good start of B2B business, clustering algorithm will be used especially K-means method. This method is used as unsupervised algorithm. It doesn’t need previous recommendations to build a model. K-means method is good for segmentation. It divides the data into clusters without any cluster-internal structures or label.

In [14]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 2

Paris_group_cluster = Paris_group.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Paris_group_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0], dtype=int32)

## 4. Results
To cluster the neighborhoods into two clusters we used K-Means clustering Algorithm. K-Mean clustering aims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean. It uses iterative refinement approach. In this project, 2 clusters are chosen:
* Neighborhood where we recommand to start a B2B buisness
* Neighborhood where we do not recommand to start a B2B buisness
* And Neighborhood where we must not start a B2B buisness (there are no potential client in 500m radius)


The map below, that is constructed using Folium libraries, shows clustering of neighborhoods in Paris into two clusters.

In [19]:
check_group=Paris_venues[['Neighborhood','Venue']].groupby('Neighborhood').count().reset_index()
check_group.insert(0,'Cluster label',kmeans.labels_)

!wget -q 'https://www.data.gouv.fr/fr/datasets/r/a8748f53-5850-4a04-b8cc-9c9f5f72949f' -O Quartier_geojson

geojson_paris = r'Quartier_geojson' # geojson file

# create a plain world map
final_map = folium.Map(location=[48.8534 , 2.3488], zoom_start=13)

final_map.choropleth(
    geo_data=geojson_paris,
    data=check_group,
    columns=['Neighborhood', 'Cluster label'],
    key_on='feature.properties.l_qu',
    fill_color='RdBu', 
    fill_opacity=0.7, 
    line_opacity=0.7,
    legend_name='Immigration to Canada'
)

final_map

## 5. Discussion
We can see that all the recommanded neighborhood is focalized in the center of Paris. If we add another variable such as the quality of future client via the number of likes of the venues, we can have a better clustering.

## 6. Conclusion
If you want to start a Conciergerie (a B2B) in Paris this project is a gift for you. 