# Capstone Project - The Battle of the Neighborhoods
# Opening an Indonesian restaurant in Strasbourg (France)
### Applied Data Science Capstone by IBM/Coursera

## 1. Introduction and business problem

### 1.1. Background
I'm a French citizen and my hometown is Strasbourg, a mid-sized city in the North-East of France. I've been living in Indonesia for the past eight years and thoroughly enjoy Indonesian cuisine. In the past years, during my home trips to France I noticed there is a growing number of Asian restaurants, with new ones opening every year. However I also noticed there is no Indonesian restaurant in my hometown whereas Chinese, Japanese, Vietnamese and Thai restaurants are blooming.
So I decided to apply the skills learned during the Coursera Applied Data Science Specialization to find what would be the optimal area to open an Indonesian restaurant in my hometown of Strasbourg.

### 1.2. Problem to solve
The challenge is to find an area that fits following characteristics:
* High income area so more likely to eat out in restaurant
* Populated area to ensure potential customers are living closeby
* Sufficient restaurant density to ensure customer traffic in the area

### 1.3. Interest
This can be of interest for anyone looking to open a new restaurant in Strasbourg, to support the decision making. Also this a good practical case for those interested in using the Foursquare API, looking for statistics in France as well as Data Science students.



## 2. Data

In France detailed data regarding population is being compiled by the INSEE, which is the French National Institute of Statistics, based on their census data. 

### 2.1. Required data
Here I will list all the data that is relevant to perform the analysis:
* Proper breakdown of the city in relevant areas: for midsize cities in France, unfortunately there is no area breakdown by postocode (as for example in New York or Toronto). The INSEE has broken down all the French territory into "IRIS" areas which are clusters sized for relevant statistical analysis, and are the basis of most statistical analysis in France
* Mean income per household by area
* Population of the area
* Geographical coordinates of each area
* Restaurant data for each area

### 2.2. Data sources
As mentioned, the INSEE is collecting all the census data such as population or income, however they are not very good at making this detailed data available to the public. Their website doesn't have many free datasets that are relevant for our analysis.
After some research I found that all the INSEE data is available and accessible in user-friendly manner on the Opendatasoft website, which offers lots of public data at following URL : <https://public.opendatasoft.com/>
This data can be easily exported as csv file, which is what we will do later on.


For restaurant data the Foursquare API will be used.

### 2.3. Data Preparation

#### Download all dependencies and libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
# import StandardScaler to normalize the data
from sklearn.preprocessing import StandardScaler

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


#### Prepare the statistical area data
In this section we will extract, clean and prepare the statistical data which is following:
* Area codes and names using the IRIS breakdown (as not postcode breakdown is possible in our case)
* Population per area
* Mean income per area
* latitude and longitude data per area

Let's start to import the first dataset from which contains most of the data except income. The data can be found in the opendatasoft website which gathers lots of the INSEE French statistical data. The dataset that we will importing first is the one containing IRIS areas data, population and geographical coordinates.
This is accessible here: <https://public.opendatasoft.com/explore/dataset/population-francaise-par-code-iris-en-2012/table/?q=com%3D67482>

In [2]:
#import the population and coordinates from csv file
df_coor = pd.read_csv('https://public.opendatasoft.com/explore/dataset/population-francaise-par-code-iris-en-2012/download/?format=csv&q=com%3D67482&timezone=Asia/Jakarta&lang=fr&use_labels_for_header=true&csv_separator=%3B', sep=';')
df_coor.head()

Unnamed: 0,IRIS,REG,REG2016,DEP,UU2010,COM,LIBCOM,TRIRIS,GRD_QUART,LIBIRIS,TYP_IRIS,MODIF_IRIS,LAB_IRIS,P12_POP,GEO_SHAPE,geo_point_2d
0,674822301,42,44,67,67701,67482,Strasbourg,670451,6748223,Neudorf Est Sud,H,0,1,2631.333272,"{""type"": ""Polygon"", ""coordinates"": [[[7.773217...","48.5640709256,7.77561938307"
1,674820502,42,44,67,67701,67482,Strasbourg,670391,6748205,Kable Sud Est,H,0,1,3208.286888,"{""type"": ""Polygon"", ""coordinates"": [[[7.757449...","48.5920899839,7.75275916149"
2,674821004,42,44,67,67701,67482,Strasbourg,670441,6748210,Esplanade Nord Est,H,0,1,3336.142677,"{""type"": ""Polygon"", ""coordinates"": [[[7.773472...","48.5791513563,7.77709301935"
3,674821201,42,44,67,67701,67482,Strasbourg,670481,6748212,Neudorf Ouest Sud-Est,H,0,1,1943.439005,"{""type"": ""Polygon"", ""coordinates"": [[[7.761680...","48.5675394053,7.75860850397"
4,674821204,42,44,67,67701,67482,Strasbourg,670491,6748212,Neudorf Ouest Centre-Ouest,H,0,1,2016.244874,"{""type"": ""Polygon"", ""coordinates"": [[[7.750557...","48.5687866641,7.74811742357"


Then we process this extraction to only keep the relevant columns, rename them and also we split the coordinates in two columns for latitude and longitude:

In [3]:
# Keeping the useful columns
df_coor=df_coor[['IRIS','LIBIRIS','P12_POP','geo_point_2d']]
#splitting the latitude longitude in 2 different columns and drop the geo_point_2d column
df_coor[['lat','lon']]=df_coor.geo_point_2d.str.split(",",expand=True).astype(float)
df_coor.drop(['geo_point_2d'],axis=1,inplace=True)
# renaming the columns headers
df_coor.rename(columns={'IRIS': 'Area_code', 'LIBIRIS' : 'Area_name' , 'P12_POP' : 'Population'}, inplace = True)
df_coor.head()

Unnamed: 0,Area_code,Area_name,Population,lat,lon
0,674822301,Neudorf Est Sud,2631.333272,48.564071,7.775619
1,674820502,Kable Sud Est,3208.286888,48.59209,7.752759
2,674821004,Esplanade Nord Est,3336.142677,48.579151,7.777093
3,674821201,Neudorf Ouest Sud-Est,1943.439005,48.567539,7.758609
4,674821204,Neudorf Ouest Centre-Ouest,2016.244874,48.568787,7.748117


Next we proceed with the second import of data that will provide us the missing income data. This is coming from another dataset located here: <https://public.opendatasoft.com/explore/dataset/base-iris-sur-les-revenus-declares/table/?refine.com=67482>

In [4]:
#import the income from csv file
url_income = 'https://public.opendatasoft.com/explore/dataset/base-iris-sur-les-revenus-declares/download/?format=csv&refine.com=67482&timezone=Asia/Jakarta&lang=fr&use_labels_for_header=true&csv_separator=%3B'
df_income = pd.read_csv(url_income, sep=';')
df_income.head()

Unnamed: 0,IRIS,Libellé de l'IRIS,Commune ou ARM,LIBCOM,Part des ménages fiscaux imposés (%),Taux de bas revenus déclarés au seuil de 60 % (%),1er quartile(€),Médiane (€),3e quartile(€),Écart inter-quartile rapporté à la médiane,1er décile (€),2e décile (€),3e décile (€),4e décile (€),6e décile (€),7e décile (€),8e décile (€),8e décile (€).1,Rapport interdécile D9/D1,S80/S20,Indice de Gini,Part des revenus d'activités salariées (%),Part des indemnités de chômage (%),Part des revenus d'activités non salariées (%),"Part des pensions, retraites et rentes (%)",Part des autres revenus (%)
0,674822104,Robertsau Centre,67482,Strasbourg,73.758099,11.083499,18812.0,28728.0,42722.0,0.832289,11752.0,16978.0,21074.0,25178.0,33144.0,38670.0,63188.0,63188.0,5.376787,7.98797,0.385618,56.8,2.7,7.0,24.2,9.3
1,674821001,Esplanade Sud Est,67482,Strasbourg,47.286822,34.562212,9454.0,16056.0,23800.0,0.893498,4306.0,8114.0,10770.0,13656.0,18476.0,21658.0,33384.0,33384.0,7.752903,9.42266,0.373278,48.4,3.6,1.6,42.9,3.5
2,674823001,Stockfeld Est,67482,Strasbourg,57.305503,23.874755,12330.0,18924.0,25118.0,0.675756,6720.0,10782.0,14058.0,16522.0,21080.0,23564.0,33076.0,33076.0,4.922024,6.07462,0.302723,55.5,3.0,1.2,37.5,2.8
3,674822106,Robertsau Est,67482,Strasbourg,74.489796,11.794254,19376.0,29956.0,44342.0,0.833422,9794.0,17038.0,21434.0,26028.0,34158.0,40268.0,66968.0,66968.0,6.837656,8.771107,0.39365,54.1,2.1,6.4,28.7,8.7
4,674820202,Petite France Nord Ouest,67482,Strasbourg,60.792952,25.502426,11982.0,22341.0,33790.0,0.976143,5776.0,10238.0,14152.0,18314.0,26134.0,30956.0,47892.0,47892.0,8.291551,11.498074,0.408787,68.1,3.8,7.5,14.6,6.0


We only keep the columns of interest (area code and mean income) and rename them:

In [5]:
#Keeping the useful columns
df_income = df_income[['IRIS','Médiane (€)']]
# renaming the columns headers
df_income.rename(columns={'IRIS': 'Area_code', 'Médiane (€)' : 'Mean_income'}, inplace = True)
df_income.head()

Unnamed: 0,Area_code,Mean_income
0,674822104,28728.0
1,674821001,16056.0
2,674823001,18924.0
3,674822106,29956.0
4,674820202,22341.0


Finally we merge both cleaned dataframes to get our expected statistical data:

In [6]:
# merge both dataframes on area_code
df=pd.merge(df_income,df_coor, on='Area_code')
df.head()

Unnamed: 0,Area_code,Mean_income,Area_name,Population,lat,lon
0,674822104,28728.0,Robertsau Centre,2146.842963,48.6073,7.783869
1,674821001,16056.0,Esplanade Sud Est,3365.658019,48.575999,7.770276
2,674823001,18924.0,Stockfeld Est,2267.72998,48.531956,7.771555
3,674822106,29956.0,Robertsau Est,2638.509955,48.603239,7.801769
4,674820202,22341.0,Petite France Nord Ouest,2292.148315,48.582471,7.741456


Ok that looks good so far. Let's map these IRIS areas on the Strasbourg map.
Let's get the geographical coordinates of Strasbourg so we can center the map properly:

In [7]:
address = 'Strasbourg, FR'

geolocator = Nominatim(user_agent="strasbourg_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Strasbourg are {}, {}.'.format(latitude, longitude))


The geographical coordinates of Strasbourg are 48.584614, 7.7507127.


And display the map using folium. As Strasbourg is a mid-size city we use a zoom of 12:

In [26]:
# create map of Strasbourg using latitude and longitude values
map_strasbourg = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(df['lat'], df['lon'], df['Area_name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_strasbourg)  
    
map_strasbourg

Now that this is done, time to get the relevant restaurant data !

#### Prepare the restaurant data by area

For this we'll use the Foursquare API. Let's enter our credentials here: (taken out for privacy reason)

In [27]:
#Foursquare API
CLIENT_ID = 'abc' # your Foursquare ID
CLIENT_SECRET = 'def' # your Foursquare Secret
VERSION = '20200803'

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: abc
CLIENT_SECRET:def


Get the food venues of each area within a radius of 500 meters (limiting to maximum 100 venues per area), by creating a function that:  
* extracts name, latitude and longitude of an areas from our dataset
* inputs the client credentials, name, latitude, longitude to the Foursquare API to explore restaurants within 500m radius of the neighbourhood
* restrict the API call to Food venues category, which according to Foursquare's API documentation has the category id of '4d4b7105d754a06374d81259' (note: there is no specific category for restaurants only)
* retrieves restaurants data from the Json code that Foursquare API outputs
* creates a dataframe of all the relevant information required to analyze the restaurants

In [10]:
#Get the food venues of each area
radius = 500
LIMIT = 100
cat_id = '4d4b7105d754a06374d81259' #Food venues category

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)


        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}&categoryId={}'.format(
       # url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng, 
            VERSION, 
            radius, 
            LIMIT,
            cat_id)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
       
    # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Area_name', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
        

Now let's run our newly created function on our dataset:

In [11]:
strasbourg_venues = getNearbyVenues(names = df['Area_name'],
                                   latitudes = df['lat'],
                                   longitudes = df['lon']
                                  )

Robertsau Centre
Esplanade Sud Est
Stockfeld Est
Robertsau Est
Petite France Nord Ouest
Gare Nord Ouest
Kable Sud Est
Contades Nord
Contades Sud
Neudorf Sud Sud
Cronenbourg Est Nord-Ouest
Gare Sud Est
Cronenbourg Est Centre-Est
Gare Sud Ouest
Foret Noire Est
Esplanade Sud Ouest
Koenigshoffen Est Ouest
Krutenau Nord Ouest
Neudorf Sud Centre Ouest
Kable Sud Ouest
Montagne Verte Nord Ouest
Polygone Est
Neuhof Sud
Contades Centre
Robertsau Nord
Cronenbourg Ouest Ouest
Vauban Ouest
Orangerie Est
Polygone Sud
Neudorf Sud Sud Est
Neudorf Est Sud
Gare Nord Est
Cronenbourg Ouest Nord-Est
Vauban Sud
Neudorf Ouest Centre
Robertsau Ouest
Koenigshoffen Ouest Centre-Est
Elsau Ouest
Neuhof Nord
Stockfeld Ouest
Petite France Nord Est
Foret Noire Sud
Robertsau Sud Ouest
Montagne Verte Nord Est
Mairie Sud
Montagne Verte Sud Est
Montagne Verte Sud
Port du Rhin Centre Ouest
Krutenau Centre Ouest
Montagne Verte Centre Ouest
Krutenau Nord Est
Neudorf Ouest Centre-Ouest
Cronenbourg Ouest Est
Petite France Ce

And check the resulting dataframe:

In [12]:
strasbourg_venues.head()

Unnamed: 0,Area_name,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Robertsau Centre,48.6073,7.783869,La Vignette,48.603874,7.785202,French Restaurant
1,Robertsau Centre,48.6073,7.783869,Restaurant Au Coq Blanc,48.603501,7.783838,French Restaurant
2,Robertsau Centre,48.6073,7.783869,Le Joyeux Pécheur,48.605472,7.778098,French Restaurant
3,Robertsau Centre,48.6073,7.783869,Le Violon d'Ingres,48.603929,7.782247,Diner
4,Robertsau Centre,48.6073,7.783869,Boulangerie Patisserie Materne,48.604399,7.78204,Bakery


In [13]:
strasbourg_venues.shape

(1322, 7)

So there are 1,322 food venues in the areas from our dataset. Let's see what different types of foods Strasbourg has to offer:

In [14]:
strasbourg_venues['Venue Category'].unique()

array(['French Restaurant', 'Diner', 'Bakery', 'Pizza Place',
       'Sandwich Place', 'Middle Eastern Restaurant', 'Taco Place',
       'Fast Food Restaurant', 'Gastropub', 'Doner Restaurant',
       'Brasserie', 'Steakhouse', 'Restaurant', 'Brazilian Restaurant',
       'Asian Restaurant', 'Belgian Restaurant', 'Café',
       'Italian Restaurant', 'Spanish Restaurant',
       'Vietnamese Restaurant', 'Alsatian Restaurant', 'Bistro',
       'Vegetarian / Vegan Restaurant', 'Mediterranean Restaurant',
       'Sushi Restaurant', 'Cigkofte Place', 'Bagel Shop',
       'Thai Restaurant', 'Japanese Restaurant', 'Tapas Restaurant',
       'German Restaurant', 'Trattoria/Osteria', 'Burger Joint',
       'Chinese Restaurant', 'Mexican Restaurant', 'Indian Restaurant',
       'Cafeteria', 'Food Truck', 'Comfort Food Restaurant',
       'Snack Place', 'Fried Chicken Joint', 'Dim Sum Restaurant',
       'Korean Restaurant', 'Lebanese Restaurant', 'Mac & Cheese Joint',
       'Deli / Bodega', 'Pa

We can see that there is no Indonesian restaurant so we would be the first one!

## 3. Methodology

### 3.1. Exploratory data analysis

In this section we will explore our data. First I want to assign flags (1 if True, 0 if False) to each restaurant, to know if there are Asian cuisine or non-Asian cuisine. 

In order to do that I need to check if the venues are part (or not) of the following categories (handpicked from the unique category list above):
* Asian Restaurant
* Vietnamese Restaurant
* Sushi Restaurant
* Thai Restaurant
* Japanese Restaurant
* Chinese Restaurant
* Korean Restaurant


In [15]:
# Create an numpy array containg all the food venus that are serving Asian cuisine
Asian_chk = np.array(['Asian Restaurant','Vietnamese Restaurant','Sushi Restaurant','Thai Restaurant','Japanese Restaurant','Chinese Restaurant','Korean Restaurant'])

# Create two columns in our venues dataset to identify if the venues are Asian or Non Asian
strasbourg_venues['Asian'] = strasbourg_venues['Venue Category'].isin(Asian_chk).astype(int)
strasbourg_venues['Non Asian']= abs(1-strasbourg_venues['Asian'])

# Check our updated venues dataset
strasbourg_venues.head()

Unnamed: 0,Area_name,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Asian,Non Asian
0,Robertsau Centre,48.6073,7.783869,La Vignette,48.603874,7.785202,French Restaurant,0,1
1,Robertsau Centre,48.6073,7.783869,Restaurant Au Coq Blanc,48.603501,7.783838,French Restaurant,0,1
2,Robertsau Centre,48.6073,7.783869,Le Joyeux Pécheur,48.605472,7.778098,French Restaurant,0,1
3,Robertsau Centre,48.6073,7.783869,Le Violon d'Ingres,48.603929,7.782247,Diner,0,1
4,Robertsau Centre,48.6073,7.783869,Boulangerie Patisserie Materne,48.604399,7.78204,Bakery,0,1


Now that we identified the types of venues (Asian / Non Asian), let's summarize this data on area level in a new dataframe. This will let us know how many Asian and Non Asian food venues are in each area:

In [16]:
sxb = strasbourg_venues[['Area_name','Asian','Non Asian']].groupby('Area_name').sum()
sxb.head()

Unnamed: 0_level_0,Asian,Non Asian
Area_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Canardiere Est Est,1,3
Canardiere Ouest Est,1,4
Canardiere Ouest Ouest,1,3
Cite de l'Ill Est,0,3
Cite de l'Ill Ouest,0,1


Next step is to merge this dataframe with number of venues per category (Asian/Non Asian) with all the statistical data (population, income, latitude, longitude):

In [17]:
sxb=pd.merge(sxb,df, on='Area_name')
sxb.head()

Unnamed: 0,Area_name,Asian,Non Asian,Area_code,Mean_income,Population,lat,lon
0,Canardiere Est Est,1,3,674822702,7052.0,3044.921472,48.548277,7.754681
1,Canardiere Ouest Est,1,4,674822602,13714.0,2167.715969,48.549848,7.751284
2,Canardiere Ouest Ouest,1,3,674822601,19106.0,3843.497615,48.54929,7.745445
3,Cite de l'Ill Est,0,3,674822002,12136.0,1845.866862,48.615211,7.779522
4,Cite de l'Ill Ouest,0,1,674822001,10358.0,2746.380994,48.615941,7.772018


Let's check the statistics of this area data:

In [18]:
sxb.describe()

Unnamed: 0,Asian,Non Asian,Area_code,Mean_income,Population,lat,lon
count,94.0,94.0,94.0,94.0,94.0,94.0,94.0
mean,1.265957,12.797872,674821500.0,19456.585106,2670.143222,48.578593,7.749775
std,2.16609,17.994372,807.1956,6979.877591,708.482747,0.017467,0.025801
min,0.0,1.0,674820100.0,5584.0,1365.319345,48.531956,7.692865
25%,0.0,4.0,674820800.0,13898.0,2141.375251,48.56742,7.731384
50%,0.0,5.0,674821400.0,19896.0,2562.127392,48.580072,7.75492
75%,1.0,12.0,674822100.0,23033.0,3161.780208,48.589633,7.768567
max,8.0,93.0,674823000.0,40438.0,5214.369785,48.625189,7.808822


From this dataset we can see that overall in Strasbourg for each area:
* Mean income per person in an area is 19,456 EURO yearly income
* Mean population of an area is 2,670 inhabitants
* Mean number of Asian restaurants per area is 1.27
* Mean number of Non Asian restaurants per area is 12.8

### 3.2. Clustering of the areas

We will run k-means clustering with 5 clusters. But first we need to normalize the data using Standardscaler.

In [19]:
# Normalization of the data
X = sxb.drop(['Area_name','Area_code','lat','lon'], axis= 1)
X = sxb.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet



array([[-1.23440611e-01, -5.47416068e-01,  1.52421347e+00,
        -1.78672161e+00,  5.31823535e-01, -1.74493673e+00,
         1.91158169e-01],
       [-1.23440611e-01, -4.91545155e-01,  1.39966350e+00,
        -8.27145834e-01, -7.12961966e-01, -1.65447723e+00,
         5.87783661e-02],
       [-1.23440611e-01, -5.47416068e-01,  1.39841800e+00,
        -5.04972943e-02,  1.66503121e+00, -1.68659505e+00,
        -1.68716577e-01],
       [-5.87577308e-01, -5.47416068e-01,  6.52363631e-01,
        -1.05443652e+00, -1.16967718e+00,  2.10760928e+00,
         1.15909964e+00],
       [-5.87577308e-01, -6.59157893e-01,  6.51118132e-01,
        -1.31053465e+00,  1.08184084e-01,  2.14966698e+00,
         8.66694721e-01],
       [-5.87577308e-01, -4.35674243e-01, -1.09133605e+00,
         2.42204514e+00,  9.21089309e-01,  6.76067209e-01,
         4.27252486e-01],
       [-5.87577308e-01, -3.79803330e-01, -1.09009055e+00,
         3.28320521e-01, -1.85159103e+00,  1.15745462e+00,
         5.8025261

Now that we normalized our data we can run k-means clustering:

In [20]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

Let's add the clusters back in our main dataset defined in step 3.1 :

In [21]:
# add clustering labels to our dataset created in step 3.1
sxb.insert(0, 'Cluster Labels', kmeans.labels_)
sxb.head()

Unnamed: 0,Cluster Labels,Area_name,Asian,Non Asian,Area_code,Mean_income,Population,lat,lon
0,1,Canardiere Est Est,1,3,674822702,7052.0,3044.921472,48.548277,7.754681
1,1,Canardiere Ouest Est,1,4,674822602,13714.0,2167.715969,48.549848,7.751284
2,3,Canardiere Ouest Ouest,1,3,674822601,19106.0,3843.497615,48.54929,7.745445
3,1,Cite de l'Ill Est,0,3,674822002,12136.0,1845.866862,48.615211,7.779522
4,1,Cite de l'Ill Ouest,0,1,674822001,10358.0,2746.380994,48.615941,7.772018


Now we can visualize each cluster on the map, using folium:

In [22]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sxb['lat'], sxb['lon'], sxb['Area_name'], sxb['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's make a summary of the data to see the characteristics of each cluster:

In [23]:
# create a dataframe to aggregate the clusters'data
sxb_summary = sxb.groupby('Cluster Labels').agg({'Area_name':'count','Asian':'sum', 'Non Asian':'sum','Mean_income':'mean','Population':'sum'})
sxb_summary

Unnamed: 0_level_0,Area_name,Asian,Non Asian,Mean_income,Population
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,31,61,636,22713.935484,76350.318941
1,26,9,120,10723.730769,71045.257264
2,10,8,110,29032.5,24785.413978
3,24,40,317,18434.958333,69242.979112
4,3,1,20,37735.333333,9569.493565


What we can say about the clusters:
* **Cluster 0:** the most populated cluster with 31 areas and 76,350 inhabitants, income of 22,713 Euros is above the city's mean income 19,456 Euros, and the highest number of restaurants **--> Good potential cluster**
* **Cluster 1:** the mean income is too low for our case
* **Cluster 2:** a smaller cluster with less than 25,000 people however there is a very high income of 29,032 Euros, almsot 10,000 Euros higher than the mean income in the city **--> Good potential cluster**
* **Cluster 3:** the mean income is too low as we are looking for lcuster above the mean income of the city
* **Cluster 4:** a very small cluster with less than 10,000 inhabitants but a the highest mean income in the city, also there is only one Asian restaurant in this cluster **--> Good potential cluster**

## 4. Results

Based on the performed exploratory analysis and clustering, we can tell the best areas in which to open an Indonesian restaurant in Strasbourg have following characteristics:
* Mean income bigger than the city's mean income
* Population of the area bigger than the mean area population
* Density of Non Asian restaurants in the area bigger than the city's mean density (to ensure there is enough customer traffic and validate the area population is eating out)
* Area is located in clusters 0, 2 or 4

So let's list the areas that fulfill these criterias:

In [24]:
# list the best locations as defined (in cluster 0 or 2 or 4 AND population > mean population AND Income > mean income)

sxb_good = sxb.loc[(sxb['Mean_income'] > sxb['Mean_income'].mean()) 
                   & (sxb['Population'] > sxb['Population'].mean()) 
                   & (sxb['Non Asian'] > sxb['Non Asian'].mean())
                   & ((sxb['Cluster Labels']==0)|(sxb['Cluster Labels']==2)|(sxb['Cluster Labels']==4))
                  ]

sxb_good

Unnamed: 0,Cluster Labels,Area_name,Asian,Non Asian,Area_code,Mean_income,Population,lat,lon
18,0,Esplanade Nord Ouest,1,16,674821005,22360.0,2746.041336,48.579749,7.766852
34,0,Kable Sud Ouest,2,19,674820503,23270.0,2962.160086,48.590608,7.748234
46,0,Mairie Sud,5,69,674820102,24812.0,3741.745583,48.582155,7.752335
78,0,Poincare Est,4,15,674820402,22446.0,2975.528259,48.588116,7.745589
79,0,Poincare Ouest,6,26,674820401,23068.0,3754.31764,48.589735,7.741896


We found 5 areas fulfilling our requirements. Let's map them out on the city map:

In [25]:
# create map of Strasbourg using latitude and longitude values
map_good = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(sxb_good['lat'], sxb_good['lon'], df['Area_name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_good)  
    
map_good

The 5 selected areas are pretty much downtown and in the nicer neighbourhoods of Strasbourg.

## 5. Discussion

Our analysis led us to define what are the best potential areas to open a new type of restaurant (in our case Indonesian) in a mid-size city (in our case Strasbourg in France, approximately 277,000 inhabitants).

Unlike bigger cities the city can not be analyzed by simply using postal code breakdown, as this data isn't granular enough, so we had to find an alternative segmentation that is more appropriate. In our case this happened to be the IRIS areas breakdown defined by the INSEE, French Institute of Statistics, which ahd the advantage to have lots of interesting data from census (such as population and income) that are accessible using thord party website.

Based on the available data we first explored some basic characteristics of each area and ran a k-clustering to see some pattern that can't be identified by looking at the data manually. Based on the different clusters we then identified which areas are fulfilling all our criterias and as a result ended up with a shortlist of the 5 best areas where to open an Indonesian restaurant in Strasbourg (France):
* Esplanade Nord Ouest
* Kable Sud Ouest
* Mairie Sud
* Poincare Est
* Poincare Ouest

## 6. Conclusion

Purpose of this project was to identify the best area in Strasbourg (France) to open an Indonesian restaurant, which would be the first one in the city. Based on our analysis we identified 5 potential areas fulfilling our criterias defined in the methodology part. This analysis is very helful to pre-screen the potential sites and allows us to speed up the process of identifying the perfect location, as we already identified in which areas it's bet to open an Indonesian restaurant.

This is a good starting point for exploring further these pre-selected areas. The decision of the precise location needs to be made also based on other factors such as real estate availability, rental prices, accessibility of the location.