# Clustering and segmenting Neighborhood in Rennes, France

## Introduction / Business Problem

Rennes is my birth city and was quite in advance on its time by giving open datas to the people since at least ten years. Now, there is lots of data from the city of Rennes and other contractors that have business in Rennes. Like for example the company that run the bus in Rennes has some API to get datas about them. Bus stop location, real time traffic, etc..

Our objectif in this project is to cluster Rennes Neighborhood using foursquare and using datas from the city of Rennes.

We use foursquare API but we also add data about transportation of Rennes city.

Then we try to analyse cluster predicted and assigned them to groups of people. 

Overall business objectif  is to direct people/businesses, that are looking to settle in Rennes, into the good neighborhood.

## Data

In order to complete out goals, two source of datas are going to be used : 
- Foursquare API
- Rennes' datasets

All datas are quite easily available. Biggest challenge is to find out the neighborhood and assign each venue, structure, etc.. to the correct neighborhood.


### Foursquare API

Foursquare is a social location service that allows users to explore the world around them. They are at this time able to review locations in which they come and go and give a notation and a comment about this place. 

The Foursquare API allows application developers to interact with the Foursquare platform. We can retrieve venues and all the details about it (notations, comments, users, etc..), but also details about users.

In order to cluster our neighborhood, we will use Foursquare API to have data about Rennes locations and venues.
We will then be able to link top 10 venues to each neighborhood.

### Rennes' datasets

In addition to the Foursquare venues, we will add transport informations to our neighborhood : 
- Number of Bus stops
- Number of charger for electric vehicles
- Number of kilometers of paid parking
- Number of bike supports
- Number of kilometers of bike ways
- Number of culture equipment

Each of these datasets are coming from Rennes Open Data Services. We downloaded some csv files containing informations about Rennes and others cities surrounding Rennes. Each equipment, location, etc. is coming with a Latitude and a Longitude. for simplicity purpose, we will use in this study only the informations concerning Rennes. 

In Rennes Open Datas Services we were also able to find on very important dataset, which is the one that cut Rennes into neighborhood.

Thanksfully Rennes is a quite small city compared to New York and Toronto, so we do not need to cut down the numbers of neighborhood.

In order to analyse our clusters, we also have datas about : 
- Sexe, age and nationality of population in Rennes
- Seniority of habitants of Rennes

We will use these datas to see if our analysis is correct at the end of the project. 
We will indeed be able to see if habitants of a neighbourhood are concording with the analysis we made about each clusters we created.

# Notebook

## Data import

First we import all the libraries needed

In [1]:
import pandas as pd
import numpy as np
import json # library to handle JSON files
from math import *

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.17.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


After this we define this little function that will help us compute distance between two points defined with Longitude and Latitude. This will be very helpfull for us in order to define the appartenance of an object (bus stop, electric car charger, etc.) to a neighbourhood.

In [3]:
def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

## Neighbourhood

First we import the Neighbourhood of Rennes. We downloaded csv files from Rennes Open Data website into the Watson Studio. We use the code provided by IBM to import the data into the notebook. After This we only remove the unused column and we keep only the Latitude and Longitude (as floats), and of course the name of the neighbourhood.

In [4]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,nom,lat,long
0,Le Blosne,48.085013,-1.658945
1,Cleunay - Arsenal - Redon,48.095816,-1.722033
2,Saint Martin,48.126865,-1.683262
3,Villejean - Beauregard,48.129004,-1.711953
4,Bréquigny,48.086038,-1.685403


Let's see this on the map : 

In [5]:
latitude = 48.117266
longitude = -1.6777926

# create map of NToronto using latitude and longitude values
map_rennes = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(rennes_neighbourhood['lat'], rennes_neighbourhood['long'], rennes_neighbourhood['nom']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rennes)  
    
map_rennes

## Other Datas from Rennes Open Data Website :

In all the datas that we will get in this section, we will remove unused columns and keep a column with the equipment name, the longitude and the latitude as float and we will add a column full of Nan named Neighbourhood in order to class each object into a neighbourhood later. And then we will show the head.

Also for simplicity we only keep datas from the city of Rennes only, as in some data frames, their is data of suburb of Rennes that we remove.

### Electric Car Charger

In [188]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='bornes-de-recharge-dediees-aux-vehicules-electriques-sur-le-territoire-de-rennes.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

electric_car_charger = pd.read_csv(body,delimiter=';')
electric_car_charger['addr'], electric_car_charger['town'] = electric_car_charger['site_adr'].str.split(',').str
electric_car_charger = electric_car_charger.loc[electric_car_charger['town'] == ' Rennes']
electric_car_charger['lat'], electric_car_charger['long'] = electric_car_charger['Geo Point'].str.split(',').str
electric_car_charger = electric_car_charger[['lat','long']]
electric_car_charger['Equipment'] = 'electric car charger'
electric_car_charger['Neighbourhood'] = np.nan
electric_car_charger.head()

Unnamed: 0,lat,long,Equipment,Neighbourhood
2,48.106149,-1.677162,electric car charger,
3,48.111,-1.68363,electric car charger,
4,48.130528,-1.638323,electric car charger,
6,48.092414,-1.674211,electric car charger,
7,48.113523,-1.686233,electric car charger,


### Bus stops

In [187]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='equipement-accessibilite-arrets-bus.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

bus_stop = pd.read_csv(body,delimiter=';')
bus_stop = bus_stop.loc[bus_stop['Commune (nom)'] == 'Rennes']
bus_stop['lat'], bus_stop['long'] = bus_stop['Coordonnées'].str.split(',').str
bus_stop = bus_stop[['lat','long']]
bus_stop['Equipment'] = 'bus stop'
bus_stop['Neighbourhood'] = np.nan
bus_stop.head()

Unnamed: 0,lat,long,Equipment,Neighbourhood
0,48.127369,-1.640433,bus stop,
1,48.121446,-1.655036,bus stop,
2,48.119241,-1.667693,bus stop,
3,48.11605,-1.674245,bus stop,
4,48.11252,-1.680352,bus stop,


### Bike Stops

For this one we keep the number of support at each point in order to sum them up later.

In [189]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='supports-velos.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

bike_stops = pd.read_csv(body,delimiter=';')
bike_stops = bike_stops.loc[bike_stops['nom_commune'] == 'Rennes']
bike_stops['lat'], bike_stops['long'] = bike_stops['Geo Point'].str.split(',').str
bike_stops = bike_stops[['nombre_support', 'lat', 'long']]
bike_stops['Equipment'] = 'bike stops'
bike_stops['Neighbourhood'] = np.nan
bike_stops.head()

Unnamed: 0,nombre_support,lat,long,Equipment,Neighbourhood
0,4,48.1174872812,-1.6777579592,bike stops,
1,5,48.1098139465,-1.67522515707,bike stops,
2,8,48.1096012826,-1.67985481523,bike stops,
3,5,48.113188555,-1.67762495861,bike stops,
4,4,48.0859188837,-1.64220623748,bike stops,


### Cultural equipment

For this one, name of the neighbourhood is already included. We do not even need to keep Latitude and Longitude.

In [192]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='liste-des-equipements-et-organismes-culturels-de-rennes-metropole.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

cultural_equipment = pd.read_csv(body,delimiter=';')
cultural_equipment = cultural_equipment.loc[cultural_equipment['CommuneNom'] == 'Rennes']
cultural_equipment['Equipment'] = 'Cultural equipement'
cultural_equipment['Neighbourhood'] = cultural_equipment['QuarNom']
cultural_equipment = cultural_equipment[['Equipment','Neighbourhood']]

cultural_equipment.head()

Unnamed: 0,Equipment,Neighbourhood
0,Cultural equipement,Centre
1,Cultural equipement,Thabor - Saint-Hélier - Alphonse Guérin
2,Cultural equipement,Maurepas - Bellangerais
3,Cultural equipement,Maurepas - Bellangerais
4,Cultural equipement,Jeanne d'Arc - Longs Champs - Beaulieu


### Green Roads

In [193]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='amenagements-velo-et-zones-de-circulation-apaisee-sur-rennes-metropole.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

green_roads = pd.read_csv(body,delimiter=';')

green_roads = green_roads.loc[green_roads['c_insee'] == 35238.0]
green_roads['lat'], green_roads['long'] = green_roads['Geo Point'].str.split(',').str
green_roads = green_roads[['lat','long']]
green_roads['Equipment'] = 'Green Roads'
green_roads['Neighbourhood'] = np.nan
green_roads.head()

Unnamed: 0,lat,long,Equipment,Neighbourhood
0,48.126091,-1.633563,Green Roads,
1,48.113899,-1.679896,Green Roads,
2,48.112593,-1.68146,Green Roads,
3,48.101184,-1.677567,Green Roads,
4,48.090941,-1.667963,Green Roads,


### Paid Parking

In [194]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='portions-de-voies-en-stationnement-payant-sur-la-ville-de-rennes.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

paid_parking = pd.read_csv(body,delimiter=';')
paid_parking['lat'], paid_parking['long'] = paid_parking['Geo Point'].str.split(',').str
paid_parking = paid_parking[['lat','long']]
paid_parking['Equipment'] = 'Parking'
paid_parking['Neighbourhood'] = np.nan
paid_parking.head()

Unnamed: 0,lat,long,Equipment,Neighbourhood
0,48.112387,-1.661293,Parking,
1,48.104794,-1.692262,Parking,
2,48.109626,-1.672802,Parking,
3,48.104185,-1.690956,Parking,
4,48.106053,-1.690532,Parking,


Lets merge all of our Data Frames (except cultural equipment that we will add once we found the neigbourhood of each equipment).
We also make sure that latitude and longitude are floats.

In [200]:
frames = [paid_parking, green_roads, bike_stops, bus_stop, electric_car_charger]
equipments = pd.concat(frames)
equipments['lat'] = equipments['lat'].astype(float)
equipments['long'] = equipments['long'].astype(float)
equipments.head()

Unnamed: 0,Equipment,Neighbourhood,lat,long,nombre_support
0,Parking,,48.112387,-1.661293,
1,Parking,,48.104794,-1.692262,
2,Parking,,48.109626,-1.672802,
3,Parking,,48.104185,-1.690956,
4,Parking,,48.106053,-1.690532,


## Put equipment into Neighboorhood

For simplicity reason we will consider that each neighboor is round. So we will take the center of the neighbourhood and look for each equipment if the point is in a distance of less than 1km of a neighbourhood. If it is the case, we will assign the neighbourhood to the equipment.

For the computation of the distances we use the fonction defined at the begining.

In [None]:
for index, row in rennes_neighbourhood.iterrows():
    check_distance = equipments.copy()
    check_distance['Neighbourhood']=rennes_neighbourhood['nom'].iloc[index]
    check_distance['Neighbourhood_lat']=rennes_neighbourhood['lat'].iloc[index]
    check_distance['Neighbourhood_long']=rennes_neighbourhood['long'].iloc[index]
    check_distance['dist_from_Neighbourhood']= check_distance.apply(lambda row: haversine(row['Neighbourhood_long'], 
                                            row['Neighbourhood_lat'], 
                                            row['long'], 
                                            row['lat']), axis=1)
    
    for index, row in check_distance.iterrows():
        if row['dist_from_Neighbourhood'] < 1:
            if pd.isnull(equipments['Neighbourhood'].iloc[index]):
                equipments['Neighbourhood'].iloc[index] = row['Neighbourhood']
            else:
                equipments['Neighbourhood'].iloc[index] = equipments['Neighbourhood'].iloc[index] + ', ' + row['Neighbourhood']
                
            
equipments.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


As a result of choosing round neighbourhood for simplicity, some equipment are not assigned to a neighbourhood.
Let's get rid of them.

In [228]:
pd.isnull(equipments['Neighbourhood'].iloc[1])

False

In [218]:
equipments.head()

Unnamed: 0,Equipment,Neighbourhood,lat,long,nombre_support
0,Parking,CentreJeanne d'Arc - Longs Champs - BeaulieuJe...,48.112387,-1.661293,
1,Parking,CentreJeanne d'Arc - Longs Champs - BeaulieuCe...,48.104794,-1.692262,
2,Parking,CentreThabor - Saint-Hélier - Alphonse GuérinT...,48.109626,-1.672802,
3,Parking,CentreSud gareCentreCentreCentre,48.104185,-1.690956,
4,Parking,CentreLe BlosneSud gareCentreCentre,48.106053,-1.690532,


# Code pour l'ancienneté des habitant et les stats de pop

In [None]:
body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='logement-anciennete.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

housing_seniority = pd.read_csv(body,delimiter=';')
housing_seniority = housing_seniority.loc[housing_seniority['Commune'] == 'Rennes']
housing_seniority = housing_seniority.drop(['Code INSEE', 'geolocalisation', 'Commune'], axis = 1)
housing_seniority.head()

body = client_2551b64066e74033992250268342bdad.get_object(Bucket='courseraproject-donotdelete-pr-pstz2scfjmnlsh',Key='population-par-sexe-age-et-nationalite-par-commune-2014.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

pop_stats = pd.read_csv(body,delimiter=';')
pop_stats = pop_stats.loc[pop_stats['libellé géographique'] == 'Rennes']
pop_stats = pop_stats.drop(['niveau géographique', 'code géographique', 'libellé géographique', 'CODE_DEPT', 'CODE_DEPT'], axis = 1)
pop_stats = pop_stats.drop(['NB', 'inter_codegeo1', 'EPCI', 'LIBEPCI', 'NATURE_EPCI', 'NOM_DEPT', 'CODE_REG', 'NOM_REG'], axis = 1)
pop_stats.head()