#### *IBM Data Science Certification Capstone Project - authored by Julien Girault, data scientist scholar*
# Find your Favourite Tube Station before Moving to London!

# Project introduction

Many French people move from Paris to London to improve their English and live a new experience there.

But what is the best place they should choose for settling down?

When moving to a new city, one will likely choose their new address based on public transports: they will search the lines that can bring them quickly and directly from their new home to their office.

__"London tube" and "Paris métro" public transports are comparably big with approximately 400 stations__... So here is the question that comes next: how can one know the kind of surroundings they are likely to find around each station? How can they decide where to settle down? Which station should they prefer with so many possibilities?

This project gives French newcomers the opportunity to find the list of tube stations in London that best fit their taste, based on a comparison with something they already know: Paris's metro stations!

For each London tube station on the map, we expose a list of Paris stations that are similar. Here is a full report describing how this is done.

You can also find this full report, more results and corresponding source code in [my jupyter notebook of the project on my github account here](https://github.com/juliengirault52/Coursera_Capstone/blob/main/capstone_project.ipynb).


# Data Description

There are two data entities for this project, here is a short description for both of them to help the reader have a quick understanding:
### Stations coordinates
For this project, I downloaded the transports stations coordinates released by the public services for both London [1] and Paris [2].

### List of venues available around each stations
To compare the stations, I used the data available via the Foursquare API for each station coordinates, exploring the list of venues reachable within a five-minute walk (300 meters).

# Methodology

The main idea of the project was __to mix all stations of both London and Paris cities and cluster them__. The comparision of the stations had to be based only on the services offered in their surroundings (i.e. without taking into account the location). And finally when able to compare, create groups of stations that are similar.

### Why need an AI algorithm for clustering?
For a two-dimension problem, when one wants to create clusters, this can be done simply by visualizing the data and there is no need to use AI. But in our case, our dataset had about 400 features (venues caterories), which means about 400 dimensions! Even after trying to simplify the data, which will be covered later in the report, there still was more than 200 features. This is why we needed to use a clustering algorithm to reach the objective.

### Data preparation
#### Categories adjustments
A manual job was necessary to increase the matching of venues categories between Paris and London. For instance, I needed to transform "Pizza Place", which surprisingly was only used in London, to "Italian Restaurant". Likewise, the venues called "Gym" in Paris were called "Gym / Fitness Center" in London. There was also very specific data such as "Auvergne Restaurant" that became "Restaurant". And of course the English "Pub" and French "Bistro" needed to be turned to a more generic naming: "Bar".

To do so, I created a "categories_to_merge.csv" with an ordered list of the most common categories and identified which ones were unbalanced between London and Paris.
#### Data simplification
To reduce features, I kept only the categories that were in the top 5 categories of the venues. This allowed having about 250 features instead of 402.

### London-Paris "city-balance": an adhoc quality score
One of the risks was that the clusters would regroup stations only from the same city (Paris or London). To make sure that the resulting clusters would regroup a balanced quantity of stations from each city, I computed a "balance score". The score must be as low as possible. It is the distance to 50% for the ratio "number of stations of Paris" / "total number of stations" (*score=abs(0.50-np/nl)*).
ex:
```
Cluster Labels-3 cities-balance score: 0.3574861584519108 inertia: 89.97750663333983
Cluster Labels-4 cities-balance score: 0.34106974568562276 inertia: 86.00200658957111
Cluster Labels-5 cities-balance score: 0.34197384655531804 inertia: 82.58403862899996
Cluster Labels-6 cities-balance score: 0.37053549083395193 inertia: 79.11207334558907
Cluster Labels-7 cities-balance score: 0.37329728602870704 inertia: 77.13651635333981
```
### Why use the k-means algorithm? (iterating and selecting the best score)
One of the advantages of the "k-means clustering" algorithm  is that one can chose how many clusters will be created at the beginning of the process. Using the "k-means clustering" algorithm was a good way to iterate in order to get the best "city-balance" quality score. As a matter of facts, in that algorithm we know that the highest number of clusters you create, the more similar the stations will be. But on the other hand, we wished to get the best balance as possible so I chose to run multiple times the algorithm and save the best balance result possible. This was a good way to get the perfect equilibrium between similarity and "city balance".

# Results
Here is the list of Paris stations corresponding for each London station.

# Discussion
Although we used k-means for its advantage explained above, it could be interesting to run a DBSCAN algorithm to see how many clusters it would compute. DBSCAN does not require to specify the number of clusters and find the best fitting number. After this is done, it would also be interesting to see what London-Paris "city-balance" score this would give.

Running numerous times the program, brought me to the conclusion that because k-means centroids are initialised randomly and the stations are very similar the groups have a great variability. I think another good idea for a next step in the project could be to record all the stations that was proposed in the same group for each London station, and maybe add a probability score.


# Conclusion
It is amasing how clustering can help to distinguish clusters when data is very complex, but the results are a bit hard to interpret. Now let's hope that this London tube list will still help any people moving to London to find the place where they will settle down!


## References

[1] [London Tube Stations List (CSV)](https://www.whatdotheyknow.com/request/512947/response/1238210/attach/3/Stations%2020180921.csv.txt?cookie_passthrough=1),

[2] [Paris Transports Stations List website](https://data.iledefrance-mobilites.fr/explore/dataset/emplacement-des-gares-idf/download/?format=json&refine.mode=Metro&timezone=Europe/Berlin&lang=fr)

[3] [Coursera Project Training Exercise "The Battle of Neighborhoods"](https://www.coursera.org/learn/applied-data-science-capstone)

In [4]:
import pandas as pd
import io
import requests

#London tube stations coordinates:
req=requests.get("https://www.whatdotheyknow.com/request/512947/response/1238210/attach/3/Stations%2020180921.csv.txt?cookie_passthrough=1").content
df_london=pd.read_csv(io.StringIO(req.decode('utf-8')))

#Paris metro stations coordinates:
req=requests.get("https://data.iledefrance-mobilites.fr/explore/dataset/emplacement-des-gares-idf/download/?format=json&refine.mode=Metro&timezone=Europe/Berlin&lang=fr").content
df_paris=pd.read_json(io.StringIO(req.decode('utf-8')))

In [5]:
df_paris.head(4)

Unnamed: 0,datasetid,recordid,fields,geometry,record_timestamp
0,emplacement-des-gares-idf,723289fe50c959f7e63d75b17870762aa8eaddd4,"{'res_stif': 110.0, 'cod_ligf': 14.0, 'nom_iv'...","{'type': 'Point', 'coordinates': [2.3891158073...",2020-01-15T11:22:48.576+01:00
1,emplacement-des-gares-idf,2f98f2e1ee73e414cf64bae428caa96ba114be23,"{'res_stif': 110.0, 'cod_ligf': 4.0, 'nom_iv':...","{'type': 'Point', 'coordinates': [2.3209981919...",2020-01-15T11:22:48.576+01:00
2,emplacement-des-gares-idf,dafc950d65ec51317aa65aaba7a12fb5a0cfc396,"{'res_stif': 110.0, 'cod_ligf': 15.0, 'nom_iv'...","{'type': 'Point', 'coordinates': [2.2781616712...",2020-01-15T11:22:48.576+01:00
3,emplacement-des-gares-idf,5bc1c5091428bb56801455343b0cd58fca8d4179,"{'res_stif': 110.0, 'cod_ligf': 1.0, 'nom_iv':...","{'type': 'Point', 'coordinates': [2.3693205849...",2020-01-15T11:22:48.576+01:00


In [6]:
df_london.head(2)

Unnamed: 0,FID,OBJECTID,NAME,EASTING,NORTHING,LINES,NETWORK,Zone,x,y
0,0,78,Temple,530959,180803,"District, Circle",London Underground,1,-0.112644,51.510474
1,1,79,Blackfriars,531694,180893,"District, Circle",London Underground,1,-0.10202,51.511114


Need to fetch the coordinates into the table dictionary structure the are stored in, easy with pandas!

In [7]:
for i, station in enumerate(df_paris["fields"]):
    df_paris["fields"][i]=station["nom_iv"]
for i, geom in enumerate(df_paris["geometry"]):
    df_paris["geometry"][i]=geom["coordinates"][0]
    df_paris["record_timestamp"][i]=geom["coordinates"][1]

Let's regroup the stations in a common table with correct column names

In [8]:
df_paris.drop({"datasetid"}, 1, inplace=True)
df_paris.rename(columns={"recordid":"City","fields": "Station", "geometry": "Longitude","record_timestamp": "Latitude"}, inplace=True)
df_london.rename(columns={"FID": "City","x": "Longitude", "y": "Latitude", "NAME": "Station"}, inplace=True)
df_london.drop({'NORTHING', 'Zone', 'LINES', 'OBJECTID', 'EASTING', 'NETWORK'}, 1, inplace=True)
df_london["City"]="London"
df_paris["City"]="Paris"
df=df_london.append(df_paris[df_london.columns], ignore_index=True)

So here what the merged data source look like:

In [9]:
df

Unnamed: 0,City,Station,Longitude,Latitude
0,London,Temple,-0.112644,51.5105
1,London,Blackfriars,-0.10202,51.5111
2,London,Mansion House,-0.0924953,51.5113
3,London,Cannon Street,-0.088801,51.511
4,London,Monument,-0.0845023,51.5102
...,...,...,...,...
857,Paris,Voltaire,2.38072,48.8575
858,Paris,Wagram,2.30467,48.8838
859,Paris,Saint-Lazare,2.32426,48.8758
860,Paris,Trinité d'Estienne d'Orves,2.33254,48.8763


If all the stations had distinct names we could have used only "Stations" as a primary key for the project, but there is one...

In [10]:
#check how many stations in Paris and London have the same name:
res=df_paris.merge(df_london, on="Station", how="inner")
res

Unnamed: 0,City_x,Station,Longitude_x,Latitude_x,City_y,Longitude_y,Latitude_y
0,Paris,Temple,2.36154,48.8667,London,-0.112644,51.510474


Now that we have the stations, we can load Foursquare and init getNearbyVenues function (borrowed from previous lab) and get top venues for each Station

In [11]:
f=open("/resources/IBM Capstone Project/Coursera_Capstone/credentials.txt","r")
lines=f.readlines()
API_id=lines[4][:-1]
API_secret=lines[7][:-1]
f.close()
 

CLIENT_ID = API_id
CLIENT_SECRET = API_secret
VERSION = '20180605' # Foursquare API version
LIMIT = 500

def getNearbyVenues(stations_done_list, venues_list, names, cities, latitudes, longitudes, radius=300):
    
    print("Fetching NearbyVenues started!")
    for name, city, lat, lng in zip(names, cities, latitudes, longitudes):
###        print(name)
        if (city+name) not in stations_done_list:
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)
                
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']
            # return only relevant information for each nearby venue
            venues_list.append([(
                city, 
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
            stations_done_list.append(city+name)        
    print("NearbyVenues are ready!")


    

<a id='item2'></a>


In [12]:
if 'venues_list' not in globals():
    venues_list=[]
    stations_done_list=[]
#force init:
#venues_list=[]
#stations_done_list=[]

In [13]:
#if you get an error, you can repeat until the end is reached without error, the venues_list is saved and the data can be loaded with multiple executions.
getNearbyVenues(stations_done_list, venues_list,names=df['Station'],cities=df['City'], 
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'], radius=300
                                  )

Fetching NearbyVenues started!
NearbyVenues are ready!


In [14]:
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['City', 'Station',
                  'Station Latitude', 
                  'Station Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

In [15]:
nearby_venues.head(100)

Unnamed: 0,City,Station,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,London,Temple,51.510474,-0.112644,Two Temple Place,51.511523,-0.112236,History Museum
1,London,Temple,51.510474,-0.112644,Temple Gardens,51.511154,-0.111472,Park
2,London,Temple,51.510474,-0.112644,The Southbank Observation Point,51.508297,-0.111180,Scenic Lookout
3,London,Temple,51.510474,-0.112644,HQS Wellington,51.510679,-0.112214,Boat or Ferry
4,London,Temple,51.510474,-0.112644,The Queen's Walk,51.508308,-0.110853,Scenic Lookout
...,...,...,...,...,...,...,...,...
95,London,Tower Hill,51.509434,-0.074914,Virgin Active,51.511804,-0.074589,Gym / Fitness Center
96,London,Aldgate,51.513982,-0.074236,Hotel Indigo,51.512740,-0.075920,Hotel
97,London,Aldgate,51.513982,-0.074236,Dorsett City London,51.514036,-0.075812,Hotel
98,London,Aldgate,51.513982,-0.074236,Treves & Hyde,51.514114,-0.070606,Restaurant


In [16]:
print(nearby_venues.shape)

(17309, 8)


### We now have a list of 17k venues!


# Data preparation

In [17]:

pd.set_option('display.max_rows', 100)
nearby_venues.loc[nearby_venues["Venue Category"]=="French Restaurant", "Venue Category"]="Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Brasserie", "Venue Category"]="Gastropub"
nearby_venues.loc[nearby_venues["Venue Category"]=="Gym / Fitness Center", "Venue Category"]="Gym"
nearby_venues.loc[nearby_venues["Venue Category"]=="Pub", "Venue Category"]="Bar"
nearby_venues.loc[nearby_venues["Venue Category"]=="Wine Bar", "Venue Category"]="Bar"
nearby_venues.loc[nearby_venues["Venue Category"]=="Bistro", "Venue Category"]="Bar"
nearby_venues.loc[nearby_venues["Venue Category"]=="Cocktail Bar", "Venue Category"]="Bar"
nearby_venues.loc[nearby_venues["Venue Category"]=="Coffee Shop", "Venue Category"]="Café"
nearby_venues.loc[nearby_venues["Venue Category"]=="Grocery Store", "Venue Category"]="Supermarket"
nearby_venues.loc[nearby_venues["Venue Category"]=="Pizza Place", "Venue Category"]="Italian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Sushi Restaurant", "Venue Category"]="Japanese Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Creperie", "Venue Category"]="Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Chinese Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Vietnamese Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Thai Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Korean Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Cambodian Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Dim Sum Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Cantonese Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Szechuan Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Ramen Restaurant", "Venue Category"]="Asian Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Lebanese Restaurant", "Venue Category"]="Mediterranean Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Israeli Restaurant", "Venue Category"]="Mediterranean Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Corsican Restaurant", "Venue Category"]="Mediterranean Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Falafel Restaurant", "Venue Category"]="Mediterranean Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Greek Restaurant", "Venue Category"]="Mediterranean Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Turkish Restaurant", "Venue Category"]="Middle Eastern Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Moroccan Restaurant", "Venue Category"]="Middle Eastern Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Kebab Restaurant", "Venue Category"]="Middle Eastern Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Persian Restaurant", "Venue Category"]="Middle Eastern Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="English Restaurant", "Venue Category"]="Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Auvergne Restaurant", "Venue Category"]="Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Alsatian Restaurant", "Venue Category"]="Restaurant"
nearby_venues.loc[nearby_venues["Venue Category"]=="Tapas Restaurant", "Venue Category"]="Spanish Restaurant"
nearby_venues=nearby_venues[~nearby_venues["Venue Category"].str.contains('Station')] #drops bus, metro and gaz stations
nearby_venues=nearby_venues[nearby_venues["Venue Category"]!="Platform"]
nearby_venues=nearby_venues[nearby_venues["Venue Category"]!="Bus Stop"]


nearby_venues.groupby(['Venue Category', 'City']).count().sort_values("Station", ascending=False).head(100)
#nearby_venues.groupby(['Venue Category', 'City']).count().sort_values("Station", ascending=False).to_csv("categories_to_merge")




Unnamed: 0_level_0,Unnamed: 1_level_0,Station,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Restaurant,Paris,1280,1280,1280,1280,1280,1280
Café,London,1081,1081,1081,1081,1081,1081
Bar,London,808,808,808,808,808,808
Bar,Paris,728,728,728,728,728,728
Hotel,Paris,664,664,664,664,664,664
Italian Restaurant,Paris,464,464,464,464,464,464
Asian Restaurant,Paris,412,412,412,412,412,412
Italian Restaurant,London,409,409,409,409,409,409
Supermarket,London,392,392,392,392,392,392
Café,Paris,352,352,352,352,352,352


In [18]:
venues=nearby_venues
venues.groupby(['Station','City']).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Station,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ABBEY ROAD - DLR,London,1,1,1,1,1,1
ALL SAINTS - DLR,London,15,15,15,15,15,15
Abbesses,Paris,70,70,70,70,70,70
Acton Central,London,5,5,5,5,5,5
Acton Town,London,12,12,12,12,12,12


In [19]:
len(venues['Station'].unique())

752

In [20]:
# vs

len(df['Station'])

862

<p style="color:red;">IMPORTANT : we notice here that 111 Stations returned zero venues</p>

<a id='item3'></a>


Analyze the typology of venues found around each Station, creating onehot encoding, top 5 most common venues around each station to read the results of the clustering and run the clustering algorithm.

In [26]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# one hot encoding
stations_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
# add Station column back to dataframe
stations_onehot['City'] = venues['City'] 
stations_onehot['Station'] = venues['Station'] 

# move Station column to the first column
fixed_columns = [stations_onehot.columns[-1]] + list(stations_onehot.columns[:-1])
stations_onehot = stations_onehot[fixed_columns]
fixed_columns = [stations_onehot.columns[-1]] + list(stations_onehot.columns[:-1])
stations_onehot = stations_onehot[fixed_columns]

#stations_onehot.head()
#### Next, let's group rows by Station+City and by taking the mean of the frequency of occurrence of each category
stations_grouped = stations_onehot.groupby(['Station','City']).mean().reset_index()
print(stations_onehot.shape)
stations_grouped.head(3)

from statistics import mean
from sklearn.cluster import KMeans

import numpy as np
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)   
    return row_categories_sorted.index.values[0:num_top_venues]

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Station','City']

num_top_venues=5
kclusters_max = 14
random_seed=16
bestScore=1
#for num_top_venues in range(3,num_top_venues_max+1)

#First, let's write a function to sort the venues in descending order.
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Stations_venues_sorted = pd.DataFrame(columns=columns)
Stations_venues_sorted['Station'] = stations_grouped['Station']
Stations_venues_sorted['City'] = stations_grouped['City']


for ind in np.arange(stations_grouped.shape[0]):
    Stations_venues_sorted.iloc[ind, 2:] = return_most_common_venues(stations_grouped.iloc[ind, :], num_top_venues) 
    
    
Stations_venues_sorted.head(10)
#Keep only top common venues for k mean computing:

# remove venue category if not in top
print(stations_grouped.shape)
stations_grouped_idexRef=stations_grouped.copy()
isIn=stations_grouped.columns.isin(Stations_venues_sorted.iloc[:,-1])+stations_grouped.columns.isin(Stations_venues_sorted.iloc[:,-2])+stations_grouped.columns.isin(Stations_venues_sorted.iloc[:,-3])+stations_grouped.columns.isin(Stations_venues_sorted.iloc[:,-4])+stations_grouped.columns.isin(Stations_venues_sorted.iloc[:,-5])
for col_index, isCommonTop in enumerate(isIn):
#    print(stations_grouped_idexRef.columns[col_index])
    if not isCommonTop and col_index>1:
        stations_grouped.drop(stations_grouped_idexRef.columns[col_index], axis=1, inplace=True)
###        print(stations_grouped_idexRef.columns[col_index]+" was dropped")
print(stations_grouped.shape)


### 4. Cluster Stations
#### Run _k_-means to cluster the Station into X clusters.
for kclusters in range (3,kclusters_max+1):

    stations_grouped_clustering = stations_grouped.drop(['Station','City'], 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=num_top_venues*kclusters*random_seed).fit(stations_grouped_clustering)

    # add clustering labels
    Stations_venues_sorted.insert(0, 'Cluster Labels'+'-'+str(kclusters), kmeans.labels_)

    #Start evaluate balance Paris / Londres pour chaque cluster. Le faire dans la version top=x et K=Y puis faire évoluer.

    #récolter le score d'equilibire moyen : moyenne des scores pour ce K-N
    #for each cluster
    clustScores=[]
    for clust in Stations_venues_sorted["Cluster Labels"+'-'+str(kclusters)].unique():
    #    compute sum of Paris / total and calculate distance to "50%" which is the best score possible
        clustScores.append(abs(0.50-Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"+'-'+str(kclusters)]==clust].loc[Stations_venues_sorted["City"]=="Paris"].size /
                               Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"+'-'+str(kclusters)]==clust].size))

    #on copie ce K-N s'il est plus proche de 50% que celui qui était enregistré
    print("Cluster Labels"+'-'+str(kclusters) + " cities-balance score: " + str(mean(clustScores)) + " inertia: "+ str(kmeans.inertia_))
    if mean(clustScores)<bestScore:
        bestScore=mean(clustScores)
        best_clustScores=clustScores
        bestLabel="Cluster Labels"+'-'+str(kclusters)
        best_kclusters_value=kclusters
        bestStations_venues_sorted=Stations_venues_sorted

(16583, 402)
(753, 402)
(753, 252)
Cluster Labels-3 cities-balance score: 0.3527849678841914 inertia: 95.07517380477258
Cluster Labels-4 cities-balance score: 0.37335087449906357 inertia: 91.0683894294145
Cluster Labels-5 cities-balance score: 0.35561662899858837 inertia: 87.73576312573124
Cluster Labels-6 cities-balance score: 0.3727151087170815 inertia: 84.19093036123198
Cluster Labels-7 cities-balance score: 0.3464779806761549 inertia: 82.783997429781
Cluster Labels-8 cities-balance score: 0.3738801729336637 inertia: 80.95399408242488
Cluster Labels-9 cities-balance score: 0.38261852670432134 inertia: 78.9028104728063
Cluster Labels-10 cities-balance score: 0.3918206617506518 inertia: 78.00923577763295
Cluster Labels-11 cities-balance score: 0.3550271922685412 inertia: 75.88376256667657
Cluster Labels-12 cities-balance score: 0.3450384532488241 inertia: 75.15501485480263
Cluster Labels-13 cities-balance score: 0.36741314005595643 inertia: 73.71867663017134
Cluster Labels-14 cities-b

In [27]:
print(best_clustScores)
print(best_kclusters_value)


[0.5, 0.43567251461988304, 0.43506493506493504, 0.3170731707317073, 0.05725190839694655, 0.33870967741935487, 0.25757575757575757, 0.17500000000000004, 0.16666666666666663, 0.5, 0.4574468085106383, 0.5]
12


<p style="color:red;">Saved interesting results and corresponding parameters:</p>
seed 16 : top=5 + cut
Cluster Labels-11 cities-balance score: 0.3225350616486323 inertia: 76.16409079368762

seed 14 : top=5 + cut    
Cluster Labels-10 cities-balance score: 0.33327133963241595 inertia: 78.31572410770997

seed 6 : top=5 + cut
Cluster Labels-10 cities-balance score: 0.3272758161898358 inertia: 78.09924142232632

seed 5 : top=5 + cut
Cluster Labels-9 cities-balance score: 0.31146976112435376 inertia: 79.35449856000717
    
seed 3 : top=5 + cut
Cluster Labels-8 cities-balance score: 0.33666339674171886 inertia: 81.02902898441722
    
seed 3 : top=5 no cut
Cluster Labels-12 cities-balance score: 0.3247983489825326 inertia: 75.99835332089556
    
seed 2 : top=5 + cut
Cluster Labels-14 cities-balance score: 0.33960713148509347 inertia: 73.12300157030559


In [28]:
Stations_venues_sorted.drop("Cluster Labels-3", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-4", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-5", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-6", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-7", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-8", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-9", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-10", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-11", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-12", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-13", 1,inplace=True)
Stations_venues_sorted.drop("Cluster Labels-14", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-15", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-16", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-17", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-18", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-19", 1,inplace=True)
#Stations_venues_sorted.drop("Cluster Labels-20", 1,inplace=True)
#e

In [29]:
print(best_clustScores)
print(best_kclusters_value)
kclusters=best_kclusters_value
Stations_venues_sorted.rename(columns={"Cluster Labels-"+str(kclusters):"Cluster Labels"}, inplace=True)

[0.5, 0.43567251461988304, 0.43506493506493504, 0.3170731707317073, 0.05725190839694655, 0.33870967741935487, 0.25757575757575757, 0.17500000000000004, 0.16666666666666663, 0.5, 0.4574468085106383, 0.5]
12


In [31]:
Stations_venues_sorted.head(500).sort_values(["Cluster Labels","1st Most Common Venue","2nd Most Common Venue"]).head(30)

Unnamed: 0,Cluster Labels,Station,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
403,0,Maison Blanche,Paris,Asian Restaurant,Bakery,Gym,Park,Zoo Exhibit
54,0,Belleville,Paris,Asian Restaurant,Italian Restaurant,Vegetarian / Vegan Restaurant,Restaurant,Café
246,0,Gabriel Péri,Paris,Food & Drink Shop,Asian Restaurant,Playground,Zoo Exhibit,Farmers Market
93,0,Burnt Oak,London,Park,Asian Restaurant,Zoo Exhibit,Farmers Market,Empanada Restaurant
22,0,Arnos Grove,London,Park,Pool,Asian Restaurant,Beer Bar,Zoo Exhibit
166,0,Coombe Lane,London,Scenic Lookout,Asian Restaurant,Zoo Exhibit,Eastern European Restaurant,Embassy / Consulate
79,1,Bounds Green,London,Café,Bakery,Gourmet Shop,Bar,Noodle House
317,1,Hoxton,London,Café,Bakery,Record Shop,Bar,Spanish Restaurant
238,1,Forest Hill,London,Café,Bar,Gastropub,Gym,Bookstore
339,1,Kentish Town West,London,Café,Bar,Liquor Store,Asian Restaurant,Bookstore


In [32]:
for clusterID in range(0,kclusters):
    #if want to explore the resulting venues in each cluster:
    #print(
    Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"]==clusterID].iloc[:,-1].append(
    Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"]==clusterID].iloc[:,-2]).append(
    Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"]==clusterID].iloc[:,-3]).append(
    Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"]==clusterID].iloc[:,-4]).append(
    Stations_venues_sorted[Stations_venues_sorted["Cluster Labels"]==clusterID].iloc[:,-5]).value_counts()


In [72]:
#stations_merged[stations_merged['Cluster Labels']==np.nan].head()
# merge stations_grouped with stations_data to add latitude/longitude for each Station
stations_merged = df.join(Stations_venues_sorted.set_index(['Station','City']), on=['Station','City'])


stations_merged.head()

Unnamed: 0,City,Station,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,London,Temple,-0.112644,51.5105,11.0,Bar,Scenic Lookout,Park,Building,Boat or Ferry
1,London,Blackfriars,-0.10202,51.5111,1.0,Café,Mediterranean Restaurant,South American Restaurant,Bar,Modern European Restaurant
2,London,Mansion House,-0.0924953,51.5113,8.0,Bar,Asian Restaurant,Restaurant,Café,Gym
3,London,Cannon Street,-0.088801,51.511,2.0,Seafood Restaurant,Gym,Monument / Landmark,Trail,Pedestrian Plaza
4,London,Monument,-0.0845023,51.5102,2.0,Gym,Scenic Lookout,Middle Eastern Restaurant,South American Restaurant,Sandwich Place


As there are Station without venues, that were added in this operation, there are some cells with class "nan", let's create one last group so now we have kclusters+1 clusters.

In [73]:
stations_merged.replace({'Cluster Labels': np.nan},kclusters, inplace=True)
stations_merged=stations_merged.astype({'Cluster Labels': 'int32'}, copy=False)
kclusters=kclusters+1

In [74]:
stations_merged['Cluster explicit']=""

for clust in stations_merged["Cluster Labels"].unique():
    print("\nCluster n°"+str(clust)+": "+", ".join(stations_merged.loc[(stations_merged["Cluster Labels"]==clust)&(stations_merged["City"]=="Paris")]["Station"]))
    stations_merged.loc[(stations_merged["Cluster Labels"]==clust)&(stations_merged["City"]=="London"), ['Cluster explicit']] = "Cluster n°"+str(clust)+": "+", ".join(stations_merged.loc[(stations_merged["Cluster Labels"]==clust)&(stations_merged["City"]=="Paris")]["Station"])




Cluster n°11: Marcadet-Poissonniers, Pré Saint-Gervais, Réaumur-Sébastopol, Bonne Nouvelle, Jules Joffrin, Marcadet-Poissonniers, Simplon, Rue Saint-Maur, Bonne Nouvelle, Couronnes, Château Rouge, Ménilmontant, La Fourche, Pelleport, Philippe Auguste, Ledru-Rollin, Parmentier, Réaumur-Sébastopol

Cluster n°1: Saint-François-Xavier, Saint-Denis-Université

Cluster n°8: Colonel Fabien, Saint-Sébastien Froissart, Bibliothèque François Mitterrand, Chevaleret, Glacière, Havre-Caumartin, Riquet, Le Kremlin-Bicêtre, Palais Royal-Musée du Louvre, Palais Royal-Musée du Louvre, Porte de Vanves, Havre-Caumartin, Maraîchers

Cluster n°2: Créteil-L'Échat, Gare d'Austerlitz, Commerce, Hôtel de Ville, Esplanade de la Défense, Louis Blanc, Mairie de Montreuil, Place de Clichy, Porte de Charenton, Télégraphe, Porte de Pantin, Porte Dorée, République, République, Saint-Fargeau, Barbès-Rochechouart, Cour Saint-Émilion, Croix de Chavaux, Château d'Eau, Gambetta, Félix Faure, Jaurès, Malesherbes, Marcel S

In [75]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if needed
import folium # map rendering library
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map zoomed on London
map_clusters = folium.Map(location=[51.509865,-0.118092], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, expl in zip(stations_merged['Latitude'], stations_merged['Longitude'], stations_merged['Station'], stations_merged['Cluster Labels'], stations_merged['Cluster explicit']):
    label = folium.Popup(str(str(poi).encode('raw_unicode_escape'))[2:-1] + " - " + str(expl.encode('raw_unicode_escape'))[2:-1], parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters