![INSA](https://gi.insa-lyon.fr/sites/all/themes/insa_satellites/logo.png)

# GI-5-DSC - Data Science: Clustering
***

The objective of this part of the tutorial is to continue the analysis of velo'v data and to experiment with artificial intelligence methods for data clustering.



## 1. Set up the environment: import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import folium
import plotly
import plotly.express as px
import geopandas

import seaborn as sn

import sklearn.cluster

## 2. Getting the data

First, the dataset must be loaded. 
On the one hand the data set containing the locations of the stations and on the other hand the usage history. 
The second one has been modified during the previous session. In order not to have to redo all the processing you can retrieve the `data-bikes-2.zip` archive directly.


All the data used in this tutorial is available on the [git repository](https://github.com/ludovicmoncla/insa-5gi-dsc-tutorials/tree/main/data) and on [Moodle](https://moodle.insa-lyon.fr/course/view.php?id=4628). 


* Download the datasets
1. data-stations.zip
2. data-bikes-2.zip


### 2.1. Loading the data

As last time, to load the data you just have to use the method [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) from the `Pandas` library. 
It takes as a parameter the path of the file you want to load. This file can be of 2 formats, either directly a CSV file, or a ZIP file containing a CSV. In our case it is therefore unnecessary to unzip the previously downloaded archives.


In [None]:
## We load the data from the stations into a dataframe
df_stations = pd.read_csv('data/data-stations.zip')

## We now load the dataframe with the history data
df_bikes = pd.read_csv('data/data-bikes-2.zip')

In [None]:
## Display the first rows
df_stations.head()

In [None]:
## Display the first rows
df_bikes.head()

In [None]:
# Reduce the size in memory
df_bikes['time'] = pd.to_datetime(df_bikes['time']) 
df_bikes[['year', 'daily_departure', 'daily_arrival']] = df_bikes[['year', 'daily_departure', 'daily_arrival']].astype('int16')
df_bikes[['month','day','hour','minute', 'bikes', 'bike_stands', 'departure30min','arrival30min']] = df_bikes[['month','day','hour','minute', 'bikes', 'bike_stands', 'departure30min','arrival30min']].astype('int8')

## 3. Clustering

Our objective in this part is to identify groups of "similar" stations. To do so, we will apply unsupervised learning methods: clustering.
The objective is not to group the stations by spatial proximity but by a similarity calculated from the historical data of the use of the stations (departures and arrivals).

### 3.1 Preparing the data

First we have to make sure that all stations are comparable. We are therefore interested in knowing if they all have the same amount of data. The lack of data can be due to bugs in the data collection process but it can also be due to stations that have closed during the year.

In order to check if all stations have the same amount of data, we can group the dataframe rows according to the station name and then display the size of each group.


In [None]:
# We group the dataframe rows by station
g = *****

# Display the size of each group
print(g.size())

Not all stations are displayed but we can already see that one station has less data than the others. We need to determine how many stations are different to remove them.
Use the plot() function to display a graph of the size (number of data) of each station.

In [None]:
# Display the size of each group on a graph
*****

We notice that 6 stations have clearly less data than the others. For the rest of the processing we will remove these 6 stations from the dataset.

Identify the name of these 6 stations by using the functions count() and [nsmallest()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nsmallest.html).

In [None]:
# The list of the 6 stations with the least data is displayed
list_to_drop = *****
list_to_drop

In [None]:
# We check the size of the dataframe before deleting the lines of the concerned stations
df_bikes.shape

In [None]:
# Delete the 6 stations from the df_bikes dataframe

df_bikes.drop(df_bikes.loc[*****].index, inplace=True)


In [None]:
# We check the size of the dataframe after deletion
df_bikes.shape

In order to be able to group the stations by similarity, we will add some variables (columns). In particular, we will be interested in the number of departures (and arrivals) normalized by day of the week. 

1. Use the method [to_datetime()](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to transform the column `time` type.
2. Then create a new column `day_of_week` using the method [day_name()](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.day_name.html)



In [None]:
# creates a new column with the name of the day of the week
df_bikes['day_of_week'] = df_bikes['time'].dt.day_name()

In [None]:
df_bikes.head()

Now we want to add columns with the mean daily values of departures and arrivals

In [None]:
# We calculate the average arrivals and departures (daily) per station and per day of the week
arrivals = *****
departures = *****


In [None]:
arrivals = arrivals.unstack(level=1) # transform rows into columns
arrivals = arrivals.fillna(0) # we replace the empty null values
arrivals

In [None]:
departures = departures.unstack(level=1) # transform rows into columns
departures = departures.fillna(0) # we replace the empty null values
departures

In [None]:
# We combine these two datasets into a single one that will serve as a training set for the clustering algorithm

df_data = departures.merge(arrivals, how='inner', on=['id_velov'])
df_data = df_data.fillna(0)
df_data.head()

In [None]:
# We check if there are infinite values (not compatible with the clustering algo)
np.any(np.isfinite(df_data))

In [None]:
# We replace these values by the value 1
df_data.replace([np.inf, -np.inf], 1, inplace=True)

In [None]:
df_data.head()

### 3.2 Training and use of the clustering model

Most of machine learning methods are already implemented in the [Scikit-Learn](https://scikit-learn.org/stable/) library. This library includes a large number of unsupervised (clustering) and supervised (classification and regression) learning algorithms.


The [Kmeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm is a widely used clustering algorithm. Its principle is simple: group the data into k homogeneous and compact clusters. In order to create homogeneous clusters the algorithm is based on a distance calculation between the data and the centroid of the different clusters. These centroids are recalculated each time a new data is added to the cluster. 

Our goal here is to determine if the stations can be grouped into 2 distinct clusters based on the similarity of their usage history.


In [None]:
# model declaration
model = *****

# training the model
*****

# use of the model to associate a cluster number to each row of the dataset
df_data["cluster"] = *****

In [None]:
df_data.head()

In order to display these clusters on a map we need to add the latitude/longitude coordinates of the stations to our dataframe.

In [None]:
df_data = *****
df_data.head()

### 3.3 Cluster mapping

In [None]:
gdf_stations = *****

In [None]:
## The geodataframe data is displayed directly on a map 
## with the scatter_mapbox() method of the plotly.express library:
fig = *****

## We remove the margins around the map
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.update(layout_coloraxis_showscale=False) # remove the colorbar

## Display the map
fig.show()

We observe that the 2 clusters are also geographically distinct. One cluster is located in the center and the second one is located in the periphery.