# Analyzing the clusters
In the previous notebooks, we have been through the process of Data cleaning, gathering the behaviour of the stations on weekdays, adding geolocation data and finally classifying them by clusters. Now it is time to understand why the model has classified these clusters this way, and understand what do the stations of each cluster have in common.

In [14]:
#importing needed libraries
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px

In [4]:
clusters = pd.read_csv("..\\Dataset\\clusters.csv",encoding="utf_8",decimal=',', sep=';')

In [5]:
clusters.head()

Unnamed: 0,station_id,00:00,01:00,02:00,03:00,04:00,05:00,06:00,07:00,08:00,...,19:00,20:00,21:00,22:00,23:00,capacity,lat,long,DisttoCentre,labels
0,1,23.45,23.83,23.71,23.85,23.66,20.89,11.78,4.01,5.03,...,12.7,15.79,17.61,20.16,21.96,30,41.397952,2.180042,1.47,0
1,2,13.52,14.1,13.8,13.48,13.4,11.88,7.34,3.51,3.19,...,7.28,8.78,10.33,12.42,12.66,27,41.39553,2.17706,1.11,4
2,3,16.59,16.69,16.76,16.66,16.36,16.76,14.39,7.89,4.04,...,18.42,16.54,16.12,16.38,17.25,27,41.394055,2.181299,1.22,0
3,4,9.78,9.74,9.94,10.39,11.53,14.07,10.06,4.9,2.04,...,13.53,11.53,9.66,9.61,10.53,21,41.39348,2.181555,1.2,4
4,5,24.42,24.39,24.24,23.87,24.26,23.06,18.66,9.33,8.24,...,27.05,27.04,27.42,26.78,24.82,39,41.391075,2.180223,0.96,0


In [6]:
clusters.labels.value_counts()

1    117
4     88
2     69
0     69
3     67
Name: labels, dtype: int64

There are 5 different clusters, where 4 of them have similar values (67-88) and one has 117 stations. Let's try to understand one by one what do they have in common.

# Cluster 0
#### Summary extracted after seeing the cells below:
69 stations
Distance to the city centre is not rellevant [Between 0,6km to 5,69km]

In [11]:
clusters[clusters.labels == 0].describe()

Unnamed: 0,station_id,00:00,01:00,02:00,03:00,04:00,05:00,06:00,07:00,08:00,...,19:00,20:00,21:00,22:00,23:00,capacity,lat,long,DisttoCentre,labels
count,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0,...,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0,69.0
mean,228.913043,22.898551,22.892319,22.801014,22.54058,21.854348,18.770725,12.166522,7.243913,6.61029,...,20.876087,21.391159,21.958261,22.631739,23.234928,27.84058,41.397026,2.177089,2.752609,0.0
std,125.204508,3.206499,3.260944,3.213845,3.363268,3.539526,4.002057,5.239381,5.274747,4.958315,...,3.639608,3.603538,3.472487,3.346404,3.220785,3.856329,0.020599,0.020316,1.429292,0.0
min,1.0,16.43,15.66,15.58,14.49,11.79,5.45,1.19,0.76,1.13,...,12.7,14.96,14.93,15.54,16.83,19.0,41.360654,2.129847,0.62,0.0
25%,132.0,21.0,21.14,21.1,21.06,20.18,16.12,8.24,3.19,2.7,...,18.29,18.51,19.57,20.41,21.28,27.0,41.377635,2.163404,1.43,0.0
50%,238.0,23.45,23.61,23.13,23.03,22.51,19.23,11.06,5.44,5.06,...,21.12,22.04,22.48,22.77,23.73,27.0,41.396623,2.181299,2.87,0.0
75%,346.0,24.78,24.79,24.88,24.66,24.12,21.84,16.82,11.73,9.14,...,23.62,23.98,24.17,24.9,24.94,30.0,41.416018,2.19115,3.85,0.0
max,427.0,32.21,31.49,29.66,29.07,28.13,25.34,23.58,23.54,23.28,...,29.46,31.9,32.65,33.41,33.03,39.0,41.43608,2.212658,5.69,0.0


In [13]:
clu_means = clusters.groupby('labels').mean()

In [16]:
clu_means.drop(columns=['station_id','capacity','lat','long','DisttoCentre'],inplace=True)

In [19]:
clu_means = clu_means.transpose()

In [20]:
clu_means.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64', name='labels')

In [43]:
fig = go.Figure(
        data=[
             go.Scatter(x=clu_means.index, 
                       y=clu_means[0],
                       mode='lines',
                       name='Cluster 0',
                       line = dict(
                           width = 6,
                           color = 'rgb(51,96,140)')),
             go.Scatter(x=clu_means.index, 
                       y=clu_means[1],
                       mode='lines',
                       name='Cluster 1',
                       line = dict(
                           width = 6,
                           color = 'rgb(195,106,156)')),
            go.Scatter(x=clu_means.index, 
                       y=clu_means[2],
                       mode='lines',
                       name='Cluster 2',
                       line = dict(
                           width = 6,
                           color = 'rgb(246,186,87)')),
            go.Scatter(x=clu_means.index, 
                       y=clu_means[3],
                       mode='lines',
                       name='Cluster 3',
                       line = dict(
                           width = 6,
                           color = 'rgb(228,96,72)')),
            go.Scatter(x=clu_means.index, 
                       y=clu_means[4],
                       mode='lines',
                       name='Cluster 4',
                       line = dict(
                           width = 6,
                           color = 'rgb(184,24,64)')),
            
        ],
        layout=go.Layout(
            title=dict(text='Clusters Workdays hourly behaviour'),
            xaxis=dict(title='Hours'),
            yaxis=dict(title='Number of bikes available')))


fig.add_trace(go.Scatter(x=['06:00','06:00'], y=[0,23], mode="lines", name="Morning rush hour start",line = dict(
                           width = 3,
                           color = 'rgb(0,0,0)')))
fig.add_trace(go.Scatter(x=['09:00','09:00'], y=[0,23], mode="lines", name="Morning rush hour finish",line = dict(
                           width = 3,
                           color = 'rgb(0,0,0)')))
fig.add_trace(go.Scatter(x=['17:00','17:00'], y=[0,23], mode="lines", name="Evening rush hour",line = dict(
                           width = 3,
                           color = 'rgb(0,0,0)')))
fig.show()

Extractions from the graph:
* Cluster 0: Stations behave as residential place: full at night, get empty at morning rush hour and then slowly increases the number of bikes towards the day.
* Cluster 1: Tend to be empty all day.
* Cluster 2: Stations from this cluster behave as work/educational locations. At morning rush hour the tendency is too get filled by bikes and then it slowly gets empty towards the end of the day.
* Cluster 3: Stations of cluster 3 tend to be half-full throughout most part of the day.
* Cluster 4: Stations behave similarly as the ones with Cluster 0, but having less number of bikes and a less aggressive tendency (hypothesis: maybe they are located at higher parts of the city than cluster 0?)
