# Data Analysis - Velib Project in [Python](https://www.python.org/) <a href="https://www.python.org/"><img src="https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/community/logos/python-logo-only.png" style="max-width: 35px; display: inline" alt="Python"/></a>&nbsp;

---
_Authors:_ Amine Aziz Alaoui (<small>IRT St-Exupéry</small>), J. Chevallier (<small>INSA Toulouse</small>), J. Guérin (<small>ANITI</small>), Franck Kouassi (<small>INSA Toulouse</small>), O. Roustant (<small>INSA Toulouse</small>).

We consider the [velib](https://www.velib-metropole.fr/donnees-open-data-gbfs-du-service-velib-metropole) data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.

From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). **The aim is to detect clusters in the data, corresponding to common customer usages.** This clustering should then be used to predict the loading profile.

---

The aim of this tutorial is to provide you a _starting point for your project_. 
Unsurprisingly, the first step is to get to grips with the dataset by exploring it through easy routines: 
- How are the data coded? 
- How many stations are observed? 
- What is the dispersion of the data? 
- _etc._

You will find some suggested solutions in the "solutions" folder (we can certainly do better). _I can only urge you to first try to answer the questions yourself_, making sure you know which graph to use to answer the question, and then to look in the Python documentation to find out how to make a particular graph (there are lots of resources on the Internet for Python!). The counterpart to this tutorial, but in [R](https://plmlab.math.cnrs.fr/wikistat/Exploration/-/blob/master/Velib/TP_velib_R.ipynb), is also available on wikistat.

In [None]:
import pandas as pd
import numpy as np
import random as rd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

%matplotlib inline

## Preliminary: Load Data and Quality Assessment

##### <span style="color:purple"> **Todo:** Load the data</span>

- `velibLoading.csv` file.
- `velibCoord.csv` file
- Check that loading has gone smoothly by looking at the first lines of the notebooks.

In [None]:
### TO BE COMPLETED ### 
    
loading = ...

In [None]:
# %load solutions/Python/load_loading.py

In [None]:
### TO BE COMPLETED ### 
    
coord = ...

In [None]:
# %load solutions/Python/load_coord.py

##### <span style="color:purple"> **Question:** Do these data sets contain missing data?</span>

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/missing_value.py

##### <span style="color:purple"> **Question:** Do these data sets duplicate data?</span>

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/duplicated.py

##### <span style="color:purple"> **Question:** Are any stations present more than once in the data set?</span>

- You can use the [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) function to count the number of occurrences of each function name.
- Discuss this result in the light of the previous question. If the answer is yes, we could, for example, try to visualize the different entries for the same station.



In [None]:
### TO BE COMPLETED ### 

# Stations in descending order of occurrence
station_name = ...
print(station_name)

[...]

In [None]:
# %load solutions/Python/station_name.py

## First Insights into the Dataset

##### <span style="color:purple"> **Todo:** Plot the loading a station</span>

- Plot the load evolution of the $i$-th station over time;
- Draw a vertical line to delimit the days (_**Hint:** How many days do we observe?_);
- Enter the station name in the figure title;
- Label the axes in the figure.

In [None]:
### TO BE COMPLETED ### 

i = ...
loading_data = loading.to_numpy()

n_steps = ...  # number of observed time steps
time    = ...  # observed time range

# --- #

plt.figure(figsize = (20, 6))

plt.plot(...)
plt.vlines(...)

[...]

In [None]:
# %load solutions/Python/plot_loading.py

> Comments?

##### <span style="color:purple"> **Question:** Does loading differ from one station to another?</span>

 Draw a matrix of plots of size 4*4 corresponding to 16 stations of your choice. _Do not forget the vertical lines corresponding to days_

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/plot_loading_16.py

> Comments?

##### <span style="color:purple"> **Todo:** Draw the boxplot of the variables, sorted in time order.</span>

1. What can you say about the distribution of the variables? 
2. Position, dispersion, symmetry? 
3. Can you see a difference between days?

_Hint:_ To change the graphical properties of boxplots (for example, the thickness of the median), use the [`patch_artist = True`](https://python-charts.com/distribution/box-plot-matplotlib/) argument in the `plt.boxplot` function.

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/plot_loading_disp.py

> Comments?

## Average Loading

##### <span style="color:purple"> **Question:** What is the average station fill rate?</span>

Which station is, on average, the fullest? the least full?

In [None]:
### TO BE COMPLETED ### 

print('--- Average fill rate ---')
[...]

# --- #
print('')

print('--- Least crowded station, on average ---')
[...]

# --- #
print('')

print('--- Fullest station, on average ---')
[...]

In [None]:
# %load solutions/Python/loading_mean.py

##### <span style="color:purple"> **Question:** Does the average load vary from one station to another?</span>

- Show the evolution of the average load for each station. 
- On the same graph, plot the average loading for the entire data set.

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/plot_mean_stations.py

> Comments?

##### <span style="color:purple"> **Question:** Does the average load vary over the course of a day?</span>

Plot the average hourly loading for each day (on a single graph).

In [None]:
### TO BE COMPLETED ### 

[...]

In [None]:
# %load solutions/Python/plot_mean_hours.py

> Comments?

## Velib Station Map

In [None]:
import matplotlib.cm as cm
import matplotlib.patches as mpatches
import plotly.express as px

##### <span style="color:purple"> **Question:** Where are the velib stations located?</span>

- Plot the stations coordinates on a 2D map (latitude _vs._ longitude)
- Use the average hourly loading as a color scale
- You can consider different times of day, for example 6am, 12pm, 11pm on Monday, or the average weekly load at 6am.
- You can consider different days at the same time, or the average load for each day.
- You can use the [`scatter_mapbox`](https://plotly.com/python/scattermapbox/) function of the [`plotly.express`](https://plotly.com/python/plotly-express/) to charge the map of Paris

In [None]:
### TO BE COMPLETED ### 
## Simple 2D representation
# Monday at hour 6h, 12h, 23h

# Hours to be displayed
hours = ...

# --- #

[...]

In [None]:
# %load solutions/Python/plot_loading_2D_monday.py

> Comments?

In [None]:
### TO BE COMPLETED ### 
## Simple 2D representation
# Loading at 6pm, depending on the day of the week

[...]

In [None]:
# %load solutions/Python/plot_loading_2D_18h.py

> Comments?

In [None]:
### TO BE COMPLETED ### 
## Visualization on the Paris map

[...]

In [None]:
# %load solutions/Python/plot_loading_map.py

> Comments?

## Influence of Altitude Difference on Station Loading

##### <span style="color:purple"> **Question:** Does Paris have many hilltop stations?</span>

- Compare the number of hilltop stations with the others.

In [None]:
loading_hill = ...

[...]

In [None]:
# %load solutions/Python/hilltop_stations.py

##### <span style="color:purple"> **Question:** Are hilltop stations more crowded than others?</span>

- Plot the stations coordinates on a 2D map (latitude _vs._ longitude), using a different color for stations which are located on a hill.
- Redo the initial study, but distinguish hilltop stations from others.

In [None]:
### TO BE COMPLETED ### 
## Simple 2D representation

[...]

In [None]:
# %load solutions/Python/hilltop_stations_2D.py

In [None]:
### TO BE COMPLETED ### 
## Visualization on the Paris map

coord['hill'] = coord['bonus'].astype('category') # convert to categorical

[...]

In [None]:
# %load solutions/Python/hilltop_stations_map.py