![INSA](https://gi.insa-lyon.fr/sites/all/themes/insa_satellites/logo.png)

# GI-5-DSC - Data Science: Data analysis and visualization
***


The objective of this part of the tutorial are the following:


* Data analysis: query the dataset to draw graphs and maps






## 1. Set up the environment: import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timezone

import folium
import plotly
import plotly.express as px
import geopandas

## 2. Getting the data

First, the dataset must be loaded. 
On the one hand the data set containing the locations of the stations and on the other hand the usage history. 
The second one has been modified during the previous session. In order not to have to redo all the processing you can retrieve the `data-bikes-2.zip` archive directly.


All the data used in this tutorial is available on the [git repository](https://github.com/ludovicmoncla/insa-5gi-dsc-tutorials/tree/main/data) and on [Moodle](https://moodle.insa-lyon.fr/course/view.php?id=4628). 


* Download the datasets
1. data-stations.zip
2. data-bikes-2.zip


### 2.1. Loading the data

As last time, to load the data you just have to use the method [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) from the `Pandas` library. 
It takes as a parameter the path of the file you want to load. This file can be of 2 formats, either directly a CSV file, or a ZIP file containing a CSV. In our case it is therefore unnecessary to unzip the previously downloaded archives.

In [None]:
## We load the data from the stations into a dataframe
df_stations = pd.read_csv('data/data-stations.zip')

## We now load the dataframe with the history data
df_bikes = pd.read_csv('data/data-bikes-2.zip')

In [None]:
## Display the first rows
df_stations.head()

In [None]:
## Display the first rows
df_bikes.head()

### 2.2. Premier apercu des données d'historique

In [None]:
## Display information about the data
df_bikes.info()

In [None]:
# Reduce the size in memory

df_bikes[['year', 'daily_departure', 'daily_arrival']] = df_bikes[['year', 'daily_departure', 'daily_arrival']].astype('int16')
df_bikes[['month','day','hour','minute', 'bikes', 'bike_stands', 'departure30min','arrival30min']] = df_bikes[['month','day','hour','minute', 'bikes', 'bike_stands', 'departure30min','arrival30min']].astype('int8')


In [None]:
## On affiche les information sur les données
df_bikes.info()

## 3.Query and visualization of data

In the previous session, we saw how to prepare and manipulate data. Now we can focus on querying and visualizing the data.

In [None]:
## We start by making a copy of our DataFrame, to be able to return to the initial data if necessary
df_sampled = df_bikes.copy()

### 3.1 Graphs (time dimension)

In order to obtain a first distribution of the data, we wish to display on a graph the cumulative sum of the departures (or arrivals) according to the hours of the day (by 30 minutes).

#### 3.1.1 Display of departures and arrivals (cumulative sum)

In [None]:
## We group the dataframe rows according to the 'hour' and 'minute' columns
## and sum the values of the 'departure30min' column
values_departure = df_sampled.groupby(['hour', 'minute'])['departure30min'].sum().values
values_arrival = df_sampled.groupby(['hour', 'minute'])['arrival30min'].sum().values

## We get the max value to limit the ordinate axis
y_max = max(max(values_departure), max(values_arrival))

## We set the visible values of the x-axis
## The default values are not suitable in our case
x_labels = [0,'','','','','',3,'','','','','',6,'','','','','',9,'','','','','',12,'','','','','',15,'','','','','',18,'','','','','',21,'','','','','']

## We create the figure which will contain the 2 graphs (departures and arrivals) (we use the matplotlib library)
fig = plt.figure(figsize=(12, 8))
## We use the subplot method to create the 2 graphs.
## Subplot() takes 3 parameters: number of rows, number of columns et number of the plot
plt.subplot(2,1,1)
plt.bar(range(len(values_departure)), values_departure)
plt.ylabel('departures')
plt.xticks(range(len(values_departure)), x_labels, rotation='vertical')
plt.ylim([0, y_max])
plt.title('Cumulative sum of departures over the period')

## We create the second graph
plt.subplot(2,1,2)
plt.bar(range(len(values_arrival)), values_arrival)
plt.ylabel('arrivals')
plt.xticks(range(len(values_arrival)), x_labels, rotation='vertical')
plt.ylim([0, y_max])
plt.title('Cumulative sum of arrivals over the period')

## We add some space between the two graphs
plt.subplots_adjust(hspace=0.5)

## We display the graphs
plt.show()

#### 3.1.2 Weekday and weekend comparison

Using the previous code, propose a solution to compare weekdays and weekends.


In [None]:
#**** To be completed










#### 3.1.3 Comparison of school vacations, non-holidays, covid-19 lock-down, summer, etc....

We now wish to make a comparison between several weeks to analyze if some constraints had an effect on the frequentation of the Vélo'v stations:


* week of February 15 to 21: school vacations
* week of April 4 to 11: curfew + work from home
* week of August 2 to 8: summer


In [None]:
#**** To be completed
# We do it only for departure










Beaucoup d'autres analyses sont possibles comme par exemple, étudier l'impact de la météo (pluie, neige, température), d'une grève des transports, des vacances scolaires, des fêtes ou soirées de réveillon, etc.



#### 3.1.4 Cumulative departures per day over a month

In [None]:
#**** To be completed





### 3.2 Maps (spatial dimension)

For the analysis of geographical data it will be interesting to display them in the form of a map in order to visualize their distribution and allow a better interpretation.

There are several types of maps that can be generated depending on the type of information you want to display.

#### 3.2.1 Heatmap 

First, we propose to create a heatmap representing the density of distribution of the vélo'v stations. This is just an example of how to create such a map.

In [None]:
## Get the list of station coordinates
df_stations[['latitude', 'longitude']].values

In [None]:
from folium.plugins import HeatMap

# We initialize the map with the Folium library (centered on Lyon)
Lyon = [45.76, 4.85]
m = folium.Map(location=Lyon, zoom_start = 13) 

# We get the list of lat/lon coordinates of the stations
heat_data = ****

# We call the HeatMap function of the folium library with the list of coordinates and we add it to the map
HeatMap(heat_data).add_to(m)

# We display the map
m

Now we want to create a map that would make more sense than just displaying the density of station locations. In particular, we want to display the localized densities of departures according to the different districts.


#### 3.2.2 Display geometries

We now want to be able to distinguish the different districts of the metropolis and obtain a result like on the image below.

![quartiers du grand lyon](https://perso.liris.cnrs.fr/lmoncla/GEONUM/fig/quartiers_lyon.png)

To do this, we will retrieve the geometry of the different districts of the city and display them on the map.
We are going to experiment 2 display methods depending on the library used: 
1. GeoJSON layer displayed with the Folium library
2. GeoJSON layer displayed with GeoPandas and Plotly


Download the geometries of the districts of Lyon (
https://www.data.gouv.fr/fr/datasets/quartiers-des-communes-de-la-metropole-de-lyon/).

The geojson file `adr_voie_lieu.adrquartier.json` is already on the [git repository](https://github.com/ludovicmoncla/insa-5gi-dsc-tutorials/tree/main/data) and on [Moodle](https://moodle.insa-lyon.fr/course/view.php?id=4628).



##### Display the geojson layer with the Folium library

 - Use the [documentation](https://python-visualization.github.io/folium/quickstart.html#GeoJSON/TopoJSON-Overlays) of the Folium library to add the geosjon layer to your map.

In [None]:
## We initialize the path to the geojson file
jsonfile = "data/adr_voie_lieu.adrquartier.json"

## We initialize the map with Folium
m = folium.Map(location = Lyon, zoom_start = 12, tiles = "CartoDB positron")

## We add the geojson layer from our file which contains the geometries of the districts
****

## We add the control layer for the interaction with the map
folium.LayerControl().add_to(m)

## We display the map
m

##### Display the geojson layer with GeoPandas et Plotly

In [None]:
## We initialize the path to the geojson file
jsonfile = "data/adr_voie_lieu.adrquartier.json"

## We load the GeoJSON file which contains the geometries in a geodataframe
gdf_districts_json = geopandas.read_file(jsonfile)

## We display the first rows
gdf_districts_json.head()

In [None]:
## We initialize a map with the choropleth_mapbox() method of the Plotly Express library.
## Refer to the documentation for the description of the parameters
## https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth_mapbox.html

fig = px.choropleth_mapbox(gdf_districts_json, 
                           geojson=gdf_districts_json, 
                           locations=gdf_districts_json.index, 
                           mapbox_style="carto-positron",
                           zoom=12, center = {"lat": Lyon[0], "lon": Lyon[1]},
                           opacity=0.5
                          )

## Remove margins
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

## Display the map
fig.show()


#### 3.2.3 Colored thematic map (choropleth map)

Now that we have retrieved and displayed the districts we want to add information associated with these districts. To start with, we want to generate a map where the color of each polygon depends on the number of stations present in the area.

To do this, we need to assign to each station the identifier of the zone to which it belongs.

We use the `contains()` method of the GeoPandas library https://geopandas.org/reference.html#geopandas.GeoSeries.contains to associate the id of the corresponding zone to each station.

In [None]:
## We import the Point() function from the Shapely library 
## which allows to create a variable of type Point from the coordinates lat/lng
from shapely.geometry import Point

## We define a function that takes as parameters the latitude and longitude of a station 
## and returns the identity of the corresponding zone
def get_gid(latitude, longitude):
    
    ## We create a Point object from the coordinates
    pt = Point(float(longitude),float(latitude))
    
    ## We filter our dataframe of neighborhood geometries 
    ## to keep only the geometries that contain the point
    zone_found = ****
    
    ## If there is at least one district returned by the query then the function returns the identifier of the first
    if len(zone_found) > 0:
        return str(zone_found.iloc[0].gid)
    
    return None

In [None]:
df_stations.head()

In [None]:
## We add a column with the gid of the area to our stations dataframe
df_stations['gid'] = df_stations.apply(lambda row : get_gid(row.latitude, row.longitude), axis=1)

## Display the first rows to check the addition of the column
df_stations.head()

In [None]:
## Now we want to group the stations by zone
## and retrieve for each area, the number of stations it contains
nb_stations_per_zone = ****

nb_stations_per_zone.head()

It remains to display this data as a choropleth map. Take inspiration from the previous code. Refer to the [documentation](https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth_mapbox.html) to know which parameters to modify or add.

In [None]:
# to be completed










# Exercices

## A.1 Thematic map showing number of departures

We want to reproduce the same type of map as before, but this time instead of simply displaying the number of stations, we want to display the cumulative number of departures for the stations in each neighborhood.


The problem is that we do not have the latitudes and longitudes in the history dataframe. We must therefore first make a [join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) between our 2 dataframes to associate each line of the history with the location of the station considered.

In [None]:
## We make the join between our 2 dataframes
df_hist_merged = ****

In [None]:
df_hist_merged.head()

In [None]:
## We calculate the sum of the departures of each zone for each step of 30 minutes (for the whole period considered)
df_hist_sum = pd.DataFrame(df_hist_merged.groupby([****])['departure30min'].sum().reset_index(name = "sum"))

## to be completed







## A.2 Thematic map showing the number of departures on a weekend day

In [None]:
## to be completed




## A.3 Thematic map showing the number of departures in the morning (before noon)

In [None]:
## to be completed




## A.4 Thematic map showing the number of afternoon departures (after noon)

In [None]:
## to be completed




### 3.3 Example of animated maps

We now wish to be able to generate animated maps allowing to visualize the spatial and temporal dimensions in a dynamic way.

In the following cell we used this example https://towardsdatascience.com/how-to-animate-scatterplots-on-mapbox-using-plotly-express-3bb49fe6a5d to build an animated map from the departure history for a specific day.

These resources are also useful to build such maps:
 - https://plotly.com/python/bubble-maps/
 - https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html

In [None]:
df_tmp = df_hist_merged.loc[(df_hist_merged['day'] == 6) & (df_hist_merged['month'] == 12) & (df_hist_merged['year'] == 2021)]

fig = px.scatter_mapbox(df_tmp, lat="latitude", lon="longitude", color="departure30min",
                     hover_name="id_velov", size="departure30min",
                     center={'lat':45.76, 'lon':4.85}, zoom=11,
                     animation_frame="time", mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"l":0,"b":0})
fig.show()