# 1.a Observation des données
Dans un premier temps, nous allons observer les données présentes pour les différentes rues à l'aide de l'outil `pandas-profiling`.
Cela permet de dégager des premières étapes de preprocessing, notamment pour le format de la date.

In [1]:
from pandas_profiling import ProfileReport
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data_convention = pd.read_csv('../data/comptages-routiers-permanents-convention.csv', sep=";")
data_champs = pd.read_csv('../data/comptages-routiers-permanents-champs.csv', sep=";")
data_peres = pd.read_csv('../data/comptages-routiers-permanents-peres.csv', sep=";")

In [3]:
def preprocess(df: pd.DataFrame):
    """
    Enlève les colonnes non nécessaires et formatte les dates en type datetime
    pour ensuite en extraire année, jour, mois, heure et jour de la semaine
    Réordonne les données temporellement
    """
    temp = df[["Débit horaire", "Taux d'occupation", "Etat arc",]].copy()
    temp["Date et heure de comptage"] = pd.to_datetime(df["Date et heure de comptage"], utc=True)
    temp = temp.sort_values("Date et heure de comptage")
    temp = temp.set_index("Date et heure de comptage")
    temp["datetime"] = temp.index
    temp["year"] = temp.index.year
    temp["month"] = temp.index.month
    temp["day"] = temp.index.day
    temp["hour"] = temp.index.hour
    # Récupère les jours de la semaine : 0 -> lundi, 6 -> dimanche
    temp["dayofweek"] = temp.index.dayofweek
    return temp

Avant les premières étapes de preprocessing, nos données contiennent beaucoup d'informations, pas toujours très lisible et souvent redondantes.

In [4]:
data_convention.head()

Unnamed: 0,Identifiant arc,Libelle,Date et heure de comptage,Débit horaire,Taux d'occupation,Etat trafic,Identifiant noeud amont,Libelle noeud amont,Identifiant noeud aval,Libelle noeud aval,Etat arc,Date debut dispo data,Date fin dispo data,geo_point_2d,geo_shape
0,5671,Convention,2020-08-12T15:00:00+02:00,626.0,3.66,Fluide,2937,Lecourbe-Convention,2973,Convention-Blomet,Invalide,2005-01-01,2019-06-01,"48.8386343727,2.29320560272","{""type"": ""LineString"", ""coordinates"": [[2.2918..."
1,5671,Convention,2020-08-12T14:00:00+02:00,583.0,3.15056,Fluide,2937,Lecourbe-Convention,2973,Convention-Blomet,Invalide,2005-01-01,2019-06-01,"48.8386343727,2.29320560272","{""type"": ""LineString"", ""coordinates"": [[2.2918..."
2,5671,Convention,2020-08-12T11:00:00+02:00,558.0,3.84389,Fluide,2937,Lecourbe-Convention,2973,Convention-Blomet,Invalide,2005-01-01,2019-06-01,"48.8386343727,2.29320560272","{""type"": ""LineString"", ""coordinates"": [[2.2918..."
3,5671,Convention,2020-11-01T12:00:00+01:00,333.0,2.02889,Fluide,2937,Lecourbe-Convention,2973,Convention-Blomet,Invalide,2005-01-01,2019-06-01,"48.8386343727,2.29320560272","{""type"": ""LineString"", ""coordinates"": [[2.2918..."
4,5671,Convention,2020-10-01T06:00:00+02:00,146.0,0.87611,Fluide,2937,Lecourbe-Convention,2973,Convention-Blomet,Invalide,2005-01-01,2019-06-01,"48.8386343727,2.29320560272","{""type"": ""LineString"", ""coordinates"": [[2.2918..."


In [5]:
data_convention = preprocess(data_convention)
data_champs = preprocess(data_champs)
data_peres = preprocess(data_peres)

Après les étapes de preprocessing simple, on a réussi à enlever les colonnes redondantes et mieux décrire la date de chaque ligne.

In [6]:
data_convention.head()

Unnamed: 0_level_0,Débit horaire,Taux d'occupation,Etat arc,datetime,year,month,day,hour,dayofweek
Date et heure de comptage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-11-01 03:00:00+00:00,323.0,1.67722,Invalide,2019-11-01 03:00:00+00:00,2019,11,1,3,4
2019-11-01 04:00:00+00:00,272.0,1.41056,Invalide,2019-11-01 04:00:00+00:00,2019,11,1,4,4
2019-11-01 05:00:00+00:00,240.0,1.35667,Invalide,2019-11-01 05:00:00+00:00,2019,11,1,5,4
2019-11-01 06:00:00+00:00,216.0,1.14056,Invalide,2019-11-01 06:00:00+00:00,2019,11,1,6,4
2019-11-01 07:00:00+00:00,260.0,1.85722,Invalide,2019-11-01 07:00:00+00:00,2019,11,1,7,4


In [7]:
data_convention["dayofweek"].value_counts()

6    1367
4    1365
5    1344
1    1344
0    1344
2    1343
3    1324
Name: dayofweek, dtype: int64

In [46]:
prof_convention = ProfileReport(data_convention)
prof_champs = ProfileReport(data_champs)
prof_peres = ProfileReport(data_peres)

In [47]:
prof_convention.to_file(output_file="convention1.html")
prof_champs.to_file(output_file="champs1.html")
prof_peres.to_file(output_file="peres1.html")

Summarize dataset: 100%|██████████| 23/23 [00:53<00:00,  2.32s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:36<00:00, 36.33s/it]
Render HTML: 100%|██████████| 1/1 [00:04<00:00,  4.08s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 55.37it/s]
Summarize dataset: 100%|██████████| 23/23 [00:24<00:00,  1.06s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:19<00:00, 19.45s/it]
Render HTML: 100%|██████████| 1/1 [00:06<00:00,  6.22s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 14.56it/s]
Summarize dataset: 100%|██████████| 23/23 [00:28<00:00,  1.22s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:17<00:00, 17.93s/it]
Render HTML: 100%|██████████| 1/1 [00:06<00:00,  6.09s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 26.80it/s]


In [63]:
barre_champs = (data_champs["Etat arc"] == "Barré").astype(int).groupby([data_champs["month"], data_champs["day"]]).sum()

In [64]:
barre_champs.sort_values().tail(20)

month  day
5      4       0
       3       0
4      27      0
5      8       4
8      2       4
3      1       6
11     24      6
12     31      6
9      6       7
7      5       7
       14      7
11     1       8
1      24      8
11     3       8
6      7       9
1      5       9
2      2       9
11     11     10
       27     11
9      20     12
Name: Etat arc, dtype: int32

In [65]:
barre_convention = (data_convention["Etat arc"] == "Barré").astype(int).groupby([data_convention["month"], data_convention["day"]]).sum()

In [67]:
barre_convention.sort_values()

month  day
1      1      0
9      6      0
       5      0
       4      0
       3      0
             ..
4      28     0
       27     0
5      6      0
12     31     0
7      11     2
Name: Etat arc, Length: 365, dtype: int32

In [68]:
barre_peres = (data_peres["Etat arc"] == "Barré").astype(int).groupby([data_peres["month"], data_peres["day"]]).sum()

In [69]:
barre_peres.sort_values()

month  day
1      1      0
9      5      0
       4      0
       3      0
       2      0
             ..
4      28     0
       27     0
5      5      0
12     31     0
1      19     4
Name: Etat arc, Length: 365, dtype: int32