# Projet accidents de la route

Ce projet porte sur un jeu de données de l'**Observatoire National Interministériel de la Sécurité Routière (ONISR)** recensant les accidents corporels (mortel ou non) de la circulation routière en France.

***
## Ouverture et concaténation des données des fichiers caractéristiques


D'après la documentation fournie, les données jusqu'en 1019 contiennent les mêmes variables. En revanche, à partir de 2019, la variable *gsp* n'existe plus. Les jeux de données contenant les latitudes et longitudes des lieux d'accidents, je supprime la colonne *gps*.

A noter que les fichiers csv de chaque année possède des séparateurs différents : point-virgule, virgule ou tabulation.

In [1]:
import pandas as pd

liste_dataframes = []
for i in range(2005,2009,1):
    liste_dataframes.append(pd.read_csv(f'caracteristiques_{i}.csv'))
    
for i in range(2010,2019,1):
    liste_dataframes.append(pd.read_csv(f'caracteristiques_{i}.csv', sep=','))

liste_dataframes.append(pd.read_csv(f'caracteristiques_2009.csv', sep='\t'))

data_caracteristiques = pd.concat(liste_dataframes, axis=0, ignore_index=True)



In [2]:
data_caracteristiques.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958469 entries, 0 to 958468
Data columns (total 16 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Num_Acc  958469 non-null  int64  
 1   an       958469 non-null  int64  
 2   mois     958469 non-null  int64  
 3   jour     958469 non-null  int64  
 4   hrmn     958469 non-null  int64  
 5   lum      958469 non-null  int64  
 6   agg      958469 non-null  int64  
 7   int      958469 non-null  int64  
 8   atm      958396 non-null  float64
 9   col      958450 non-null  float64
 10  com      958467 non-null  float64
 11  adr      816550 non-null  object 
 12  gps      480052 non-null  object 
 13  lat      471401 non-null  float64
 14  long     471397 non-null  object 
 15  dep      958469 non-null  int64  
dtypes: float64(4), int64(9), object(3)
memory usage: 117.0+ MB


In [3]:
data_caracteristiques.columns

Index(['Num_Acc', 'an', 'mois', 'jour', 'hrmn', 'lum', 'agg', 'int', 'atm',
       'col', 'com', 'adr', 'gps', 'lat', 'long', 'dep'],
      dtype='object')

In [4]:
liste_dataframes=[]
for i in range(2019,2022,1):
    liste_dataframes.append(pd.read_csv(f"caracteristiques_{i}.csv", sep=';'))
    
df_2019_2021 = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [5]:
df_2019_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163102 entries, 0 to 163101
Data columns (total 15 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Num_Acc  163102 non-null  int64 
 1   jour     163102 non-null  int64 
 2   mois     163102 non-null  int64 
 3   an       163102 non-null  int64 
 4   hrmn     163102 non-null  object
 5   lum      163102 non-null  int64 
 6   dep      163102 non-null  object
 7   com      163102 non-null  object
 8   agg      163102 non-null  int64 
 9   int      163102 non-null  int64 
 10  atm      163102 non-null  int64 
 11  col      163102 non-null  int64 
 12  adr      161745 non-null  object
 13  lat      163102 non-null  object
 14  long     163102 non-null  object
dtypes: int64(9), object(6)
memory usage: 18.7+ MB


In [6]:
data_caracteristiques.drop(columns='gps', inplace=True)
data_caracteristiques=pd.concat([data_caracteristiques,df_2019_2021],axis=0, ignore_index=True)
data_caracteristiques.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1121571 entries, 0 to 1121570
Data columns (total 15 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   Num_Acc  1121571 non-null  int64  
 1   an       1121571 non-null  int64  
 2   mois     1121571 non-null  int64  
 3   jour     1121571 non-null  int64  
 4   hrmn     1121571 non-null  object 
 5   lum      1121571 non-null  int64  
 6   agg      1121571 non-null  int64  
 7   int      1121571 non-null  int64  
 8   atm      1121498 non-null  float64
 9   col      1121552 non-null  float64
 10  com      1121569 non-null  object 
 11  adr      978295 non-null   object 
 12  lat      634503 non-null   object 
 13  long     634499 non-null   object 
 14  dep      1121571 non-null  object 
dtypes: float64(2), int64(7), object(6)
memory usage: 128.4+ MB


**Enregistrement des données dans un nouveau fichier csv**

In [7]:
data_caracteristiques.to_csv('caracteristiques_2005_2021.csv', sep=';', encoding='utf-8')

***
## Ouverture et concaténation des fichiers lieux

In [8]:
liste_dataframes = []
for i in range(2005,2009,1):
    liste_dataframes.append(pd.read_csv(f'lieux_{i}.csv', dtype={'voie': 'str'}))
    
for i in range(2009,2019,1):
    liste_dataframes.append(pd.read_csv(f'lieux_{i}.csv', sep=',', dtype={'voie': 'str'}))

data_lieux = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [9]:
data_lieux.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958469 entries, 0 to 958468
Data columns (total 18 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Num_Acc  958469 non-null  int64  
 1   catr     958468 non-null  float64
 2   voie     869558 non-null  object 
 3   v1       333391 non-null  float64
 4   v2       39348 non-null   object 
 5   circ     956895 non-null  float64
 6   nbv      955738 non-null  float64
 7   pr       482985 non-null  float64
 8   pr1      481166 non-null  float64
 9   vosp     955708 non-null  float64
 10  prof     956520 non-null  float64
 11  plan     956188 non-null  float64
 12  lartpc   902271 non-null  float64
 13  larrout  904096 non-null  float64
 14  surf     956545 non-null  float64
 15  infra    953061 non-null  float64
 16  situ     953499 non-null  float64
 17  env1     953029 non-null  float64
dtypes: float64(15), int64(1), object(2)
memory usage: 131.6+ MB


In [10]:
liste_dataframes=[]
for i in range(2019,2022,1):
    liste_dataframes.append(pd.read_csv(f"lieux_{i}.csv", sep=';'))
    
df_2019_2021 = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [11]:
df_2019_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163102 entries, 0 to 163101
Data columns (total 18 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Num_Acc  163102 non-null  int64  
 1   catr     163102 non-null  int64  
 2   voie     144833 non-null  object 
 3   v1       152356 non-null  float64
 4   v2       12021 non-null   object 
 5   circ     163102 non-null  int64  
 6   nbv      163102 non-null  int64  
 7   vosp     163102 non-null  int64  
 8   prof     163102 non-null  int64  
 9   pr       163102 non-null  object 
 10  pr1      163102 non-null  object 
 11  plan     163102 non-null  int64  
 12  lartpc   468 non-null     object 
 13  larrout  104634 non-null  object 
 14  surf     163102 non-null  int64  
 15  infra    163102 non-null  int64  
 16  situ     163102 non-null  int64  
 17  vma      163102 non-null  int64  
dtypes: float64(1), int64(11), object(6)
memory usage: 22.4+ MB


In [12]:
data_lieux=pd.concat([data_lieux,df_2019_2021],axis=0, ignore_index=True)
data_lieux.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1121571 entries, 0 to 1121570
Data columns (total 19 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   Num_Acc  1121571 non-null  int64  
 1   catr     1121570 non-null  float64
 2   voie     1014391 non-null  object 
 3   v1       485747 non-null   float64
 4   v2       51369 non-null    object 
 5   circ     1119997 non-null  float64
 6   nbv      1118840 non-null  float64
 7   pr       646087 non-null   object 
 8   pr1      644268 non-null   object 
 9   vosp     1118810 non-null  float64
 10  prof     1119622 non-null  float64
 11  plan     1119290 non-null  float64
 12  lartpc   902739 non-null   object 
 13  larrout  1008730 non-null  object 
 14  surf     1119647 non-null  float64
 15  infra    1116163 non-null  float64
 16  situ     1116601 non-null  float64
 17  env1     953029 non-null   float64
 18  vma      163102 non-null   float64
dtypes: float64(12), int64(1), object(6)
memory

In [13]:
data_lieux.to_csv('lieux_2005_2021.csv', sep=';', encoding='utf-8')

***
## Ouverture et concaténation des fichiers usagers

In [14]:
liste_dataframes = []
for i in range(2005,2009,1):
    liste_dataframes.append(pd.read_csv(f'usagers_{i}.csv'))
    
for i in range(2009,2019,1):
    liste_dataframes.append(pd.read_csv(f'usagers_{i}.csv', sep=','))

data_usagers = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [15]:
data_usagers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2142195 entries, 0 to 2142194
Data columns (total 12 columns):
 #   Column   Dtype  
---  ------   -----  
 0   Num_Acc  int64  
 1   place    float64
 2   catu     int64  
 3   grav     int64  
 4   sexe     int64  
 5   trajet   float64
 6   secu     float64
 7   locp     float64
 8   actp     float64
 9   etatp    float64
 10  an_nais  float64
 11  num_veh  object 
dtypes: float64(7), int64(4), object(1)
memory usage: 196.1+ MB


In [16]:
liste_dataframes=[]
for i in range(2019,2022,1):
    liste_dataframes.append(pd.read_csv(f"usagers_{i}.csv", sep=';'))
    
df_2019_2021 = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [17]:
df_2019_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367425 entries, 0 to 367424
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Num_Acc      367425 non-null  int64  
 1   id_vehicule  367425 non-null  object 
 2   num_veh      367425 non-null  object 
 3   place        367425 non-null  int64  
 4   catu         367425 non-null  int64  
 5   grav         367425 non-null  int64  
 6   sexe         367425 non-null  int64  
 7   an_nais      364358 non-null  float64
 8   trajet       367425 non-null  int64  
 9   secu1        367425 non-null  int64  
 10  secu2        367425 non-null  int64  
 11  secu3        367425 non-null  int64  
 12  locp         367425 non-null  int64  
 13  actp         367425 non-null  object 
 14  etatp        367425 non-null  int64  
dtypes: float64(1), int64(11), object(3)
memory usage: 42.0+ MB


In [18]:
data_usagers.drop(columns='secu', inplace=True)

In [19]:
df_2019_2021.drop(columns=["secu1", "secu2", "secu3"], inplace=True)

In [20]:
df_2019_2021.drop(columns=["id_vehicule"], inplace=True)

In [21]:
data_usagers=pd.concat([data_usagers,df_2019_2021],axis=0, ignore_index=True)
data_usagers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2509620 entries, 0 to 2509619
Data columns (total 11 columns):
 #   Column   Dtype  
---  ------   -----  
 0   Num_Acc  int64  
 1   place    float64
 2   catu     int64  
 3   grav     int64  
 4   sexe     int64  
 5   trajet   float64
 6   locp     float64
 7   actp     object 
 8   etatp    float64
 9   an_nais  float64
 10  num_veh  object 
dtypes: float64(5), int64(4), object(2)
memory usage: 210.6+ MB


In [22]:
data_usagers.to_csv('usagers_2005_2021.csv', encoding='utf-8', sep=';')

***
## Ouverture et concaténation des fichiers vehicules

In [23]:
liste_dataframes = []
for i in range(2005,2009,1):
    liste_dataframes.append(pd.read_csv(f'vehicules_{i}.csv'))
    
for i in range(2009,2019,1):
    liste_dataframes.append(pd.read_csv(f'vehicules_{i}.csv', sep=','))

data_vehicules = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [24]:
data_vehicules.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1635811 entries, 0 to 1635810
Data columns (total 9 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   Num_Acc  1635811 non-null  int64  
 1   senc     1635539 non-null  float64
 2   catv     1635811 non-null  int64  
 3   occutc   1635811 non-null  int64  
 4   obs      1634805 non-null  float64
 5   obsm     1635033 non-null  float64
 6   choc     1635414 non-null  float64
 7   manv     1635343 non-null  float64
 8   num_veh  1635811 non-null  object 
dtypes: float64(5), int64(3), object(1)
memory usage: 112.3+ MB


In [25]:
liste_dataframes=[]
for i in range(2019,2022,1):
    liste_dataframes.append(pd.read_csv(f"vehicules_{i}.csv", sep=';'))
    
df_2019_2021 = pd.concat(liste_dataframes, axis=0, ignore_index=True)

In [26]:
df_2019_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279091 entries, 0 to 279090
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Num_Acc      279091 non-null  int64  
 1   id_vehicule  279091 non-null  object 
 2   num_veh      279091 non-null  object 
 3   senc         279091 non-null  int64  
 4   catv         279091 non-null  int64  
 5   obs          279091 non-null  int64  
 6   obsm         279091 non-null  int64  
 7   choc         279091 non-null  int64  
 8   manv         279091 non-null  int64  
 9   motor        279091 non-null  int64  
 10  occutc       2257 non-null    float64
dtypes: float64(1), int64(8), object(2)
memory usage: 23.4+ MB


In [27]:
df_2019_2021.drop(columns=['id_vehicule','motor'], inplace=True)

In [28]:
data_vehicules = pd.concat([data_vehicules, df_2019_2021], axis=0, ignore_index=True)

In [29]:
data_vehicules.to_csv('vehicules_2005_2021.csv', encoding='utf-8', sep=';')