# Introduction

Air Quality is assessed in Europe using the concentration of 5 pollutants - NO2, O3, PM10, PM2.5 and SO2. The data source I used was [Geod'Air](https://www.geodair.fr/).

In this notebook you can see how I collected the initial dataset for pollutants.

# Downloading raw data

I used the [Advanced Export](https://www.geodair.fr/donnees/export-advanced) functionality on the Geod'Air site for the bulk download of data. Initially I wasn't sure which pollutants would be needed, so in addition to the 5 mentioned above I download data for C6H6, CO and NOx as NO2.

First I identified the measuring stations in or close to Montpellier using the fields on the download page and the [map](https://www.geodair.fr/donnees/referentiel-mesure).

I then downloaded data for each pollutant seperately into a folder of the form `data/Air Quality/Historical to 2022-08-29/{Pollutant}`. For some pollutants, the download was split into several files as Geod'air has a limit on the download size. In these cases, the files were combined manually using Excel into a single file. In each case the final file for each pollutant is simply named `{Pollutant}.csv`. So for example, the data for sulphur dioxide is found in a file with the path `data/Air Quality/Historical to 2022-08-29/{Pollutant}/{Pollutant}.csv`.

# Processing data

First we merge the data in the separate pollutant files into single dataframe.

In [1]:
import numpy as np
import pandas as pd

In [2]:
all_pollutants = ['C6H6', 'CO', 'NO2', 'NOx as NO2', 'O3', 'PM10', 'PM25', 'SO2']

In [3]:
pollutant_data_dfs = []
for pollutant in all_pollutants:
    pollutant_data_path = f"data/Air Quality/Historical to 2022-08-29/{pollutant}/{pollutant}.csv"
    pollutant_data_df = pd.read_csv(pollutant_data_path, index_col = 0)
    pollutant_data_dfs.append(pollutant_data_df)

In [4]:
pollutants_data_df = pd.concat(pollutant_data_dfs)

In [5]:
pollutants_data_df

Unnamed: 0_level_0,Date de fin,Organisme,code zas,Zas,code site,nom site,type d'implantation,Polluant,type d'influence,Réglementaire,...,valeur,valeur brute,unité de mesure,taux de saisie,couverture temporelle,couverture de données,code qualité,validité,Latitude,Longitude
Date de début,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-02 00:00:00,2013-01-02 01:00:00,ATMO SUD,FR02N30,PROVENCE-ALPES-COTE-D-AZUR-ZI,FR02001,Berre l'Etang,Périurbaine,C6H6,Industrielle,Oui,...,0.38,0.37500,µg-m3,,,,A,1,43.486234,5.171939
2013-01-02 01:00:00,2013-01-02 02:00:00,ATMO SUD,FR02N30,PROVENCE-ALPES-COTE-D-AZUR-ZI,FR02001,Berre l'Etang,Périurbaine,C6H6,Industrielle,Oui,...,0.00,0.00000,µg-m3,,,,A,1,43.486234,5.171939
2013-01-02 02:00:00,2013-01-02 03:00:00,ATMO SUD,FR02N30,PROVENCE-ALPES-COTE-D-AZUR-ZI,FR02001,Berre l'Etang,Périurbaine,C6H6,Industrielle,Oui,...,0.00,0.00000,µg-m3,,,,A,1,43.486234,5.171939
2013-01-02 03:00:00,2013-01-02 04:00:00,ATMO SUD,FR02N30,PROVENCE-ALPES-COTE-D-AZUR-ZI,FR02001,Berre l'Etang,Périurbaine,C6H6,Industrielle,Oui,...,0.00,0.00000,µg-m3,,,,A,1,43.486234,5.171939
2013-01-02 04:00:00,2013-01-02 05:00:00,ATMO SUD,FR02N30,PROVENCE-ALPES-COTE-D-AZUR-ZI,FR02001,Berre l'Etang,Périurbaine,C6H6,Industrielle,Oui,...,0.00,0.00000,µg-m3,,,,A,1,43.486234,5.171939
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-08-29 19:00:00,2022-08-29 20:00:00,ATMO SUD,FR93ZAG01,ZAG MARSEILLE-AIX,FR03043,MARSEILLE 5 AVENUES,Urbaine,SO2,Fond,Oui,...,0.70,0.67500,µg-m3,,,,A,1,43.305287,5.394716
2022-08-29 20:00:00,2022-08-29 21:00:00,ATMO SUD,FR93ZAG01,ZAG MARSEILLE-AIX,FR03043,MARSEILLE 5 AVENUES,Urbaine,SO2,Fond,Oui,...,-0.10,-0.07500,µg-m3,,,,A,1,43.305287,5.394716
2022-08-29 21:00:00,2022-08-29 22:00:00,ATMO SUD,FR93ZAG01,ZAG MARSEILLE-AIX,FR03043,MARSEILLE 5 AVENUES,Urbaine,SO2,Fond,Oui,...,0.50,0.52500,µg-m3,,,,A,1,43.305287,5.394716
2022-08-29 22:00:00,2022-08-29 23:00:00,ATMO SUD,FR93ZAG01,ZAG MARSEILLE-AIX,FR03043,MARSEILLE 5 AVENUES,Urbaine,SO2,Fond,Oui,...,0.60,0.63333,µg-m3,,,,R,1,43.305287,5.394716


## Duplicates

We check for duplicates...

In [6]:
print(f"{len(pollutants_data_df)} rows in original, {len(pollutants_data_df.drop_duplicates())} in deduplicated")

1437864 rows in original, 1437864 in deduplicated


...and we have none.

## Clean locations

Let's look at the geographic coverage.

In [7]:
pollutants_data_df.pivot_table(
    index = ['Polluant', 'nom site']
).loc[:, ['Latitude', 'Longitude']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Latitude,Longitude
Polluant,nom site,Unnamed: 2_level_1,Unnamed: 3_level_1
C6H6,Berre l'Etang,43.486234,5.171939
C6H6,Martigues Lavera,43.386564,5.026868
C6H6,VALLEE HUVEAUNE,43.283341,5.511384
CO,A7 SUD LYONNAIS,45.720024,4.818156
CO,Clermont-Esplanade Gare,45.775696,3.09625
CO,GARIBALDI,45.7681,4.8503
CO,Grenoble Boulevards,45.18069,5.720625
CO,Le Rondeau,45.158363,5.703764
CO,Lyon Périphérique,45.77482,4.898572
CO,RIVE DE GIER,45.533424,4.623939


We can see that `Chaptal` and `Saint Denis` are repeated as `Montpellier Chaptal` and `Montpellier St Denis` with the same location. Checking the data shows that the names were changed on 1 Jan 2021, so we rename them.

In [8]:
pollutants_data_df.loc[pollutants_data_df['nom site'] == 'Chaptal', 'nom site'] = 'Montpellier Chaptal'
pollutants_data_df.loc[pollutants_data_df['nom site'] == 'Saint Denis', 'nom site'] = 'Montpellier St Denis'

We'll save the data in the current form including all pollutants as a backup.

In [9]:
pollutants_df_path = f"data/Air Quality/Historical to 2022-08-29/All pollutants.gz"

In [10]:
pollutants_data_df.to_csv(pollutants_df_path)

We also save the location data.

In [11]:
pollutants_data_df.pivot_table(
    index = ['Polluant', 'nom site'],
).to_csv('data/Air Quality/Stations with data.csv')

## Select locations, pollutants and data

We will choose only the 5 pollutants required, and we'll also select only certain measuring stations - those that are inside Montpellier, or the closest possible station where there is none in Montpellier. The records of which were chosen were created from scratch using Excel and saved as a csv.

In [12]:
stations_chosen = pd.read_csv('data/Air Quality/Air pollutants measuring stations.csv', index_col = 0)

In [13]:
with pd.option_context('display.max_colwidth', 0, 'display.colheader_justify', 'left'):
    display(stations_chosen.style.set_properties(**{'text-align': 'left'}))

Unnamed: 0_level_0,Station(s) at Montpellier?,Closest stations,Selected stations
Pollutant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SO2,No,"One in Marseille, Several in the industrial area near l’Étang de Berre",MARSEILLE 5 AVENUES
O3,Yes,,Montpellier Prés d'Arènes
NO2,Yes,,"Montpellier Prés d'Arènes, Pompignane, Montpellier Chaptal, Montpellier St Denis"
PM10,Yes,,"Montpellier Prés d'Arènes, Pompignane"
PM2.5,Yes,,"Montpellier Prés d'Arènes, Pompignane"


We select only the data that is likely to be useful. Non urban stations are excluded as we are trying to get a forecast for Montpellier centre.

In [14]:
required_pollutants = ['NO2', 'O3', 'PM10', 'PM2.5', 'SO2']
excluded_stations = [
    'Périurbaine Nord',
    'Périurbaine Sud',
    'Montpellier Périurbaine Sud (Lattes)',
    'Montpellier St Gely'
]

pollutants = pollutants_data_df[
    (pollutants_data_df['Polluant'].isin(required_pollutants)) &
    (~pollutants_data_df['nom site'].isin(excluded_stations))
]

In [15]:
pollutants.pivot_table(
    index = ['Polluant', 'nom site']
).loc[:, ['Latitude', 'Longitude']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Latitude,Longitude
Polluant,nom site,Unnamed: 2_level_1,Unnamed: 3_level_1
NO2,Montpellier Chaptal,43.611302,3.86626
NO2,Montpellier Prés d'Arènes,43.5915,3.88681
NO2,Montpellier St Denis,43.6051,3.87464
NO2,Pompignane,43.6096,3.89878
O3,Montpellier Prés d'Arènes,43.5915,3.88681
PM10,Montpellier Prés d'Arènes,43.5915,3.88681
PM10,Pompignane,43.6096,3.89878
PM2.5,Montpellier Prés d'Arènes,43.5915,3.88681
PM2.5,Pompignane,43.6096,3.89878
SO2,MARSEILLE 5 AVENUES,43.305287,5.394716


Let's look at summary info for our data as that will help us exclude other elements.

In [16]:
pollutants.info()

<class 'pandas.core.frame.DataFrame'>
Index: 841920 entries, 2013-01-02 00:00:00 to 2022-08-29 23:00:00
Data columns (total 22 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Date de fin            841920 non-null  object 
 1   Organisme              841920 non-null  object 
 2   code zas               841920 non-null  object 
 3   Zas                    841920 non-null  object 
 4   code site              841920 non-null  object 
 5   nom site               841920 non-null  object 
 6   type d'implantation    841920 non-null  object 
 7   Polluant               841920 non-null  object 
 8   type d'influence       841920 non-null  object 
 9   Réglementaire          841920 non-null  object 
 10  type d'évaluation      841920 non-null  object 
 11  type de valeur         841920 non-null  object 
 12  valeur                 811823 non-null  float64
 13  valeur brute           811823 non-null  float64
 14  unité de m

In [17]:
with pd.option_context('display.max_columns', None):
    display(pollutants.describe(include = 'all'))

Unnamed: 0,Date de fin,Organisme,code zas,Zas,code site,nom site,type d'implantation,Polluant,type d'influence,Réglementaire,type d'évaluation,type de valeur,valeur,valeur brute,unité de mesure,taux de saisie,couverture temporelle,couverture de données,code qualité,validité,Latitude,Longitude
count,841920,841920,841920,841920,841920,841920,841920,841920,841920,841920,841920,841920,811823.0,811823.0,841920,0.0,0.0,0.0,841920,841920.0,841920.0,841920.0
unique,84648,2,4,4,5,5,1,5,2,1,2,2,,,2,,,,3,,,
top,2013-01-02 01:00:00,ATMO OCCITANIE,FR76ZAG02,ZAG MONTPELLIER,FR08016,Montpellier Prés d'Arènes,Urbaine,NO2,Fond,Oui,mesures fixes,moyenne horaire validée,,,µg-m3,,,,A,,,
freq,10,757560,443664,443664,337272,337272,841920,336264,504480,841920,818592,841680,,,717840,,,,708654,,,
mean,,,,,,,,,,,,,23.539493,23.52843,,,,,,0.928504,43.571573,4.038253
std,,,,,,,,,,,,,23.039819,23.039394,,,,,,0.371323,0.089256,0.452766
min,,,,,,,,,,,,,-4.1,-4.125,,,,,,-1.0,43.305287,3.86626
25%,,,,,,,,,,,,,7.1,7.125,,,,,,1.0,43.5915,3.88681
50%,,,,,,,,,,,,,15.8,15.8,,,,,,1.0,43.5915,3.88681
75%,,,,,,,,,,,,,32.0,32.025,,,,,,1.0,43.6096,3.89878


In addition to irrelevant columns, we can exclude:
- `unité de mesure` - because although there are 2 values they are the same (`µg/m3`, and `µg-m3`).
- `type d'implantation` and `Réglementaire` - because each only has one value (`urban` and `1`).
- `taux de saisie`, `couverture temporelle`, and `couverture de données` - because they have no non-null values.
- `Latitude`, and `Longitude` because we have them stored by station elsewhere.

In [18]:
pollutants = pollutants.loc[:, [
    'Polluant',
    'nom site',
    'type d\'influence',
    'type d\'évaluation',
    'type de valeur',
    'valeur',
    'valeur brute'
]]

## Final data

In [19]:
pollutants

Unnamed: 0_level_0,Polluant,nom site,type d'influence,type d'évaluation,type de valeur,valeur,valeur brute
Date de début,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-01-02 00:00:00,NO2,Montpellier Chaptal,Fond,mesures fixes,moyenne horaire validée,,
2013-01-02 01:00:00,NO2,Montpellier Chaptal,Fond,mesures fixes,moyenne horaire validée,,
2013-01-02 02:00:00,NO2,Montpellier Chaptal,Fond,mesures fixes,moyenne horaire validée,,
2013-01-02 03:00:00,NO2,Montpellier Chaptal,Fond,mesures fixes,moyenne horaire validée,,
2013-01-02 04:00:00,NO2,Montpellier Chaptal,Fond,mesures fixes,moyenne horaire validée,0.7,0.66667
...,...,...,...,...,...,...,...
2022-08-29 19:00:00,SO2,MARSEILLE 5 AVENUES,Fond,mesures fixes,moyenne horaire brute,0.7,0.67500
2022-08-29 20:00:00,SO2,MARSEILLE 5 AVENUES,Fond,mesures fixes,moyenne horaire brute,-0.1,-0.07500
2022-08-29 21:00:00,SO2,MARSEILLE 5 AVENUES,Fond,mesures fixes,moyenne horaire brute,0.5,0.52500
2022-08-29 22:00:00,SO2,MARSEILLE 5 AVENUES,Fond,mesures fixes,moyenne horaire brute,0.6,0.63333


In [20]:
pollutants.to_csv('data/Air Quality/Historical to 2022-08-29/Pollutants data.gz')