# Project: Prediction of NO2 emissions in Zurich

## Data preparation

### 1) Introduction

This project is about predicting the 1 hour ahead emission of nitrogen dioxide (NO2) in the city of Zurich at the Stampfenbachstrasse, where the air quality and meteorological measurement station is located.

The initial idea was to use NOx (nitrogen oxides) as a target feature. However, one of the main findings in the exploratory data analysis is that NOx is a sum of NO and NO2. For this reason, NOx was dropped as a target feature and NO2 was used instead. 

Nitrogen dioxide (NO2) and nitrogen monoxide (NO) contribute to the formation of smog and acid rain, and are affecting tropospheric ozone. These gases are usually produced from the reaction between nitrogen and oxygen during combustion of fuels at high temperatures in car engines. 

The wider context of the problem is the air pollution in the cities. It can cause diseases and allergies to humans, harm to other living organisms and may damage the natural environment (climate change, ozone depletion) or built environment (acid rain).

### 2) Collect and import the required datasets

The raw dataset obtained in this notebook corresponds to 5 years of data, from January 1, 2015 to December 31, 2019. It contains air quality and meteorological hourly measurements from the Stampfenbachstrasse station in Zurich, public holiday and school holiday information for Zurich and the day name feature.

2020, 2021 and 2022 were not taken into account because Covid-19 pandemic makes prediction difficult, and 2023 didn't have final air quality and meteorological data at the start of the project.

#### A) Data sources

##### Open data portal of canton Zurich

Air quality measurements for 5 years (2015-2019) were taken from the open data portal of canton Zurich:<br/>
https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/

Meteorological measurements for 5 years (2015-2019) were taken from the open data portal of canton Zurich:<br/>
https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/

#####  Website: officeholidays.com 

Public holiday information for 5 years (2015-2019) was taken from the officeholidays.com website:<br/>
https://www.officeholidays.com/countries/switzerland/zurich/

##### Website: schulferien.org

School holiday information for 5 years (2015-2019) was taken from the schulferien.org website:<br/>
https://www.schulferien.org/schweiz/ferien/

#### B) Import data

##### Air quality data

Air quality measurements from 2015 to 2019 were taken from the open portal of canton Zurich.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import datetime

In [2]:
# read csv file for 2015 into dataframe
df_air_2015 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/\
download/ugz_ogd_air_h1_2015.csv")

In [3]:
# read csv file for 2016 into dataframe
df_air_2016 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/\
download/ugz_ogd_air_h1_2016.csv")

In [4]:
# read csv file for 2017 into dataframe
df_air_2017 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/\
download/ugz_ogd_air_h1_2017.csv")

In [5]:
# read csv file for 2018 into dataframe
df_air_2018 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/\
download/ugz_ogd_air_h1_2018.csv")

In [6]:
# read csv file for 2019 into dataframe
df_air_2019 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_luftschadstoffmessung_stundenwerte/\
download/ugz_ogd_air_h1_2019.csv")

In [7]:
# append dataframes
df_air = df_air_2015.append(df_air_2016.append(df_air_2017.append(df_air_2018.append(df_air_2019))))

In [8]:
# show first 5 entries
df_air.head()

Unnamed: 0,Datum,Standort,Parameter,Intervall,Einheit,Wert,Status
0,2015-01-01T00:00+0100,Zch_Heubeeribüel,NO2,h1,µg/m3,37.86,bereinigt
1,2015-01-01T00:00+0100,Zch_Heubeeribüel,NO,h1,µg/m3,2.62,bereinigt
2,2015-01-01T00:00+0100,Zch_Heubeeribüel,NOx,h1,ppb,21.9,bereinigt
3,2015-01-01T00:00+0100,Zch_Heubeeribüel,O3,h1,µg/m3,19.75,bereinigt
4,2015-01-01T00:00+0100,Zch_Rosengartenstrasse,NO2,h1,µg/m3,62.72,bereinigt


In [9]:
# show last 5 entries
df_air.tail()

Unnamed: 0,Datum,Standort,Parameter,Intervall,Einheit,Wert,Status
210235,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,O3,h1,µg/m3,1.87,bereinigt
210236,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,PM10,h1,µg/m3,27.47,bereinigt
210237,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,PM2.5,h1,µg/m3,23.71,bereinigt
210238,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,CO,h1,mg/m3,0.44,bereinigt
210239,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,SO2,h1,µg/m3,1.67,bereinigt


In [10]:
# check stations ('Standort')
df_air['Standort'].value_counts()

Zch_Stampfenbachstrasse    332840
Zch_Schimmelstrasse        245400
Zch_Rosengartenstrasse     235763
Zch_Heubeeribüel           174528
Name: Standort, dtype: int64

Air quality measurements took place at 4 different stations in Zurich: Stampfenbachstrasse, Schimmelstrasse, Rosengartenstrasse and Heubeeribüel.

Stampfenbachstrasse station has the highest number of recorded values. These values are used in the project.

In [11]:
# extract data for the Stampfenbachstrasse station
df_air_stampfenbach = df_air[df_air['Standort'] == 'Zch_Stampfenbachstrasse']

# drop Standort, Intervall, Einheit and Status columns
df_air_stampfenbach = df_air_stampfenbach.drop(columns = ['Standort', 'Intervall', 
                                                          'Einheit', 'Status'])

# dataframe shape
df_air_stampfenbach.shape

(332840, 3)

The dataframe with measurements (2015-2019) from the Stampfenbachstrasse station has 332'840 rows and 3 columns.

In [12]:
# show first 5 entries
df_air_stampfenbach.head()

Unnamed: 0,Datum,Parameter,Wert
14,2015-01-01T00:00+0100,NO2,60.32
15,2015-01-01T00:00+0100,NO,34.11
16,2015-01-01T00:00+0100,NOx,58.89
17,2015-01-01T00:00+0100,O3,1.89
18,2015-01-01T00:00+0100,PM10,258.95


In [13]:
# measured parameters
df_air_stampfenbach['Parameter'].unique().tolist()

['NO2', 'NO', 'NOx', 'O3', 'PM10', 'CO', 'SO2', 'PM2.5']

8 air quality parameters were measured at the Stampfenbachstrasse station. These are:
- Nitrogen dioxide (NO2), µg/m3
- Nitrogen monoxide (NO), µg/m3
- Nitrogen oxides (NOx), ppb
- Ozone (O3), µg/m3
- Fine dust (PM10), µg/m3
- Carbon monoxide (CO), mg/m3
- Sulfur dioxide (SO2), µg/m3
- Fine dust (PM2.5), µg/m3

##### Meteorological data

Meteorological data from 2015 to 2019 were taken from the open portal of canton Zurich.

In [14]:
# read csv file for 2015 into dataframe
df_meteo_2015 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/\
download/ugz_ogd_meteo_h1_2015.csv")

In [15]:
# read csv file for 2016 into dataframe
df_meteo_2016 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/\
download/ugz_ogd_meteo_h1_2016.csv")

In [16]:
# read csv file for 2017 into dataframe
df_meteo_2017 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/\
download/ugz_ogd_meteo_h1_2017.csv")

In [17]:
# read csv file for 2018 into dataframe
df_meteo_2018 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/\
download/ugz_ogd_meteo_h1_2018.csv")

In [18]:
# read csv file for 2019 into dataframe
df_meteo_2019 = pd.read_csv("https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/\
download/ugz_ogd_meteo_h1_2019.csv")

In [19]:
# append dataframes
df_meteo = df_meteo_2015.append(df_meteo_2016.append(df_meteo_2017.append(df_meteo_2018.append(df_meteo_2019))))

In [20]:
# show first 5 entries
df_meteo.head()

Unnamed: 0,Datum,Standort,Parameter,Intervall,Einheit,Wert,Status
0,2015-01-01T00:00+0100,Zch_Rosengartenstrasse,Hr,h1,%Hr,89.73,bereinigt
1,2015-01-01T00:00+0100,Zch_Rosengartenstrasse,RainDur,h1,min,0.0,bereinigt
2,2015-01-01T00:00+0100,Zch_Rosengartenstrasse,T,h1,°C,-2.36,bereinigt
3,2015-01-01T00:00+0100,Zch_Rosengartenstrasse,p,h1,hPa,983.52,bereinigt
4,2015-01-01T00:00+0100,Zch_Schimmelstrasse,Hr,h1,%Hr,87.59,bereinigt


In [21]:
# show last 5 entries
df_meteo.tail()

Unnamed: 0,Datum,Standort,Parameter,Intervall,Einheit,Wert,Status
183955,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,WD,h1,°,37.95,bereinigt
183956,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,WVs,h1,m/s,1.8,bereinigt
183957,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,WVv,h1,m/s,1.68,bereinigt
183958,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,p,h1,hPa,982.21,bereinigt
183959,2019-12-31T23:00+0100,Zch_Stampfenbachstrasse,StrGlo,h1,W/m2,0.02,bereinigt


In [22]:
# check stations ('Standort')
df_meteo['Standort'].value_counts()

Zch_Stampfenbachstrasse    344360
Zch_Rosengartenstrasse     290357
Zch_Schimmelstrasse        262944
Name: Standort, dtype: int64

Meteorological measurements took place at 3 different stations in Zurich: Stampfenbachstrasse, Rosengartenstrasse and Schimmelstrasse.

Stampfenbachstrasse station has the highest number of recorded values. These values are used in the project.

In [23]:
# extract data for the Stampfenbachstrasse station
df_meteo_stampfenbach = df_meteo[df_meteo['Standort'] == 'Zch_Stampfenbachstrasse']

# drop Standort, Intervall, Einheit and Status columns
df_meteo_stampfenbach = df_meteo_stampfenbach.drop(columns = ['Standort', 'Intervall', 
                                                              'Einheit', 'Status'])

# dataframe shape
df_meteo_stampfenbach.shape

(344360, 3)

The dataframe with measurements (2015-2019) from the Stampfenbachstrasse station has 344'360 rows and 3 columns.

In [24]:
# show first 5 entries
df_meteo_stampfenbach.head()

Unnamed: 0,Datum,Parameter,Wert
10,2015-01-01T00:00+0100,Hr,89.25
11,2015-01-01T00:00+0100,RainDur,0.0
12,2015-01-01T00:00+0100,T,-2.09
13,2015-01-01T00:00+0100,WD,20.41
14,2015-01-01T00:00+0100,WVv,1.4


In [25]:
# measured parameters
df_meteo_stampfenbach['Parameter'].unique().tolist()

['Hr', 'RainDur', 'T', 'WD', 'WVv', 'p', 'WVs', 'StrGlo']

8 meteorological parameters were measured at the Stampfenbachstrasse station. These are:
- Relative humidity (Hr), %
- Duration of precipitation (RainDur), min
- Temperature (T), °C
- Wind direction (WD), °
- Vector wind speed (WVv), m/s
- Air pressure (p), hPa
- Scalar wind speed (WVs), m/s
- Global radiation (StrGlo), W/m2

##### Merge air quality and meteorological data

Dataframes with air quality and meteorological measurements are merged by creating a dictionary of dataframes for each of 16 parameters. These dataframes (from the dictionary) are then merged on the basis of date ('Datum').

In [26]:
# append meteorological dataframe to air quality dataframe
df_air_meteo = df_air_stampfenbach.append(df_meteo_stampfenbach)

In [27]:
# show first 5 entries
df_air_meteo.head(5)

Unnamed: 0,Datum,Parameter,Wert
14,2015-01-01T00:00+0100,NO2,60.32
15,2015-01-01T00:00+0100,NO,34.11
16,2015-01-01T00:00+0100,NOx,58.89
17,2015-01-01T00:00+0100,O3,1.89
18,2015-01-01T00:00+0100,PM10,258.95


In [28]:
# show last 5 entries
df_air_meteo.tail(5)

Unnamed: 0,Datum,Parameter,Wert
183955,2019-12-31T23:00+0100,WD,37.95
183956,2019-12-31T23:00+0100,WVs,1.8
183957,2019-12-31T23:00+0100,WVv,1.68
183958,2019-12-31T23:00+0100,p,982.21
183959,2019-12-31T23:00+0100,StrGlo,0.02


In [29]:
# list with parameters
parameters = df_air_meteo['Parameter'].unique().tolist()

# show list
parameters

['NO2',
 'NO',
 'NOx',
 'O3',
 'PM10',
 'CO',
 'SO2',
 'PM2.5',
 'Hr',
 'RainDur',
 'T',
 'WD',
 'WVv',
 'p',
 'WVs',
 'StrGlo']

In [30]:
# create dictionary of 16 dataframes (for each parameter)
d = dict(list(df_air_meteo.groupby('Parameter')))

In [31]:
# drop 'Parameter' column and change 'Wert' column name to parameter name
for p in parameters:
    d[p] = d[p].drop(columns='Parameter').rename(columns={'Wert':p})

In [32]:
# create empty dataframe with dates ('Datum')
df = pd.DataFrame()
dates = df_air_meteo['Datum'].unique().tolist()
df['Datum'] = dates

In [33]:
# merge 16 dataframes from the dictionary on the basis of date ('Datum')
for p in parameters:
    df = df.merge(d[p], on='Datum', how='outer')

In [34]:
# show first 5 entries
df.head()

Unnamed: 0,Datum,NO2,NO,NOx,O3,PM10,CO,SO2,PM2.5,Hr,RainDur,T,WD,WVv,p,WVs,StrGlo
0,2015-01-01T00:00+0100,60.32,34.11,58.89,1.89,258.95,0.62,10.75,,89.25,0.0,-2.09,20.41,1.4,982.8,1.4,0.02
1,2015-01-01T01:00+0100,65.27,69.67,89.99,1.66,249.51,0.7,9.8,,90.47,0.0,-2.48,353.85,0.6,982.64,0.61,0.01
2,2015-01-01T02:00+0100,64.36,56.56,79.0,1.4,227.11,0.67,7.14,,89.45,0.0,-2.46,21.48,1.31,983.0,1.31,0.02
3,2015-01-01T03:00+0100,57.08,35.51,58.32,1.19,127.39,0.6,5.52,,89.2,0.0,-2.63,12.22,1.66,982.93,1.7,0.02
4,2015-01-01T04:00+0100,53.96,36.88,57.78,1.11,93.83,0.57,4.91,,89.56,0.0,-2.77,8.3,1.21,983.03,1.23,0.02


In [35]:
# show last 5 entries
df.tail()

Unnamed: 0,Datum,NO2,NO,NOx,O3,PM10,CO,SO2,PM2.5,Hr,RainDur,T,WD,WVv,p,WVs,StrGlo
43819,2019-12-31T19:00+0100,45.56,52.54,65.95,1.96,34.26,0.5,1.87,28.86,89.25,0.0,1.59,31.31,0.96,981.84,1.18,0.02
43820,2019-12-31T20:00+0100,42.08,47.67,60.22,2.0,33.25,0.46,1.91,28.19,92.08,0.0,1.04,36.35,1.45,981.93,1.5,0.03
43821,2019-12-31T21:00+0100,37.0,35.62,47.9,2.18,30.88,0.47,1.85,26.47,92.34,0.0,0.88,26.59,2.23,981.98,2.28,0.03
43822,2019-12-31T22:00+0100,33.09,26.88,38.86,2.11,27.63,0.44,1.71,23.82,93.05,0.0,0.18,33.61,2.66,982.18,2.76,0.04
43823,2019-12-31T23:00+0100,31.61,23.92,35.71,1.87,27.47,0.44,1.67,23.71,91.33,0.0,-0.26,37.95,1.68,982.21,1.8,0.02


In [36]:
# size of the dataframe
df.shape

(43824, 17)

The dataframe with air quality and meteorological measurements has 43'824 rows and 17 columns. 43'824 corresponds to the number of days in 5 years (2015-2019) that were multiplied with 24 hours.

In [37]:
# dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43824 entries, 0 to 43823
Data columns (total 17 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Datum    43824 non-null  object 
 1   NO2      43548 non-null  float64
 2   NO       43548 non-null  float64
 3   NOx      43548 non-null  float64
 4   O3       43574 non-null  float64
 5   PM10     42950 non-null  float64
 6   CO       43572 non-null  float64
 7   SO2      43456 non-null  float64
 8   PM2.5    25864 non-null  float64
 9   Hr       43716 non-null  float64
 10  RainDur  43732 non-null  float64
 11  T        43740 non-null  float64
 12  WD       43724 non-null  float64
 13  WVv      43724 non-null  float64
 14  p        43761 non-null  float64
 15  WVs      37506 non-null  float64
 16  StrGlo   43732 non-null  float64
dtypes: float64(16), object(1)
memory usage: 6.0+ MB


There are 16 float and 1 object feature (date) in the dataframe. Out of 16 float features, 8 are air quality and 8 are meteorological features.

##### Public holidays

Public holidays from 2015 to 2019 were taken from the officeholidays.com website by reading HTML table for each year and then saving the dates in a csv file.

In [38]:
# read csv file with public holidays
df_ph= pd.read_csv('df_ph.csv')

# show dataframe
df_ph

Unnamed: 0,Date
0,2015-01-01
1,2015-01-02
2,2015-04-03
3,2015-04-06
4,2015-04-13
...,...
60,2019-08-01
61,2019-09-09
62,2019-09-15
63,2019-12-25


In [39]:
# list with public holidays as strings
list_ph = df_ph['Date'].astype(str).tolist()

List with public holidays from 2015 to 2019 has 65 entries (dates).

##### School holidays

School holidays from 2015 to 2019 were taken from the schulferien.org website by reading HTML table for each year and then saving the dates in a csv file.

In [40]:
# read csv file with school holidays
df_sh= pd.read_csv('df_sh.csv')

# show dataframe
df_sh

Unnamed: 0,Date
0,2015-01-01
1,2015-01-02
2,2015-01-03
3,2015-02-07
4,2015-02-08
...,...
432,2019-12-27
433,2019-12-28
434,2019-12-29
435,2019-12-30


In [41]:
# list with school holidays as strings
list_sh = df_sh['Date'].astype(str).tolist()

List with school holidays from 2015 to 2019 has 437 entries (dates).

#### C) Raw dataset

The raw dataset is a dataframe that contains all above mentioned data. These are air quality, meteorological, public holiday and school holiday features. One additional feature, day name, is added to the raw dataset.

In [42]:
# add public holidays as a column to the dataframe

# empty list
ph = []

# list with 0 or 1 values for public holidays 
for i in range(len(df)):
    if df.loc[i]['Datum'][:10] in list_ph:
        ph.append(1)
    else:
        ph.append(0)
        
# add new column 'PH' with list values to the dataframe
df['PH'] = ph

In [43]:
# add school holidays as a column to the dataframe

# empty list
sh = []

# list with 0 or 1 values for school holidays 
for i in range(len(df)):
    if df.loc[i]['Datum'][:10] in list_sh:
        sh.append(1)
    else:
        sh.append(0)
        
# add new column 'SH' with list values to the dataframe
df['SH'] = sh

In [44]:
# dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43824 entries, 0 to 43823
Data columns (total 19 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Datum    43824 non-null  object 
 1   NO2      43548 non-null  float64
 2   NO       43548 non-null  float64
 3   NOx      43548 non-null  float64
 4   O3       43574 non-null  float64
 5   PM10     42950 non-null  float64
 6   CO       43572 non-null  float64
 7   SO2      43456 non-null  float64
 8   PM2.5    25864 non-null  float64
 9   Hr       43716 non-null  float64
 10  RainDur  43732 non-null  float64
 11  T        43740 non-null  float64
 12  WD       43724 non-null  float64
 13  WVv      43724 non-null  float64
 14  p        43761 non-null  float64
 15  WVs      37506 non-null  float64
 16  StrGlo   43732 non-null  float64
 17  PH       43824 non-null  int64  
 18  SH       43824 non-null  int64  
dtypes: float64(16), int64(2), object(1)
memory usage: 7.7+ MB


In [45]:
# change the column name from 'Datum' to 'Timestamp'
df.rename(columns={'Datum': 'Timestamp'}, inplace=True)

# change the column format from object to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [46]:
# add new column 'Day' with the day name
df['Day'] = df['Timestamp'].dt.day_name()

In [47]:
# dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43824 entries, 0 to 43823
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype                               
---  ------     --------------  -----                               
 0   Timestamp  43824 non-null  datetime64[ns, pytz.FixedOffset(60)]
 1   NO2        43548 non-null  float64                             
 2   NO         43548 non-null  float64                             
 3   NOx        43548 non-null  float64                             
 4   O3         43574 non-null  float64                             
 5   PM10       42950 non-null  float64                             
 6   CO         43572 non-null  float64                             
 7   SO2        43456 non-null  float64                             
 8   PM2.5      25864 non-null  float64                             
 9   Hr         43716 non-null  float64                             
 10  RainDur    43732 non-null  float64                        

The raw dataset contains 20 features. These are 16 float, 2 integer, 1 datetime (Timestamp) and 1 object feature (Day).

In [48]:
# show first 5 entries
df.head()

Unnamed: 0,Timestamp,NO2,NO,NOx,O3,PM10,CO,SO2,PM2.5,Hr,RainDur,T,WD,WVv,p,WVs,StrGlo,PH,SH,Day
0,2015-01-01 00:00:00+01:00,60.32,34.11,58.89,1.89,258.95,0.62,10.75,,89.25,0.0,-2.09,20.41,1.4,982.8,1.4,0.02,1,1,Thursday
1,2015-01-01 01:00:00+01:00,65.27,69.67,89.99,1.66,249.51,0.7,9.8,,90.47,0.0,-2.48,353.85,0.6,982.64,0.61,0.01,1,1,Thursday
2,2015-01-01 02:00:00+01:00,64.36,56.56,79.0,1.4,227.11,0.67,7.14,,89.45,0.0,-2.46,21.48,1.31,983.0,1.31,0.02,1,1,Thursday
3,2015-01-01 03:00:00+01:00,57.08,35.51,58.32,1.19,127.39,0.6,5.52,,89.2,0.0,-2.63,12.22,1.66,982.93,1.7,0.02,1,1,Thursday
4,2015-01-01 04:00:00+01:00,53.96,36.88,57.78,1.11,93.83,0.57,4.91,,89.56,0.0,-2.77,8.3,1.21,983.03,1.23,0.02,1,1,Thursday


In [49]:
# shape of the raw dataset
df.shape

(43824, 20)

The raw dataset has 43'824 rows and 20 features.

In [50]:
# save the raw dataset to a csv file
df.to_csv('raw-data.csv', index=False)

### 3) Summary

The raw dataset obtained in this notebook corresponds to 5 years of data, from January 1, 2015 to December 31, 2019. It contains air quality and meteorological hourly measurements from the Stampfenbachstrasse station in Zurich, public holidays and school holidays in Zurich and the day name feature.

Air quality and meteorological measurements were taken from the open data portal of canton Zurich. Public holidays were taken from the officeholidays.com website. School holidays were taken from the schulferien.org website. The day name was obtained from the timestamp.

Air quality and meteorological measurements were merged by creating a dictionary of 16 dataframes (for each parameter). 16 dataframes where then merged on the basis of date. Public holidays, school holidays and the day name were added as separate columns.

The raw dataset has 43'824 rows and 20 features. There are 16 float (air quality and meteorological features), 2 integer (public holiday, school holiday), 1 datetime (timestamp) and 1 object (day name) feature.

Feature information:

- Timestamp
- Nitrogen dioxide (NO2), µg/m3
- Nitrogen monoxide (NO), µg/m3
- Nitrogen oxides (NOx), ppb
- Ozone (O3), µg/m3
- Fine dust (PM10), µg/m3
- Carbon monoxide (CO), mg/m3
- Sulfur dioxide (SO2), µg/m3
- Fine dust (PM2.5), µg/m3
- Relative humidity (Hr), %
- Duration of precipitation (RainDur), min
- Temperature (T), °C
- Wind direction (WD), °
- Vector wind speed (WVv), m/s
- Air pressure (p), hPa
- Scalar wind speed (WVs), m/s
- Global radiation (StrGlo), W/m2
- Public holiday (0 or 1)
- School holiday(0 or 1)
- Day (Monday - Sunday)