The overall plan of this code is to create a model that deduces the daily AQI (air quality index) based on the number of cars without measuring the air composition.
For that, a column named AQI will firstly be added to the fulldata_daily dataframe.
Using this dataframe, different algorithms along with crossvalidation will be tried to create a model that classifies the observed data into correct AQI, only based on the number of cars data.

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

In [21]:
df = pd.read_csv('../processed_data/full_data.csv')
df.head()

Unnamed: 0,Jahr,Monat,Tag,Zeit,Datum,Zweirad,Personenwagen,Lastwagen,Hr,RainDur,...,WVs,StrGlo,p,NO2,NO,NOx,O3,CO,PM10,SO2
0,2007,1,1,00:00,2007-01-01T00:00,,,,65.26,0.0,...,3.79,1.48,975.65,19.8,1.7,11.71,45.31,0.3,53.27,7.88
1,2007,1,1,01:00,2007-01-01T01:00,,,,68.6,5.63,...,5.27,1.5,974.98,13.26,2.88,9.24,54.38,0.27,27.84,3.21
2,2007,1,1,02:00,2007-01-01T02:00,,,,73.04,26.47,...,4.4,1.51,974.43,14.07,1.95,8.92,52.51,0.26,13.06,3.01
3,2007,1,1,03:00,2007-01-01T03:00,,,,78.79,57.95,...,4.11,1.49,973.78,12.26,1.69,7.77,53.81,0.24,10.81,2.97
4,2007,1,1,04:00,2007-01-01T04:00,,,,83.82,41.25,...,2.82,1.48,973.3,32.6,4.9,20.98,25.57,0.33,25.29,3.71


EAQI classifies the stations as 'Traffic stations' or 'Industrial and Background stations'. Stampfenbachstrasse is classified as traffic stations and for traffic stations, only NO2, PM10, PM2.5 are considered to calculate EAQI.
Since our dataset doesn't have PM2.5 we only use NO2 and PM10 to calculate the EAQI.

In [22]:
# Leave out only the variables that I need
df = df.iloc[:,[0,1,2,3,4,6,7,14,19]]
df.head()

Unnamed: 0,Jahr,Monat,Tag,Zeit,Datum,Personenwagen,Lastwagen,NO2,PM10
0,2007,1,1,00:00,2007-01-01T00:00,,,19.8,53.27
1,2007,1,1,01:00,2007-01-01T01:00,,,13.26,27.84
2,2007,1,1,02:00,2007-01-01T02:00,,,14.07,13.06
3,2007,1,1,03:00,2007-01-01T03:00,,,12.26,10.81
4,2007,1,1,04:00,2007-01-01T04:00,,,32.6,25.29


In [24]:
df = df.dropna(subset = ['Personenwagen', 'Lastwagen'])

In [28]:
# For NO2, O3, SO2, hourly concentrations are fed into the calculation of index
# For PM10, PM2.5, the 24-hour running means for the past 24 hours are used. (minum of 18 hours needed)

df['PM10_calc'] = df['PM10'].rolling(window=24, min_periods=18).mean()
df

Unnamed: 0,Jahr,Monat,Tag,Zeit,Datum,Personenwagen,Lastwagen,NO2,PM10,PM10_calc
7944,2007,11,28,00:00,2007-11-28T00:00,79.0,4.0,55.87,28.73,
7945,2007,11,28,01:00,2007-11-28T01:00,57.0,0.0,43.34,26.47,
7946,2007,11,28,02:00,2007-11-28T02:00,46.0,0.0,34.87,26.22,
7947,2007,11,28,03:00,2007-11-28T03:00,21.0,2.0,37.67,21.58,
7948,2007,11,28,04:00,2007-11-28T04:00,30.0,3.0,39.56,33.95,
...,...,...,...,...,...,...,...,...,...,...
131422,2021,12,28,23:00,2021-12-28T23:00,26.0,0.0,2.65,4.29,6.118333
131423,2021,12,29,00:00,2021-12-29T00:00,23.0,0.0,2.37,4.02,5.882917
131424,2021,12,29,01:00,2021-12-29T01:00,7.0,0.0,2.35,6.10,5.714583
131425,2021,12,29,02:00,2021-12-29T02:00,3.0,0.0,1.22,8.25,5.495833


In [47]:
# integrating concentration and index levels according to the EAQI table

range_PM10 = [0, 20, 40, 50, 100, 150, 1200]
range_NO2 = [0, 40, 90, 120, 230, 340, 1000]

# by setting labels=False, AQI will be expressed as integers.
# 0:Good, 1:Fair, 2:Moderate, 3:Poor, 4:VeryPoor, 5:ExtremlyPoor

NO2_bins = pd.cut(df['NO2'], bins=range_NO2, labels=False, include_lowest=True)
df['NO2_AQI'] = NO2_bins

PM10_bins = pd.cut(df['PM10_calc'], bins=range_PM10, labels=False, include_lowest=True)
df['PM10_AQI'] = PM10_bins

# The AQI corresponds to the poorest of any pollutant considered
df['AQI'] = np.fmax(df['NO2_AQI'], df['PM10_AQI'])
df = df.dropna(subset=['AQI'])

df

Unnamed: 0,Jahr,Monat,Tag,Zeit,Datum,Personenwagen,Lastwagen,NO2,PM10,PM10_calc,NO2_AQI,PM10_AQI,AQI
7944,2007,11,28,00:00,2007-11-28T00:00,79.0,4.0,55.87,28.73,,1.0,,1.0
7945,2007,11,28,01:00,2007-11-28T01:00,57.0,0.0,43.34,26.47,,1.0,,1.0
7946,2007,11,28,02:00,2007-11-28T02:00,46.0,0.0,34.87,26.22,,0.0,,0.0
7947,2007,11,28,03:00,2007-11-28T03:00,21.0,2.0,37.67,21.58,,0.0,,0.0
7948,2007,11,28,04:00,2007-11-28T04:00,30.0,3.0,39.56,33.95,,0.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
131422,2021,12,28,23:00,2021-12-28T23:00,26.0,0.0,2.65,4.29,6.118333,0.0,0.0,0.0
131423,2021,12,29,00:00,2021-12-29T00:00,23.0,0.0,2.37,4.02,5.882917,0.0,0.0,0.0
131424,2021,12,29,01:00,2021-12-29T01:00,7.0,0.0,2.35,6.10,5.714583,0.0,0.0,0.0
131425,2021,12,29,02:00,2021-12-29T02:00,3.0,0.0,1.22,8.25,5.495833,0.0,0.0,0.0


Now with the hourly vehicles and AQI data, we can perform classification algorithms.