# Data merging
The data for this project consists of open source data provided by [Sensor Community](https://sensor.community/en/) and [Deutscher Wetterdienst](https://opendata.dwd.de/climate_environment/CDC/observations_germany/). We focus on the following sensor positions and time spans:
| city | mean PM pollution of today | longitude | latitude | time span |
|---|---|---|---|---|
| Bremen   | homogeneous, low  | 8.670000 - 8.933400 | 53.013000 - 53.145600 | Jan 20 - Feb 22 | 
| Frankfurt | inhomogeneous, high | 8.430634 - 8.919868 | 50.030681 - 50.205692 | Jan 20 - Feb 22 |

This data includes a good variety as the particulate matter (PM) pollution in Bremen is homogeneously low, whereupon it's unhomogeneous and quite high in mean Frankfurt.

The contributors to the Sensor Community use different sensors that are installed on private property. The different sensors measure the following values:
| sensor name | time stamp | temperature (°C) | PM2.5 (µg/m<sup>3</sup>) | PM10 (µg/m<sup>3</sup>) | air pressure | humidity |
|---|---|---|---|---|---|---|
| sds011 | x |   | x | x |   |   |
| bme280 | x | x |   |   | x | x |
| bmp280 | x | x |   |   | x | x |
| dht22  | x | x |   |   |   | x |

As a consequence and because sensors of different contributors can be situated at the same longitude/latitude-position the data can comprise several measurements per site. 

The measurement rate is with about 20 measurements per hour higher than needed. To reduce the data size mean values per hour and per longitude/latitude-position are calculated together with standard deviations.

Before calculating mean values we proof if the sensors measure within the expected range and how many values are missing.

In this notebook the following will be done:
* extracting data from Sensor Community
* checking for inconsistencies 
* merging data into one DataFrame and calculating mean values per hour

## Importing libraries

In [None]:
# fundamentals
import numpy as np
import pandas as pd

# plotting
import seaborn as sns
import matplotlib.pyplot as plt

# select path to data
import glob

# plt.rcParams.update({'figure.facecolor':'white'})   

## Extracting data from Sensor Community
The complete data for one data type is loaded by means of a function.  Unnecessary(empty) columns are dropped. Based on the time stamp columns for data and hour are added.

In [None]:
def import_sensor_data(sensor):
    '''
    imports the data for a given sensor type (sds, bme, bmp, dht)
    returns list with DataFrames with one entry per sensor
    '''
    path = r'../data/SensorCommunity' # use your path
    all_files = glob.glob(path + "/*.csv") # list with paths to data files

    li = []
    # select data files for chosen sensor, read it to DataFrame and save it in list
    for filename in all_files: 
        if sensor in filename:
            df = pd.read_csv(filename, index_col=None, header=0)
            li.append(df)

    return pd.concat(li, axis=0, ignore_index=True)


def process_timestamps(df):
    # add columns with date and hour
    df.timestamp = pd.to_datetime(df.timestamp)
    df['hour'] = df.timestamp.dt.hour
    df['date'] = pd.to_datetime(df.timestamp.dt.date)
    df.drop('timestamp', axis=1, inplace=True)

In [None]:
# load data and drop unnecessary columns
df_sds = import_sensor_data("sds").drop(['durP1', 'durP2', 'ratioP1', 'ratioP2', 'sensor_type', 'sensor_id', 'location'], axis=1)
df_bme280 = import_sensor_data("bme280").drop(['altitude', 'pressure_sealevel', 'sensor_type', 'sensor_id', 'location'], axis=1)
df_bmp280 = import_sensor_data("bmp280").drop(['altitude', 'pressure_sealevel', 'sensor_type', 'sensor_id', 'location'], axis=1)
df_dht22 = import_sensor_data("dht22").drop(['sensor_type', 'sensor_id', 'location'], axis=1)

# df_bmp180 = import_sensor_data("bmp180").drop(['altitude', 'pressure_sealevel', 'sensor_type', 'location'], axis=1)
# df_ds18b20 = import_sensor_data("ds18b20").drop(['sensor_type', 'location'], axis=1)

dataframes = [df_sds, df_bme280, df_bmp280, df_dht22]


In [None]:
# Make date and hour columns
for df in dataframes:
    process_timestamps(df)

## Checking for inconsistencies
For the particular matter (PM) sensor sds011 the measurement range is given in the data sheet as (0.0-999.9) μg /m<sup>3</sup>. This means that beyond 999.9 μg /m<sup>3</sup> the sensor is still measuring something, but the absolute values are not trustworthy. Overall a constant value for a long time points to some measurement problems. For this first of all we check for 0-measurements for the PM10 sensor P1.

In [None]:
# check for 0-measurements and sort descending 
df_sds.query("P1==0").groupby(['lat', 'lon']).count().sort_values('P1', ascending=False).head(10)

In [None]:
# check for 0-measurements and sort descending 
df_sds.query("P2==999.9").groupby(['lat', 'lon']).count().sort_values('P2', ascending=False).head(10)

In [None]:
df_sds.query("P1==1999.9").groupby(['lat', 'lon']).count().sort_values('P1', ascending=False).head(10)

In [None]:
# check for 0-measurements and sort descending 
df_sds.query("P1==1999.9 and P2==999.9").groupby(['lat', 'lon']).count().sort_values('P1', ascending=False).head(10)

In [None]:
# count exemplarily total number of measurements 
df_sds.query("lat==50.08600 and lon==8.63400").count()

In [None]:
df_sds.head()

In [None]:
df_sds['date_hour'] = df_sds['date'].astype(str) + '_' + df_sds['hour'].astype(str)
print(df_sds.head())

In [None]:

plt.figure(figsize=(25, 15))
ax = sns.lineplot(data=df_sds.query("lat==50.08600 and lon==8.63400")[::100], x='date_hour', y='P2')
plt.xticks(rotation=90)
plt.ylim(-2, 50);

In [None]:
plt.figure(figsize=(25, 15))
ax = sns.lineplot(data=df_sds.query("lat==53.068 and lon==8.818")[::100], x='date_hour', y='P2')
plt.xticks(rotation=90)
plt.ylim(-2, 50);

# Preprocessing

In [None]:

df_sds_grouped = df_sds.groupby(['hour', 'date', 'lat', 'lon']).mean().reset_index()
df_sds_grouped_std = df_sds.groupby(['hour', 'date', 'lat', 'lon']).std().reset_index()

df_sds_grouped_std.rename(columns={'P1': 'PM10_std', 'P2': 'PM2p5_std'}, inplace=True)
df_sds_grouped.rename(columns={'P1': 'PM10', 'P2': 'PM2p5'}, inplace=True)

df_sds_merged = df_sds_grouped.merge(df_sds_grouped_std, how='left', on=['hour', 'date', 'lat', 'lon'])

df_sds_merged.head()

In [None]:
df_environment = pd.concat([df_bme280, df_bmp280, df_dht22], axis=0)
df_environment.head()

In [None]:
df_environment.query("hour==0 and date=='2020-01-01' and lat==50.042000 and lon==8.436000")['temperature'].std()

In [None]:


df_environment_grouped = df_environment.groupby(['hour', 'date', 'lat', 'lon']).mean().reset_index()
df_environment_grouped_std = df_environment.groupby(['hour', 'date', 'lat', 'lon']).std().reset_index()

df_environment_grouped_std.rename(columns={'pressure': 'pressure_std', 'temperature': 'temperature_std', 'humidity': 'humidity_std'}, inplace=True)
df_environment_merged = df_environment_grouped.merge(df_environment_grouped_std, how='left', on=['hour', 'date', 'lat', 'lon'])
df_environment_merged.head()

In [None]:
df_environment_merged.info()

In [None]:
for col in df_environment_merged.columns:
    print(f"{col}: {df_environment_merged[col].isna().sum()}")

In [None]:
df = df_sds_merged.merge(df_environment_merged, how='left', on=['hour', 'date', 'lat', 'lon'])
df

In [None]:
df.info()