# Data Aggregation Across Data Sources

We have 3 different sources of data:

1. Our sensor data: that has the Indoor Air Quality and Indoor Environmental Data.

2. SINAICA: Outdoor Air Quality Monitoring Data from the Government.

3. OpenWeatherData: Outdoor Environmental Data.

We need it to be available that data to the models we plan to train. In the following sections this process is detailed.

In [1]:
import os, gzip, json, re, stan, dplython, asyncio, nest_asyncio
#nest_asyncio.apply()
import warnings
from matplotlib import pyplot as plt
warnings.filterwarnings("ignore", category=DeprecationWarning)
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction, dfilter)
import seaborn as sns
from plotnine import *
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import (mean_squared_error, 
                             r2_score,
                             mean_absolute_error)
import pandas as pd
import numpy as np
from IPython.display import display, Markdown, update_display



In [2]:
df = pd.read_pickle("data/sinaica/sinaica-imputated.pickle.gz")
df

Unnamed: 0,CO,NO,NO2,NOx,O3,PM10,PM2.5,SO2,month,day,hour,datetime,minute,temperature,pressure,humidity,gasResistance,IAQ
987,2.200000,0.205000,0.031000,0.207000,0.002000,45.000000,22.000000,0.004000,2,12,6,2021-02-12 06:05:35.846304417,35.0,21.51,777.41,44.04,152149.0,34.7
988,2.200000,0.205000,0.031000,0.207000,0.002000,45.000000,22.000000,0.004000,2,12,6,2021-02-12 06:05:38.837326527,34.0,21.51,777.41,43.98,152841.0,33.6
989,2.200000,0.205000,0.031000,0.207000,0.002000,45.000000,22.000000,0.004000,2,12,6,2021-02-12 06:05:47.812360048,32.0,21.54,777.41,43.73,153259.0,31.5
990,2.200000,0.205000,0.031000,0.207000,0.002000,45.000000,22.000000,0.004000,2,12,6,2021-02-12 06:05:50.803695202,32.0,21.53,777.41,43.70,152841.0,31.5
991,2.200000,0.205000,0.031000,0.207000,0.002000,45.000000,22.000000,0.004000,2,12,6,2021-02-12 06:05:53.795462847,30.0,21.52,777.41,43.70,153399.0,30.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1582081,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,0,2021-09-18 00:59:47.142104626,138.0,26.00,782.92,56.34,916837.0,138.2
1582082,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,0,2021-09-18 00:59:50.136709690,138.0,26.00,782.92,56.33,917462.0,137.7
1582083,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,0,2021-09-18 00:59:53.131285429,138.0,26.00,782.90,56.34,916837.0,137.6
1582084,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,0,2021-09-18 00:59:56.125959396,136.0,26.00,782.92,56.35,921233.0,136.0


In [47]:
weather = pd.read_pickle("data/openweathermap/weather.pickle.gz")

weather["year"] = [dt.year for dt in weather["dt"]]
weather["month"] = [dt.month for dt in weather["dt"]]
weather["day"] = [dt.day for dt in weather["dt"]]
weather["hour"] = [dt.hour for dt in weather["dt"]]
weather.rename(columns={'temp': 'temperature'},
               inplace=True)
weather = pd.merge(df, weather, on=['month', 'day', 'hour'],
                  suffixes=('', '_outdoor'))
weather.drop('dt', axis=1, inplace=True)
weather

Unnamed: 0,CO,NO,NO2,NOx,O3,PM10,PM2.5,SO2,month,day,...,pressure_outdoor,humidity_outdoor,wind_speed,wind_deg,rain_1h,rain_3h,clouds_all,weather_id,weather_main,year
0,2.500000,0.244000,0.035000,0.205000,0.002000,57.000000,25.000000,0.005000,2,12,...,1020,44,0.00,0,0.0,0.0,1,800,Clear,2021
1,2.500000,0.244000,0.035000,0.205000,0.002000,57.000000,25.000000,0.005000,2,12,...,1020,44,0.00,0,0.0,0.0,1,800,Clear,2021
2,2.500000,0.244000,0.035000,0.205000,0.002000,57.000000,25.000000,0.005000,2,12,...,1020,44,0.00,0,0.0,0.0,1,800,Clear,2021
3,2.500000,0.244000,0.035000,0.205000,0.002000,57.000000,25.000000,0.005000,2,12,...,1020,44,0.00,0,0.0,0.0,1,800,Clear,2021
4,2.500000,0.244000,0.035000,0.205000,0.002000,57.000000,25.000000,0.005000,2,12,...,1020,44,0.00,0,0.0,0.0,1,800,Clear,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1605252,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,...,1015,93,1.37,199,1.0,0.0,0,500,Rain,2021
1605253,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,...,1015,93,1.37,199,1.0,0.0,0,500,Rain,2021
1605254,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,...,1015,93,1.37,199,1.0,0.0,0,500,Rain,2021
1605255,0.765217,0.009174,0.027826,0.039043,0.015304,20.304348,16.391304,0.001304,9,18,...,1015,93,1.37,199,1.0,0.0,0,500,Rain,2021


In [48]:
weather.dtypes

CO                            float64
NO                            float64
NO2                           float64
NOx                           float64
O3                            float64
PM10                          float64
PM2.5                         float64
SO2                           float64
month                           int64
day                             int64
hour                            int64
datetime               datetime64[ns]
minute                        float64
temperature                   float64
pressure                      float64
humidity                      float64
gasResistance                 float64
IAQ                           float64
temperature_outdoor           float64
feels_like                    float64
temp_min                      float64
temp_max                      float64
pressure_outdoor                int64
humidity_outdoor                int64
wind_speed                    float64
wind_deg                        int64
rain_1h     

## Interpolation of Hourly Data

We found that the dataframe contains repeated records on the columns of hourly data: SINAICA Gov't Air Quality Monitoring and OpenWeatherData. 

We think that the repeated data can be an issue, as the data moves very abruptly from a record call it at 10:57 and 11:00. 

We propose an approach similar to the imputations using the interpolation incorporating noise, that could avert the overfitting issue on our machine learning and deep learning training.

## Resampling

To reduce training time we propose to have a resampling of the data.

In the following subsections we create those resampled-data dataframes.

### 1 Minute Resampling

### 2 Minute Resampling

### 3 Minute Resampling

## References

* <https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression>

* <https://pythonhealthcare.org/2018/05/03/81-distribution-fitting-to-data/>

* <https://medium.com/@amirarsalan.rajabi/distribution-fitting-with-python-scipy-bb70a42c0aed>

* <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html?highlight=kernel%20density#sklearn.neighbors.KernelDensity>