# Creating Datasets: AirThings
Documentation on source code used to generate the AirThings datasets.

In [47]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import pandas as pd
import numpy as np

import sys
sys.path.append("../")
from src.data import make_purpleair_dataset, make_airthings_dataset

#import warnings
#warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---

# Table of Contents
1. [Creating an Object](#instantiating)
2. [Generating the Dataset](#generating_dataset)

---

<a id="instantiating"></a>

# Creating a `Process` Object
The line to create a `Process` object to pull AirThings data is shown below:

In [3]:
at_processor = make_airthings_dataset.Process(start_date="20220501",end_date="20220515",ip_filename=None)

where the three possible input parameters are:
* `start_date`: string in the format %Y%m%d (or yyyymmdd) specifying the first date of data to include
* `end_date`: string in the format %Y%m%d (or yyyymmdd) specifying the last date of data to include
* `ip_filename`: string specifying the filename that contains the IP addresses in the /references/meta_data/ directory. None, specifies the file "airthings_meta.csv" which should be the default.

---

<a id="generating_dataset"></a>

# Generating the Dataset
The `make_dataset()` method creates the aggregated dataset by running subsequent methods.

## `download`
The `download` method takes a single IP address and downloads all available data from that device.

In [11]:
at_processor.download("at1730")

Data are stored in the /data/interim/DATA location

In [12]:
temp = pd.read_csv("../data/interim/DATA/2930041730-2022-05-10.csv",
            index_col=0,parse_dates=True,infer_datetime_format=True)
temp.head()

Unnamed: 0_level_0,rh,radon_acute,radon_chronic,temperature,pressure,co2,voc
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-05-10 00:14:27,52.5,9.0,9.0,23.48,994.16,547.0,61.0
2022-05-10 00:29:14,52.5,9.0,9.0,23.48,994.12,541.0,48.0
2022-05-10 00:44:14,53.0,9.0,9.0,23.48,994.22,537.0,48.0
2022-05-10 00:59:16,,,,,,,
2022-05-10 01:14:14,53.0,7.0,7.0,23.49,994.38,579.0,52.0


## `perform_quality_checks`
This method cleans the data by removing values based on the z-score

### Getting dummy data

In [34]:
# data from one device
at_processor.download("at1730")
temp1 = pd.read_csv("../data/interim/DATA/2930041730-2022-05-10.csv",
            index_col=0,parse_dates=True,infer_datetime_format=True)
temp1["device"] = "1730"
# data from another device
at_processor.download("at2168")
temp2 = pd.read_csv("../data/interim/DATA/2930042168-2022-05-10.csv",
            index_col=0,parse_dates=True,infer_datetime_format=True)
temp2["device"] = "2168"
raw_temp = pd.concat([temp1,temp2],axis=0)

### Filtering

In [39]:
qc_temp = at_processor.perform_quality_checks(raw_temp)

In [40]:
print("Observations without QC:\t",len(raw_temp.dropna(subset=["rh","temperature","co2","voc"])))
print("Observations with QC:\t",len(qc_temp.dropna(subset=["rh","temperature","co2","voc"])))

Observations without QC:	 1526
Observations with QC:	 1480


## `make_dataset`
This method takes the two previous methods to download data, combine datasets, and perform quality checks.

_this method takes some time since it is downloading data from each device_

In [48]:
at_processor.make_dataset()

and afterwards, we have created a new class variable `processed` which contains the processed AirThings data:

In [49]:
at_processor.processed.describe()

Unnamed: 0,rh,radon_acute,radon_chronic,temperature,pressure,co2,voc
count,10838.0,10861.0,10861.0,10838.0,10861.0,10838.0,10842.0
mean,54.937858,6.500691,6.615229,22.790766,996.908868,550.040137,11535.195628
std,1.786493,6.631426,6.187427,0.410989,14.682295,47.434908,24868.884219
min,48.5,0.0,0.0,22.0,989.92,455.0,0.0
25%,53.5,1.0,4.0,22.48,994.42,517.0,61.0
50%,55.0,5.0,5.0,22.73,995.8,540.0,78.0
75%,56.5,10.0,6.0,22.93,998.52,560.0,144.0
max,60.0,32.0,30.0,24.01,1310.7,750.0,65535.0


---