# Creating Datasets: PurpleAir
Documentation on source code used to generate the AirThings datasets.

In [42]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import pandas as pd
import numpy as np

import sys
sys.path.append("../")
from src.data import make_purpleair_dataset, make_airthings_dataset, make_dataset

#import warnings
#warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---

# Table of Contents
1. [Creating an Object](#instantiating)
2. [Generating the Dataset](#generating_dataset)

---

<a id="instantiating"></a>

# Creating a `Process` Object
The line to create a `Process` object to pull PurpleAir data is shown below:

In [43]:
pa_processor = make_dataset.PurpleAir(start_date="20220501",end_date="20220517")

where the three possible input parameters are:
* `start_date`: string in the format %Y%m%d (or yyyymmdd) specifying the first date of data to include
* `end_date`: string in the format %Y%m%d (or yyyymmdd) specifying the last date of data to include
* `ip_filename`: string specifying the filename that contains the IP addresses in the /references/meta_data/ directory. None, specifies the file "airthings_meta.csv" which should be the default.

---

<a id="generating_dataset"></a>

# Generating the Dataset
The `make_dataset()` method creates the aggregated dataset by running subsequent methods.

## `download`
The `download` method takes a single IP address and downloads all available data from that device.

In [44]:
pa_processor.download()

Accessing 20 urls with async...Done
Download complete...
Processing PA_2C79...
Processing PA_FBAC...
Processing PA_05B7...
Processing PA_FC10...
Processing PA_36A4...
Processing PA_1D90...
Processing PA_FA97...
Processing PA_1A0A...
Processing PA_667E...
Processing PA_FD40...
Data saved to /Users/hagenfritz/Projects/bleed-orange-measure-iaq/data/interim/purpleair/


Data are stored in the /data/interim/purpleair location

In [49]:
temp = pd.read_csv("/Users/hagenfritz/Projects/bleed-orange-measure-iaq/data/interim/purpleair/PA_FBAC (inside) (30.288195 -97.735269) Primary Real Time 05_01_2022 05_15_2022.csv",
            index_col=0,parse_dates=True,infer_datetime_format=True)
temp.head()

Unnamed: 0_level_0,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-05-01 00:02:28+00:00,109261,4.76,7.25,7.27,11792.0,-61.0,80.0,37.0,7.25
2022-05-01 00:04:28+00:00,109262,4.64,6.55,7.04,11794.0,-62.0,80.0,37.0,6.55
2022-05-01 00:06:28+00:00,109263,4.15,5.85,5.94,11796.0,-61.0,80.0,37.0,5.85
2022-05-01 00:08:32+00:00,109264,4.24,5.88,6.24,11798.0,-62.0,80.0,37.0,5.88
2022-05-01 00:10:26+00:00,109265,4.78,6.07,6.07,11800.0,-61.0,80.0,37.0,6.07


## `make_dataset`
This method takes the two previous methods to download data, combine datasets, and perform quality checks.

_this method takes some time since it is downloading data from each device_

In [53]:
pa_processor.make_dataset()

and afterwards, we have created a new class variable `processed` which contains the processed AirThings data:

In [54]:
pa_processor.processed.describe()

Unnamed: 0,pm1_mass-microgram_per_m3,pm2p5_mass-microgram_per_m3,pm10_mass-microgram_per_m3,temperature-f,rh-percent
count,23463.0,23463.0,23463.0,23463.0,23463.0
mean,4.77703,6.556393,6.786053,80.33491,37.678899
std,3.893329,5.391626,5.56341,0.88274,1.539811
min,0.06,0.36,0.36,74.0,30.0
25%,2.02,2.78,2.91,80.0,37.0
50%,3.52,4.8,4.97,80.0,38.0
75%,6.57,8.875,9.14,81.0,39.0
max,25.79,37.83,39.22,86.0,46.0


---

# `Run`
The source code can be run and the processed data saved through the `run()` method.

In [57]:
pa_processor.run()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_inde

---