# PurpleRain Tutorial

## 1. Import PurpleRain

Simply import the PurpleRain library. Your .ipynb or .py file should be in the same file as PurpleRain.py.
Many common libraries are already imported with PurpleRain. There's a chance a Python may throw a
"ModuleNotFoundError" since so many libraries are used. Consult https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/ if you are using the Anaconda platform (includes Jupyter and Spyder) or https://docs.python.org/3/installing/index.html if you are using a different IDE.

In [None]:
from PurpleRain import *

## 2. Use the pa_query function to access the Purple Air Database

### 2a. Variable Assignment:

Some of the data structures used in the pa_query function are recycled in other PurpleRain functions. To simplify, first assign all to a concise yet informative variable then input into the function. The function requires:

* sensor_list: A list of strings containing the ***exact*** names of the sensors as they appear on PA server. No need to include both A and B sensors.

* driver_path: A string with the path to the chromedriver. **Be sure to enter the path to the csv file, if you don't replace 'insert_your_path_here' the code won't compile.**

* start_date:  A date string of the form - 'MM/DD/YYYY'

* end_date:    A date string of the form - 'MM/DD/YYYY' - must occur at *least* one day after the start date.

* tz_str:      A string containing the timezone information. Visit https://en.wikipedia.org/wiki/List_of_tz_database_time_zones and look at the column titled 'TZ Database Name'

In [None]:
sensor_list = ['UT Sensor 1','UT Sensor 4']
driver_path = 'your_path_here'
start_date  = '06/20/2019'
end_date    = '08/18/2019'
tz_str      = 'Asia/Calcutta'

### 2b. [Optional] The sensors_from_csv function

Sometimes you may want to download a long list of sensors. It may be more practical for documentation purposes to simply process a csv file with the sensor list. The csvfile must have the sensor list in the first column with a header. Check the csv file for formatting best practices. **Be sure to enter the path to the csv file, if you don't replace 'insert_your_path_here' the code won't compile.**

In [None]:
sensor_list = sensors_from_csv('insert_your_path_here'+'sample_sensor_list.csv')
print(sensor_list)

### 2c. pa_query implementation

Finally, execute the function. You may notice the 0 before the previously assigned variables. 0 represents the query type. Currently only (1) query type is stable - "Exact" name, the fastest and most efficient query type. In future releases two additional query types will be supported: Lat-Lon bounds, and inexact sensor name queries. These two are **much slower** than type 0.

The function requires an internet connection and properly assinged values for all the previously assigned variables. Please make sure you have followed all guidelines and check the documentation before reporting errors as bugs.

In [None]:
pa_query(0, driver_path, sensor_list, start_date, end_date, tz_str)

Allow the function to run. Don't exit out of the webdriver. All files will appear in your Downloads file by default. Consult documentation for more information on the pa_query function.

## 3. Organize files into a Hierarchical Database

### 3a. Variable Assignment

The only additional variables needed are the path to the downloads directory, and the name the our local PA database. For a name I have chosen "Tutorial" for the purposes of this file, but it can be called any string. Note that all .h5 files within in the same directory ***must*** have different names to avoid overwritting data. If you don't rename, you will raise an OSError. The downloads_path is the directory where the PA csvs have been downloaded in step 2c. If you have moved them from the default Downloads directory, use that directory instead.

In [None]:
h5filename     = 'Tutorial'
downloads_path = 'instert_downloads_path_here'

### 3b. The download_file_list function

Before building the database, run the download_file_list function to build the list of files. The function returns the filenames for the next function, so be sure to assign a variable to the output.

In [None]:
names = downloaded_file_list(downloads_path, sensor_list)

### 3c. The build_hdf function

The build_hdf function will organize the csv files into one central platform, as well as calcualte 1-hr mean, median,
standard deviation, and max values for each timestep for each parameter measured by the PAs. The .h5 file will be saved in the same directory as the this program or wherever you are running PurpleRain. build_hdf will return the "keys" or levels of the hdf file. It can be helpful for navigating the file, but it is not necessary to assign a variable to the function. Additionally, you can visualize the contents of the file with the NASA app Panoply: https://www.giss.nasa.gov/tools/panoply/download/.

Note that the HDF saves datetime in Julian date float format. Look here for more information: https://en.wikipedia.org/wiki/Julian_day.

In [None]:
keys = build_hdf(names, h5filename, tz_str)

### 3d. Convert hdf file to mat file

You maybe more comfortable with MATLAB. Although MATLAB has HDF support, it can be difficult to use and often rejects Python generated HDF files. Simply enter the name of the HDF file (assuming you have not moved it out of this directory, if you have put the directory in front of the filename string), and it will generate a mat file of the HDF in the same folder. Contact me if you are interested in MATLAB visualizations.

In [None]:
hdf5_to_mat(h5filename + '.h5')

### 3e. Navigating the hdf file

To grab data arrays from the hdf file, you can use the syntax from the h5py library outlined here: http://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data. Alternatively, you can use the function of this module called h5file_query. The function requires the h5 file name, and hierarchical query string. Sample usage is shown below.

By default Panoply will fill any space with underscores, so the sensor named UT Sensor 1 will appear as UT_Sensor_1 in Panoply. The proper format for query is the sensor name with any spaces, and an additional space at the end.

In [None]:
sensor='UT Sensor 1'

A_PM25_1hr = h5file_query(h5filename+'.h5', sensor + ' /A/PM_CF/PM25/Subsampled/PM25_CF_Mean')
B_PM25_1hr =  h5file_query(h5filename+'.h5', sensor + ' /A/PM_CF/PM25/Subsampled/PM25_CF_Mean')

Data = pd.DataFrame([A,B])
Data.head()

### 3f. [Optional] Import and Format Calibration Data

Calibration data can come from many sources and therefore there is no uniform format. To use as a calibration source in this environment, convert to a time-indexed pandas dataframe with a single column representing the 1-hour averaged PM2.5 concentration. Consult https://pandas.pydata.org/pandas-docs/stable/index.html for more information. Below is an example of using data from a MetOne BAM-1022. **Be aware of your calibration data quality.** Many sensors will flag bad data but may not delete/nan it by default.

Also keep in mind that example below is not necessarily the most efficient or strongest way of building the necessarry dataframe. Try experimenting with different arrangements to improve your Pandas skills.

In [None]:
BAM_df = pd.read_csv('6_13_2019_to_8_19_2019_UT_BAM.csv')
BAM_df = BAM_df.replace(99999, np.nan)
Time = BAM_df.Time
for i in range(0, len(Time)):
    t = Time[i]
    Time[i] = pd.Timestamp(t) - timedelta(hours=1) # BAM-1022 defines measurement hour different than pandas
PM25 = BAM_df['ConcHR(ug/m3)']
calibration_df = pd.DataFrame(PM25)
calibration_df.index = Time
calibration_df = calibration_df.resample('60T').apply(np.nanmean)
calibration_df.head()

## 4. Some Visualizations 

### 4a. PA Intra-sensor Comparison: a_vs_b_plot function

a_vs_b_plot generates a figure to understand the relationship between the A and B modules of the PA package. In theory they should have a perfectly 1:1 relationship. Input variables include the hdfile, the sensor name (one at a time for now), and whether to show the plot or not (irrelevant for Jupyter, but important for not overloading your screen with plots if using a different IDE or command line). The plots will be saved in this file with the sensor name as a png file. The function will also return the **slope**, **y-intercept**, **R**, **Pearson's r**, and **NRMSE**.

In [None]:
m, b, R, r, nrmse = a_vs_b_plot(h5file, sensor, False)

print('Slope: '+str(m))
print('y-intercept: '+str(b))
print('R2: '+str(R**2))
print("Pearson's r: "+str(r))
print('NRMSE: '+str(nrmse))

### 4b. PA Calibration: calibration_plot function

If you have imported the calibration data, the calibration_plot can help you visualize the linear relationship betweeen the PA and the calibration instrument. Input variables include the hdfile, the sensor name (one at a time for now),the calibration pandas dataframe, and whether to show the plot or not (irrelevant for Jupyter, but important for not overloading your screen with plots if using a different IDE or command line). The plots will be saved in this file with the sensor name as a png file. The function will also return the **slope**, **y-intercept**, **R**, **Pearson's r**, and **NRMSE** for **both the A and B sensors**.

In [None]:
calibration_plot(h5file, sensor, calibration_df, False)

### 4c. Scripting multiple sensors

In this release, the plot functions can only handle one sensor at a time, however you can use the sensor name list and a for loop to run it for multiple sensors. Sample usage is shown below.

In [None]:
m = []
b = []
R = []
r = []
nrmse = []

for i in range(0, len(sensor_list)):
    s = sensor_list[i]
    output = a_vs_b_plot(h5file, s, False)
    m = m + [output[0]]
    b = b + [output[1]]
    R = R + [output[2]]
    r = r + [output[3]]
    nrmse = nrmse + [output[4]]
    
print('Slope: ', m)
print('y-intercept: ', b)
print('R2: ', R**2)
print("Pearson's r: ", r)
print('NRMSE: ', nrmse)

## 5. [Optional] Download Meteorological Data

Meteorological Data is vital to understanding low cost sensor data, especially Purple Airs. Relative Humidity, Temperature, Dew Point, Barometric Pressure, Wind Speed, Wind Direction, and Gust Speed all play a significant role in calibration and casual analysis of PM2.5 data. In future releases, functions will be added to include meteorological visualization and inclusion in calibration procedures.

### 5a. Variable Assignment

To use the downloader, you must specify the network, and weather station name. The network name is usally the name of the common name of the country (such as "India" or "Democratic Republic of the Congo"). If the station is in the US, the network is the name of the state (such as "Iowa" or "New York"). The station name usually corresponds to the airport name (such as "JFK" for JFK Airport, or "AUS" for Austin-Bergstorm Airport) or the met station name if not an airport ("NYC" for Central Park).
Check the following for network and station names: https://mesonet.agron.iastate.edu/request/download.phtml. The start_date, and end_date, unlike the pa_query, use the datetime format of datetime(year, month, day) in integer casting.

In [None]:
network     = 'India'
station     = 'VOBG'
start_date  = datetime(2019,6,20)
end_date    = datetime(2019,8,19)

### 5b. Download Meteorological Data

The download_met_data function is similar to the pa_query function in that it uses an automated webdriver to download data. It also requires the driver_path, along with the aforementioned network, station, start_date, and end_date. Remember not to exit the webdriver, it will automatically exit after the download is complete. The download file will have the same name as the station and will be a text file. It will automatically download to the default download directory. If there are multiple downloads from the same station it will begin namming in the format station+(integer).txt. 

In [None]:
download_met_data(driver_path, network, station, start_date, end_date)

### 5c. Format Met Data

The native text file contains bloat, and is in UTC timezone. format_met_data reformats the text file into a 1-hr averaged dataframe in the local timezone, and saves it as a csv file in the same folder as this file.

In [None]:
meteorology_df = format_met_data('insert_your_path_here'+station+.'txt', tz_str)