# Analysis of Local Aerosol Datasests

Goal: Continue building our familiarity with atmospheric datasets and how they can be used to examine research questions in atmospheric chemistry.
We will be continuing our exploration of data available via the EPA's Air Quality Data system and utilizing it to explore variability in PM2.5 and other aerosol measurements over the past year. This lab will also serve as your introuction to remote sensing, by utilizing AERONET data collected on the roof of the building. The results will contribute to your first lab report.

As always the first cell imports libraries, you will have to run the first cell twice, once to install pyrsig, and again to import the newly installed library after restarting the kernel. If you prefer you can download the notebook as a .py and run it locally on your machine.

In [1]:
%matplotlib inline
%pip install --user pyrsig pycno pyproj netcdf4

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import datetime as dt
from dateutil.parser import parse
import pyrsig
from bs4 import BeautifulSoup      #reads data from website (web scraping)
import requests                    #useful for sending HTTP requests



First we will construct an RSIG query to load PM2.5 data from sites in the Los Angeles area over 2023.

In [2]:
rsigapi = pyrsig.RsigApi(
    bdate='2023-01-01', edate='2023-12-31',
    bbox=(-119, 33, -117, 35)
)

# # This list all available datasets!
# print([k for k in rsigapi.keys()])

# list available datasets for a specific parameter
print([k for k in rsigapi.keys() if 'pm25' in k])
# Exract the desired values to a data frame
df = rsigapi.to_dataframe('aqs.pm25',parse_dates=True, unit_keys=False)

['airnow.pm25', 'airnow2.pm25', 'aqs.pm25', 'aqs.pm25_daily_average', 'aqs.pm25_daily_filter', 'metar.snowCovernesdis.pm25', 'purpleair.pm25_corrected', 'purpleair.pm25_corrected_hourly', 'purpleair.pm25_corrected_daily', 'purpleair.pm25_corrected_monthly', 'purpleair.pm25_corrected_yearly']


Next, we will need to examine the available sites, and extract the desired sites to individual data frames. You will pick three sites to compare and modify the code cell below to extract them to their own dataframes for further analysis.

In [3]:
#list all sites
df.SITE_NAME.unique()
# Code to extract specific site to a data frame
anaheim=df[df.SITE_NAME.str.strip()=='Anaheim']

Using the techniques we learned in the previous labs, plot the data from your chosen sites and compare the data from the three sites. Questions to ponder:

    1. How does PM2.5 at each site compare? Compare with both other sites and the current NAAQS for PM2.5
    2. Are there time periods that jump out at you? What are some potential causes for elevated readings?
    3. Is there a seasonality to PM2.5? Does this vary between sites?

Next we will move to examing data from AERONET. More about this network can be found here : https://aeronet.gsfc.nasa.gov/new_web/index.html

In contrast with the in situ measurements we have worked with previously, these instruments measure the attenuation of light due to aerosols as it passes through the atmosphere.
![AERONET On top of building](https://pkpeterson.github.io/images/pandora_aeronet.jpg)

First we will import data from our site using code provided by Petar Grigorov and Pawan Gupta (NASA AERONET)

In [15]:
url='https://aeronet.gsfc.nasa.gov/cgi-bin/print_web_data_v3?site=WC_Whittier_CA&year=2023&month=05&day=01&year2=2024&month2=02&day2=01&AOD15=1&AVG=10'

soup = BeautifulSoup(requests.get(url).text) #web services contents are read here from URL

if len(soup) <= 1:                    #alerts the user if the data cannot be read due to improper parameter inputs
  print("\nThe link could not be generated due to issues with the input. Please try again.")

"""**Read and filter downloaded data as per user average type specification**"""

with open(r'time_series_WC.txt' ,"w") as oFile:          #writes the data scraped from "beautiful soup" to a text file on your local Google drive
    oFile.write(str(soup.text))
    oFile.close()



#%%%%%
df = pd.read_csv(r'/content/time_series_WC.txt',skiprows = 5)     #loads the csv data into a Pandas dataframe
#list variables
print(list(df))
#%%
"""**Read and filter downloaded data to parse dates and select only 500 nm AOD**"""

df['Date']= pd.to_datetime(df['Date(dd:mm:yyyy)']+'-'+df['Time(hh:mm:ss)'],format='%d:%m:%Y-%H:%M:%S')
df = df.dropna().reset_index(drop=True) #Drops NaN or -999.0 values

# Preserves original resolution, you can resample if desired
aeronet= df.groupby(['Date']).median()['AOD_500nm']



['AERONET_Site', 'Date(dd:mm:yyyy)', 'Time(hh:mm:ss)', 'Day_of_Year', 'Day_of_Year(Fraction)', 'AOD_1640nm', 'AOD_1020nm', 'AOD_870nm', 'AOD_865nm', 'AOD_779nm', 'AOD_675nm', 'AOD_667nm', 'AOD_620nm', 'AOD_560nm', 'AOD_555nm', 'AOD_551nm', 'AOD_532nm', 'AOD_531nm', 'AOD_510nm', 'AOD_500nm', 'AOD_490nm', 'AOD_443nm', 'AOD_440nm', 'AOD_412nm', 'AOD_400nm', 'AOD_380nm', 'AOD_340nm', 'Precipitable_Water(cm)', 'AOD_681nm', 'AOD_709nm', 'AOD_Empty', 'AOD_Empty.1', 'AOD_Empty.2', 'AOD_Empty.3', 'AOD_Empty.4', 'Triplet_Variability_1640', 'Triplet_Variability_1020', 'Triplet_Variability_870', 'Triplet_Variability_865', 'Triplet_Variability_779', 'Triplet_Variability_675', 'Triplet_Variability_667', 'Triplet_Variability_620', 'Triplet_Variability_560', 'Triplet_Variability_555', 'Triplet_Variability_551', 'Triplet_Variability_532', 'Triplet_Variability_531', 'Triplet_Variability_510', 'Triplet_Variability_500', 'Triplet_Variability_490', 'Triplet_Variability_443', 'Triplet_Variability_440', 'Tri

  aeronet= df.groupby(['Date']).median()['AOD_500nm']


Next we want to plot the time series of these data and compare it with data from the closest AQS measurements (Anaheim). Do so in the cell below. Some things to think about.

    1. How do these measurements compare? Are they correlated?
    2. Would you expect these measurements to be correlated? Why or why not?
    3. The AERONET data have more gaps than the in situ measurements. why do you think this is?