# US EPA Air Quality System AQI Data Acquisition and Processing 
The code below illustrates how to request data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API. This is a historical API and does not provide real-time air quality data. The [documentation](https://aqs.epa.gov/aqsweb/documents/data_api.html) for the API provides definitions of the different call parameter and examples of the various calls that can be made to the API.

This notebook works systematically through calls, requesting an API key, using 'list' to get various IDs and parameter values, and using 'daily summary' to get summary data that meets specific conditions. 

The US EPA was created in the early 1970's. The EPA reports that they only started broad based monitoring with standardized quality assurance procedures in the 1980's. Many counties will have data starting somewhere between 1983 and 1988. Specifically, my county Maricopa (for assigned city Mesa, AZ) has data starting from 1965. Some [additional information on the Air Quality System can be found in the EPA FAQ](https://www.epa.gov/outdoor-air-quality-data/frequent-questions-about-airdata) on the system.

The AQI index is meant to tell us something about how healthy or clean the air is on any day. The AQI is actually a somewhat complext measure. When I started this example I looked up [how to calculate the AQI](https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf) so that I would know roughly what goes into that value.





### License
A lot of the code/ markdown text below was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 16, 2024.

I have made slight modifications to the code as required for my project, and also added some extra code for the creation of my dataset.

Setting up the Google Colab Workspace

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
%cd "drive/MyDrive/Data 512/Project/"

/content/drive/MyDrive/Data 512/Project


### Preliminaries
First we start with some imports and some constant definitions.


You will need to install any dependencies you don't have. For this, you will require pip/pip3 if you do not already have it.
After installing pip, you can run: !pip install {package name} to install any required packages

In [1]:
#
#    These are standard python modules
#
import json, time
#
#    The 'requests' module is a distribution module for making web requests. If you do not have it already, you'll need to install it
import requests
import csv
import pandas as pd

In [2]:
#########
#
#    CONSTANTS
#

#
#    This is the root of all AQS API URLs
#
API_REQUEST_URL = 'https://aqs.epa.gov/data/api'

#
#    These are some of the 'actions' we can ask the API to take or requests that we can make of the API
#
#    Sign-up request - generally only performed once - unless you lose your key
API_ACTION_SIGNUP = '/signup?email={email}'
#
#    List actions provide information on API parameter values that are required by some other actions/requests
API_ACTION_LIST_CLASSES = '/list/classes?email={email}&key={key}'
API_ACTION_LIST_PARAMS = '/list/parametersByClass?email={email}&key={key}&pc={pclass}'
API_ACTION_LIST_SITES = '/list/sitesByCounty?email={email}&key={key}&state={state}&county={county}'
#
#    Monitor actions are requests for monitoring stations that meet specific criteria
API_ACTION_MONITORS_COUNTY = '/monitors/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_MONITORS_BOX = '/monitors/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'
#
#    Summary actions are requests for summary data. These are for daily summaries
API_ACTION_DAILY_SUMMARY_COUNTY = '/dailyData/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_DAILY_SUMMARY_BOX = '/dailyData/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'
#
#    It is always nice to be respectful of a free data resource.
#    We're going to observe a 100 requests per minute limit - which is fairly nice
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED
#
#
#    This is a template that covers most of the parameters for the actions we might take, from the set of actions
#    above. In the examples below, most of the time parameters can either be supplied as individual values to a
#    function - or they can be set in a copy of the template and passed in with the template.
#
AQS_REQUEST_TEMPLATE = {
    "email":      "",
    "key":        "",
    "state":      "",     # the two digit state FIPS # as a string
    "county":     "",     # the three digit county FIPS # as a string
    "begin_date": "",     # the start of a time window in YYYYMMDD format
    "end_date":   "",     # the end of a time window in YYYYMMDD format, begin_date and end_date must be in the same year
    "minlat":    0.0,
    "maxlat":    0.0,
    "minlon":    0.0,
    "maxlon":    0.0,
    "param":     "",     # a list of comma separated 5 digit codes, max 5 codes requested
    "pclass":    ""      # parameter class is only used by the List calls
}



**Step 1:** Making a sign-up request

Before we use the API you need to request a key. You will use an email address to make the request. The EPA then sends a confirmation email link and a 'key' that you use for all other requests.

You only need to sign-up once, unless you want to invalidate your current key (by getting a new key) or you lose your key.


In [3]:
#
#    This implements the sign-up request. The parameters are standardized so that this function definition matches
#    all of the others. However, the easiest way to call this is to simply call this function with your preferred
#    email address.
#
def request_signup(email_address = None,
                   endpoint_url = API_REQUEST_URL,
                   endpoint_action = API_ACTION_SIGNUP,
                   request_template = AQS_REQUEST_TEMPLATE,
                   headers = None):

    # Make sure we have a string - if you don't have access to this email addres, things might go badly for you
    if email_address:
        request_template['email'] = email_address

    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_signup()'")

    if '@' not in request_template['email']:
        raise Exception(f"Must supply an email address to call 'request_signup()'. The string '{request_template['email']}' does not look like an email address.")

    # Compose the signup url - create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_action.format(**request_template)

    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response




In [None]:
#
#    A SIGNUP request is only to be done once, to request a key. A key is sent to that email address and needs to be confirmed with a click through
#    This code should probably be commented out after you've made your key request to make sure you don't accidentally make a new sign-up request
#
print("Requesting SIGNUP ...")
USERNAME = "gmihir@uw.edu" # Replace with your email address
response = request_signup(USERNAME)
print(json.dumps(response,indent=4))
#

Requesting SIGNUP ...
{
    "Header": [
        {
            "status": "Success",
            "request_time": "2024-10-29T16:40:10-04:00",
            "url": "https://aqs.epa.gov/data/api/signup?email=gmihir@uw.edu"
        }
    ],
    "Data": [
        "You should receive a registration confirmation email with a link for confirming your email shortly."
    ]
}


In [None]:
#
#   Once we have the signup email, we can define two constants:
#
#   USERNAME - This should be the email address you sent the EPA asking for access to the API during sign-up
#   APIKEY   - This should be the authorization key they sent you
#
#
USERNAME = "gmihir@uw.edu" # Replace with the email address you signed up with.
APIKEY = "" # Add the API key provided in your email. (Don't forget to confirm your account from the email)


**Step 2:** Making a list request

Once you have a key, the next thing is to get information about the different types of air quality monitoring (sensors) and the different places where we might find air quality stations. The monitoring system is complex and changes all the time. The EPA implementation allows an API user to find changes to monitoring sites and sensors by making requests - maybe monthly, or daily. This API approach is probably better than having the EPA publish documentation that may be out of date as soon as it hits a web page. The one problem here is that some of the responses rely on jargon or terms-of-art. That is, one needs to know a bit about the way atmospheric sciece works to understand some of the terms. ... Good thing we can use the web to search for terms we don't know!

In [5]:
#
#    This implements the list request. There are several versions of the list request that only require email and key.
#    This code sets the default action/requests to list the groups or parameter class descriptors. Having those descriptors
#    allows one to request the individual (proprietary) 5 digit codes for individual air quality measures by using the
#    param request. Some code in later cells will illustrate those requests.
#
def request_list_info(email_address = None, key = None,
                      endpoint_url = API_REQUEST_URL,
                      endpoint_action = API_ACTION_LIST_CLASSES,
                      request_template = AQS_REQUEST_TEMPLATE,
                      headers = None):

    #  Make sure we have email and key - at least
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key

    # For the basic request we need an email address and a key
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_list_info()'")
    if not request_template['key']:
        raise Exception("Must supply a key to call 'request_list_info()'")

    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)

    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



In [6]:
#
#   The default should get us a list of the various groups or classes of sensors. These classes are user defined names for clustors of
#   sensors that might be part of a package or default air quality sensing station. We need a class name to start getting down to the
#   a sensor ID. Each sensor type has an ID number. We'll eventually need those ID numbers to be able to request values that come from
#   that specific sensor.
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY

response = request_list_info(request_template=request_data)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))


[
    {
        "code": "AIRNOW MAPS",
        "value_represented": "The parameters represented on AirNow maps (88101, 88502, and 44201)"
    },
    {
        "code": "ALL",
        "value_represented": "Select all Parameters Available"
    },
    {
        "code": "AQI POLLUTANTS",
        "value_represented": "Pollutants that have an AQI Defined"
    },
    {
        "code": "CORE_HAPS",
        "value_represented": "Urban Air Toxic Pollutants"
    },
    {
        "code": "CRITERIA",
        "value_represented": "Criteria Pollutants"
    },
    {
        "code": "CSN DART",
        "value_represented": "List of CSN speciation parameters to populate the STI DART tool"
    },
    {
        "code": "FORECAST",
        "value_represented": "Parameters routinely extracted by AirNow (STI)"
    },
    {
        "code": "HAPS",
        "value_represented": "Hazardous Air Pollutants"
    },
    {
        "code": "IMPROVE CARBON",
        "value_represented": "IMPROVE Carbon Parameters"
    }

We're interested in getting to something that might be the Air Quality Index (AQI). You see this reported on the news - often around smog values, but also when there is smoke in the sky. The AQI is a complex measure of different gasses and of the particles in the air (dust, dirt, ash ...).

From the list produced by our 'list/Classes' request above, it looks like there is a class of sensors called "AQI POLLUTANTS". Let's try to get a list of those specific sensors and see what we can get from those.


In [7]:
#
#   Once we have a list of the classes or groups of possible sensors, we can find the sensor IDs that make up that class (group)
#   The one that looks to be associated with the Air Quality Index is "AQI POLLUTANTS"
#   We'll use that to make another list request.
#
AQI_PARAM_CLASS = "AQI POLLUTANTS"


In [8]:
#
#   Structure a request to get the sensor IDs associated with the AQI
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['pclass'] = AQI_PARAM_CLASS  # here we specify that we want this 'pclass' or parameter classs

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_PARAMS)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))


[
    {
        "code": "42101",
        "value_represented": "Carbon monoxide"
    },
    {
        "code": "42401",
        "value_represented": "Sulfur dioxide"
    },
    {
        "code": "42602",
        "value_represented": "Nitrogen dioxide (NO2)"
    },
    {
        "code": "44201",
        "value_represented": "Ozone"
    },
    {
        "code": "81102",
        "value_represented": "PM10 Total 0-10um STP"
    },
    {
        "code": "88101",
        "value_represented": "PM2.5 - Local Conditions"
    },
    {
        "code": "88502",
        "value_represented": "Acceptable PM2.5 AQI & Speciation Mass"
    }
]


We should now have (above) a response containing a set of sensor ID numbers. The list should include the sensor numbers as well as a description or name for each sensor.

The EPA AQS API has limits on some call parameters. Specifically, when we request data for sensors we can only specify a maximum of 5 different sensor values to return. This means we cannot get all of the Air Quality Index parameters in one request for data. We have to break it up.

What I did below was to break the request into two logical groups, the AQI sensors that sample gasses and the AQI sensors that sample particles in the air.

In [9]:
#
#   Given the set of sensor codes, now we can create a parameter list or 'param' value as defined by the AQS API spec.
#   It turns out that we want all of these measures for AQI, but we need to have two different param constants to get
#   all seven of the code types. We can only have a max of 5 sensors/values request per param.
#
#   Gaseous AQI pollutants CO, SO2, NO2, and O2
AQI_PARAMS_GASEOUS = "42101,42401,42602,44201"
#
#   Particulate AQI pollutants PM10, PM2.5, and Acceptable PM2.5
AQI_PARAMS_PARTICULATES = "81102,88101,88502"
#
#

Air quality monitoring stations are located all over the US at different locations.

This list includes the [FIPS](https://www.census.gov/library/reference/code-lists/ansi.html) number for the state and county as a 5 digit string. This format, the 5 digit string, is a 'old' format that is still widely used. There are new codes that may eventually be adopted for the US government information systems. But FIPS is currently what the AQS uses, so that's what is in the list as the constant.

For my assigned city Mesa, AZ, the county is Maricopa, and the fips code is 04013

In [10]:
#
#   We store the information about my assigned city in city_locations
#
CITY_LOCATIONS = {
    'Mesa' :       {'city'   : 'Mesa',
                       'county' : 'Maricopa',
                       'state'  : 'Arizona',
                       'fips'   : '04013',
                       'latlon' : [33.40, -111.72] },
}


Given our CITY_LOCATIONS constant we can now find which monitoring locations are nearby. One option is to use the county to define the area we're interest in. You can get the EPA to list their monitoring stations by county. You can also get a set of monitoring stations by using a bounding box of latitude, longitude points. For my purpose, the county approach gave me enough stations and data, resulting in me not using the bounding box approach

In [11]:
#
#  This list request should give us a list of all the monitoring stations in the county specified by the
#  given city selected from the CITY_LOCATIONS dictionary
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['state'] = CITY_LOCATIONS['Mesa']['fips'][:2]   # the first two digits (characters) of FIPS is the state code
request_data['county'] = CITY_LOCATIONS['Mesa']['fips'][2:]  # the last three digits (characters) of FIPS is the county code

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_SITES)

if response["Header"][0]['status'] == "Success":
    non_null_results = [item for item in response['Data'] if item.get('value_represented') is not None]

    # Print the filtered list and its length
    print(json.dumps(non_null_results, indent=4))
    print("Number of non-null 'value_represented' results:", len(non_null_results))
    #print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))


[
    {
        "code": "0013",
        "value_represented": "Old South Phoenix Site"
    },
    {
        "code": "0015",
        "value_represented": "Emergency Management"
    },
    {
        "code": "0016",
        "value_represented": "WEST INDIAN SCHOOL RD"
    },
    {
        "code": "0019",
        "value_represented": "WEST PHOENIX"
    },
    {
        "code": "0022",
        "value_represented": "GRAND AVE"
    },
    {
        "code": "1003",
        "value_represented": "MESA"
    },
    {
        "code": "1004",
        "value_represented": "NORTH PHOENIX"
    },
    {
        "code": "1010",
        "value_represented": "FALCON FIELD"
    },
    {
        "code": "2001",
        "value_represented": "GLENDALE"
    },
    {
        "code": "2004",
        "value_represented": "North Scottsdale"
    },
    {
        "code": "2005",
        "value_represented": "PINNACLE PEAK"
    },
    {
        "code": "3002",
        "value_represented": "CENTRAL PHOENIX"
    },
    {


The above response gives us a list of monitoring stations in the Maricopa county. Each monitoring station has a unique "code" which is a string number, and, sometimes, a description. The description seems to be something about where the monitoring station is located.


**Step 3:** Making a daily summary request

The function below is designed to encapsulate requests to the EPA AQS API. When calling the function one should create/copy a parameter template, then initialize that template with values that won't change with each call. Then on each call simply pass in the parameters that need to change, like date ranges.

Another function below provides an example of extracting values and restructuring the response to make it a little more usable.

In [12]:
#
#    This implements the daily summary request. Daily summary provides a daily summary value for each sensor being requested
#    from the start date to the end date.
#
#    Like the two other functions, this can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
#
def request_daily_summary(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL,
                          endpoint_action = API_ACTION_DAILY_SUMMARY_COUNTY,
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):

    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_daily_summary()'")
    if not request_template['key']:
        raise Exception("Must supply a key to call 'request_daily_summary()'")
    if not request_template['param']:
        raise Exception("Must supply param values to call 'request_daily_summary()'")
    if not request_template['begin_date']:
        raise Exception("Must supply a begin_date to call 'request_daily_summary()'")
    if not request_template['end_date']:
        raise Exception("Must supply an end_date to call 'request_daily_summary()'")
    # Note we're not validating FIPS fields because not all of the daily summary actions require the FIPS numbers

    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)

    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



In [13]:

request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['param'] = AQI_PARAMS_GASEOUS
request_data['state'] = CITY_LOCATIONS['Mesa']['fips'][:2]
request_data['county'] = CITY_LOCATIONS['Mesa']['fips'][2:]

# request daily summary data for May 01, 1964 to October 31st 2024 (the last 60 years). We put May and October based on the fire season
gaseous_aqi = request_daily_summary(request_template=request_data, begin_date="19640501", end_date="20241031")
print("Response for the gaseous pollutants ...")
#
if gaseous_aqi["Header"][0]['status'] == "Success":
    print(json.dumps(gaseous_aqi['Data'],indent=4))
elif gaseous_aqi["Header"][0]['status'].startswith("No data "):
    print("Looks like the response generated no data. You might take a closer look at your request and the response data.")
else:
    print(json.dumps(gaseous_aqi,indent=4))

request_data['param'] = AQI_PARAMS_PARTICULATES
# request daily summary data for the month of July in 2021
particulate_aqi = request_daily_summary(request_template=request_data, begin_date="19640501", end_date="20241031")
print("Response for the particulate pollutants ...")
#
if particulate_aqi["Header"][0]['status'] == "Success":
    print(json.dumps(particulate_aqi['Data'],indent=4))
elif particulate_aqi["Header"][0]['status'].startswith("No data "):
    print("Looks like the response generated no data. You might take a closer look at your request and the response data.")
else:
    print(json.dumps(particulate_aqi,indent=4))


Response for the gaseous pollutants ...
{
    "Header": [
        {
            "status": "Failed",
            "request_time": "2024-10-30T21:54:37.823-04:00",
            "url": "https://aqs.epa.gov/data/api/dailyData/byCounty?email=gmihir@uw.edu&key=mauvemouse74&param=42101,42401,42602,44201&bdate=19640501&edate=20241031&state=04&county=013",
            "error": [
                "bdate: 19640501, edate: 20241031, only 1 year of data is permitted."
            ]
        }
    ]
}
Response for the particulate pollutants ...
{
    "Header": [
        {
            "status": "Failed",
            "request_time": "2024-10-30T21:54:38.651-04:00",
            "url": "https://aqs.epa.gov/data/api/dailyData/byCounty?email=gmihir@uw.edu&key=mauvemouse74&param=81102,88101,88502&bdate=19640501&edate=20241031&state=04&county=013",
            "error": [
                "bdate: 19640501, edate: 20241031, only 1 year of data is permitted."
            ]
        }
    ]
}


When I ran a single request to get data for all the required years, the API responded with a status failed. It said that I can only request data for the maximum of 1 year at a time. This led me to create a loop that would provide the API with the specific time period involving the fire season for each of the required years (1964-2024)

In [16]:
# Attribution: The code below has taken references from the professors code.

# Helper function to format date for API
def get_date_str(year, month_day):
    return f"{year}{month_day}"

gaseous_data = []
particulate_data = []

# We define the start and end years
start_year = 1964
end_year = 2024

# In this loop, we go through the fire season of each year and request the API for that data
for year in range(start_year, end_year + 1):
    # Ensure that the begin date corresponds to May 1, and end to Oct 31 (fire season)
    begin_date = get_date_str(year, "0501")
    end_date = get_date_str(year, "1031")

    # First we request the api for gaseous pollutants data
    request_data['param'] = AQI_PARAMS_GASEOUS
    gaseous_aqi = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date)

    # Confirm status and accumulate data
    if gaseous_aqi["Header"][0]['status'] == "Success":
        gaseous_data.extend(gaseous_aqi['Data'])
    elif gaseous_aqi["Header"][0]['status'].startswith("No data "):
        print(f"No data for gaseous pollutants in {year}.")
    else:
        print(json.dumps(gaseous_aqi, indent=4))

    # Next we request the api for pollutants data
    request_data['param'] = AQI_PARAMS_PARTICULATES
    particulate_aqi = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date)

    # Confirm status and accumulate data
    if particulate_aqi["Header"][0]['status'] == "Success":
        particulate_data.extend(particulate_aqi['Data'])
    elif particulate_aqi["Header"][0]['status'].startswith("No data "):
        print(f"No data for particulate pollutants in {year}.")
    else:
        print(json.dumps(particulate_aqi, indent=4))

# The function below helps us save the data to a csv file for both the gaseous and particulate pollutants
def save_to_csv(data, filename):
    if data:
        keys = data[0].keys()
        with open(filename, "w", newline="") as csv_file:
            dict_writer = csv.DictWriter(csv_file, fieldnames=keys)
            dict_writer.writeheader()
            dict_writer.writerows(data)
        print(f"Data saved to {filename}")
    else:
        print(f"No data to save for {filename}")


No data for gaseous pollutants in 1964.
No data for particulate pollutants in 1964.
No data for particulate pollutants in 1965.
No data for particulate pollutants in 1966.
No data for particulate pollutants in 1967.
No data for particulate pollutants in 1968.
No data for particulate pollutants in 1969.
No data for particulate pollutants in 1970.
No data for particulate pollutants in 1971.
No data for particulate pollutants in 1972.
No data for particulate pollutants in 1973.
No data for particulate pollutants in 1974.
No data for particulate pollutants in 1975.
No data for particulate pollutants in 1976.
No data for particulate pollutants in 1977.
No data for particulate pollutants in 1978.
No data for particulate pollutants in 1979.
No data for particulate pollutants in 1980.
No data for particulate pollutants in 1981.
No data for particulate pollutants in 1982.
No data for particulate pollutants in 1983.
No data for particulate pollutants in 1984.
No data for particulate pollutants i

In [18]:
# Save results
save_to_csv(gaseous_data, "../Processed Data/gaseous_pollutants.csv")
save_to_csv(particulate_data, "../Processed Data/particulate_pollutants.csv")

Data saved to ../Processed Data/gaseous_pollutants.csv
Data saved to ../Processed Data/particulate_pollutants.csv


## Process AQI Data

Now, that the data is collected for the required time interval, I process the data by removing any null aqi values. Then I combine both the gaseous_pollutants and the particulate_pollutants together to understand how much data I have. Based on the data, I believe I have sufficient inforamtion to calculate the average AQI for the day/month/year so I perform that calculation by taking the average. Finally, I save the processed data in AQI_data.csv which will be used for further analysis.

Here I read the gaseous_pollutants and particulate_pollutants data and see if they have similar columns

In [None]:
gaseous_df = pd.read_csv("./Processed Data/gaseous_pollutants.csv")
particulate_df = pd.read_csv("./Processed Data/particulate_pollutants.csv")

print("Columns in gaseous_df:", gaseous_df.columns.tolist())
print("Columns in particulate_df:", particulate_df.columns.tolist())

Columns in gaseous_df: ['state_code', 'county_code', 'site_number', 'parameter_code', 'poc', 'latitude', 'longitude', 'datum', 'parameter', 'sample_duration_code', 'sample_duration', 'pollutant_standard', 'date_local', 'units_of_measure', 'event_type', 'observation_count', 'observation_percent', 'validity_indicator', 'arithmetic_mean', 'first_max_value', 'first_max_hour', 'aqi', 'method_code', 'method', 'local_site_name', 'site_address', 'state', 'county', 'city', 'cbsa_code', 'cbsa', 'date_of_last_change']
Columns in particulate_df: ['state_code', 'county_code', 'site_number', 'parameter_code', 'poc', 'latitude', 'longitude', 'datum', 'parameter', 'sample_duration_code', 'sample_duration', 'pollutant_standard', 'date_local', 'units_of_measure', 'event_type', 'observation_count', 'observation_percent', 'validity_indicator', 'arithmetic_mean', 'first_max_value', 'first_max_hour', 'aqi', 'method_code', 'method', 'local_site_name', 'site_address', 'state', 'county', 'city', 'cbsa_code', '

Based on the output above, both the dataframes have similar columns.

Now, I explore the different columns

In [17]:
print("Basic statistics - gaseous_df")
print(gaseous_df.describe())

Basic statistics - gaseous_df
       state_code  county_code    site_number  parameter_code            poc  \
count    864929.0     864929.0  864929.000000   864929.000000  864929.000000   
mean          4.0         13.0    4157.067377    43574.674844       1.221315   
std           0.0          0.0    3271.684067      900.169115       0.873470   
min           4.0         13.0       3.000000    42101.000000       1.000000   
25%           4.0         13.0    2001.000000    42602.000000       1.000000   
50%           4.0         13.0    3003.000000    44201.000000       1.000000   
75%           4.0         13.0    7020.000000    44201.000000       1.000000   
max           4.0         13.0    9998.000000    44201.000000       6.000000   

            latitude      longitude  observation_count  observation_percent  \
count  864929.000000  864929.000000      864929.000000        864929.000000   
mean       33.507747    -112.009782          22.140232            99.231559   
std         

In [18]:
print("Basic statistics - particulate_df")
print(particulate_df.describe())

Basic statistics - particulate_df
       state_code  county_code    site_number  parameter_code            poc  \
count    375551.0     375551.0  375551.000000   375551.000000  375551.000000   
mean          4.0         13.0    4533.737410    85644.735727       2.261632   
std           0.0          0.0    3482.777929     3344.745421       1.073024   
min           4.0         13.0      13.000000    81102.000000       1.000000   
25%           4.0         13.0    1004.000000    81102.000000       1.000000   
50%           4.0         13.0    4005.000000    88101.000000       3.000000   
75%           4.0         13.0    7022.000000    88101.000000       3.000000   
max           4.0         13.0    9998.000000    88502.000000       9.000000   

            latitude      longitude  observation_count  observation_percent  \
count  375551.000000  375551.000000      375551.000000        375551.000000   
mean       33.466012    -112.061539           6.009333           101.931455   
std     

Based on the output above, both the dataframes do have a lot of null values in their aqi columns, so I decide to explore further. 

In [19]:
# Total number of records and percentage of nulls in 'aqi' for each dataframe
def analyze_aqi(df):
  total_records = len(df)
  null_count = df['aqi'].isnull().sum()
  percentage_nulls = (null_count / total_records) * 100 if total_records > 0 else 0
  return total_records, null_count, percentage_nulls

gaseous_total_records, gaseous_null_records, gaseous_percentage_nulls = analyze_aqi(gaseous_df)
particulate_total_records, particulate_null_records, particulate_percentage_nulls = analyze_aqi(particulate_df)

print("Total records in gaseous_df:", gaseous_total_records)
print("Null values in 'aqi' column of gaseous_df:", gaseous_null_records)
print("Percentage of nulls in 'aqi' column of gaseous_df:", gaseous_percentage_nulls)
print("-------------------------------------------------------------------------------------")
print("Total records in particulate_df:", particulate_total_records)
print("Null values in 'aqi' column of particulate_df:", particulate_null_records)
print("Percentage of nulls in 'aqi' column of particulate_df:", particulate_percentage_nulls)

Total records in gaseous_df: 864929
Null values in 'aqi' column of gaseous_df: 251103
Percentage of nulls in 'aqi' column of gaseous_df: 29.03163149807672
-------------------------------------------------------------------------------------
Total records in particulate_df: 375551
Null values in 'aqi' column of particulate_df: 83061
Percentage of nulls in 'aqi' column of particulate_df: 22.117102603907327


Based on the above output, I think that I have enough data to reasonably make an estimate for the avg aqis. Therefore, I drop the NaN values and combine both the pollutants.

In [None]:
# Drop NaN values from both DataFrames
gaseous_df_cleaned = gaseous_df.dropna()
particulate_df_cleaned = particulate_df.dropna()

# Concatenate the cleaned DataFrames
combined_df = pd.concat([gaseous_df_cleaned, particulate_df_cleaned], ignore_index=True)

# Ensure no duplicates exist
combined_df_no_duplicates = combined_df.drop_duplicates()

combined_df_no_duplicates

Unnamed: 0,state_code,county_code,site_number,parameter_code,poc,latitude,longitude,datum,parameter,sample_duration_code,...,method_code,method,local_site_name,site_address,state,county,city,cbsa_code,cbsa,date_of_last_change
0,4,13,3002,42401,3,33.457970,-112.046590,NAD83,Sulfur dioxide,1,...,13.0,INSTRUMENTAL - CONDUCTIMETRIC,CENTRAL PHOENIX,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2013-06-11
1,4,13,3002,42401,3,33.457970,-112.046590,NAD83,Sulfur dioxide,1,...,13.0,INSTRUMENTAL - CONDUCTIMETRIC,CENTRAL PHOENIX,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2013-06-11
2,4,13,3002,42401,3,33.457970,-112.046590,NAD83,Sulfur dioxide,1,...,13.0,INSTRUMENTAL - CONDUCTIMETRIC,CENTRAL PHOENIX,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2013-06-11
3,4,13,3002,42401,3,33.457970,-112.046590,NAD83,Sulfur dioxide,1,...,13.0,INSTRUMENTAL - CONDUCTIMETRIC,CENTRAL PHOENIX,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2013-06-11
4,4,13,3002,42401,3,33.457970,-112.046590,NAD83,Sulfur dioxide,1,...,13.0,INSTRUMENTAL - CONDUCTIMETRIC,CENTRAL PHOENIX,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2013-06-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
895209,4,13,9997,88101,3,33.503833,-112.095767,WGS84,PM2.5 - Local Conditions,X,...,638.0,Teledyne T640X at 16.67 LPM w/Network Data Ali...,JLG SUPERSITE,4530 N 17TH AVENUE,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2024-10-17
895210,4,13,9997,88101,3,33.503833,-112.095767,WGS84,PM2.5 - Local Conditions,X,...,638.0,Teledyne T640X at 16.67 LPM w/Network Data Ali...,JLG SUPERSITE,4530 N 17TH AVENUE,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2024-10-17
895211,4,13,9997,88101,3,33.503833,-112.095767,WGS84,PM2.5 - Local Conditions,X,...,638.0,Teledyne T640X at 16.67 LPM w/Network Data Ali...,JLG SUPERSITE,4530 N 17TH AVENUE,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2024-10-17
895212,4,13,9997,88101,3,33.503833,-112.095767,WGS84,PM2.5 - Local Conditions,X,...,638.0,Teledyne T640X at 16.67 LPM w/Network Data Ali...,JLG SUPERSITE,4530 N 17TH AVENUE,Arizona,Maricopa,Phoenix,38060,"Phoenix-Mesa-Scottsdale, AZ",2024-10-17


Now that I have a cleaned combined dataframe, I calculate the average AQI's to help with the comparision of the smoke estimate.
Note: I also calculate the monthly and daily avg aqi in case I need it for the future parts

In [None]:
combined_df_no_duplicates['date_local'] = pd.to_datetime(combined_df_no_duplicates['date_local'])

# Here I extract the year, month, and day
combined_df_no_duplicates['year'] = combined_df_no_duplicates['date_local'].dt.year
combined_df_no_duplicates['month'] = combined_df_no_duplicates['date_local'].dt.month
combined_df_no_duplicates['day'] = combined_df_no_duplicates['date_local'].dt.day

# Here I am calculating the daily average AQI
daily_avg_aqi = combined_df_no_duplicates.groupby(['year', 'month', 'day'])['aqi'].mean().reset_index()
daily_avg_aqi = daily_avg_aqi.rename(columns={'aqi': 'daily_avg_aqi'})

# Here I am calculating the monthly average AQI
monthly_avg_aqi = combined_df_no_duplicates.groupby(['year', 'month'])['aqi'].mean().reset_index()
monthly_avg_aqi = monthly_avg_aqi.rename(columns={'aqi': 'monthly_avg_aqi'})

# Here I am calculating the yearly average AQI
yearly_avg_aqi = combined_df_no_duplicates.groupby(['year'])['aqi'].mean().reset_index()
yearly_avg_aqi = yearly_avg_aqi.rename(columns={'aqi': 'yearly_avg_aqi'})

# We merge the average AQI values back into the original DataFrame
combined_df_no_duplicates = pd.merge(combined_df_no_duplicates, daily_avg_aqi, on=['year', 'month', 'day'], how='left')
combined_df_no_duplicates = pd.merge(combined_df_no_duplicates, monthly_avg_aqi, on=['year', 'month'], how='left')
combined_df_no_duplicates = pd.merge(combined_df_no_duplicates, yearly_avg_aqi, on=['year'], how='left')

combined_df_no_duplicates.to_csv('./Processed Data/AQI_data.csv', index=False) # Save the final dataset