# Course Project: Wildfire Analysis (Part 1: Common Analysis)

## Request data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API.

More and more frequently summers in the western US have been characterized by wildfires with smoke billowing across multiple western states. There are many proposed causes for this: climate change, US Forestry policy, growing awareness, just to name a few. Regardless of the cause, the impact of wildland fires is widespread as wildfire smoke reduces the air quality of many cities. There is a growing body of work pointing to the negative impacts of smoke on health, tourism, property, and other aspects of society.

The Course Project is designed to analyze the impacts of wildfires on specific cities in the United States, particularly focusing on the effects of wildfire smoke and its implications for public health, air quality, and overall community well-being.

This is the first step in the project, where all the students conduct a base analysis using a shared dataset while focusing on a unique city. The aim is to create an understanding of wildfire impacts tailored to local contexts.

In this notebook, we will request data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API.

This is a historical API and does not provide real-time air quality data. The [documentation](https://aqs.epa.gov/aqsweb/documents/data_api.html) for the API provides definitions of the different call parameter and examples of the various calls that can be made to the API.

This notebook works systematically, requesting an API key, using 'list' to get various IDs and parameter values, and using 'daily summary' to get summary data that meets specific condistions. Changing values to explore the results of the API is probably useful, but that will result in some explanations being out of sync with the outputs.

The US EPA was created in the early 1970's. The EPA reports that they only started broad based monitoring with standardized quality assurance procedures in the 1980's. Many counties will have data starting somewhere between 1983 and 1988. 

The end goal of this notebook is to get to some values that we might use for the Air Quality Index or AQI. The AQI index is meant to tell us something about how healthy or clean the air is on any day. The AQI is actually a somewhat complext measure. 

Other references:  [how to calculate the AQI](https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf).

**License**:
Many of the snippets used here was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 16, 2024



## 1. Import libraries and required dependencies

In [30]:
# These are standard python modules
import os, json, time

# The 'requests' module is a distribution module for making web requests. If you do not have it already, you'll need to install it
import requests

# suppress the warnings
import warnings
warnings.filterwarnings("ignore")

# import pandas for the data processing step
import pandas as pd

In [19]:
#    CONSTANTS
#    This is the root of all AQS API URLs
API_REQUEST_URL = 'https://aqs.epa.gov/data/api'

#    These are some of the 'actions' we can ask the API to take or requests that we can make of the API
#    Sign-up request - generally only performed once - unless you lose your key
API_ACTION_SIGNUP = '/signup?email={email}'

#    List actions provide information on API parameter values that are required by some other actions/requests
API_ACTION_LIST_CLASSES = '/list/classes?email={email}&key={key}'
API_ACTION_LIST_PARAMS = '/list/parametersByClass?email={email}&key={key}&pc={pclass}'
API_ACTION_LIST_SITES = '/list/sitesByCounty?email={email}&key={key}&state={state}&county={county}'

#    Monitor actions are requests for monitoring stations that meet specific criteria
API_ACTION_MONITORS_COUNTY = '/monitors/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_MONITORS_BOX = '/monitors/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'

#    Summary actions are requests for summary data. These are for daily summaries
API_ACTION_DAILY_SUMMARY_COUNTY = '/dailyData/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_DAILY_SUMMARY_BOX = '/dailyData/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'

#    It is always nice to be respectful of a free data resource.
#    We're going to observe a 100 requests per minute limit - which is fairly nice
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

#    This is a template that covers most of the parameters for the actions we might take, from the set of actions
#    above. In the examples below, most of the time parameters can either be supplied as individual values to a
#    function - or they can be set in a copy of the template and passed in with the template.

AQS_REQUEST_TEMPLATE = {
    "email":      "",     
    "key":        "",      
    "state":      "",     # the two digit state FIPS # as a string
    "county":     "",     # the three digit county FIPS # as a string
    "begin_date": "",     # the start of a time window in YYYYMMDD format
    "end_date":   "",     # the end of a time window in YYYYMMDD format, begin_date and end_date must be in the same year
    "minlat":    0.0,
    "maxlat":    0.0,
    "minlon":    0.0,
    "maxlon":    0.0,
    "param":     "",     # a list of comma separated 5 digit codes, max 5 codes requested
    "pclass":    ""      # parameter class is only used by the List calls
}

# A dictionary of the city location from the US west coast state.
CITY_LOCATIONS = {
    'vancouver' :   {'city'   : 'Vancouver',
                    'county' : 'Clark',
                    'state'  : 'Washington',
                    'fips'   : '53011',
                    'latlon' : [45.64, -122.60] }
}

I have been assigned to analyse the city: Vancouver, WA for my analysis
- 2023 estimate: 196442
- 2020 census: 190915
- 2020 density: 3920
- Latitude/Longitude: 45.64°N 122.60°W

## 2. Making a sign-up request

Before using the API, we need to request a key. We will use an email address to make the request. The EPA then sends a confirmation email link and a 'key' that we need to use for all other requests.

You only need to sign-up once, unless you want to invalidate your current key (by getting a new key) or you lose your key.

In [4]:
#    This implements the sign-up request. The parameters are standardized so that this function definition matches
#    all of the others. However, the easiest way to call this is to simply call this function with your preferred
#    email address.
def request_signup(email_address = None,
                   endpoint_url = API_REQUEST_URL, 
                   endpoint_action = API_ACTION_SIGNUP, 
                   request_template = AQS_REQUEST_TEMPLATE,
                   headers = None):
    
    # Make sure we have a string - if you don't have access to this email addres, things might go badly for you
    if email_address:
        request_template['email'] = email_address        
    
    if not request_template['email']: 
        raise Exception("Must supply an email address to call 'request_signup()'")

    if '@' not in request_template['email']: 
        raise Exception(f"Must supply an email address to call 'request_signup()'. The string '{request_template['email']}' does not look like an email address.")

    # Compose the signup url - create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


A SIGNUP request is only to be done once, to request a key. A key is sent to that email address and needs to be confirmed with a click through. This code should probably be commented out after you've made your key request to make sure you don't accidentally make a new sign-up request


In [5]:
# print("Requesting SIGNUP ...")
# USERNAME = "pj2901@uw.edu"
# response = request_signup(USERNAME)
# print(json.dumps(response,indent=4))

Requesting SIGNUP ...
{
    "Header": [
        {
            "status": "Success",
            "request_time": "2024-10-30T15:49:37-04:00",
            "url": "https://aqs.epa.gov/data/api/signup?email=pj2901@uw.edu"
        }
    ],
    "Data": [
        "You should receive a registration confirmation email with a link for confirming your email shortly."
    ]
}


Assign your email address to "USERNAME" and your key to "APIKEY" as constants and the remaining cells in the notbook should work.

In [7]:
# Once we have the signup email, we can define two constants:
#   USERNAME - This should be the email address you sent the EPA asking for access to the API during sign-up
#   APIKEY   - This should be the authorization key they sent you

# Specify the values as constants below. Just don't distribute the notebook without removing the
# constants or you'll be distributing your key too.

# USERNAME = "pj2901@uw.edu"
# APIKEY = <key>


## 3. Making a list request

Once we have a key, the next thing is to get information about the different types of air quality monitoring (sensors) and the different places where we might find air quality stations. The monitoring system is complex and changes all the time. The EPA implementation allows an API user to find changes to monitoring sites and sensors by making requests - maybe monthly, or daily. This API approach is probably better than having the EPA publish documentation that may be out of date as soon as it hits a web page. The one problem here is that some of the responses rely on jargon or terms-of-art. That is, one needs to know a bit about the way atmospheric sciece works to understand some of the terms. ... Good thing we can use the web to search for terms we don't know!

In [8]:
#    This implements the list request. There are several versions of the list request that only require email and key.
#    This code sets the default action/requests to list the groups or parameter class descriptors. Having those descriptors 
#    allows one to request the individual (proprietary) 5 digit codes for individual air quality measures by using the
#    param request. Some code in later cells will illustrate those requests.
def request_list_info(email_address = None, key = None,
                      endpoint_url = API_REQUEST_URL, 
                      endpoint_action = API_ACTION_LIST_CLASSES, 
                      request_template = AQS_REQUEST_TEMPLATE,
                      headers = None):
    
    #  Make sure we have email and key - at least
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    
    # For the basic request we need an email address and a key
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_list_info()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_list_info()'")

    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [9]:
#   The default should get us a list of the various groups or classes of sensors. These classes are user defined names for clustors of
#   sensors that might be part of a package or default air quality sensing station. We need a class name to start getting down to the
#   a sensor ID. Each sensor type has an ID number. We'll eventually need those ID numbers to be able to request values that come from
#   that specific sensor.
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY

response = request_list_info(request_template=request_data)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))

[
    {
        "code": "AIRNOW MAPS",
        "value_represented": "The parameters represented on AirNow maps (88101, 88502, and 44201)"
    },
    {
        "code": "ALL",
        "value_represented": "Select all Parameters Available"
    },
    {
        "code": "AQI POLLUTANTS",
        "value_represented": "Pollutants that have an AQI Defined"
    },
    {
        "code": "CORE_HAPS",
        "value_represented": "Urban Air Toxic Pollutants"
    },
    {
        "code": "CRITERIA",
        "value_represented": "Criteria Pollutants"
    },
    {
        "code": "CSN DART",
        "value_represented": "List of CSN speciation parameters to populate the STI DART tool"
    },
    {
        "code": "FORECAST",
        "value_represented": "Parameters routinely extracted by AirNow (STI)"
    },
    {
        "code": "HAPS",
        "value_represented": "Hazardous Air Pollutants"
    },
    {
        "code": "IMPROVE CARBON",
        "value_represented": "IMPROVE Carbon Parameters"
    }

We're interested in getting to something that might be the Air Quality Index (AQI). You see this reported on the news - often around smog values, but also when there is smoke in the sky. The AQI is a complex measure of different gasses and of the particles in the air (dust, dirt, ash ...).

From the list produced by our 'list/Classes' request above, it looks like there is a class of sensors called "AQI POLLUTANTS". Let's try to get a list of those specific sensors and see what we can get from those.

In [10]:
#
#   Once we have a list of the classes or groups of possible sensors, we can find the sensor IDs that make up that class (group)
#   The one that looks to be associated with the Air Quality Index is "AQI POLLUTANTS"
#   We'll use that to make another list request.
#
AQI_PARAM_CLASS = "AQI POLLUTANTS"

In [11]:
#   Structure a request to get the sensor IDs associated with the AQI
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['pclass'] = AQI_PARAM_CLASS  # here we specify that we want this 'pclass' or parameter classs

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_PARAMS)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))

[
    {
        "code": "42101",
        "value_represented": "Carbon monoxide"
    },
    {
        "code": "42401",
        "value_represented": "Sulfur dioxide"
    },
    {
        "code": "42602",
        "value_represented": "Nitrogen dioxide (NO2)"
    },
    {
        "code": "44201",
        "value_represented": "Ozone"
    },
    {
        "code": "81102",
        "value_represented": "PM10 Total 0-10um STP"
    },
    {
        "code": "88101",
        "value_represented": "PM2.5 - Local Conditions"
    },
    {
        "code": "88502",
        "value_represented": "Acceptable PM2.5 AQI & Speciation Mass"
    }
]


We should now have (above) a response containing a set of sensor ID numbers. The list should include the sensor numbers as well as a description or name for each sensor. 

The EPA AQS API has limits on some call parameters. Specifically, when we request data for sensors we can only specify a maximum of 5 different sensor values to return. This means we cannot get all of the Air Quality Index parameters in one request for data. We have to break it up.

We broke up the request into two logical groups, the AQI sensors that sample gasses and the AQI sensors that sample particles in the air.

In [13]:
#   Given the set of sensor codes, now we can create a parameter list or 'param' value as defined by the AQS API spec.
#   It turns out that we want all of these measures for AQI, but we need to have two different param constants to get
#   all seven of the code types. We can only have a max of 5 sensors/values request per param.

#   Gaseous AQI pollutants CO, SO2, NO2, and O2
AQI_PARAMS_GASEOUS = "42101,42401,42602,44201"

#   Particulate AQI pollutants PM10, PM2.5, and Acceptable PM2.5
AQI_PARAMS_PARTICULATES = "81102,88101,88502"

Air quality monitoring stations are located all over the US at different locations. 

This list includes the [FIPS](https://www.census.gov/library/reference/code-lists/ansi.html) number for the state and county as a 5 digit string. This format, the 5 digit string, is a 'old' format that is still widely used. There are new codes that may eventually be adopted for the US government information systems. But FIPS is currently what the AQS uses, so that's what is in the list as the constant.

Given our CITY_LOCATIONS constant we can now find which monitoring locations are nearby. One option is to use the county to define the area we're interest in. You can get the EPA to list their monitoring stations by county. 

In [20]:
#
#  This list request should give us a list of all the monitoring stations in the county specified by the
#  given city selected from the CITY_LOCATIONS dictionary
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['state'] = CITY_LOCATIONS['vancouver']['fips'][:2]   # the first two digits (characters) of FIPS is the state code
request_data['county'] = CITY_LOCATIONS['vancouver']['fips'][2:]  # the last three digits (characters) of FIPS is the county code

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_SITES)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))


[
    {
        "code": "0001",
        "value_represented": null
    },
    {
        "code": "0002",
        "value_represented": null
    },
    {
        "code": "0003",
        "value_represented": null
    },
    {
        "code": "0004",
        "value_represented": null
    },
    {
        "code": "0005",
        "value_represented": null
    },
    {
        "code": "0006",
        "value_represented": null
    },
    {
        "code": "0007",
        "value_represented": null
    },
    {
        "code": "0008",
        "value_represented": null
    },
    {
        "code": "0009",
        "value_represented": "19912 NE 164TH ST HOCKINSON SCHOOL IN BRUSH PRAIRIE, WA"
    },
    {
        "code": "0010",
        "value_represented": null
    },
    {
        "code": "0011",
        "value_represented": "VANCOUVER - BLAIRMONT DR"
    },
    {
        "code": "0012",
        "value_represented": null
    },
    {
        "code": "0013",
        "value_represented": "VANCOUVER -

The above response gives us a list of monitoring stations. Each monitoring station has a unique "code" which is a string number, and, sometimes, a description. The description seems to be something about where the monitoring station is located.


## 4. Making a daily summary request

The function below is designed to encapsulate requests to the EPA AQS API. When calling the function one should create/copy a parameter template, then initialize that template with values that won't change with each call. Then on each call simply pass in the parameters that need to change, like date ranges.

In [21]:
#    This implements the daily summary request. Daily summary provides a daily summary value for each sensor being requested
#    from the start date to the end date. 
#    Like the two other functions, this can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
def request_daily_summary(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL, 
                          endpoint_action = API_ACTION_DAILY_SUMMARY_COUNTY, 
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):
    
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]            

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_daily_summary()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_daily_summary()'")
    if not request_template['param']: 
        raise Exception("Must supply param values to call 'request_daily_summary()'")
    if not request_template['begin_date']: 
        raise Exception("Must supply a begin_date to call 'request_daily_summary()'")
    if not request_template['end_date']: 
        raise Exception("Must supply an end_date to call 'request_daily_summary()'")
    # Note we're not validating FIPS fields because not all of the daily summary actions require the FIPS numbers
        
    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



The form of the daily summary response is a bit verbose with lots of repeated values. What we'll do is create a data structure that relies on a hierarchical context to summarize the data.

The next function takes the response and a set of fields that should be extracted for their data values. The code assumes those fields are available. If there are missing values something could certainly go wrong. The function creates a summary for each monitoring site.

In [24]:
#    This is a list of field names - data - that will be extracted from each record
EXTRACTION_FIELDS = ['sample_duration','observation_count','arithmetic_mean','aqi']

#    The function creates a summary record
def extract_summary_from_response(r=None, fields=EXTRACTION_FIELDS):
    ## the result will be structured around monitoring site, parameter, and then date
    result = dict()
    data = r["Data"]
    for record in data:
        # make sure the record is set up
        site = record['site_number']
        param = record['parameter_code']
        #date = record['date_local']    # this version keeps the respnse value YYYY-
        date = record['date_local'].replace('-','') # this puts it in YYYYMMDD format
        if site not in result:
            result[site] = dict()
            result[site]['local_site_name'] = record['local_site_name']
            result[site]['site_address'] = record['site_address']
            result[site]['state'] = record['state']
            result[site]['county'] = record['county']
            result[site]['city'] = record['city']
            result[site]['pollutant_type'] = dict()
        if param not in result[site]['pollutant_type']:
            result[site]['pollutant_type'][param] = dict()
            result[site]['pollutant_type'][param]['parameter_name'] = record['parameter']
            result[site]['pollutant_type'][param]['units_of_measure'] = record['units_of_measure']
            result[site]['pollutant_type'][param]['method'] = record['method']
            result[site]['pollutant_type'][param]['data'] = dict()
        if date not in result[site]['pollutant_type'][param]['data']:
            result[site]['pollutant_type'][param]['data'][date] = list()
        
        # now extract the specified fields
        extract = dict()
        for k in fields:
            if str(k) in record:
                extract[str(k)] = record[k]
            else:
                # this makes sure we always have the requested fields, even if
                # we have a missing value for a given day/month
                extract[str(k)] = None
        
        # add this extraction to the list for the day
        result[site]['pollutant_type'][param]['data'][date].append(extract)
    
    return result


#### 4.1 Get AQI Particulates data


In [28]:
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['param'] = AQI_PARAMS_PARTICULATES
request_data['state'] = CITY_LOCATIONS['vancouver']['fips'][:2]
request_data['county'] = CITY_LOCATIONS['vancouver']['fips'][2:]

# Initialize an empty list to store the results
particulate_data = []

# Define the years you want to retrieve data for
start_year = 1961
end_year = 2024

# Loop through the years and request daily summary data
for year in range(start_year, end_year + 1):
    begin_date = f"{year}0101"
    end_date = f"{year}1231"

    # Make the request for the current year
    particulate_aqi = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date)
    # print(json.dumps(particulate_aqi['Data'],indent=4))

    # check the response
    if particulate_aqi["Header"][0]['status'] == "Success":
        extract_particulate = extract_summary_from_response(particulate_aqi)
        particulate_data.append(extract_particulate)

    elif particulate_aqi["Header"][0]['status'].startswith("No data "):
        print("No date for the year: ", year)
    
    else:
        print("Error in getting the data for year: ", year)
        # print(json.dumps(particulate_aqi,indent=4))

No date for the year:  1961
No date for the year:  1962
No date for the year:  1963
No date for the year:  1964
No date for the year:  1965
No date for the year:  1966
No date for the year:  1967
No date for the year:  1968
No date for the year:  1969
No date for the year:  1970
No date for the year:  1971
No date for the year:  1972
No date for the year:  1973
No date for the year:  1974
No date for the year:  1975
No date for the year:  1976
No date for the year:  1977
No date for the year:  1978
No date for the year:  1979
No date for the year:  1980
No date for the year:  1981
No date for the year:  1982
No date for the year:  1983
No date for the year:  1984
No date for the year:  1985
No date for the year:  1986
No date for the year:  1987
No date for the year:  1988
No date for the year:  1989


In [31]:
# Write the data stored in 'particulate_data' to the specified JSON file
output_file_path = "../intermediate_data/aqi_particulate_data_raw.json"
os.makedirs(os.path.dirname(output_file_path), exist_ok=True)

with open(output_file_path, 'w') as file:
    json.dump(particulate_data, file, indent=4)

print("Particulate data saved successfully!")

Particulate data saved successfully!


### 4.2 Get AQI Gaseous data

In [38]:
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['param'] = AQI_PARAMS_GASEOUS
request_data['state'] = CITY_LOCATIONS['vancouver']['fips'][:2]
request_data['county'] = CITY_LOCATIONS['vancouver']['fips'][2:]

# Initialize an empty list to store the results
gaseous_data = []

# Define the years you want to retrieve data for
start_year = 1961
end_year = 2024

# Loop through the years and request daily summary data
for year in range(start_year, end_year + 1):
    begin_date = f"{year}0101"
    end_date = f"{year}1231"

    # Make the request for the current year
    gaseous_aqi = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date)
    # print(json.dumps(particulate_aqi['Data'],indent=4))

    # check the response
    if gaseous_aqi["Header"][0]['status'] == "Success":
        extract_gaseous = extract_summary_from_response(gaseous_aqi)
        gaseous_data.append(extract_gaseous)

    elif gaseous_aqi["Header"][0]['status'].startswith("No data "):
        print("No date for the year: ", year)
    
    else:
        print("Error in getting the data for year: ", year)
        # print(json.dumps(particulate_aqi,indent=4))

No date for the year:  1961
No date for the year:  1962
No date for the year:  1963
No date for the year:  1964
No date for the year:  1965
No date for the year:  1966
No date for the year:  1967
No date for the year:  1968
No date for the year:  1969
No date for the year:  1970
No date for the year:  1971


In [39]:
# Write the data stored in 'gaseous_data' to the specified JSON file
output_file_path = "../data/intermediate_data/aqi_gaseous_data_raw.json"
os.makedirs(os.path.dirname(output_file_path), exist_ok=True)

with open(output_file_path, 'w') as file:
    json.dump(particulate_data, file, indent=4)

print("Particulate data saved successfully!")

Particulate data saved successfully!


### 4.3 Data Preprocessing

In [40]:
def extract_aqi(data_dict, additional_keys=None):
    if additional_keys is None:
        additional_keys = []

    aqi_data_list = []

    for item_key, item_value in data_dict.items():
        # create the current key path by appending the current key to the list of additional keys
        current_key_path = additional_keys + [item_key]

        # if the current value is a dictionary, call extract_aqi
        if isinstance(item_value, dict):
            aqi_data_list.extend(extract_aqi(item_value, current_key_path))
        
        # else, if the current value is a list, iterate through the list items
        elif isinstance(item_value, list):
            for sub_item in item_value:
                if isinstance(sub_item, dict):  # check if the list item is a dictionary
                    aqi_value = sub_item.get("aqi") # extract AQI valuea
                    sample_duration_value = sub_item.get("sample_duration") # get the sample duration values

                    # If AQI is present, add the data to the list
                    if aqi_value is not None:
                        aqi_data_list.append({
                            "keys": current_key_path,
                            "sample_duration": sample_duration_value,
                            "aqi": aqi_value
                        })

    return aqi_data_list


Now we will extract and process AQI data from a list of datasets related to particulate matter.

In [41]:
extracted_particulate_data = []

for dataset in particulate_data:
    extracted_particulate_data.extend(extract_aqi(dataset))

# process the extracted AQI data
for entry in extracted_particulate_data:
    key_hierarchy = " > ".join(entry["keys"])
    aqi_value = entry["aqi"]
    sample_duration_value = entry['sample_duration']
    # print(f"Keys: {key_hierarchy}, AQI: {aqi_value}, Sample Duration: {sample_duration_value}")

# print(extracted_particulate_data)


Now we will extract and process AQI data from a list of datasets related to gaseous matter.

In [42]:
extracted_gaseous_data = []

for dataset in gaseous_data:
    extracted_gaseous_data.extend(extract_aqi(dataset))

# process the extracted AQI data
for entry in extracted_gaseous_data:
    key_hierarchy = " > ".join(entry["keys"])
    aqi_value = entry["aqi"]
    sample_duration_value = entry['sample_duration']
    # print(f"Keys: {key_hierarchy}, AQI: {aqi_value}, Sample Duration: {sample_duration_value}")

# print(extracted_gaseous_data)

Now we will processes the extracted AQI data to create a dataFrame that organizes information by date, AQI value, sample duration, and pollutant type. I am more used to performing analysis using Pandas, so this helps me do further analysis.

First, we will do for the particulate matter.

In [43]:
aqi_data = []

for record in extracted_particulate_data:
    key = record['keys']
    date = key[-1]

    aqi = record['aqi']
    duration_of_sample = record['sample_duration']
    type_of_pollutant = key[-3]

    aqi_data.append({
        'date': date,
        'AQI': aqi,
        'sample_duration': duration_of_sample,
        'pollutant_type': type_of_pollutant
    })

particulate_df = pd.DataFrame(aqi_data)
particulate_df['date'] = pd.to_datetime(particulate_df['date'], format='%Y%m%d')
particulate_df.head()

Unnamed: 0,date,AQI,sample_duration,pollutant_type
0,1990-02-04,9,24 HOUR,81102
1,1990-02-10,6,24 HOUR,81102
2,1990-02-16,12,24 HOUR,81102
3,1990-02-22,19,24 HOUR,81102
4,1990-02-28,19,24 HOUR,81102


Now we will do the same for gaseous matter.

In [44]:
aqi_data = []

for record in extracted_gaseous_data:
    key = record['keys']
    date = key[-1]

    aqi = record['aqi']
    duration_of_sample = record['sample_duration']
    type_of_pollutant = key[-3]

    aqi_data.append({
        'date': date,
        'AQI': aqi,
        'sample_duration': duration_of_sample,
        'pollutant_type': type_of_pollutant
    })

gaseous_df = pd.DataFrame(aqi_data)
gaseous_df['date'] = pd.to_datetime(particulate_df['date'], format='%Y%m%d')
gaseous_df.head()

Unnamed: 0,date,AQI,sample_duration,pollutant_type
0,1990-02-04,29,1 HOUR,42401
1,1990-02-10,29,1 HOUR,42401
2,1990-02-16,29,1 HOUR,42401
3,1990-02-22,29,1 HOUR,42401
4,1990-02-28,14,1 HOUR,42401


In [46]:
# Save the cleaned files 
particulate_df.to_csv("../data/intermediate_data/aqi_particulate_data_clean.csv")
gaseous_df.to_csv("../data/intermediate_data/aqi_gaseous_data_clean.csv")

### 4.4 Combine the particulate and gaseous matter data to obtain the AQI dataset

I want to combine and analyze air quality data from particulate_df and gaseous_df. Since I want to highlight peak pollution values and understand the highest levels of air quality degradation associated with your smoke estimate in the next step, I will take sum of either of the two values. If suppose particulate data is not available, it will be filled with the dataset mean.

In [66]:
particulate_df = particulate_df.drop(['sample_duration', 'pollutant_type'], axis=1)
particulate_df_by_date = particulate_df.groupby('date', as_index=False).mean()
particulate_df_by_date.head()

Unnamed: 0,date,AQI
0,1990-02-04,9.0
1,1990-02-10,6.0
2,1990-02-16,12.0
3,1990-02-22,19.0
4,1990-02-28,19.0


In [67]:
gaseous_df = gaseous_df.drop(['sample_duration', 'pollutant_type'], axis=1)
gaseous_df_by_date = gaseous_df.groupby('date', as_index=False).mean()
gaseous_df_by_date.head()

Unnamed: 0,date,AQI
0,1990-02-04,29.0
1,1990-02-10,29.0
2,1990-02-16,29.0
3,1990-02-22,29.0
4,1990-02-28,14.0


In [76]:
# merge the dataframes on 'date' and obtain the sum of AQI
final_aqi_df_by_date = gaseous_df_by_date.merge(particulate_df_by_date, on='date', how='outer', suffixes=('_gaseous', '_particulate'))

# calculate the total AQI by summing the two types of AQI
final_aqi_df_by_date['AQI'] = final_aqi_df_by_date[['AQI_gaseous', 'AQI_particulate']].sum(axis=1)

# replace empty values with the mean of available AQI data
mean_aqi = final_aqi_df_by_date['AQI'].mean()
final_aqi_df_by_date['AQI'].fillna(mean_aqi, inplace=True)
final_aqi_df_by_date.drop(columns=['AQI_gaseous', 'AQI_particulate'], inplace=True)

final_aqi_df_by_date['date'] = pd.to_datetime(final_aqi_df_by_date['date'])
final_aqi_df_by_date.head()

Unnamed: 0,date,AQI
0,1990-02-04,38.0
1,1990-02-10,35.0
2,1990-02-16,41.0
3,1990-02-22,48.0
4,1990-02-28,33.0


In [77]:
# save the dataframe for future reference
final_aqi_df_by_date.to_csv("../data/intermediate_data/aqi_data_grouped_by_date.csv")

We will be comparing the AQI data with the smoke estimate we created in the previous step. Since we have grouped our smoke estimate data by year, we will do the same for AQI values as well. I am getting the mean AQI values. 

Further, I am saving the dataset for future reference.

In [78]:
final_aqi_df_by_date['year'] = final_aqi_df_by_date['date'].dt.year # get the year from date

final_aqi_df_by_year = final_aqi_df_by_date.groupby('year')['AQI'].mean().reset_index() # group by 'year' and calculate the mean AQI

final_aqi_df_by_year.head()

Unnamed: 0,year,AQI
0,1990,54.318966
1,1991,52.859375
2,1992,56.285714
3,1993,47.42623
4,1994,49.564516


In [79]:
# save the dataframe for future reference
final_aqi_df_by_year.to_csv("../data/intermediate_data/aqi_data_grouped_by_year.csv")