# Frame the Problem

1. **What is the objective?**  
   To understand and quantify the causal relationship between air quality and health outcomes. It is well established that deteriorating air quality can exacerbate conditions such as asthma and heart disease. For point-in-time events (e.g., forest fires, storms) that cause rapid declines in air quality, can we quantify their causal impact on health outcomes? For example, how many incremental emergency room visits for asthma attacks or heart attacks can be causally linked to the LA fires of early 2025?

2. **How will the solution be used?**  
   For now, this will be a standalone data analysis. In theory, cities could use insights from such an analysis to better prepare for spikes in emergency room visits during and after air quality events like fires or storms. Additionally, as the frequency and severity of these events increase due to climate change, this analysis could help forecast the public health costs of inaction on climate issues.

3. **What are current solutions?**  
   A brief search reveals a few similar analyses using machine learning to predict hospital visit volumes based on environmental data such as air quality:

   - [**Predictive Analytics for Emergency Department Visits Based on Local Short-Term Pollution and Weather Exposure**](https://www.researchgate.net/publication/379227522_Predictive_analytics_for_Emergency_Department_visits_based_on_local_short-term_pollution_and_weather_exposure):  
     This study applies machine learning techniques to forecast daily emergency department (ED) visits and hospitalizations for cardiovascular and respiratory conditions in Brescia, Italy (2018–2022). The researchers used Random Forest and ARIMA models, incorporating local air pollution and weather data. They found that the Random Forest model more accurately captured overall trends compared to ARIMA. The findings suggest that integrating environmental data with machine learning can improve ED resource planning and patient care during environmental events.

   - [**Peak Outpatient and Emergency Department Visit Forecasting for Patients With Chronic Respiratory Diseases Using Machine Learning Methods: Retrospective Cohort Study**](https://medinform.jmir.org/2020/3/e13075/?utm_source=chatgpt.com):  
     This retrospective study evaluates machine learning models for predicting peak outpatient and emergency department (OED) visits among patients with chronic respiratory diseases. The models combined air quality and weather data, using methods such as adaptive boosting-based neural networks, tree bagging, and random forests. The Random Forest model achieved the highest performance, with an AUC of 0.809. The study concludes that such models can effectively anticipate patient surges, aiding in proactive medical resource allocation.

4. **How should the problem be framed?**  
   This is primarily a regression task, using ML methods to predict continuous outcomes. The causal analysis is an extension of this task. I plan to use a geo experiment or matched-market framework to estimate causal impact via a counterfactual approach. The accuracy of the causal estimates hinges on the quality of the counterfactual.

5. **How should performance be measured?**  
   In geo experiments, performance is typically evaluated using Minimum Detectable Effect (MDE) and statistical power:
   - **MDE:** The smallest effect size the experiment can reliably detect. A smaller MDE implies greater sensitivity, enabling the detection of subtle effects. It is influenced by factors like KPI variability, the number of geographic units, and experiment duration.
   - **Power:** The probability of detecting a true effect when one exists. A standard threshold is 80%, meaning there is an 80% chance of identifying a real effect. Higher power reduces the risk of Type II errors.

6. **What are comparable problems?**  
   As mentioned, geo experiments are commonly used in marketing to estimate the incremental effect of ad spend on downstream outcomes (e.g., sales). While at Pinterest, I frequently analyzed geo experiment data. Several third-party vendors (e.g., Measured, Nielsen) offer geo experiment platforms, and companies like Google and Meta have released open-source tools—`matched_markets` and `GeoLift`, respectively.

7. **Have any assumptions been made?**  
   Yes. The analysis assumes that data on environmental factors (e.g., air quality, forest fires, storms) and health outcomes (e.g., hospital visits) is readily accessible. It also assumes that the geo experiment framework is applicable to this type of data.

# Get the Data

**What data is needed?**  

This analysis requires several types of data:

1. **Environmental factors** — including temperature, humidity, air quality, and related metrics.  
    - I will use the EPA Air Quality System (AQS) to obtain this data. The AQS provides daily summary data on pollutants and meteorological conditions. This data is available through [pre-generated data files](https://aqs.epa.gov/aqsweb/airdata/download_files.html) and a [REST API](https://aqs.epa.gov/aqsweb/documents/data_api.html). I will use the API, as it offers more flexibility and supports reproducible workflows.
2. **Health outcomes** — specifically hospital visits, with detailed information on the reasons for each visit.

Ideally, the data should be as granular as possible—preferably at the daily level—since geo experiments rely on time series data.

In [3]:
import requests
from dotenv import load_dotenv
import os 

load_dotenv()

email = os.getenv("EMAIL")
url = "https://aqs.epa.gov/data/api/signup"

params = {
    "email": email
}

response = requests.get(url, params=params)

if response.status_code == 200:
    print("Successfully signed up for the EPA AQS API.")
else:   
    print(f"Failed to sign up for the EPA AQS API. Status code: {response.status_code}")
    print("Response:", response.text)

Successfully signed up for the EPA AQS API.
