# Prediction of Nighttime NO2

## Background

### Names and Acronyms
1) ASDC: [Atmosperhic Science Data Center](https://asdc.larc.nasa.gov/about)
1) PGN (Pandora): [Pandonia Global Network](https://www.pandonia-global-network.org/) / [Pandora](https://pandora.gsfc.nasa.gov/About/)
    - **NOTE**: NASA's portion of the PGN is known as Pandora.  Within the scope of this notebook, Pandora and PGN may be used interchangably as this project will only use NASAs PGN site data.
1) TEMPO: [Troposoperic Emissions: Monitoring of Pollution](https://science.nasa.gov/mission/tempo/)

### Resources
1) ASDC Data Processing Tool (Version 1)
    - This notebook was published by the ASDC and provides examples of how to correctly load and use Pandora and TEMPO data.
    - https://github.com/nasa/ASDC_Data_and_User_Services/blob/main/TEMPO/additional_drafts/ASDC_Data_Processing_ML_v1.2.ipynb
1) PGN Station Map
    - A map showing the location of all PGN groundsites.
    - https://blickm.hetzner.pandonia-global-network.org/livemaps/pgn_stationsmap.png


This notebook borrows heavily from and extens the functionality of the NASA, ASDC Data and User Servicies notebook found here:

https://github.com/nasa/ASDC_Data_and_User_Services/blob/main/TEMPO/additional_drafts/ASDC_Data_Processing_ML_v1.2.ipynb

This notebook intends to test the hypothesis that a model can be built with Pandora which can predict nightitme NO<sub>2</sub> and that that model can be applied to TEMPO daytime measurments to predict NO<sub>2</sub> for any location covered by TEMPO.



## 1. Environment Setup

### Environment Setup
There are many tools available such as [poetry](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://python-poetry.org/&ved=2ahUKEwjr9aLgna6QAxX5EVkFHVsNBMUQFnoECBsQAQ&usg=AOvVaw3Jp8q7OO7XkcY8Tq4tDe30) and [uv](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://docs.astral.sh/uv/&ved=2ahUKEwiP9aXVna6QAxVyF1kFHeyTNGYQFnoECAsQAQ&usg=AOvVaw2VJVt0jrah2S9tIgdc1yRc) that simplify and speed up environment setup.  For simplicity, this guide only covers the method built into the python standard library.
1) Install [Python 3.11](https://www.python.org/downloads/) (or higher)
1) (Recomended) Create a virtual environment (learn more [here](https://docs.python.org/3/library/venv.html))
1) Install the required packages using the following command.<br>`% pip install pyproject.toml`
1) Select the newly created kernal in your notebook.
    - NOTE: this varies slightly between notebook tools, but in almost all tools you will be prompted to select a kernal upon running a cell.

### Import required modules

In [32]:
import codecs
from datetime import datetime
from pathlib import Path

import earthaccess
import netCDF4 as nc
import numpy as np
from numpy.typing import NDArray
import pandas as pd
import requests

### Data Access
In order to access data, you will need an Earthdata Login account.  If you do not have an Earthdata Login account, you can create one here:<br>
https://urs.earthdata.nasa.gov/

The earthaccess module allows you to authenticate.  Multilple login options exist for providing your credentials, you can read more on options here:<br>
https://pypi.org/project/earthaccess/<br>
By unless another option is configured, you will be prompted by your notebook to enter your credentials.

In [5]:
earthaccess.login()

<earthaccess.auth.Auth at 0x20d80072b10>

Notebook Settings

In [6]:
PGN_DATA_DIR = Path('pgn-data')
PGN_DATA_DIR.mkdir(mode=0o777, parents=True, exist_ok=True)
PGN_DATA_PATH = PGN_DATA_DIR.joinpath('pgn-data.csv')
TEMPO_DATA_DIR = Path('tempo-data')
TEMPO_DATA_DIR.mkdir(mode=0o777, parents=True, exist_ok=True)
TEMPO_DATA_PATH = TEMPO_DATA_DIR.joinpath('tempo-data.csv')

## 1. Data Prepairation

### 1.1. Download Data

#### 1.1.1. Define Download Settings

In [7]:
start_date = datetime(2023, 7, 1)
end_date = datetime(2023, 7, 31)
temporal_range = [start_date, end_date]
spatial_range = []

sites = ['BronxNY', 'BuffaloNY', 'QueensNY']
sites_url = "https://data.pandonia-global-network.org"

# settings for various types of pgn files
all_pgn_formats = {
    "rnvh3p1-8.txt": {
        'no2_quality_flag_index': 52,
        'valid_quality_flags': [0, 10],
        'column_index': 61,
        'column_unc_index': 62,
    },
    "rnvm2p1-8.txt": {
        'no2_quality_flag_index': 35,
        'valid_quality_flags': [0, 1, 10, 11],
        'column_index': 38,
        'column_unc_index': 39,
    }
}

# formats to process
pgn_formats = {
    "rnvh3p1-8.txt": all_pgn_formats["rnvh3p1-8.txt"]
}

#### 1.1.1. Download Pandora data

Get sites

In [8]:
def get_page_links(url: str):
    """
    An tool for getting all PGN links from a PGN data webpage (https://data.hetzner.pandonia-global-network.org/)

    ARGS:
        url (str): The URL of the page to extract links form
    """
    things_to_remove = [
        '<span class="name">',
        '</span>',
        '/</span>'
    ]

    response = requests.get(url)
    assert response.status_code==200, f"Download failed with code {response.status_code}"
    
    # get item name lines
    names = [l.strip() for l in response.text.splitlines()]
    names = [l for l in names if l.startswith('<span class="name">')]

    # get item names from name lines
    for thing_to_remove in things_to_remove:
        names = [l.replace(thing_to_remove, '') for l in names]
        names = [l.rstrip('/') for l in names]

    return names

Build file URLs

In [9]:
# get file URLs
print("Getting File URLs")
pgn_urls: list[str] = []
for i, site in enumerate(sites):
    site_url = f"{sites_url}/{site}"
    instruments = get_page_links(site_url)
    print(f"Site {i+1} of {len(sites)}:", site)
    for j, instrument in enumerate(instruments):
        print(f"\tInstrument {j+1} of {len(instruments)}:", instrument)
        for file_suffix in pgn_formats.keys():
            file_url = f"{site_url}/{instrument}/L2/{instrument}_{site}_L2_{file_suffix}"
            # verify file exists
            if not requests.head(file_url, allow_redirects=True).ok:
                print(f"\tFile does not exist (this may not be an issue): {file_url}")
                continue
            pgn_urls.append(file_url)

print("File URLs:")
for pgn_url in pgn_urls:
    print(f"\t{pgn_url}")


Getting File URLs
Site 1 of 3: BronxNY
	Instrument 1 of 2: Pandora147s1
	Instrument 2 of 2: Pandora180s1
Site 2 of 3: BuffaloNY
	Instrument 1 of 1: Pandora206s1
Site 3 of 3: QueensNY
	Instrument 1 of 1: Pandora55s1
File URLs:
	https://data.pandonia-global-network.org/BronxNY/Pandora147s1/L2/Pandora147s1_BronxNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/BronxNY/Pandora180s1/L2/Pandora180s1_BronxNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/BuffaloNY/Pandora206s1/L2/Pandora206s1_BuffaloNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/QueensNY/Pandora55s1/L2/Pandora55s1_QueensNY_L2_rnvh3p1-8.txt


Download files (if not already downloaded)

In [10]:
# Download Files
pgn_paths: list[Path] = []
print(f"Downloading pgn files to {PGN_DATA_DIR}")
for i, pgn_url in enumerate(pgn_urls):
    file_name = Path(pgn_url).name
    file_path = PGN_DATA_DIR.joinpath(file_name)
    if file_path.exists():
        print(f"File {i+1} of {len(pgn_urls)} exists and will not be downloaded: {file_name}")
    
    else:
        print(f"\tDownloading file {i+1} of {len(pgn_urls)}: {file_name}")
        response = requests.get(pgn_url)
        file_path.write_bytes(response.content)

    pgn_paths.append(file_path)
print("Files downloaded")

Downloading pgn files to pgn-data
File 1 of 4 exists and will not be downloaded: Pandora147s1_BronxNY_L2_rnvh3p1-8.txt
File 2 of 4 exists and will not be downloaded: Pandora180s1_BronxNY_L2_rnvh3p1-8.txt
File 3 of 4 exists and will not be downloaded: Pandora206s1_BuffaloNY_L2_rnvh3p1-8.txt
File 4 of 4 exists and will not be downloaded: Pandora55s1_QueensNY_L2_rnvh3p1-8.txt
Files downloaded


Build a datafram of all PGN data for columns with valid quality flags.

In [17]:
# PGN format settings (these should not change)
pgn_section_delim = f"{'-'*87}\n"
header_delim = ": "
pgn_loc_key = "Short location name"
pgn_lat_key = "Location latitude [deg]"
pgn_lon_key = "Location longitude [deg]"
# Avogadro constant divided by 10000
no2_scale = 6.02214076E+19

# build the final dataset
pgn_data = pd.DataFrame()
print("PGN Loading Started")
for i, pgn_path in enumerate(pgn_paths):
    print(f"Loading file {i+1} of {len(pgn_paths)}: {pgn_path.name}")

    # get file format indicies
    file_suffix = pgn_path.name.split('_')[-1]
    file_data = pgn_formats.get(file_suffix)
    if file_data is None:
      raise Exception(f"Invalid suffix for {pgn_path}, handled suffixes:", pgn_formats)
    no2_quality_flag_index = file_data['no2_quality_flag_index']
    valid_quality_flags = file_data['valid_quality_flags']
    column_index = file_data['column_index']
    column_unc_index = file_data['column_unc_index']

    # get file sections as lines
    text = pgn_path.read_text()
    metadata_text, column_text, data_text = text.split(pgn_section_delim)
    metadata_lines = metadata_text.splitlines()
    column_lines = column_text.splitlines()
    data_lines = data_text.splitlines()

    # get metadata
    metadata = {}
    for line in metadata_lines:
        key, value = line.split(header_delim)
        metadata[key] = value

    # get data
    rows = []
    for line in data_lines:
      values = line.split()

      # ignore if timestamp is not between start and end time
      timestamp = datetime.fromisoformat(values[0]).replace(tzinfo=None)
      if not (start_date <= timestamp):
        continue

      # ignore row if quality is not between 0 and 10
      no2_quality_flag = int(values[no2_quality_flag_index])
      if no2_quality_flag not in valid_quality_flags:
        continue
      
      # Nitrogen dioxide tropospheric vertical column amount [moles per square meter]
      column = float(values[61])
      # Independent uncertainty of nitrogen dioxide tropospheric vertical column amount [moles per square meter]
      column_unc = float(values[62])

      lat = float(metadata[pgn_lat_key])
      lon = float(metadata[pgn_lon_key])
      loc = metadata[pgn_loc_key]
      row = {
         'Timestamp': timestamp, 
         'Latitude': lat, 
         'Longitude': lon, 
         'Location': loc, 
         'Column': column*no2_scale, 
         'Uncertainty': column_unc*no2_scale
      }
      rows.append(row)
    df = pd.DataFrame(rows)
    if not rows:
      print("\tWARNING: No valid observations found (NO2 quality flag of 0 or 10)")
    else:
      print(f"\tValid Observations: {len(rows)}")
    pgn_data = pd.concat([pgn_data, df])
print(f"PGN Loading Complete, found {pgn_data.shape} valid observations.")
print("Writing to", PGN_DATA_PATH)
pgn_data.to_csv(PGN_DATA_PATH, index=False)
pgn_data.head()


PGN Loading Started
Loading file 1 of 4: Pandora147s1_BronxNY_L2_rnvh3p1-8.txt
Loading file 2 of 4: Pandora180s1_BronxNY_L2_rnvh3p1-8.txt
	Valid Observations: 3208
Loading file 3 of 4: Pandora206s1_BuffaloNY_L2_rnvh3p1-8.txt
	Valid Observations: 4894
Loading file 4 of 4: Pandora55s1_QueensNY_L2_rnvh3p1-8.txt
	Valid Observations: 16168
PGN Loading Complete, found (24270, 6) valid observations.
Writing to pgn-data\pgn-data.csv


Unnamed: 0,Timestamp,Latitude,Longitude,Location,Column,Uncertainty
0,2023-07-01 12:09:22.300,40.8679,-73.8781,BronxNY,1.825732e+16,174491500000000.0
1,2023-07-01 12:19:59.400,40.8679,-73.8781,BronxNY,2.381636e+16,208655100000000.0
2,2023-07-01 12:43:43.400,40.8679,-73.8781,BronxNY,2.6439e+16,199507500000000.0
3,2023-07-01 13:39:43.200,40.8679,-73.8781,BronxNY,3.037989e+16,216303300000000.0
4,2023-07-01 13:57:10.500,40.8679,-73.8781,BronxNY,2.223254e+16,148608400000000.0


Get latitudes and longitudes (for use with TEMPO download)

In [19]:
pgn_sites_df = pgn_data[['Location', 'Latitude', 'Longitude']].drop_duplicates().set_index('Location')
pgn_sites_df.head()

Unnamed: 0_level_0,Latitude,Longitude
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
BronxNY,40.8679,-73.8781
BuffaloNY,43.0015,-78.7869
QueensNY,40.7361,-73.8215


#### 1.1.1 Download TEMPO Data

(123,)
(123, 2048)
(123, 2048)
(123, 2048)
(123, 2048)
[[59.290962 59.24963  59.20839  ... 17.34152  17.326826 17.312134]
 [59.277916 59.236637 59.195454 ... 17.34037  17.325676 17.310984]
 [59.26651  59.225285 59.18415  ... 17.339794 17.325102 17.310411]
 ...
 [58.302814 58.264565 58.228058 ... 17.289572 17.274948 17.260324]
 [58.295998 58.258343 58.220425 ... 17.28848  17.273857 17.259233]
 [58.292873 58.25554  58.21623  ... 17.289175 17.274553 17.25993 ]]


  no2_col = np.array(col_var)
  no2_unc = np.array(unc_var)
  lat = np.array(geo.variables['latitude']) # this reads variable latitude from geo (geolocation group, /geolocation) into a numpy array
  lon = np.array(geo.variables['longitude']) # this reads variable longitude from geo (geolocation group, /geolocation) into a numpy array
  time = np.array(geo.variables['time'] )# this reads variable longitude from geo (geolocation group, /geolocation) into a numpy array


### 1.2. Explore Data

### 1.3. Clean Data

### 1.4. Transform Data

## 2. Data Modeling

### 2.1 Train Model

### 2.2 Test Model

### 2.3. Model Visualization