# Prediction of Nighttime NO2

## Background

### Names and Acronyms
1) ASDC: [Atmosperhic Science Data Center](https://asdc.larc.nasa.gov/about)
1) PGN (Pandora): [Pandonia Global Network](https://www.pandonia-global-network.org/) / [Pandora](https://pandora.gsfc.nasa.gov/About/)
    - **NOTE**: NASA's portion of the PGN is known as Pandora.  Within the scope of this notebook, Pandora and PGN may be used interchangably as this project will only use NASAs PGN site data.
1) TEMPO: [Troposoperic Emissions: Monitoring of Pollution](https://science.nasa.gov/mission/tempo/)

### Resources
1) ASDC Data Processing Tool (Version 1)
    - This notebook was published by the ASDC and provides examples of how to correctly load and use Pandora and TEMPO data.
    - https://github.com/nasa/ASDC_Data_and_User_Services/blob/main/TEMPO/additional_drafts/ASDC_Data_Processing_ML_v1.2.ipynb
1) PGN Station Map
    - A map showing the location of all PGN groundsites.
    - https://blickm.hetzner.pandonia-global-network.org/livemaps/pgn_stationsmap.png


This notebook borrows heavily from and extens the functionality of the NASA, ASDC Data and User Servicies notebook found here:

https://github.com/nasa/ASDC_Data_and_User_Services/blob/main/TEMPO/additional_drafts/ASDC_Data_Processing_ML_v1.2.ipynb

This notebook intends to test the hypothesis that a model can be built with Pandora which can predict nightitme NO<sub>2</sub> and that that model can be applied to TEMPO daytime measurments to predict NO<sub>2</sub> for any location covered by TEMPO.



## 1. Environment Setup

### Environment Setup
There are many tools available such as [poetry](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://python-poetry.org/&ved=2ahUKEwjr9aLgna6QAxX5EVkFHVsNBMUQFnoECBsQAQ&usg=AOvVaw3Jp8q7OO7XkcY8Tq4tDe30) and [uv](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://docs.astral.sh/uv/&ved=2ahUKEwiP9aXVna6QAxVyF1kFHeyTNGYQFnoECAsQAQ&usg=AOvVaw2VJVt0jrah2S9tIgdc1yRc) that simplify and speed up environment setup.  For simplicity, this guide only covers the method built into the python standard library.
1) Install [Python 3.11](https://www.python.org/downloads/) (or higher)
1) (Recomended) Create a virtual environment (learn more [here](https://docs.python.org/3/library/venv.html))
1) Install the required packages using the following command.<br>`% pip install pyproject.toml`
1) Select the newly created kernal in your notebook.
    - NOTE: this varies slightly between notebook tools, but in almost all tools you will be prompted to select a kernal upon running a cell.

### Import required modules

In [145]:
import codecs
from datetime import datetime
from pathlib import Path

import earthaccess
import numpy as np
from numpy.typing import NDArray
import pandas as pd
import requests

### Data Access
In order to access data, you will need an Earthdata Login account.  If you do not have an Earthdata Login account, you can create one here:<br>
https://urs.earthdata.nasa.gov/

The earthaccess module allows you to authenticate.  Multilple login options exist for providing your credentials, you can read more on options here:<br>
https://pypi.org/project/earthaccess/<br>
By unless another option is configured, you will be prompted by your notebook to enter your credentials.

In [119]:
earthaccess.login()

<earthaccess.auth.Auth at 0x14e4e776410>

Notebook Settings

In [129]:
PGN_DATA_PATH = Path('pgn-data')
PGN_DATA_PATH.mkdir(mode=0o777, parents=True, exist_ok=True)

## 1. Data Prepairation

### 1.1. Download Data

#### 1.1.1. Define Download Settings

In [120]:
temporal_range = []
spatial_range = []

sites = ['BronxNY', 'BuffaloNY', 'QueensNY']
sites_url = "https://data.pandonia-global-network.org"
file_suffix = "rnvh3p1-8.txt"

#### 1.1.1. Download Pandora data

Get sites

In [121]:
def get_page_links(url: str):
    """
    An tool for getting all PGN links from a PGN data webpage (https://data.hetzner.pandonia-global-network.org/)

    ARGS:
        url (str): The URL of the page to extract links form
    """
    things_to_remove = [
        '<span class="name">',
        '</span>',
        '/</span>'
    ]

    response = requests.get(url)
    assert response.status_code==200, f"Download failed with code {response.status_code}"
    
    # get item name lines
    names = [l.strip() for l in response.text.splitlines()]
    names = [l for l in names if l.startswith('<span class="name">')]

    # get item names from name lines
    for thing_to_remove in things_to_remove:
        names = [l.replace(thing_to_remove, '') for l in names]
        names = [l.rstrip('/') for l in names]

    return names

In [None]:
# get file URLs
print("Getting File URLs")
pgn_urls: list[str] = []
for i, site in enumerate(sites):
    site_url = f"{sites_url}/{site}"
    instruments = get_page_links(site_url)
    print(f"Site {i+1} of {len(sites)}:", site)
    for j, instrument in enumerate(instruments):
        print(f"\tInstrument {j+1} of {len(instruments)}:", instrument)
        instrument_url = f"{site_url}/{instrument}/L2/{instrument}_{site}_L2_{file_suffix}"
        pgn_urls.append(instrument_url)

print("File URLs:")
for pgn_url in pgn_urls:
    print(f"\t{pgn_url}")


Getting File URLs
Site 1 of 3: BronxNY
	Instrument 1 of 2: Pandora147s1
	Instrument 2 of 2: Pandora180s1
Site 2 of 3: BuffaloNY
	Instrument 1 of 1: Pandora206s1
Site 3 of 3: QueensNY
	Instrument 1 of 1: Pandora55s1
File URLs:
	https://data.pandonia-global-network.org/BronxNY/Pandora147s1/L2/Pandora147s1_BronxNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/BronxNY/Pandora180s1/L2/Pandora180s1_BronxNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/BuffaloNY/Pandora206s1/L2/Pandora206s1_BuffaloNY_L2_rnvh3p1-8.txt
	https://data.pandonia-global-network.org/QueensNY/Pandora55s1/L2/Pandora55s1_QueensNY_L2_rnvh3p1-8.txt


In [None]:
# Download Files
pgn_paths: list[Path] = []
print(f"Downloading pgn files to {PGN_DATA_PATH}")
for i, pgn_url in enumerate(pgn_urls):
    file_name = Path(pgn_url).name
    print(f"Downloading file {i+1} of {len(pgn_urls)}: {file_name}")
    response = requests.get(pgn_url)
    assert response.ok, f"File not found: {pgn_url}"

    file_path = PGN_DATA_PATH.joinpath(file_name)
    file_path.write_bytes(response.content)
    pgn_paths.append(file_path)
print("Files downloaded")

Downloading pgn files to pgn-data
Downloading file 1 of 4: Pandora147s1_BronxNY_L2_rnvh3p1-8.txt
Downloading file 2 of 4: Pandora180s1_BronxNY_L2_rnvh3p1-8.txt
Downloading file 3 of 4: Pandora206s1_BuffaloNY_L2_rnvh3p1-8.txt
Downloading file 4 of 4: Pandora55s1_QueensNY_L2_rnvh3p1-8.txt
Files downloaded


In [None]:
"""
The following PGN data extraction functions were copied from ASDC_Data_Processing_ML_v1.2.ipynb and will not be changed
"""
##unction reading Pandora NO2 data files rnvh3p1-8
# function converting Pandora timestamp into a set of  year, month, day, hour, minute, and second
# function read_timestamp converts Pandora timestamp of the format
# 'yyyymmddThhmmssZ' into a set of 6 numbers:
# integer year, month, day, hour, minute, and real second.
def read_timestamp(timestamp):

  yyyy = int(timestamp[0:4])
  mm = int(timestamp[4:6])
  dd = int(timestamp[6:8])
  hh = int(timestamp[9:11])
  mn = int(timestamp[11:13])
  ss = int(timestamp[13:15])

  return yyyy, mm, dd, hh, mn, ss


# function reading Pandora NO2 data file rnvh3p1-8
#
# Below is the second version of function read_Pandora_NO2_rnvs3p1_8. It is to be used for the future validation efforts.
# The difference with the original version is that instead of discriminating negative values of the total NO2 column,
# it uses quality flags. It was previously found that QF == 0 does not occure often enough,
# so we will have to use QF == 10 (not-assured high quality).
#
# function read_Pandora_NO2_rnvh3p1_8 reads Pandora total NO2 column data files ending with rnvh3p1-8.
# Arguments:
# fname - name file to be read, string;
# start_date - beginning of the time interval of interest,
#              integer of the form YYYYMMDD;
# end_date -   end of the time interval of interest,
#              integer of the form YYYYMMDD.
#
# if start_date is greater than end_date, the function returns a numpy array
# with shape (0, 8), otherwise it returns an 8-column numpy array
# with with columns being year, month, day, hour, minute, second of observation
# and retrieved total NO2 column along with its total uncertainty.
#
# NO2 column is in mol/m^2, so conversion to molecules/cm^2 is performed by
# multiplication by Avogadro constant, NA =  6.02214076E+23, and division by 1.E+4
def read_Pandora_NO2_rnvh3p1_8(fname, start_date, end_date):

  conversion_coeff = 6.02214076E+19 # Avogadro constant divided by 10000

  data = np.empty([0, 8])
  if start_date > end_date: return -999., -999., data

  with codecs.open(fname, 'r', encoding='utf-8', errors='ignore') as f:

    while True:
# Get next line from file
      line = f.readline()

      if line.find('Short location name:') >= 0:
        loc_name = line.split()[-1] # location name, to be used in the output file name
        print('location name ', loc_name)

      if line.find('Location latitude [deg]:') >= 0:
        lat = float(line.split()[-1]) # location latitude
        print('location latitude ', lat)

      if line.find('Location longitude [deg]:') >= 0:
        lon = float(line.split()[-1]) # location longitude
        print('location longitude ', lon)

      if line.find('--------') >= 0: break

    while True:
# Get next line from file
      line = f.readline()
      # print(line)
      if line.find('--------') >= 0: break

    while True:
# now reading line with data
      line = f.readline()
      
      if not line: break

      line_split = line.split()
     
      yyyy, mm, dd, hh, mn, ss = read_timestamp(line_split[0])
      date_stamp = yyyy*10000 + mm*100 + dd
      if date_stamp < start_date or date_stamp > end_date: continue

      QF = int(line_split[52]) # total column uncertainty

      if QF == 0 or QF == 10:
        column = float(line_split[61]) # Nitrogen dioxide tropospheric vertical column amount [moles per square meter]
        column_unc = float(line_split[62]) # Independent uncertainty of nitrogen dioxide tropospheric vertical column amount [moles per square meter]
        
        data = np.append(data, [[yyyy, mm, dd, hh, mn, ss\
                               , column*conversion_coeff\
                               , column_unc*conversion_coeff]], axis = 0)

  return lat, lon, loc_name, data

def read_Pandora_NO2_rnvm2p1_8(fname, start_date, end_date):  #####################LUNAR
  conversion_coeff = 6.02214076E+19 # Avogadro constant divided by 10000
  data = np.empty([0, 8])
  if start_date > end_date: return -999., -999., data

  # with codecs.open(fname, 'r', encoding='utf-8', errors='ignore') as f:
  with codecs.open(fname, 'r', encoding='utf-8', errors='ignore') as f:

    while True:
# Get next line from file
      line = f.readline()

      if line.find('Short location name:') >= 0:
        loc_name = line.split()[-1] # location name, to be used in the output file name
        # print('location name ', loc_name)

      if line.find('Location latitude [deg]:') >= 0:
        lat = float(line.split()[-1]) # location latitude
        # print('location latitude ', lat)

      if line.find('Location longitude [deg]:') >= 0:
        lon = float(line.split()[-1]) # location longitude
        # print('location longitude ', lon)

      if line.find('--------') >= 0: break

    while True:
# Get next line from file
      line = f.readline()
      # print(line)
      if line.find('--------') >= 0: break

    while True:
# now reading line with data
      line = f.readline()
      
      if not line: break

      line_split = line.split()
     
      yyyy, mm, dd, hh, mn, ss = read_timestamp(line_split[0])
      date_stamp = yyyy*10000 + mm*100 + dd
      if date_stamp < start_date or date_stamp > end_date: continue

      QF = int(line_split[35])
      
      if QF == 0 or QF == 10 or QF == 1 or QF ==11:
      # if QF:
        column = float(line_split[38])# - float(line_split[52]) # Nitrogen dioxide total vertical column amount [moles per square meter]
        column_unc = float(line_split[39]) # Independent uncertainty of nitrogen dioxide tropospheric vertical column amount [moles per square meter]
        
        data = np.append(data, [[yyyy, mm, dd, hh, mn, ss\
                               , column*conversion_coeff\
                               , column_unc*conversion_coeff]], axis = 0)

  return data


pgn-data\Pandora147s1_BronxNY_L2_rnvh3p1-8.txt has 93 lines in the header.


ParserError: Error tokenizing data. C error: Expected 81 fields in line 98, saw 109


#### 1.1.1 Download TEMPO Data

### 1.2. Explore Data

### 1.3. Clean Data

### 1.4. Transform Data

## 2. Data Modeling

### 2.1 Train Model

### 2.2 Test Model

### 2.3. Model Visualization