# SIF Data Exploration

This notebook will guide you through the steps involved in collecting solar-induced fluorescence (SIF) data from NASA's Goddard Earth Sciences Data and Information Services Center (GES DISC), an online archive that stores data from the Orbiting Carbon Observatory-3 (OCO-3) spacecraft, among others.

The first code block below will simply import some necessary helper functions for exploring and displaying the data.

In [None]:
import calendar
from datetime import datetime
from IPython.display import display, Markdown
from netCDF4 import Dataset
import numpy as np
import os
import sys
import textwrap

# Add src directory containing helper code to sys.path
sys.path.append(os.path.abspath("../src"))

from geosif import GesDiscDownloader, plot_samples, plot_gridded, create_gridded_raster

# an additional helper function for displaying long lists
def wrapped_markdown_list(my_list, width=160):
    wrapped_text = textwrap.fill(", ".join(my_list), width=width)
    display(Markdown(f"```\n{wrapped_text}\n```"))

## I. Getting granules from GES DISC

The GES DISC stores various datasets associated with the OCO-2 and OCO-3 instruments, and in this training we will be focusing on the "SIF Lite" datasets as they have already received L2 processing to extract chlorophyll fluorescence signatures in the O2-A (757 nm) and O2-B  (771 nm) bands. Data is served through an OpenDAP interface that provides a browsing experience similar to looking at a directory tree. You can navigate this directory tree yourself here: [https://oco2.gesdisc.eosdis.nasa.gov/opendap/](https://oco2.gesdisc.eosdis.nasa.gov/opendap/)

A "granule" is an instrument data file, typically in netCDF (.nc or .nc4) format, containing a set of related variables from a time range of observations. Granules in general can be daily or subdaily in time cadence. In the case of the OCO3_L2_Lite_SIF.11r dataset we are are looking at today, individual netCDFs on GES DISC correspond to a signal day worth of instrument observations. 

The pydap module is able to lazily evaluate data in the archive without downloading it until needed, allowing us to explore the variables in a given granule before we download it. By the way, you may see a pydap warning when loading the data, this is no concern.

**If you want to look at a different dataset, change the value of the dataset variable and re-run this cell.**

In [None]:
print("Gathering datasets on GES DISC...")
dl = GesDiscDownloader()

dataset = "OCO3_L2_Lite_SIF.11r" # See Step V for an example with "OCO2_L2_Lite_FP.11.2r"
print(f"Getting time range for {dataset} data...")
timerange = dl.get_dataset_timerange(dataset)
print(
    f"{dataset} has time range {timerange[0].strftime('%Y-%m-%d')} to {timerange[1].strftime('%Y-%m-%d')}"
)

**If you want to look at a different date, change the date specified in `data_date` and re-run the cell below!**

In [None]:
# Get the OCO-3 SIF Lite V11r product from December 1, 2019
data_date = datetime(2019, 12, 1) # Replace with a different date if you'd like
granule = dl.get_granule_by_date(dataset, data_date)
print(f"\n\nThe {dataset} granule from {data_date.strftime('%d/%m/%Y')} has the following variables:")

wrapped_markdown_list(list(granule.keys()))

## II. Download Variables and Plot

As mentioned previously, the SIF granules contain different many different variables needed for further analysis. To get a quick sense of where observations were acquired on this particular day, we can download the Latitude and Longitude coordinates alongside the Daily_SIF_757nm variable. The value for SIF is colormapped using the viridis colormap by default.

The first time you run this code block, you will get a few warnings from cartopy notifying you that it is downloading public resources for displaying the map context, this is expected and not a problem. It may take 20-30 seconds to download all the data, so please be patient.

Note that some of the SIF samples have negative values, this is normal and expected.

In [None]:
def get_variable_array(variable: str):
    return np.array(granule[variable].data[:])

# Note: For some datasets, these variables may be called "latitude" or "longitude"
lat = get_variable_array("Latitude")
lon = get_variable_array("Longitude")
sif = get_variable_array("Daily_SIF_757nm")
# Setting vmax to 1.5 W/m^2/sr/μm improves the contrast of the colormapped samples
# and is based on a priori knowledge of the data range in this granule
# Remove the vmax keyword if you want matplotlib to set the data range automatically
plot_samples(
    sif, lat, lon,
    vmax=1.5,
    title=f"757nm SIF ({data_date.strftime('%Y-%m-%d')})",
    label="SIF (W/$\mathrm{m}^2$/sr/μm)"
)

## III. Download a set of Data (Optional)

Now we can download a set of granules across a date range to perform analysis. After gathering the filesizes of the granules, the code will prompt you to confirm the amount of data you are about to download before proceeding. The following cell will download one month of data, but you can try different time ranges.

**Troubleshooting**: Some file downloads may fail. You can set `parallel=False` in the function call to improve your odds of success, but simply retrying the operation will only download the files that you do not already have. 

In [None]:
# Simply put in the year and month you are interested in and the timerange will be
# downloaded into a named directory for you (e.g., 2020-06)
year = 2020
month = 6
start_date = datetime(year, month, 1)
_, num_days = calendar.monthrange(year, month)
end_date = datetime(year, month, num_days)

dl_granules, dates_notfound, failed_dls = dl.download_timerange(
        dataset,
        start_date,
        end_date,
        outpath=f"data/{dataset}/{year}-{month:02d}",
        parallel=True,
    )

if dl_granules:
    print(f"Download summary: Downloaded {len(dl_granules)} out of {num_days} days")
    for granule in sorted(dl_granules):
        print(granule.name)

if dates_notfound:
    print(f"No data found for {len(dates_notfound)} days:")
    for dt in sorted(dates_notfound):
        print(dt.strftime("%d/%m/%Y"))

if failed_dls:
    print(f"Failed to download {len(failed_dls)} granules:")
    for url in failed_dls:
        print(url)

## IV. Generate a Mean Daily SIF Product for One Month (PLEASE READ)

**Now let's create a gridded mean daily SIF raster from the granules for one month. There are two options for sourcing your data, you only have to run one of these cells:**
- **Option A:** Use the locally downloaded files from step III.
- **Option B:** Use pydap to download data from GES DISC on the fly, saving you from having to store each day's worth of data.

The provided helper functions create a gridded netCDF for the `Daily_SIF_757nm` variable, but you can add to the list of variables if you wish. Note that the data is filtered to use exclusively samples with a `Quality_Flag` of 0, representing the best quality data. You may wish to include points with a `Quality_Flag` of 1 (good).


Customizations:
- Change the start and end dates specified by the first two arguments to adjust the month referenced for the output product.
- Add additional variables to average in the output netCDF (ex: `Daily_SIF_771nm`, `Science_SIF_Relative_757nm`, `Science_daily_correction_factor`, etc.)
- Change the name of the output file by changing the value of the `daily_avg_file` variable.
- You can also specify a bounding box for the data, such as `lon_min=-130, lat_max=-65, lat_min=22, lat_max=50,` for the CONUS.

In [None]:
# Run this cell regardless of which option you choose
year = 2020
month = 6
start_date = datetime(year, month, 1)
_, num_days = calendar.monthrange(year, month)
end_date = datetime(year, month, num_days)

daily_avg_file = f"data/{dataset}/{start_date.strftime('%b_%Y').lower()}_sif.nc4"

**Option A:** Use the files you downloaded in Step III to create the gridded raster. This will take about 30 seconds now that you already have the data locally.

In [None]:
# For this option, note that local_dataset has the value that is
# expected in the filename of your downloaded granules, rather than
# the name of the dataset on the DAAC.
local_dataset = "oco3_LtSIF"
create_gridded_raster(
    start_date,
    end_date,
    local_dataset,
    ["Daily_SIF_757nm"],
    daily_avg_file,
    lat_res = 0.5,
    lon_res = 0.5,
    # Uncomment to bound the data to the CONUS
    # lon_min=-150, lon_max=-60, lat_min=22, lat_max=50,
    local_dir=f"data/{dataset}/{year}-{month:02d}",
    filters={
        # Change Quality_Flag to ("<" 2) if you would like to use both best and good quality data
        "Quality_Flag": ("=", 0),
        # Filter to Nadir mode observations
        # "Metadata/MeasurementMode": ("=", 0)
    }
)

**Option B: Use pydap to download data on the fly. This option uses less disk space but will take about 5 - 10 minutes to finish processing.**

You can skip this block and proceed to plotting if you have already generated the `daily_avg_file` referenced in the block above.

In [None]:
create_gridded_raster(
    start_date,
    end_date,
    dataset,
    ["Daily_SIF_757nm"],
    daily_avg_file,
    lat_res = 0.5,
    lon_res = 0.5,
    # Uncomment to bound the data to the CONUS
    # lon_min=-130, lon_max=-65, lat_min=22, lat_max=50,
    filters={
        # Change Quality_Flag to (">" 2) if you would like to use both best and good quality data
        "Quality_Flag": ("=", 0)
    }
)

### Display the Monthly Gridded Raster

If you used a different dataset or variable for your monthly gridded raster, be sure to modify the values of `title`, `label`, and `outfile` in the code block below.

In [None]:
ds = Dataset(daily_avg_file, "r")
var_name = "Daily_SIF_757nm"
var = ds[var_name]
lats = ds["lat"][:]
lons = ds["lon"][:]
fill_val = var._FillValue

daily_avg_data = var[...]
ds.close()
lon2d, lat2d = np.meshgrid(lons, lats)
# Transpose the meshgrid result to be of shape (lon_res, lat_res) ex. (360, 180)
lon2d = lon2d.T
lat2d = lat2d.T

# Create a masked array where the fill_val is masked
data_masked = np.ma.masked_where(daily_avg_data == fill_val, daily_avg_data)
# Average over axis 0 (the "time" dimension), produces a masked array of shape (lon_res, lat_res)
mean_data_masked = data_masked.mean(axis=0)

# Be sure to change the title and label if you change the monthly gridded raster you want to display
# To avoid overwriting the output image when making changes, you should also change the value of outfile
plot_gridded(
    mean_data_masked, lat2d, lon2d,
    vmax=0.8,
    vmin=0,
    # Uncomment to window the plot to the CONUS
    # extents=[-130, -65, 22, 50],
    title=f"OCO-3 {start_date.strftime('%B %Y')} Mean Daily SIF$_{{757}}$",
    label="SIF (W/$\mathrm{m}^2$/sr/μm)",
    outfile=f"mean_oco3_{var_name.lower()}_{start_date.strftime('%B_%Y').lower()}.png"
)

## V. Create an OCO-2 XCO<sub>2</sub> Monthly Gridded Raster

Next, we will repeat the process of generating and plotting a monthly gridded raster in an abridged manner for an XCO<sub>2</sub> dataset. The process is largely the same, we are just changing the arguments to the functions and labels of plots.

In [None]:
# These are all the dataset-dependent variables
dataset = "OCO2_L2_Lite_FP.11.2r"
local_dataset = "oco2_LtCO2" # This is the first part of the filename
mission_name = "OCO-2"
year = 2020
month = 6
var_name = "xco2"
var_label = "XCO$_2$"
unit = "ppm"
start_date = datetime(year, month, 1)
_, num_days = calendar.monthrange(year, month)
end_date = datetime(year, month, num_days)

granule_dir = f"data/{dataset}/{year}-{month:02d}"
daily_avg_file = f"data/{dataset}/{start_date.strftime('%b_%Y').lower()}_{var_name}.nc4"

Download the granules, then generate the raster. XCO<sub>2</sub> granules are three times larger than SIF, so this will grab about 2 GB of data.

In [None]:
dl = GesDiscDownloader()
dl_granules, dates_notfound, failed_dls = dl.download_timerange(
        dataset,
        start_date,
        end_date,
        outpath=granule_dir,
        parallel=True,
    )

if dl_granules:
    print(f"Download summary: Downloaded {len(dl_granules)} out of {num_days} days")
    for granule in sorted(dl_granules):
        print(granule.name)

if dates_notfound:
    print(f"No data found for {len(dates_notfound)} days:")
    for dt in sorted(dates_notfound):
        print(dt.strftime("%d/%m/%Y"))

if failed_dls:
    print(f"Failed to download {len(failed_dls)} granules:")
    for url in failed_dls:
        print(url)

create_gridded_raster(
    start_date,
    end_date,
    local_dataset,
    [var_name],
    daily_avg_file,
    lat_res = 0.5,
    lon_res = 0.5,
    # Uncomment to bound the data to the CONUS
    # lon_min=-130, lon_max=-65, lat_min=22, lat_max=50,
    local_dir=granule_dir,
)

In [None]:
ds = Dataset(daily_avg_file, "r")
var = ds[var_name]
lats = ds["lat"][:]
lons = ds["lon"][:]
fill_val = var._FillValue

daily_avg_data = var[...]
ds.close()
lon2d, lat2d = np.meshgrid(lons, lats)
# Transpose the meshgrid result to be of shape (lon_res, lat_res) ex. (360, 180)
lon2d = lon2d.T
lat2d = lat2d.T

# Create a masked array where the fill_val is masked
data_masked = np.ma.masked_where(daily_avg_data == fill_val, daily_avg_data)
# Average over axis 0 (the "time" dimension), produces a masked array of shape (lon_res, lat_res)
mean_data_masked = data_masked.mean(axis=0)

# Be sure to change the title and label if you change the monthly gridded raster you want to display
# To avoid overwriting the output image when making changes, you should also change the value of outfile
plot_gridded(
    mean_data_masked, lat2d, lon2d,
    vmin=405,
    vmax=420,
    # Uncomment to window the plot to the CONUS
    # extents=[-130, -65, 22, 50],
    title=f"{mission_name} {start_date.strftime('%B %Y')} Mean Daily {var_label}",
    label=f"{var_label} ({unit})",
    outfile=f"mean_{mission_name.lower()}_{var_name.lower()}_{start_date.strftime('%B_%Y').lower()}.png"
)