# SIF Data Exploration

This notebook will guide you through the steps involved in collecting solar-induced fluorescence (SIF) data from NASA's Goddard Earth Sciences Data and Information Services Center (GES DISC), an online archive that stores data from the Orbiting Carbon Observatory-3 (OCO-3) spacecraft, among others.

The first code block below will simply import some necessary helper functions for exploring and displaying the data.

In [None]:
import cartopy.crs as ccrs
from datetime import datetime
from IPython.display import display, Markdown
import matplotlib.pyplot as plt
from netCDF4 import Dataset
import numpy as np
import os
import sys
import textwrap

# Add src directory containing helper code to sys.path
sys.path.append(os.path.abspath("../src"))


from geosif import GesDiscDownloader, plot_samples, plot_gridded, create_gridded_raster

# an additional helper function for displaying long lists
def wrapped_markdown_list(my_list, width=160):
    wrapped_text = textwrap.fill(", ".join(my_list), width=width)
    display(Markdown(f"```\n{wrapped_text}\n```"))

## I. Getting granules from GES DISC

The GES DISC stores various datasets associated with the OCO-2 and OCO-3 instruments, and in this training we will be focusing on the "SIF Lite" datasets as they have already received L2 processing to extract chlorophyll fluorescence signatures in the O2-A (757 nm) and O2-B  (771 nm) bands. Data is served through an OpenDAP interface that provides a browsing experience similar to looking at a directory tree. You can navigate this directory tree yourself here: [https://oco2.gesdisc.eosdis.nasa.gov/opendap/](https://oco2.gesdisc.eosdis.nasa.gov/opendap/)

A "granule" is an instrument data file, typically in netCDF (.nc or .nc4) format, containing a set of related variables from a time range of observations. Granules in general can be daily or subdaily in time cadence. In the case of the OCO3_L2_Lite_SIF.11r dataset we are are looking at today, individual netCDFs on GES DISC correspond to a signal day worth of instrument observations. 

The pydap module is able to lazily evaluate data in the archive without downloading it until needed, allowing us to explore the variables in a given granule before we download it. By the way, you may see a pydap warning when loading the data, this is no concern.

**If you want to look at a different dataset, change the value of the dataset variable and re-run this cell.**

In [None]:
print("Gathering datasets on GES DISC...")
dl = GesDiscDownloader()

dataset = "OCO3_L2_Lite_SIF.11r"
print(f"Getting time range for {dataset} data...")
timerange = dl.get_dataset_timerange(dataset)
print(
    f"{dataset} has time range {timerange[0].strftime('%Y-%m-%d')} to {timerange[1].strftime('%Y-%m-%d')}"
)

**If you want to look at a different date, change the date specified in `data_date` and re-run the cell below!**

In [None]:
# Get the OCO-3 SIF Lite V11r product from December 1, 2019
data_date = datetime(2019, 12, 1) # Replace with a different date if you'd like
granule = dl.get_granule_by_date(dataset, data_date)
print(f"\n\nThe {dataset} granule from {data_date.strftime('%d/%m/%Y')} has the following variables:")

wrapped_markdown_list(list(granule.keys()))

## II. Download Variables and Plot

As mentioned previously, the SIF granules contain different many different variables needed for further analysis. To get a quick sense of where observations were acquired on this particular day, we can download the Latitude and Longitude coordinates alongside the Daily_SIF_757nm variable. The value for SIF is colormapped using the viridis colormap by default.

The first time you run this code block, you will get a few warnings from cartopy notifying you that it is downloading public resources for displaying the map context, this is expected and not a problem. It may take 20-30 seconds to download all the data, so please be patient.

Note that some of the SIF samples have negative values, this is normal and expected.

In [None]:
def get_variable_array(variable: str):
    return np.array(granule[variable].data[:])

lat = get_variable_array("Latitude")
lon = get_variable_array("Longitude")
sif = get_variable_array("Daily_SIF_757nm")
# Setting vmax to 1.5 W/m^2/sr/μm improves the contrast of the colormapped samples
# and is based on a priori knowledge of the data range in this granule
# Remove the vmax keyword if you want matplotlib to set the data range automatically
plot_samples(
    sif, lat, lon,
    vmax=1.5,
    title=f"757nm SIF ({data_date.strftime('%Y-%m-%d')})",
    label="SIF (W/m^2/sr/μm)"
)

## III. Download a set of Data (Optional)

Now we can download a set of granules across a date range to perform analysis. After gathering the filesizes of the granules, the code will prompt you to confirm the amount of data you are about to download before proceeding. The following cell will download one month of data, but you can try different time ranges.

**Troubleshooting**: Some file downloads may fail. You can set `parallel=False` in the function call to improve your odds of success, but simply retrying the operation will only download the files that you do not already have. 

In [None]:
# Downloading data from 2020-Apr-1 to 2020-Apr-30.
dl.download_timerange(
        dataset,
        datetime(2020, 4, 1),
        datetime(2020, 4, 1),
        outpath="../data",
        parallel=True,
    )

## IV. Generate a Mean Daily SIF Product for One Month (Step III not required!)

We can generate a gridded mean daily SIF product from the granules for one month. The provided helper function will do so for the Daily_SIF_757nm variable, but you can add to the list of variables if you wish. Data is downloaded from OpenDAP using pydap, so no intermediate netCDF files will be stored on your drive. 

Customizations:
- Change the start and end dates specified by the first two arguments to adjust the month referenced for the output product.
- Add additional variables to average in the output netCDF (ex: Daily_SIF_771nm, Science_SIF_Relative_757nm, Science_daily_correction_factor, etc.)
- Change the name of the output file by changing the value of the `daily_avg_file` variable.
- You can also specify a bounding box for the data, i.e., lat_min=-90, lat_max=90, etc. Add these keywords after the filename argument.

In [None]:
daily_avg_file = "../data/apr_2020_sif.nc4"

**This step will take about 5 - 10 minutes to finish processing.**

You can skip this block and proceed to plotting if you have already generated the `daily_avg_file` referenced in the block above.

In [None]:
create_gridded_raster(
    datetime(2020, 4, 1),
    datetime(2020, 4, 30),
    dataset,
    ["Daily_SIF_757nm"],
    daily_avg_file,
)

### Display the Monthly Gridded Raster

If you used different parameters, for example dates, for your monthly gridded raster, be sure to modify the values of `title`, `label` and `vmax` in the code block below.

In [None]:

ds = Dataset(daily_avg_file, "r")
var = ds["Daily_SIF_757nm"]
lats = ds["lat"][:]
lons = ds["lon"][:]
fill_val = var._FillValue

daily_avg_data = var[...]
ds.close()
lon2d, lat2d = np.meshgrid(lons, lats)
# Transpose the meshgrid result to be of shape (360, 180)
lon2d = lon2d.T
lat2d = lat2d.T

# Create a masked array where the fill_val is masked
data_masked = np.ma.masked_where(daily_avg_data == fill_val, daily_avg_data)
# Average over axis 0 (the "time" dimension), produces a masked array of shape (360, 180)
mean_data_masked = data_masked.mean(axis=0)

# Be sure to change the title and label if you change the monthly gridded raster you want to display
plot_gridded(
    mean_data_masked, lon2d, lat2d,
    vmax=0.8,
    title="Daily Average SIF (757 nm) for April 2020",
    label="SIF (W/m^2/sr/μm)"
)