# Using concurrency to download data from an HTTP server

This demo notebook focuses on using the `httpx` and `asyncio` libraries to connect to the PRISM HTTP server, download zipped files for daily data, and extract the BIL files to a directory. To start, run the next code cell.

## Download daily PRISM data 

This may take several minutes and a message will show when it is complete. For this demo, we will start with 1988, the variable 'ppt', and the rest of the default settings.

In [None]:
%run -i ./async_prism_download.py

`httpx` and `asyncio` makes use of async/await syntax to implement asynchronous programming. Async/await effectively allows us to write asynchronous code that looks like synchronous code i.e. an operation that would normally happen one after another (download data for day 1 then download data for day 2) can happen at the same time. Asynchoronus helps with the code whose only bottleneck is waiting for external events, such as network IO and timeouts.

This differs from something such multiprocessing or parallel processing, which is useful for CPU-bound tasks. For example, if you have a list of numbers and you want to square each number, you can use multiprocessing to split the list into chunks and have each core of your CPU square the numbers in each chunk.

Now, onto the code. The first step is to import the libraries we will use.

Import libraries

In [None]:
from glob import glob
import io
import os
from pathlib import Path
from time import sleep                                          
import zipfile
from dask.distributed import LocalCluster, Client
from dask import config as cfg
import fsspec
import hvplot.xarray
import httpx
import pandas as pd
import rioxarray
import xarray as xr

cfg.set({'distributed.scheduler.worker-ttl': None})
hvplot.extension('bokeh')

## Setup dask client

In [None]:
cluster = LocalCluster()
cluster.adapt(minimum=1, maximum=6)

## Read all 365 BIL files into a single xarray dataset

First, we want to check a couple of things. First, is there an exisiting zarr file for that variable? Second, is that year already present in the data.

In [None]:
# set up variables
year_str = '1988'
var = "ppt"

def date_range_list(year):
    """Create list of dates for a given year"""
    date_list = (pd.date_range(year + '-01-01', year + '-12-31')
                .strftime("%Y%m%d")
                .tolist())
    return date_list

prism_date = date_range_list(year_str)

In [None]:
zarr_path_base = Path("./zarr/")

zarr_filename = var + ".zarr"

zarr_path = zarr_path_base / zarr_filename

if zarr_path.exists():
    var_zarr = xr.open_zarr(zarr_path)
    if len(var_zarr.sel(time=slice(prism_date[0], prism_date[-1])).coords["time"]) > 0:
        print(f"Year already exists in dataset for variable {var}. Proceed to another year or variable.")
    else:
        print("Proceed with workflow.")
else:
    print("Zarr file does not exist. Proceed with workflow.")

Now we can move on to reading all of the BIL files in lazily as 'xarray DataArrays' using `rioxarray` and `dask`. Lets write a function to handle this in case we want to use it again (spoiler alert: we will).

In [None]:
def create_da_list(year):
    da_list = []
    bil_path = f"download/*_{year}*_bil.bil"
    bil_files_list = glob(bil_path)
    for file in bil_files_list:
        with rioxarray.open_rasterio(file, chunks={}) as f:
            da_list.append(f)
            f.close()
    return da_list

In [None]:
pr_da_list = create_da_list(year_str)

Next, we want to add the time dimension to each dataarray, convert them to a `xarray DataSet`, and then concatenate them into a single `xarray DataSet`. Lets write a function to handle this in case we want to use it again (spoiler alert: we will).

In [None]:
def process_list_datarrays(da_list, date_list, var):
    #create a list to hold the datasets
    ds_list = []

    # add time dimension to each dataset in pr and convert attributes to data variables
    for i in range(len(date_list)):
        # get single day
        day = pd.date_range(date_list[i], periods=1)

        # convert to DataArray
        time_da = xr.DataArray(day, [('time', day)])

        # expand dims
        da_list[i] = da_list[i].expand_dims(time=time_da)

        # add name as str(i)
        da_list[i].name = var

        # squeeze band dimension
        da_list[i] = da_list[i].squeeze("band", drop=True)

        # convert to dataset
        ds_list.append(da_list[i].to_dataset())

    # convert to dataset
    ds = xr.concat(ds_list, dim='time', combine_attrs='drop') 

    return ds

In [None]:
pr_ds = process_list_datarrays(pr_da_list, prism_date, var)

Finally, a little tidying up before exporting the dataset to a zarr file.

In [None]:
# create list of attrs from pr_da_list[0]
attrs_list = list(pr_da_list[0].attrs.keys())[-3:]

# create dict of attrs
attrs = dict((k, pr_da_list[0].attrs[k]) for k in attrs_list if k in attrs_list)

# add attrs to pr_ds
pr_ds.attrs = attrs

# create chunk dict
# sets time to be monthly chunks step has its own chunk
chunk_dict = {'time': pr_ds.dims['time'], 'x': 281, 'y': 207}

# rechunk
pr_ds_rechunk = pr_ds.chunk(chunk_dict)

# if the zarr file exists, append to it along the time dimension
if zarr_path.exists():
    pr_ds_rechunk.to_zarr(zarr_path, append_dim="time")
    print("Appending to existing zarr file.")
else:
    pr_ds_rechunk.to_zarr(zarr_path)
    print("Creating new zarr file.")

Now lets see what our new data looks like when read from file!

In [None]:
newly_minted_zarr = xr.open_zarr(zarr_path, decode_coords="all")
newly_minted_zarr

But does it plot?

In [None]:
newly_minted_zarr.hvplot(x="x", y="y", rasterize=True)

## Cleaning up downloaded files 

Now that the individual days have been combined into a single dataset, we can delete the individual files. We can use the `glob` library to get a list of all of the files in the directory and then use `os.remove` to delete them.

In [None]:
def cleanup_downloads(year):
    """Cleanup downloads and use sleep function to wait for file to be released (if needed).
    Args:
        year (int, float, str): year to cleanup
    """
    files = glob(f"./download/*{year}*")

    print(f"Number of files to delete for {year}: {len(files)} files")

    print("Starting cleanup process...")

    def cleanup(file_list):
    # iterate through files and delete
        for file in file_list:
            # check if file exists
            if Path(file).exists():
                for i in range(10):
                    try:
                        os.remove(file)
                        break
                    except:
                        sleep(1)
                        continue

        # now a bit of recursion to check if files are still there
        file_list_updated = glob(f"./download/*{year}*")
        if len(file_list_updated) > 0:
            cleanup(file_list_updated)
        else:
            print("Cleanup complete.")

    cleanup(files)

In [None]:
cleanup_downloads(year_str)

## Add to the zarr file

Lets add the next year, 1989, to our zarr file. We will go through the same process as above, reusing the functions we wrote. Just a reminder that the download script takes a couple of minutes to run so feel free to grab a coffee, eat a donut, and/or do some pushups.

In [None]:
%run -i ./async_prism_download.py

Set our new variables...

In [None]:
# set up variables
year_str = '1989'
var = "ppt"

# create date list
prism_date = date_range_list(year_str)

...create the list of DataArrays...

In [None]:
pr_da_list = create_da_list(year_str)

...create the DataSet...

In [None]:
pr_ds = process_list_datarrays(pr_da_list, prism_date, var)

...and add it to the zarr file.

In [None]:
# create list of attrs from pr_da_list[0]
attrs_list = list(pr_da_list[0].attrs.keys())[-3:]

# create dict of attrs
attrs = dict((k, pr_da_list[0].attrs[k]) for k in attrs_list if k in attrs_list)

# add attrs to pr_ds
pr_ds.attrs = attrs

# create chunk dict
# sets time to be monthly chunks step has its own chunk
chunk_dict = {'time': pr_ds.dims['time'], 'x': 281, 'y': 207}

# rechunk
pr_ds_rechunk = pr_ds.chunk(chunk_dict)

# if the zarr file exists, append to it along the time dimension
if zarr_path.exists():
    pr_ds_rechunk.to_zarr(zarr_path, append_dim="time")
    print("Appending to existing zarr file.")
else:
    pr_ds_rechunk.to_zarr(zarr_path)
    print("Creating new zarr file.")

Lets check out the zarr file!

In [None]:
newer_newly_minted_zarr = xr.open_dataset(zarr_path, decode_coords="all", engine="zarr")
newer_newly_minted_zarr

And plot it!

In [None]:
newer_newly_minted_zarr.hvplot(x="x", y="y", rasterize=True)

Finally, there is just that little matter of cleaning up the downloaded files.

In [None]:
cleanup_downloads(year_str)

## Adding another variable

Lets say we want to add another variable to our zarr file, this time tmax. Lets download the data for 1988 and 1989 and process it like we did for ppt.

In [None]:
# 1988 tmax
%run -i ./async_prism_download.py

In [None]:
# 1989 tmax
%run -i ./async_prism_download.py

Now create the zarr file for tmax

In [None]:
year_str_list = ["1988", "1989"]

var = "tmax"

zarr_path_base = Path("./zarr/")

zarr_filename = var + ".zarr"

zarr_path = zarr_path_base / zarr_filename

for year_str in year_str_list:

    if zarr_path.exists():
        var_zarr = xr.open_zarr(zarr_path)
        if len(var_zarr.sel(time=slice(prism_date[0], prism_date[-1])).coords["time"]) > 0:
            print(f"Year already exists in dataset for variable {var}. Proceed to another year or variable.")
        else:
            print("Proceed with workflow.")
    else:
        print("Zarr file does not exist. Proceed with workflow.")

    # create date list
    prism_date = date_range_list(year_str)
    pr_da_list = create_da_list(year_str)
    pr_ds = process_list_datarrays(pr_da_list, prism_date, var)
    # create list of attrs from pr_da_list[0]
    attrs_list = list(pr_da_list[0].attrs.keys())[-3:]

    # create dict of attrs
    attrs = dict((k, pr_da_list[0].attrs[k]) for k in attrs_list if k in attrs_list)

    # add attrs to pr_ds
    pr_ds.attrs = attrs

    # create chunk dict
    # sets time to be monthly chunks step has its own chunk
    chunk_dict = {'time': pr_ds.dims['time'], 'x': 281, 'y': 207}

    # rechunk
    pr_ds_rechunk = pr_ds.chunk(chunk_dict)

    # if the zarr file exists, append to it along the time dimension
    if zarr_path.exists():
        pr_ds_rechunk.to_zarr(zarr_path, append_dim="time")
        print("Appending to existing zarr file.")
    else:
        pr_ds_rechunk.to_zarr(zarr_path)
        print("Creating new zarr file.")

Open up the tmax zarr file and check it out!

In [None]:
tmax_zarr = xr.open_dataset(zarr_path, decode_coords="all", engine="zarr")
tmax_zarr

## Now, can we combine ppt and tmax into a single zarr file?

In [None]:
ppt_zarr = xr.open_dataset("./zarr/ppt.zarr", decode_coords="all", engine="zarr")
ppt_zarr

Now create a single `xarray DataSet`

In [None]:
prism_zarr = xr.merge([tmax_zarr, ppt_zarr])
prism_zarr

And save out to a combined zarr file.

In [None]:
prism_zarr.to_zarr("./zarr/prism.zarr")

Mandatory file cleanup

In [None]:
for yr in year_str_list:
    cleanup_downloads(yr)

And shutdown the `dask` cluster

In [None]:
cluster.close()

## Wrapping Up

So, lets recap what we did in this notebook:
1. Used async/await to download daily data from the PRISM HTTP server for multiple years
2. Used `rioxarray` and `dask` to read the BIL files into a single `xarray DataSet`
3. Used `xarray` to save the `xarray DataSet` to a zarr file
4. Used `glob` and `os` to delete the downloaded files
5. Created a single zarr file for multiple variables

This workflow could easily be adapted to:
* Download data for multiple variables for the same year, combine them all into a single `xarray DataSet`, and save them to a single zarr file
* Download data for multiple variables for multiple years, combine them all into a single `xarray DataSet`, and save them to a single zarr file