# Cloud Computing with JupyterHub and M²LInES

James Munroe, [2i2c](https://2i2c.org) Product and Community Lead

# Welcome and Introductions

# Welcome and Introductions

Goals:
- Guided Tour of 2i2c's Managed JupyterHub
- Git / GitHub Workflow for Collaborating in Research
- Data on the Hub and in the Cloud
- Machine Learning Possibilities

Audience assumptions:
- Previous experience with Python and Jupyter Notebooks
- Working with Machine Learning (e.g. scikit-learn or PyTorch)
- Innovating the new advances in Earth Systems Modelling

## Jupyter Notebooks vs JupyterHub

Computational notebooks, such as Jupyter Notebooks (`.ipynb`), are ushering in a new wave of interactive, collaborative science. 

**Erdmann, C., S. Stall, B. Hanson, L. Lyon, B. Sedora, M. Giampoala, and M. Ricci (2022), [Notebooks Now! elevating computational notebooks](https://doi.org/10.1029/2022EO225024), Eos, 103, (18 Aug 2022):**

> - Researchers are increasingly using computational notebooks to share workflows and data analyses with others.
> - Research computing services often highlight their support of notebooks as a method to interact with their services and facilitate collaboration and sharing. 
> - As a result, notebooks are fast becoming ubiquitous in research workflows.
> - Notebooks fuse research narratives with data, visualization, and executable code to create an interactive experience.

Jupyter Notebook = Text + Code + Output + Plots + More

<img src="https://jupyter.org/assets/homepage/main-logo.svg" style="display:block; margin-left: auto; margin-right: auto;width:10%"/>

[Project Jupyter](https://jupyter.org/) supports an ecosystem of related products: Jupyter Notebooks, JupyterLab, JupyterHub


<img src="https://jupyter.org/assets/homepage/hublogo.svg" style="display:block; margin-left: auto; margin-right: auto;width:20%"/>

JupyterHub is a multi-user version of the notebook designed for companies, classrooms and research labs.



## Where is the Hub?


The JupyterHub is available at

https://m2lines.2i2c.cloud

Open infrastructure: All of the code to manage this cloud-based JupyterHub is available on GitHub (minus security keys/tokens). 2i2c supports *Right to Replicate* meaning, if at some point in the future, you wanted to spin up your own hub you are completely welcome to do so.  See our infrastructure repo on [GitHub](https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/m2lines). 

Some technical details:
- Google Cloud Platform
- Region: us-central1 (Council Bluffs, Iowa), 115,000 sq.ft.,97% Carbon Free Energy
- N1 machine series: General-purpose machine series available on Skylake, Broadwell, Haswell, Sandy Bridge, and Ivy Bridge CPU platforms.
 - Support up to 96 vCPUs and 624 GB of memory
 - Optional: NVIDIA Telsa K80 GPUs (4992 CUDA cores)
 
<img src="https://lh3.googleusercontent.com/_T5y2eKUusOWBn44MkgTDc1EQVsiGkvWXDDgbNZxeKOp1aHKYpIMS56JhU3esg6F_V6sbmGmxmThuk5ugETygfPdv2ssbVRjHD3fcw=w1200-l100-sg-rj-c0xffffff" style="display:block; margin-left: auto; margin-right: auto;width:50%"/>

**Activity: Start your own Jupyter server**
- Open a web browser: https://m2lines.2i2c.cloud
- Click on "Log in to continue"
- Authentication is handled by membership in the m2lines GitHub organization
- Select the **Huge** instance to start your own, private, Jupyter server instance
- While this is getting started, let's chat about what is occuring behind the scenes:
  - Kubernetes: container orchestration
  - Containers vs Virtual Machines
  - 1. Auto-scaling cluster: if a node of the right size isn't available, it will be provisioned (takes a moment)
  - 2. Download container image (https://github.com/pangeo-data/pangeo-docker-images pangeo/pangeo-notebook+2022.06.02)
  - 3. Start Jupyter Server

<img src="JupyterHubEventLog.png" width=80%/>

### [2i2c's Shared Responsibility Model](https://docs.2i2c.org/en/latest/about/service/shared-responsibility.html)

2i2c shares responsibility for each hub with the communities we serve. We do this by defining the responsibilities that are a good fit for the skills and goals of each organization. This “Shared Responsibility Model” is a useful way to understand what actions communities are still expected to perform under a service agreement with 2i2c.

<img src="https://drive.google.com/uc?export=download&id=1SIhHrzPXSFBZ0yyVpxHm0WYs63k0SBRQ"/>

An overview of some categories of shared responsibility between the [Cloud Engineering Team](https://docs.2i2c.org/en/latest/about/service/team.html#term-Cloud-Engineering-Team) and the [Community Leadership Team](https://docs.2i2c.org/en/latest/about/service/team.html#term-Community-Leadership-Team).

<img src="https://drive.google.com/uc?export=download&id=1S6Y9TQcXXLkrGrhgXQc7kLzq7dxcuw9a"/>

An overview of some categories of shared responsibility between the [Community Support Team](https://docs.2i2c.org/en/latest/about/service/team.html#term-Community-Support-Team) and the [Community Leadership Team](https://docs.2i2c.org/en/latest/about/service/team.html#term-Community-Leadership-Team).

Think of **2i2c** as your data engineering and cloud infrastructure focused team member so you can concentrate on doing the science!

## Introduction to Jupyter and Git

Some suggestions made to me from Ryan Abernathy:

- Explain how the 2i2c service works, not teach them everything they might need to know to do their science. Spend time on things that seem obvious to you, like
  - starting and stopping your server
  - selecting a machine type
  - opening and closing notebooks
- Spend A LOT of time on the git / github workflow. A good goal would be to make sure that all participants are able to push to github via `github-scoped-creds`
- Make them understand how the environment works, how it evolves, and how to customize it via pip / conda installs

Reference: https://github.com/fperez/demo-jupyter-git


### JupyterHub Activity: Starting and Stopping the Server

- You can stop your server using the Hub Control Panel https://m2lines.2i2c.cloud/hub/home
- Jupyter server stays running while you are not active
  - Independent of web browser. You can shift from a workstation to a laptop and keep on working.
- Python kernels are 'culled' after 1 hour of no activity
- Jupyter servers are not culled. Please shutdown to conserve resources.
  - But this can be changed as needed.

### Git Activity: Fork and clone a repo

This tutorial is available on GitHub at https://github.com/jmunroe/2i2c-m2lines

1. Fork this repo into your own GitHub account
2. Clone your forked repo in the JupyterHub (address will be something like https://github.com/GITHUB_USERID/2i2c-m2lines )
3. Add a new file and push back to GitHub (use `github-scoped-creds`)

Discuss Git/GitHub workflows.

### Conda Activity: Explore L96 Model Notebooks

JupyterBook: [Learning Machine Learning with Lorenz-96](https://m2lines.github.io/L96_demo)

GitHub repo: https://github.com/m2lines/L96_demo

Example: `Neural_network_for_Lorenz96.ipynb` 

How to get `torch` (or any other library) installed if it is not there?
- Option 1: Install the exact same environment that was used before
    - `conda create --prefix L96M2lines --file conda-linux-64.lock`
- Option 2: Use a container image that has PyTorch already installed
- Option 3: Manage your own environment
  - `conda env create --prefix myenv`
  - Also works with `pip install`

Managing your own environment is also great for testing packages at the "bleeding edge."

*Careful* `/home/jovyan` is limited to 10 GB.  It's designed for notebooks, analysis scripts, and small datasets (< 1GB).  Trying to manage multiple large conda environment could lead to storage issues.
- Solution: Create a community image with the packages, software, and libraries your community uses.

# [Files and Data in the Cloud](https://docs.2i2c.org/en/latest/user/storage.html)

- Use your home directory to store code, notebooks, and small data files (<1 GB) for personal use
- Use cloud object storage to store larger datasets and to share data across your team
- Consider whether your project would benefit from other cloud-native data storage solutions such as a database, data warehouse, or data lake

### The JupyterHub Filesystem

When you start a Jupyter server on the Hub, it is effectively a private Linux 'virtual machine'

To move files to and from JupyterHub:
- Drag and Drop a file to file browser to upload
- Right-click to download back out

Terminal
- You can ssh/scp/ftp to a remote system
- However, you can't ssh in!


#### Your Home Directory

Your username is `jovyan`, and your home directory is `/home/jovyan`. This is the same for all users, but no one else can see or access the files in your home directory.

`/home/jovyan` is a persistant network-attached drive. Any files you put there will be there when you log out and log back into the JupyterHub.

The `/home/jovyan` space is typically limited to 10 GB. Consequently, your home directory is intended only for notebooks, analysis scripts, and small datasets (< 1 GB). It is not an appropriate place to store large datasets.

#### The `shared` Directory
All users have a directory called shared in their home directory. This is a readonly directory - anybody on the hub can access and read from the shared directory. The hub administrator may choose to distribute shared materials via this directory. The shared directory is not intended as a way for hub users to share data with each other.

#### The `/tmp` Directory
Any directory outside of `/home/jovyan` is emphemeral on Cloud-hosted JupyterHubs. This means if you add data or scripts under a writeable directory like `/tmp/myfile.txt` it will not be there when you log out and log back in.

Nevertheless, `/tmp` is a convenient location for storing data temporarily because it is a fast SSD drive. The space available depends on your server but will generally be much larger than `/home/jovyan` (50-100s of GB).

You can use the full path in your code or add a symlink from your home directory: `ln -s /tmp ~/tmp`

### Using Git / GitHub

The recommended way to move code in and out of the hub is via git / GitHub. You should clone your project repo from the terminal and use git pull / git push to update and push changes. In order to push data to GitHub from the hub, you will need to set up GitHub authentication. ``gh-scoped-creds` should be already setup on your 2i2c managed JupyterHub, and we shall use that to authenticate to GitHub for push / pull access.

Open a terminal in JupyterHub, run gh-scoped-creds and follow the prompts.

Alternatively, in a notebook, run the following code and follow the prompts:

```
import gh_scoped_creds
%ghscopedcreds
```

You should now be able to push to GitHub from the hub! These credentials will expire after 8 hours (or whenever your JupyterHub server stops), and you’ll have to repeat these steps to fetch a fresh set of credentials. Once you authenticate, you’ll be provided with a link to a GitHub App that you have to install on the repositories you want to be able to push to from this particular JupyterHub. You only need to do this once per JupyterHub, and can revoke access any time. You can always provide access to your own personal repositories, but might need approval from admins of GitHub organizations if you want to push to repos in that organization.

### Cloud Object Storage

Your hub lives in the cloud. The preferred way to store data in the cloud is using cloud object storage, such as Google Cloud Storage (GCS). Cloud object storage is essentially a key/value storage system. They keys are strings, and the values are bytes of data. Data is read and written using HTTP calls.

The performance of object storage is very different from file storage. On one hand, each individual read / write to object storage has a high overhead (10-100 ms), since it has to go over the network. On the other hand, object storage “scales out” nearly infinitely, meaning that we can make hundreds, thousands, or millions of concurrent reads / writes. This makes object storage well suited for distributed data analytics. However, data analysis software must be adapted to take advantage of these properties.

#### Scratch Bucket
The M2LInES 2i2c environments is configured with a “scratch bucket,” which allows you to temporarily store data (for example, when you need to store intermediate files during data transformations). Credentials to write to the scratch bucket are pre-loaded into your Hub’s user environment.

*Warning*: Any data in scratch buckets will be deleted once it is 7 days old. Do not use scratch buckets to store data permanently.

The location of your scratch bucket is contained in the environment variable `SCRATCH_BUCKET`.

A common set of credentials is currently used for accessing scratch buckets. This means users can read, and potentially remove / overwrite, each others’ data. You can avoid this problem by always using `SCRATCH_BUCKET` as a prefix. Still, you should not store any sensitive or mission-critical data in the scratch bucket.

#### Working with Object Storage

**Case Study**: Janni Yuval and Paul O’Gorman's work on "Parameterizations in Atmospheric Models"

Why did I choose this example? 
- Code available (GitHub)
- Data available (Google Drive)
- Talk available (YouTube)
- Nice example of Open Science! 

Observations:
- Fortran model generating the "high-resolution models" on HPC
- Matlab codes for model post-processing
- Python scripts for machine learning
- Jupyter notebooks

### Abstract data access with `fsspec`

`fsspec` is a Python library that abstracts the idea of a File System for *many* different cloud storage models.

In [None]:
import fsspec

In [None]:
fsspec.available_protocols()

Google Cloud Storage - GCS

In [None]:
import os
SCRATCH_BUCKET = os.environ['SCRATCH_BUCKET']
SCRATCH_BUCKET

In [None]:
gcs = fsspec.filesystem('gcs')

With an object, you can put a file (data) in the cloud and associate it with a key.

In [None]:
gcs.put('README.md', f'{SCRATCH_BUCKET}/README.md')

But since object storage is really just a key-value mapping, this works too:

In [None]:
scratch = gcs.get_mapper(SCRATCH_BUCKET)
scratch['MyNewFile'] = b'This is a Byte String'

In [None]:
gcs.ls(SCRATCH_BUCKET)

In [None]:
scratch['README.md']

I've already taken one of Janni's high resolution atmospheric model runs (found at [this google drive](https://drive.google.com/drive/folders/1TRPDL6JkcLjgTHJL9Ib_Z4XuPyvNVIyY)) and pushed it into Google Cloud Storage.

(Briefly, my approach was to use fsspec to connect to both Google Drive and Google Cloud Storage and copy the key-value pairs in a parallelized loop.  The Google Drive support in `fsspec` is experimental -- there are likely much better ways of pushing data directly into Google Cloud Storage with specialized tools.)

In [None]:
uri = 'gcs://m2lines-scratch/jmunroe/filesqobskm12x576'

In [None]:
nc_files = gcs.ls(uri)
nc_files[:10]

In [None]:
print(f'Total number of NetCDF4 files: {len(nc_files)}')
print(f'Size of dataset: {gcs.du(uri) / 2**30:.1f} GB')

Remember this scratch bucket will only keep files for 7 days -- not for permanent storage.

#### Transfer one file from the "cloud" to the local filesystem on the hub

In [None]:
%%time
gcs.get(nc_files[0], '~/') # this will copy the file into /home/jovyan with the same basename

In [None]:
filename = os.path.basename(nc_files[0])
filename

In [None]:
!ls -lh ~/{filename}

These are pretty decently sized files (880 MB) and it took ~10 seconds to download the file. 

#### Using Xarray for data access

To read a NetCDF4 files, use the `xarray` library and `hvplot` from the Holoviews [ecosystem](https://hvplot.holoviz.org/) for visualization.

See [xarray Tutorial](https://tutorial.xarray.dev) for additional guidance.

In [None]:
import hvplot.xarray
import xarray as xr

In [None]:
ds = xr.open_dataset(f'~/{filename}')
ds

Let's explore what we find in this *Water World*

#### Water Vapour

In [None]:
ds.Q.hvplot(x='x', y='y', clim=(0, 15), width=800, height=400)

Humidity highest along the warm equator?

#### Non-precipitating Condensate (Water+Ice)

In [None]:
ds.QN.hvplot(x='x', y='y',  clim=(0,0.5), width=800, height=400, cmap='blues_r')

Clouds?

#### Precipitating Water (Rain + Snow)

In [None]:
ds.QP.hvplot(x='x', y='y', clim=(0,0.2), width=800, height=400, cmap='viridis')

Weather on Water World!

#### Temperature

In [None]:
ds.TABS.hvplot(x='x', y='y', clim=(270, 310), width=800, height=600, cmap='coolwarm')

Preciptitation correlated with temperature fronts?

But that's only one time-step of this 4800 timestep dataset. At even 10s/timestep, it will take

In [None]:
print(f'{4800 * 10 / 3600:.1f} hours')

To download the dataset it will take more than half a day (and 4 TB of local storage) onto this Jupyter Server. Not the right approach when working with cloud infrastructure.

## [Pangeo-Forge](https://pangeo-forge.org/)

A **Big Idea** of cloud computing is to avoid downloading data to be able to analyze it. Instead, bring your analysis to where the data is located.

Framework for creating *Analysis Ready, Cloud Optimized* datasets in the cloud. This is non-trivial task and efforts to make those conversion could be shared across several communities.

<img src="https://pangeo-forge.org/pangeo-forge-diagram.png" width=800/>


Covered in a seminar two weeks ago? How did that go?

- (Ryan A) For existing ARCO datasets, use stuff from https://pangeo-forge.org/catalog and https://catalog.pangeo.io/. Explain that Pangeo Forge is the main pathway to getting data into the cloud.

#### [Global Precipitation Climatology Project](https://pangeo-forge.org/dashboard/feedstock/42)

daily, global 1x1-deg gridded fields of precipitation totals for 1996-2021 based on merged data sources

In [None]:
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

In [None]:
ds.precip.hvplot(x='longitude', y='latitude', clim=(0, 80), width=800, height=400, cmap='viridis', widget_type="scrubber",
    widget_location="bottom")

##### Compute a climatology

In [None]:
ds.precip

In [None]:
# filter out flagged data
precip_climatology = ds.precip.where((ds.precip >= 0) & (ds.precip < 1000)).mean('time')
precip_climatology

The calculation is *lazy*.  It will only be completed when needed.

In [None]:
from dask.diagnostics import ProgressBar
with ProgressBar():
    precip_climatology.load() # force the calculation to occur

Reducing 2.2 GB of remote data in ~5 seconds.

In [None]:
precip_climatology.hvplot(x='longitude', y='latitude', width=800, height=400, cmap='viridis')

But our *Water World* dataset isn't (yet?) converted to Zarr or on Pangeo-Forge. Can we still make progress?

## Kerchunk

[Kerchunk](https://fsspec.github.io/kerchunk/) is library for abstracting out chunked, compressed data and virtually aggregating in to a Analysis Ready, Cloud-Optimized dataset.

- Not quite good performance as Zarr
- But there is **lots** of data on data providers using archival formats (like NetCDF4)
- Avoids having to have multiple copies of the same data in different formats
- Written by the same developers as `fsspec`

Brand new Medium article (Sept 11, 2022): https://medium.com/pangeo/accessing-netcdf-and-grib-file-collections-as-cloud-native-virtual-datasets-using-kerchunk-625a2d0a9191

In [None]:
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import ujson

In [None]:
so = dict(mode='rb', default_fill_cache=False, default_cache_type='first')

# compute the "offsets" into a NetCDF file
def gen_json(u):
    with gcs.open(u, **so) as infile:
        
        p = u.split('/')
        fname = os.path.splitext(os.path.basename(u))[0]
        outf = f'{fname}.json'
        
        if not os.path.exists(outf):
            h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
            with open(outf, 'wb') as f:
                f.write(ujson.dumps(h5chunks.translate()).encode());
                
    return outf

In [None]:
%%time
path = nc_files[1]  # we have not downloaded this file locally
ref_json = gen_json(path) # one-time cost

Discuss what is really in this 'reference' JSON file.

But with this file, we can lazily open the NetCDF file stored on Google Cloud Storage *as if* it was zarr file by using `fsspec`'s `reference` file system:

In [None]:
%%time
ds = xr.open_dataset("reference://", engine="zarr", chunks={},
                     backend_kwargs={
                        "consolidated": False,
                        "storage_options": {"fo": ref_json, 
                                            "remote_protocol": "gcs"}
                    })
ds

Let's plot the zonal wind profile:

In [None]:
%%time
ds.U.mean('x').hvplot.line(y='y', width=400, height=600,)

#### Combine multiple kerchunk'd datasets into a single logical aggregate dataset

Preprocess all of the reference files for the JSON files. Since each NetCDF is completely independent, we can parallelize this one-time operation with Dask:

```
import dask

results = dask.compute(*[dask.delayed(gen_json)(u) for u in nc_files], retries=10)

json_list = sorted(glob.glob('*.json'))

mzz = MultiZarrToZarr(json_list, remote_protocol='gcs',
        concat_dims=['time'], identical_dims = ['x', 'y', 'z'],
    )

mzz.translate('qobskm12x576.json')
```

It took about 35 minutes to preprocess the entire 4TB archive and produce this reference file. I put this file in the `shared` folder at `/home/jovyan/shared/jmunroe/qobskm12x576.json`.

In [None]:
ref_json = '/home/jovyan/shared/jmunroe/qobskm12x576.json'

In [None]:
!ls -lh {ref_json}

This is 442 MB index file to the entire 4TB dataset.

In [None]:
%%time
backend_args = { "consolidated": False,
                 "storage_options": { "fo": ref_json,
                 "remote_protocol": "gcs"  }}
ds = xr.open_dataset("reference://", engine="zarr",
                     chunks={},
                     backend_kwargs=backend_args)
ds

So now we can open up the entire dataset at once (still lazy access):

In [None]:
ds.QP.hvplot(x='x', y='y', clim=(0,0.2), width=800, height=400, cmap='viridis')

Back to the **case study**, a usual thing to do was to consider the vertically integrated precipitation:

In [None]:
QP = ds.QP.sum('z')
QP

In [None]:
QP.hvplot(x='x', y='y', clim=(0, 1), width=800, height=400, cmap='viridis', widget_type="scrubber",
    widget_location="bottom")

#### Compute a "10 day" climatology

In [None]:
QP_climatology = QP.isel(time=slice(0, 80)).mean('time')
QP_climatology

In [None]:
with ProgressBar():
    QP_climatology.load()

In [None]:
QP_climatology.hvplot(x='x', y='y', clim=(0, 1), width=800, height=400, cmap='viridis')

Functional, but we could also make the effort to convert to Zarr if the vertically integrated precip was going to be needed again:

```
# Takes about 35 mins of processing
QP = QP.chunk({'time':20, 'y':1440, 'x':576}) # make chunk sizes larger. Typically, ~100MB is recommended for Zarr
QP.to_dataset().to_zarr(f'gcs://m2lines-scratch/jmunroe/qobskm12x576_QP.zarr', consolidated=True)
```

In [None]:
%%time
ds = xr.open_zarr(f'gcs://m2lines-scratch/jmunroe/qobskm12x576_QP.zarr', consolidated=True)
ds

In [None]:
ds.QP.hvplot(x='x', y='y', clim=(0,1), width=800, height=400, cmap='viridis',  widget_type="scrubber",
    widget_location="bottom",)

We can now interactively compute the climatology the entire 4800 time step long sequence.

In [None]:
with ProgressBar():
    QP_mean = ds.QP.mean('time').compute()   # above we used .load() to replace a lazy array with its calculated version
                                             # we can also use .compute() to force computation

And we can plot both the mean and the zonal average

In [None]:
(
    QP_mean.hvplot(x='x', y='y', clim=(0,1), width=800, height=400, cmap='viridis') + 
     QP_mean.mean('x').hvplot()
).cols(1)

Continuing, we may want to compare this 'high-res' model (*truth*) to some coarsened representation (*observation*).

In [None]:
QP_coarse = ds.QP.coarsen(x=16, y=16).mean()

with ProgressBar():
    QP_coarse.load()

In [None]:
options = dict(x='x', y='y', clim=(0,1), width=800, height=400, cmap='viridis')
(
    ds.QP.hvplot(**options) + 
    QP_coarse.hvplot(**options)
).cols(1)

# Next Steps

Where to get more information:
- Join the Pangeo Discourse [https://discourse.pangeo.io/](https://discourse.pangeo.io/) 
- GitHub issues are good for specific package topics but Discourse(s) can cover these science domain, multi-package problems

What are your pain points? (computational or community related)

How might infrastructure for interactive cloud computing improve to make your research more **impactful**, **accessible**, and **delightful**?