# Cloud Data Transfer Speeds Benchmarking Workflow

Add overview of workflow

## Step 0: Load Required Setup Packages & Classes

Enter the following parameters to install packages to the correct conda instance and environment.


`jupyter_conda_path : str`
    - The path to miniconda that this Jupyter notebook is running from. Do not include a terminal `/` at the end of the path.
    
`jupyter_conda_env : str`
    - The conda environment that this Jupyter notebook is running in

In [5]:
jupyter_conda_path = "/home/jgreen/.miniconda3c"
jupyter_conda_env = "jupyter"

Installs required workflow setup packages and calls UI generation script. If one or more of the packages don't exist in the specified environment, they will install for you. Note that if installation is required, this cell will take a few minutes to complete execution.

**NOTE: If you recieve an import error for `jupyter-ui-poll`, you will have to manually install the package in a user container terminal with the following commands:**
```
source <jupyter_conda_path>/etc/profile.d/conda.sh
conda activate <jupyter_conda_env>
pip install jupyter-ui-poll
```

In [6]:
import os
import json
import sys

print('Checking conda environment for UI depedencies...')
os.system("bash " + os.getcwd() + f"/jupyter-helpers/install_ui_packages.sh {jupyter_conda_path} {jupyter_conda_env}")
print('All dependencies installed.')

sys.path.insert(0, os.getcwd() + '/jupyter-helpers')
import ui_helpers as ui
import pandas as pd

Checking conda environment for UI depedencies...
All dependencies installed.


## Step 1: Define Workflow Inputs

Run the following cells to generate interactive widgets allowing you to enter all workflow inputs. **All inputs must be filled out to proceed with the benchmarking process.**

### Cloud Resources

#### Compute Resources

Before defining anything else, the resources you intend to use with the benchmarking must be defined. Currently, only resources defined in the Parallel Works platform may be used. Also of note are options that will be passed to Dask: you have full control over how many cores and memory you want each worker from your cluster to use, as well as how many nodes you want to be active at a single time.

In particular, these options are included so that you can form fair comparisons between different cloud service providers (CSPs). Generally, different CSPs won't have worker nodes with the exact same specs, and in order to achieve a fair comparison between two CSPs one cluster will have to limited to not exceed to the computational power of the other.

<div class="alert alert-block alert-info">
For resources controlled from the Parallel Works platform, the <code>Resource name</code> box should be populated with the name found on the <b>RESOURCES</b>  tab.
    </div>

In [7]:
resource = ui.resourceWidgets()
resource.display()
resources = resource.processInput()
print(f'Your resource inputs:\n {resources}')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Resource Name: '), Text(value=''))), HBox(child…

Button(description='Submit', style=ButtonStyle())


-----------------------------------------------------------------------------
If you wish to change information about cloud resources, run this cell again.

Your resource inputs:
 [{'Name': 'gcptestnew', 'CSP': 'GCP', 'Dask': {'Scheduler': 'SLURM', 'Partition': 'compute', 'CPUs': 2, 'Memory': 16.0}}]


#### Object Stores

This set of inputs is where you enter the cloud object store Universal Resource Identifiers (URIs). Both public and private buckets are supported. For the latter, ensure that you have access credentials with *at least* read, write, list, and put (copy from local storage to cloud) permissions, as format conversions will need to be made during the benchmarking process.

In [8]:
store = ui.storageWidgets()
store.display()
storage = store.processInput()
print(f'Your storage inputs:\n {storage}')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Storage URI: '), Text(value='', placeholder='gc…

Button(description='Submit', style=ButtonStyle())


--------------------------------------------------------------------------------------
If you wish to change information about cloud storage locations, run this cell again.

Your storage inputs:
 [{'Path': 'gs://cloud-data-benchmarks', 'Type': 'Private', 'CSP': 'GCP', 'Credentials': {'token': './.cloud-data-benchmarks.json'}}, {'Path': 's3://cloud-data-benchmarks', 'Type': 'Private', 'CSP': 'AWS', 'Credentials': {'anon': False, 'key': 'AKIARQNWZVPER5LAPF27', 'secret': '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'}}]


### Datasets

#### User-Supplied Datasets

Below you can specify datasets that you want to be tested in the benchmarking. You can either enter single files or multiple files that belong to a single dataset, but that dataset match at least one of the supported formats. **Read the following input rules after running the UI cell below this one.**


1. Activate the checkbox if you desire to record your user-defined datasets. If it is not checked, none of your inputs will be recorded.

<br>

2. Input the full URI or absolute path of the data location (`<URI prefix>://bucket-name/path/to/file.extension` or `/path/to/file.extension`)
    - Use globstrings (`<URI prefix>://bucket-name/path/to/files/*` or `path/to/files/*`) to specify datasets that are split up into multiple subfiles.
    - If using a globstring, ensure that *only* files that belong to the dataset exist in that directory. The workflow will take all files in the directory before the `*` and attempt to gather them into a single dataset.
    - **Globstrings are NOT supported for NetCDF files**
    
<br>

3. If you have a dataset stored in multiple cloud storage locations that will be used in the benchmarking, you must input the full URI of that dataset for each of these locations. That is, you must define each location of the data separately.

In [10]:
userdata = ui.userdataWidgets(storage=storage)
userdata.display()
user_files = userdata.processInput()
print(f'Your data inputs:\n {user_files}')

Checkbox(value=False, description='Provide datasets to workflow?')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Data Format'), Dropdown(options=('CSV', 'NetCDF…

Button(description='Submit', style=ButtonStyle())


---------------------------------------------------------------------------
If you wish to change information about your input data, run this cell again.

Your data inputs:
 [{'Format': 'NetCDF4', 'SourcePath': 'gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc', 'Type': 'Private', 'CSP': 'GCP', 'Credentials': {'token': './.cloud-data-benchmarks.json'}}, {'Format': 'NetCDF4', 'SourcePath': 's3://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc', 'Type': 'Private', 'CSP': 'AWS', 'Credentials': {'anon': False, 'key': 'AKIARQNWZVPER5LAPF27', 'secret': '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'}}]


#### Randomly-Generated Datasets

Another option to supply data to the benchmarking is to create randomly-generated datasets. These sets can be as large as you want (as they are written in parallel), and provide a great option if you are new to the world of cloud-native data formats. There are currently two supported randomly-generated data formats: CSV and NetCDF4. Since NetCDF4 is a gridded data format, an option to specify the number of coordinate axes is also included.

<div class="alert alert-block alert-info">
Randomly-generated NetCDF4 file sizes are limited by available disk space in the cluster you are generating the file with. Ensure that you have adequate disk space in your cluster, or the file will not fully generate.
    </div>

In [11]:
randgen = ui.randgenWidgets(resources=resources)
randgen.display()
randfiles = randgen.processInput()
print(f'Your randomly-generated file options:\n {randfiles}')

Accordion(children=(HBox(children=(Label(value='Resource to write random files with: '), Dropdown(options=('gc…

Button(description='Submit', style=ButtonStyle())


-------------------------------------------------------------------------------
If you wish to change the randomly generated file options, run this cell again.

Your randomly-generated file options:
 [{'Format': 'CSV', 'Generate': False, 'SizeGB': 0.0}, {'Format': 'NetCDF4', 'Generate': True, 'SizeGB': 10.0, 'Data Variables': 2.0, 'Float Coords': 3.0}, {'Resource': 'gcptestnew'}]


### TODO: Cloud-Native Format Conversion Options

**Feature not ready**

## Step 2: Notebook Setup

Executing the following cell will write all of your inputs to `inputs.json`, install miniconda3 and the "cloud-data" Python environment to all resources, and write randomly-generated files to all cloud storage locations (if any files were specified). If writing randomly-generated files, especially large ones, the execution of this cell may take a while.

<div class="alert alert-block alert-info">
While randomly-generated files are written in parallel by default, if you wish to speed up the execution of this cell, consider creating/choosing a resource with more powerful worker nodes.
    </div>

In [None]:
print('Setting up workflow...')

user_input = json.dumps({"RESOURCES" : resources,
                         "STORAGE" : storage,
                         "USERFILES" : user_files,
                         "RANDFILES" : randfiles
                        })

with open('inputs.json', 'w') as outfile:
    outfile.write(user_input)

os.system("bash workflow_notebook_setup.sh")

print('Workflow setup complete.')

Setting up workflow...
Will install miniconda3 to "/home/jgreen/.miniconda3"
Installing Miniconda-latest on "gcptestnew"...
Miniconda is already installed in "/home/jgreen/.miniconda3"!
Finished installing Miniconda on "gcptestnew".

Building "cloud-data" environment on "gcptestnew"...
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - ipython
    - python


The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  asttokens          pkgs/main/noarch::asttokens-2.0.5-pyhd3eb1b0_0 
  backcall           pkgs/main/noarch::backcall-0.2.0-pyhd3eb1b0_0 
  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h7b6447c_0 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.05.30-h06a4308_0 




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2




done
Verifying transaction: ...working... done
Executing transaction: ...working... done
#
# To activate this environment, use
#
#     $ conda activate cloud-data
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - dask


The following NEW packages will be INSTALLED:

  abseil-cpp         conda-forge/linux-64::abseil-cpp-20211102.0-h27087fc_1 
  arrow-cpp          pkgs/main/linux-64::arrow-cpp-11.0.0-h374c478_1 
  aws-c-common       pkgs/main/linux-64::aws-c-common-0.6.8-h5eee18b_1 
  aws-c-event-stream pkgs/main/linux-64::aws-c-event-stream-0.1.6-h6a678d5_6 
  aws-checksums      pkgs/main/linux-64::aws-checksums-0.1.11-h5eee18b_2 
  aws-sdk-cpp        pkgs/main/linux-64::aws-sdk-cpp-1.8.185-h721c034_1 
  blas               pkgs/main/linux



  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2




done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - dask-jobqueue


The following NEW packages will be INSTALLED:

  dask-jobqueue      conda-forge/noarch::dask-jobqueue-0.8.2-pyhd8ed1ab_0 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - xarray


The following NEW packages will be INSTALLED:

  xarray             conda-forge/noarch::xarray-2023.7.0-pyhd8ed1ab_0 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - intake-xarray


The following NEW packages will be INSTALLED:

  appdirs            conda-forge/noarch::appdirs-1.4.4-pyh9f0ad1d_0 
  asciitree          conda-forge/noarch::asciitree-0.3.3-py_2 
  certifi            conda-forge/noarch::certifi-2023.7.22-pyhd8ed1ab_0 
  cftime             pkgs/main/linux-64::cftime-1.6.2-py311hbed6279_0 
  charset-normalizer conda-forge/noarch::charset-normalizer-3.2.0-pyhd8ed1ab_0 
  curl               pkgs/main/linux-64::curl-8.1.1-hdbd6064_2 
  entrypoints        conda-forge/noarch::entrypoints-0.4-pyhd8ed1ab_0 
  fasteners          conda-forge/noarch::fasteners-0.17.3-pyhd8ed1ab_0 
  hdf4               conda-forge/linux-64::hdf4-4.2.15-h10796ff_3 
  hdf5               conda-forge/linux-64::hdf5-1.10.6-nompi_h3c11f04_101 
  idna               conda-forge/noarch::idna-3.4-pyhd8ed1ab_0 
  intake             conda-forge/noarch::intake-0.



  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - fastparquet


The following NEW packages will be INSTALLED:

  cramjam            pkgs/main/linux-64::cramjam-2.6.2-py311h52d8a92_0 
  fastparquet        pkgs/main/linux-64::fastparquet-2023.4.0-py311hf4808d0_0 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - h5netcdf


The following NEW packages will be INSTALLED:

  h5netcdf           pkgs/main/linux-64::h5netcdf-1.2.0-py311h06a4308_0 
  h5py               pkgs/main/linux-64::h5py-3.7.0-py311h021c08c_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> pkgs/main::ca-certificates-2023.05.30-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/linux-64::certifi-2023.7.22-py311h06a4308_0 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - gcsfs


The following NEW packages will be INSTALLED:

  aiohttp            pkgs/main/linux-64::aiohttp-3.8.3-py311h5eee18b_0 
  aiosignal          conda-forge/noarch::aiosignal-1.3.1-pyhd8ed1ab_0 
  async-timeout      conda-forge/noarch::async-timeout-4.0.2-pyhd8ed1ab_0 
  attrs              conda-forge/noarch::attrs-23.1.0-pyh71513ae_1 
  blinker            conda-forge/noarch::blinker-1.6.2-pyhd8ed1ab_0 
  cachetools         conda-forge/noarch::cachetools-4.2.4-pyhd8ed1ab_0 
  cffi               pkgs/main/linux-64::cffi-1.15.1-py311h5eee18b_3 
  cryptography       pkgs/main/linux-64::cryptography-41.0.2-py311h22a60cf_0 
  frozenlist         pkgs/main/linux-64::frozenlist-1.3.3-py311h5eee18b_0 
  gcsfs              conda-forge/noarch::gcsfs-2023.6.0-pyhd8ed1ab_0 
  google-api-core    conda-forge/noarch::google-api-core-2.8.1-pyhd8ed1ab_0 
  google-auth        conda-for



  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - s3fs


The following NEW packages will be INSTALLED:

  aiobotocore        conda-forge/noarch::aiobotocore-2.5.0-pyhd8ed1ab_0 
  aioitertools       conda-forge/noarch::aioitertools-0.11.0-pyhd8ed1ab_0 
  botocore           pkgs/main/linux-64::botocore-1.29.76-py311h06a4308_0 
  brotlipy           pkgs/main/linux-64::brotlipy-0.7.0-py311h5eee18b_1002 
  jmespath           conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0 
  s3fs               conda-forge/noarch::s3fs-2023.6.0-pyhd8ed1ab_0 
  wrapt              pkgs/main/linux-64::wrapt-1.14.1-py311h5eee18b_0 

The following packages will be DOWNGRADED:

  urllib3                                2.0.4-pyhd8ed1ab_0 --> 1.26.15-pyhd8ed1ab_0 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package met



  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2




Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - matplotlib


The following NEW packages will be INSTALLED:

  brotli             conda-forge/linux-64::brotli-1.0.9-h166bdaf_7 
  brotli-bin         conda-forge/linux-64::brotli-bin-1.0.9-h166bdaf_7 
  contourpy          pkgs/main/linux-64::contourpy-1.0.5-py311hdb19cb5_0 
  cycler             conda-forge/noarch::cycler-0.11.0-pyhd8ed1ab_0 
  dbus               pkgs/main/linux-64::dbus-1.13.18-hb2f20db_0 
  expat              conda-forge/linux-64::expat-2.2.10-h9c3ff4c_0 
  fontconfig         pkgs/main/linux-64::fontconfig-2.14.1-hef1e5e3_0 
  fonttools          pkgs/main/noarch::fonttools-4.25.0-pyhd3eb1b0_0 
  glib               pkgs/main/linux-64::glib-2.69.1-he621ea3_2 
  gst-plugins-base   pkgs/main/linux-64::gst-plugins-base-1.14.1-h6a678d5_1 
  gstreamer          pkgs/main/linux-64::gstreamer-1.14.1-h5eee18b_1 
  kiwisolver         pkgs/main/linux-64::kiwisolver-



  current version: 23.5.2
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2





## Package Plan ##

  environment location: /home/jgreen/.miniconda3/envs/cloud-data

  added / updated specs:
    - ujson


The following NEW packages will be INSTALLED:

  ujson              pkgs/main/linux-64::ujson-5.4.0-py311h6a678d5_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2023.7.2~ --> anaconda::ca-certificates-2023.01.10-h06a4308_0 
  certifi            conda-forge::certifi-2023.7.22-pyhd8e~ --> anaconda::certifi-2020.6.20-pyhd3eb1b0_3 



Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting scipy
  Obtaining dependency information for scipy from https://files.pythonhosted.org/packages/b8/46/1d255bb55e63de02f7b2f3a2f71b59b840db21d61ff7cd41edbfc2da448a/scipy-1.11.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached scipy-1.11.1-cp311-cp311-man

## Step 3: Run Benchmarking

### Convert File to Cloud-Native

Since one of the major goals of this benchmarking is testing legacy formats against cloud-native ones, we must convert your legacy-formatted data (CSV and NetCDF4) into their corresponding cloud-native formats. This cell will execute and time the conversion process, writing each new format in parallel. The conversion will be done using each cluster's full amount of resources, so be mindful of this feature when using clusters that are expensive to operate.

In [None]:
os.system("bash benchmarks-core/run_benchmark_step.sh \"convert-data.py\" \"conversions.csv\"")
df_conversions = pd.read_csv(os.getcwd() + '/results/csv-files/conversions.csv')
df_conversions

### File Reads

The last computation-intensive test in the benchmarking is reading and timing files from cloud storage. This will give you an idea of what data transfer throughput you can expect when using cloud storage and different data formats in other workflows.

In [None]:
os.system("bash benchmarks-core/run_benchmark_step.sh \"read-data.py\" \"reads.csv\"")
df_reads = pd.read_csv(os.getcwd() + '/results/csv-files/reads.csv')
df_reads

## TODO: Step 4: Visualize Results

**Feature not ready**

## Step 5: Remove Benchmarking Files from Cloud Resources (Optional)

In [None]:
os.system("bash postprocessing/remove-benchmark-files.sh")