# Cloud Data Transfer Speeds Benchmarking Workflow

Add overview of workflow

## Step 0: Load Required Setup Packages & Classes

Installs required workflow setup packages and calls UI generation script. If one or more of the packages don't exist in your `base` environment, they will install for you. Note that if installation is required, this cell will take a few minutes to complete execution.

In [1]:
import os
import json
import sys

print('Checking conda environment for UI depedencies...')
os.system("bash " + os.getcwd() + "/jupyter-helpers/install_ui_packages.sh")
print('All dependencies installed.')

sys.path.insert(0, os.getcwd() + '/jupyter-helpers')
import ui_helpers as ui
import pandas as pd

Checking conda environment for UI depedencies...
All dependencies installed.


## Step 1: Define Workflow Inputs

Run the following cells to generate interactive widgets allowing you to enter all workflow inputs. **All inputs must be filled out to proceed with the benchmarking process.**

### Cloud Resources

#### Compute Resources

Before defining anything else, the resources you intend to use with the benchmarking must be defined. Currently, only resources defined in the Parallel Works platform may be used. Also of note are options that will be passed to Dask: you have full control over how many cores and memory you want each worker from your cluster to use, as well as how many nodes you want to be active at a single time.

In particular, these options are included so that you can form fair comparisons between different cloud service providers (CSPs). Generally, different CSPs won't have worker nodes with the exact same specs, and in order to achieve a fair comparison between two CSPs one cluster will have to limited to not exceed to the computational power of the other.

<div class="alert alert-block alert-info">
For resources controlled from the Parallel Works platform, the <code>Resource name</code> box should be populated with the name found on the <b>RESOURCES</b>  tab.
    </div>

In [2]:
resource = ui.resourceWidgets()
resource.display()
resources = resource.processInput()
print(f'Your resource inputs:\n {resources}')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Resource Name: '), Text(value='', placeholder='…

Button(description='Submit', style=ButtonStyle())


-----------------------------------------------------------------------------
If you wish to change information about cloud resources, run this cell again.

Your resource inputs:
 [{'Name': 'gcptestnew', 'CSP': 'GCP', 'Dask': {'Scheduler': 'SLURM', 'Partition': 'compute', 'CPUs': 2, 'Memory': 16.0}}]


#### Object Stores

This set of inputs is where you enter the cloud object store Universal Resource Identifiers (URIs). Both public and private buckets are supported. For the latter, ensure that you have access credentials with both read and write permissions, as format conversions will need to be made during the benchmarking process.

In [3]:
store = ui.storageWidgets()
store.display()
storage = store.processInput()
print(f'Your storage inputs:\n {storage}')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Storage URI: '), Text(value='', placeholder='gc…

Button(description='Submit', style=ButtonStyle())


--------------------------------------------------------------------------------------
If you wish to change information about cloud storage locations, run this cell again.

Your storage inputs:
 [{'Path': 'gs://cloud-data-benchmarks', 'Type': 'Private', 'CSP': 'GCP', 'Credentials': './.cloud-data-benchmarks.json'}, {'Path': 's3://cloud-data-benchmarks', 'Type': 'Private', 'CSP': 'AWS', 'Credentials': 'default'}]


### Datasets

#### User-Supplied Datasets

Below you can specify datasets that you want to be tested in the benchmarking. You can either enter single files or multiple files that belong to a single dataset, but that dataset match at least one of the supported formats. **Read the following input rules after running the UI cell below this one.**


1. Activate the checkbox if you desire to record your user-defined datasets. If it is not checked, none of your inputs will be recorded.

<br>

2. Input the full URI or absolute path of the data location (`<URI prefix>://bucket-name/path/to/file.extension` or `/path/to/file.extension`)
    - Use globstrings (`<URI prefix>://bucket-name/path/to/files/*` or `path/to/files/*`) to specify datasets that are split up into multiple subfiles.
    - If using a globstring, ensure that *only* files that belong to the dataset exist in that directory. The workflow will take all files in the directory before the `*` and attempt to gather them into a single dataset.
    
<br>

3. If you have a dataset stored in multiple cloud storage locations that will be used in the benchmarking, you must input the full URI of that dataset for each of these locations. That is, you must define each location of the data separately.

<br>

4. For single-file datasets > 5.4 GB (5 GiB) in size that may need to be transferred across clouds in the benchmarking, you must transfer these files to the desired locations *before* running the benchmarking.
    - This is because `gsutil`, the tool the benchmark uses to copy user-defined files from the original cloud storage location to other benchmarking locations, can only handle single-file transfers between CSPs that are smaller than 5 GiB.

In [4]:
userdata = ui.userdataWidgets(storage=storage)
userdata.display()
user_files = userdata.processInput()
print(f'Your data inputs:\n {user_files}')

Checkbox(value=False, description='Provide datasets to workflow?')

HBox(children=(Button(description='Add field', style=ButtonStyle()), Button(description='Remove field', style=…

Accordion(children=(VBox(children=(HBox(children=(Label(value='Data Format'), Dropdown(options=('CSV', 'NetCDF…

Button(description='Submit', style=ButtonStyle())


---------------------------------------------------------------------------
If you wish to change information about your input data, run this cell again.

Your data inputs:
 [{'Format': 'NetCDF4', 'SourcePath': 'gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc', 'Type': 'Private', 'CSP': 'GCP', 'Credentials': './.cloud-data-benchmarks.json'}, {'Format': 'NetCDF4', 'SourcePath': 's3://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.nc', 'Type': 'Private', 'CSP': 'AWS', 'Credentials': 'default'}]


#### Randomly-Generated Datasets

Another option to supply data to the benchmarking is to create randomly-generated datasets. These sets can be as large as you want (as they are written in parallel), and provide a great option if you are new to the world of cloud-native data formats. There are currently two supported randomly-generated data formats: CSV and NetCDF4. Since NetCDF4 is a gridded data format, options to customize the types and numbers of dimensions are included.

In [5]:
randgen = ui.randgenWidgets(resources=resources)
randgen.display()
randfiles = randgen.processInput()
print(f'Your randomly-generated file options:\n {randfiles}')

Accordion(children=(HBox(children=(Label(value='Resource to write random files with: '), Dropdown(options=('gc…

Button(description='Submit', style=ButtonStyle())


-------------------------------------------------------------------------------
If you wish to change the randomly generated file options, run this cell again.

Your randomly-generated file options:
 [{'Format': 'CSV', 'Generate': False, 'SizeGB': 0.0}, {'Format': 'NetCDF4', 'Generate': True, 'SizeGB': 10.0, 'Data Variables': 1.0, 'Float Coords': 2.0, 'Time Coords': 1.0}, {'Resource': 'gcptestnew'}]


### Cloud-Native Format Conversion Options

You also have the options of setting which chunksizes and compression algorithms you'd like to use in the benchmarking.

## Step 2: Notebook Setup

Executing the following cell will write all of your inputs to `benchmark_info.json`, install miniconda3 and the "cloud-data" Python environment to all resources, and write randomly-generated files to all cloud storage locations (if any files were specified). If writing randomly-generated files, especially large ones, the execution of this cell may take a while.

<div class="alert alert-block alert-info">
While randomly-generated files are written in parallel by default, if you wish to speed up the execution of this cell, consider creating/choosing a resource with more powerful worker nodes.
    </div>

In [8]:
print('Setting up workflow...')

user_input = json.dumps({"RESOURCES" : resources,
                         "STORAGE" : storage,
                         "USERFILES" : user_files,
                         "RANDFILES" : randfiles
                        })

with open('benchmark_info.json', 'w') as outfile:
    outfile.write(user_input)

os.system("bash workflow_notebook_setup.sh")

print('Workflow setup complete.')

Setting up workflow...
Will install miniconda3 to "/home/jgreen/.miniconda3"
Installing Miniconda-latest on "gcptestnew"...
--2023-07-27 19:03:02--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103219356 (98M) [application/x-sh]
Saving to: ‘/tmp/miniconda-1690484582-22031.sh’

     0K .......... .......... .......... .......... ..........  0% 3.89M 25s
    50K .......... .......... .......... .......... ..........  0% 10.0M 18s
   100K .......... .......... .......... .......... ..........  0% 5.58M 18s
   150K .......... .......... .......... .......... ..........  0% 11.3M 15s
   200K .......... .......... .......... .......... ..........  0% 13.3M 14s
   250K .......... .......... .......... .......... ..........  0% 1

 59950K .......... .......... .......... .......... .......... 59%  338M 0s
 60000K .......... .......... .......... .......... .......... 59%  475M 0s
 60050K .......... .......... .......... .......... .......... 59%  383M 0s
 60100K .......... .......... .......... .......... .......... 59%  371M 0s
 60150K .......... .......... .......... .......... .......... 59%  362M 0s
 60200K .......... .......... .......... .......... .......... 59%  433M 0s
 60250K .......... .......... .......... .......... .......... 59%  362M 0s
 60300K .......... .......... .......... .......... .......... 59%  380M 0s
 60350K .......... .......... .......... .......... .......... 59%  279M 0s
 60400K .......... .......... .......... .......... .......... 59%  413M 0s
 60450K .......... .......... .......... .......... .......... 60%  401M 0s
 60500K .......... .......... .......... .......... .......... 60%  503M 0s
 60550K .......... .......... .......... .......... .......... 60%  350M 0s
 60600K ....

Finished installing Miniconda on "gcptestnew".                                                                   

Building "cloud-data" environment on "gcptestnew"...
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.5.2
  latest version: 23.7.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.1





Downloading and Extracting Packages
python-dateutil-2.8. | 241 KB    |            |   0% 
wcwidth-0.2.5        | 34 KB     |            |   0% [A

msgpack-python-1.0.3 | 36 KB     |            |   0% [A[A


cachetools-5.3.1     | 14 KB     |            |   0% [A[A[A



decorator-5.1.1      | 12 KB     |            |   0% [A[A[A[A




mkl-service-2.4.0    | 54 KB     |            |   0% [A[A[A[A[A





openssl-3.1.1        | 2.5 MB    |            |   0% [A[A[A[A[A[A






re2-2022.04.01       | 212 KB    |            |   0% [A[A[A[A[A[A[A







sortedcontainers-2.4 | 26 KB     |            |   0% [A[A[A[A[A[A[A[A








grpcio-1.48.2        | 819 KB    |            |   0% [A[A[A[A[A[A[A[A[A









libwebp-1.2.4        | 86 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A










libthrift-0.15.0     | 4.0 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A











pyparsing-3.1.0      | 87 KB     |            |   0

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Installing pip dependencies: ...working... Ran pip subprocess with arguments:
['/home/jgreen/.miniconda3/envs/cloud-data/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/jgreen/condaenv.s6id5cmw.requirements.txt', '--exists-action=b']
Pip subprocess output:
Collecting h5netcdf==1.2.0 (from -r /home/jgreen/condaenv.s6id5cmw.requirements.txt (line 1))
  Downloading h5netcdf-1.2.0-py3-none-any.whl (43 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 2.6 MB/s eta 0:00:00
Collecting h5py==3.9.0 (from -r /home/jgreen/condaenv.s6id5cmw.requirements.txt (line 2))
  Downloading h5py-3.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 53.5 MB/s eta 0:00:00
Collecting scipy==1.11.1 (from -r /home/jgreen/condaenv.s6id5cmw.requirements.txt (line 3))
  Downloading scipy-1.11.1-

2023-07-27 19:14:47,310 - distributed.deploy.adaptive_core - INFO - Adaptive stop
2023-07-27 19:14:47,805 - distributed.deploy.adaptive_core - INFO - Adaptive stop


Workflow setup complete.


## Step 3: Run Benchmarking

### Convert File to Cloud-Native

In [9]:
os.system("bash benchmarks-core/run_benchmark_step.sh \"convert-data.py\" \"conversions.csv\"")
df = pd.read_csv(os.getcwd() + '/results/csv-files/conversions.csv')
df

Converting files in "gs://cloud-data-benchmarks" with "gcptestnew"...
Converting ETOPO1_Ice_g_gmt4.nc to Zarr...
Written to "gs://cloud-data-benchmarks/cloud-data-transfer-benchmarking/cloudnativefiles/ETOPO1_Ice_g_gmt4.nc_zarr"
Converting random_10.0GB_NetCDF4 to Zarr...


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Written to "gs://cloud-data-benchmarks/cloud-data-transfer-benchmarking/cloudnativefiles/random_10.0GB_NetCDF4_zarr"
Done converting files in "gs://cloud-data-benchmarks".
Converting files in "s3://cloud-data-benchmarks" with "gcptestnew"...
Converting ETOPO1_Ice_g_gmt4.nc to Zarr...
Written to "s3://cloud-data-benchmarks/cloud-data-transfer-benchmarking/cloudnativefiles/ETOPO1_Ice_g_gmt4.nc_zarr"
Converting random_10.0GB_NetCDF4 to Zarr...


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Written to "s3://cloud-data-benchmarks/cloud-data-transfer-benchmarking/cloudnativefiles/random_10.0GB_NetCDF4_zarr"
Done converting files in "s3://cloud-data-benchmarks".


2023-07-27 19:25:41,283 - distributed.deploy.adaptive_core - INFO - Adaptive stop
2023-07-27 19:25:41,545 - distributed.deploy.adaptive_core - INFO - Adaptive stop


Unnamed: 0,resource,bucket,conversionType,dataset_name,Conversion Time
0,gcptestnew,gs://cloud-data-benchmarks,netcdf2zarr,ETOPO1_Ice_g_gmt4.nc,168.593685
1,gcptestnew,gs://cloud-data-benchmarks,netcdf2zarr,random_10.0GB_NetCDF4,129.873984
2,gcptestnew,s3://cloud-data-benchmarks,netcdf2zarr,ETOPO1_Ice_g_gmt4.nc,21.037287
3,gcptestnew,s3://cloud-data-benchmarks,netcdf2zarr,random_10.0GB_NetCDF4,51.472581


In [None]:
os.system("bash benchmarks-core/run_benchmark_step.sh \"read-data.py\" \"reads.csv\"")
df = pd.read_csv(os.getcwd() + '/results/reads.csv')
df

## Step 4: Visualize Results