# Cloud Data Transfer Speeds Benchmarking Workflow

Add overview of workflow

## Step 0: Load Required Setup Packages & Classes

Installs required workflow setup packages and calls UI generation script. If one or more of the packages don't exist in your `base` environment, they will install for you. Note that if installation is required, this cell will take a few minutes to complete execution.

In [None]:
import os
import json
import sys
os.system("bash " + os.getcwd() + "/utils/jupyter-user-input/install-ui-packages.sh")
sys.path.insert(0, os.getcwd() + '/utils/jupyter-user-input')
import ui_helpers as ui

## Step 1: Define Workflow Inputs

Run the following cells to generate interactive widgets, allowing you to enter all required workflow inputs:

**1. Cloud Compute Resources**

Before defining anything else, the resources you intent to use with the benchmarking must be defined. Currently, only resources defined in the Parallel Works platform may be used. Also of note are options that will be passed to Dask: you have full control over how many cores and memory you want each worker from your cluster to use, as well as how many nodes you want to be active at a single time.

In particular, these options are included so that you can form fair comparisons between different cloud service providers (CSPs). Generally, different CSPs won't have worker nodes with the exact same specs, and in order to achieve a fair comparison between two CSPs one cluster will have to limited to not exceed to the computational power of the other.

**2. Cloud Object Stores**

This set of inputs is where you enter the cloud object store Universal Resource Identifiers (URIs) or absolute path to the cloud storage location (if it is mounted into the filesystems of your head node and *all* worker nodes). Parallel Works provides an intuitive and easy-to-use [feature for creating mounted cloud object stores](https://docs.parallel.works/managing-storage/). Note that object stores requiring access credentials are not yet supported.

**3. Randomly Generated File Options**

The last set of inputs required by the workflow are the randomly generated file options. The workflow does not currently support user file input (though it will be the first feature added once the first fully-functioning version of the workflow is released). Instead, you can choose between a number of commonly-used file formats to be randomly generated: CSV, NetCDF4, and plain binary. You may specify the desired sizes, and for NetCDF4, the number of data variables and axes to create.

### Cloud Compute Resources

<div class="alert alert-block alert-info">
For resources controlled from the Parallel Works platform, the <code>Resource name</code> box should be populated with the name found on the <b>RESOURCES</b>  tab.
    </div>

In [None]:
resource = ui.resourceWidgets()
resource.display()
resources = resource.processInput()
resources

### Cloud Object Stores

In [None]:
store = ui.storageWidgets()
store.display()
storage = store.processInput()
storage

### Randomly Generated File Options

In [None]:
randgen = ui.randgenWidgets(resources=resources)
randgen.display()
randfiles = randgen.processInput()
randfiles

## Step 2: Notebook Setup

In [None]:
print('Setting up workflow...')

user_input = json.dumps({"RESOURCES" : resources,
             "STORAGE" : storage,
             "RANDFILES" : randfiles})

with open('user_input.json', 'w') as outfile:
    outfile.write(user_input)


#! chmod u+x workflow_notebook_setup.sh
#! ./workflow_notebook_setup.sh

print('Workflow setup complete.')