# Cloud Data Transfer Speeds Benchmarking Workflow

Add overview of workflow

## Step 1: Define Workflow Inputs

### Cloud Resource & Storage Information

- `resources (list) = ["<resource-1-name>", "<resource-2-name>", ..., "<resource-n-name>"]`
    
    Input the names of as many resources as you wish to run the benchmarks on. The name(s) of your resource(s) can be found on the **`RESOURCES`** tab in the Parallel Works platform.
    
    
- `storage (list) = ["<URI-1>", "<URI-2>", ..., "<URI-n>"]`

    This list determines what cloud object store(s) you would like to run the benchmarks on. You may enter any number of storage locations.


In [2]:
resources = ["gcptest"]
storage = ["gs://cloud-data-benchmarks/"]

### Randomly Generated File Options

Any desired randomly generated files will be written to all cloud storage locations specified in the **Cloud Resource Information** section. The ensures that all the cloud object stores that the benchmarks test will have identical randomly generated files for a fair comparison.

- `<filetype> (list) = [<Boolean>, <int>]`

   Filetype will always correspond to three different formats: CSV, NetCDF, and a general binary file. The first term in the tuple, the boolean, sets whether the workflow will randomly generate a file of the indicated format. `True` creates the file, and `False` does not.
   
   The second term, an integer, represents the file size in GB. If the first term is set to `False`, the second term can be set to anything. **NOTE: THE SECOND TERM MUST ALWAYS BE POPULATED. DO NOT LEAVE IT BLANK** 

In [3]:
csv = [True, 1]
netcdf = [False, 0]
binary = [False, 0]

## Step 2: Notebook Setup

In [11]:
print('Setting up workflow...')

# Import basic packages
import os
import subprocess
import json

# Set cloud resource & storage environment variables
os.environ["resources"] = json.dumps(resources)
os.environ["benchmark_storage"] = json.dumps(storage)

# Set randomly generated file option environment variables
rand_filetype = list(map(str, [csv[0], netcdf[0], binary[0]]))
rand_filesize = list(map(str, [csv[1], netcdf[1], binary[1]]))
os.environ["randgen_files"] = json.dumps(rand_filetype)
os.environ["randgen_sizes"] = json.dumps(rand_filesize)

# Execute Setup Script
env_variables = ["resources", "benchmark_storage", "randgen_files", "randgen_sizes"]
for n in env_variables:
    command = "export " + n
    subprocess.run([command], shell=True)

! chmod u+x workflow_notebook_setup.sh
! ./workflow_notebook_setup.sh

print('Workflow setup complete.')

Setting up workflow...
Workflow setup complete.
