microsoft · fepegar · May 18, 2022 · May 17, 2022 · May 18, 2022
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -2,7 +2,7 @@
 // https://github.com/microsoft/vscode-dev-containers/tree/v0.187.0/containers/python-3-miniconda
 {
 	"name": "hi-ml",
-	"build": { 
+	"build": {
 		"context": "..",
 		"dockerfile": "Dockerfile",
 		"args": {
@@ -12,7 +12,7 @@
 	},
 
 	// Set *default* container specific settings.json values on container create.
-	"settings": { 
+	"settings": {
 		"python.pythonPath": "/opt/conda/bin/python",
 		"python.languageServer": "Pylance",
 		"python.linting.enabled": true,
@@ -42,7 +42,7 @@
 
 	// Comment out connect as root instead. More info: https://aka.ms/vscode-remote/containers/non-root.
 	// "remoteUser": "vscode"
-	
+
 	// Extra settings to start the docker container in order to use libfuse, required for locally mounting datasets.
 	// More info: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py#mount-mount-point-none----kwargs-
 	"runArgs": [

diff --git a/.devcontainer/noop.txt b/.devcontainer/noop.txt
@@ -1,3 +1,3 @@
 This file is copied into the container along with environment.yml* from the
-parent folder. This is done to prevent the Dockerfile COPY instruction from 
+parent folder. This is done to prevent the Dockerfile COPY instruction from
 failing if no environment.yml is found.
diff --git a/.github/workflows/check-pr-title.yml b/.github/workflows/check-pr-title.yml
@@ -1,5 +1,5 @@
 name: 'Check PR Title'
-on: 
+on:
   pull_request:
     types: [edited, opened, synchronize, reopened]
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -85,7 +85,7 @@ class Foo:
     This is the class description.
 
     The following block will be pretty-printed by Sphinx. Note the space between >>> and the code!
-    
+
     Usage example:
         >>> from module import Foo
         >>> foo = Foo(bar=1.23)
@@ -107,7 +107,7 @@ class Foo:
         if enclosed in double backtick.
 
         This method can raise a :exc:`ValueError`.
-        
+
         :param arg: This is a description for the method argument.
             Long descriptions should be indented.
         """

diff --git a/SECURITY.md b/SECURITY.md
@@ -14,7 +14,7 @@ Instead, please report them to the Microsoft Security Response Center (MSRC) at
 
 If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
 
-You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
 
 Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
 

diff --git a/docs/source/authentication.md b/docs/source/authentication.md
@@ -2,14 +2,14 @@
 
 ## Authentication
 
-The `hi-ml` package uses two possible ways of authentication with Azure. 
+The `hi-ml` package uses two possible ways of authentication with Azure.
 The default is what is called "Interactive Authentication". When you submit a job to Azure via `hi-ml`, this will
 use the credentials you used in the browser when last logging into Azure. If there are no credentials yet, you should
 see instructions printed out to the console about how to log in using your browser.
 
-We recommend using Interactive Authentication. 
+We recommend using Interactive Authentication.
 
-Alternatively, you can use a so-called Service Principal, for example within build pipelines. 
+Alternatively, you can use a so-called Service Principal, for example within build pipelines.
 
 
 ## Service Principal Authentication
@@ -19,7 +19,7 @@ training runs from code, for example from within an Azure pipeline. You can find
 [here](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals).
 
 If you would like to use Service Principal, you will need to create it in Azure first, and then store 3 pieces
-of information in 3 environment variables — please see the instructions below. When all the 3 environment variables are in place, 
+of information in 3 environment variables — please see the instructions below. When all the 3 environment variables are in place,
 your Azure submissions will automatically use the Service Principal to authenticate.
 
 
@@ -29,28 +29,28 @@ your Azure submissions will automatically use the Service Principal to authentic
  1. Navigate to `App registrations` (use the top search bar to find it).
  1. Click on `+ New registration` on the top left of the page.
  1. Choose a name for your application e.g. `MyServicePrincipal` and click `Register`.
- 1. Once it is created you will see your application in the list appearing under `App registrations`. This step might take 
- a few minutes. 
- 1. Click on the resource to access its properties. In particular, you will need the application ID. 
- You can find this ID in the `Overview` tab (accessible from the list on the left of the page). 
- 1. Create an environment variable called `HIML_SERVICE_PRINCIPAL_ID`, and set its value to the application ID you 
+ 1. Once it is created you will see your application in the list appearing under `App registrations`. This step might take
+ a few minutes.
+ 1. Click on the resource to access its properties. In particular, you will need the application ID.
+ You can find this ID in the `Overview` tab (accessible from the list on the left of the page).
+ 1. Create an environment variable called `HIML_SERVICE_PRINCIPAL_ID`, and set its value to the application ID you
  just saw.
- 1. You need to create an application secret to access the resources managed by this service principal. 
- On the pane on the left find `Certificates & Secrets`. Click on `+ New client secret` (bottom of the page), note down your token. 
- Warning: this token will only appear once at the creation of the token, you will not be able to re-display it again later. 
+ 1. You need to create an application secret to access the resources managed by this service principal.
+ On the pane on the left find `Certificates & Secrets`. Click on `+ New client secret` (bottom of the page), note down your token.
+ Warning: this token will only appear once at the creation of the token, you will not be able to re-display it again later.
  1. Create an environment variable called `HIML_SERVICE_PRINCIPAL_PASSWORD`, and set its value to the token you just
  added.
 
 ### Providing permissions to the Service Principal
-Now that your service principal is created, you need to give permission for it to access and manage your AzureML workspace. 
+Now that your service principal is created, you need to give permission for it to access and manage your AzureML workspace.
 To do so:
 1. Go to your AzureML workspace. To find it you can type the name of your workspace in the search bar above.
 1. On the `Overview` page, there is a link to the Resource Group that contains the workspace. Click on that.
 1. When on the Resource Group, navigate to `Access control`. Then click on `+ Add` > `Add role assignment`. A pane will appear on the
  the right. Select `Role > Contributor`. In the `Select` field type the name
 of your Service Principal and select it. Finish by clicking `Save` at the bottom of the pane.
- 
- 
+
+
 ### Azure Tenant ID
 The last remaining piece is the Azure tenant ID, which also needs to be available in an environment variable. To get
 that ID:

diff --git a/docs/source/commandline_tools.md b/docs/source/commandline_tools.md
@@ -6,8 +6,8 @@ From the command line, run the command
 
 ```himl-tb```
 
-specifying one of 
-`[--experiment] [--latest_run_file] [--run]` 
+specifying one of
+`[--experiment] [--latest_run_file] [--run]`
 
 This will start a TensorBoard session, by default running on port 6006. To use an alternative port, specify this with `--port`.
 
@@ -21,16 +21,16 @@ If you choose to specify `--experiment`, you can also specify `--num_runs` to vi
 
 If your AML config path is not ROOT_DIR/config.json, you must also specify `--config_file`.
 
-To see an example of how to create TensorBoard logs using PyTorch on AML, see the 
+To see an example of how to create TensorBoard logs using PyTorch on AML, see the
 [AML submitting script](examples/9/aml_sample.rst) which submits the following [pytorch sample script](examples/9/pytorch_sample.rst). Note that to run this, you'll need to create an environment with pytorch and tensorboard as dependencies, as a minimum. See an [example conda environemnt](examples/9/tensorboard_env.rst). This will create an experiment named 'tensorboard_test' on your Workspace, with a single run. Go to outputs + logs -> outputs to see the tensorboard events file.
 ## Download files from AML Runs
 
-From the command line, run the command 
+From the command line, run the command
 
 ```himl-download```
 
-specifying one of 
-`[--experiment] [--latest_run_file] [--run]` 
+specifying one of
+`[--experiment] [--latest_run_file] [--run]`
 
 If `--experiment` is provided, the most recent Run from this experiment will be downloaded.
 If `--latest_run_file` is provided, the script will expect to find a RunId in this file.
@@ -46,29 +46,29 @@ If your AML config path is not `ROOT_DIR/config.json`, you must also specify `--
 ## Creating your own command line tools
 
 When creating your own command line tools that interact with the Azure ML ecosystem, you may wish to use the
- `AmlRunScriptConfig` class for argument parsing. This gives you a quickstart way for accepting command line arguments to 
+ `AmlRunScriptConfig` class for argument parsing. This gives you a quickstart way for accepting command line arguments to
  specify the following
-   
+
   - experiment: a string representing the name of an Experiment, from which to retrieve AML runs
   - tags: to filter the runs within the given experiment
   - num_runs: to define the number of most recent runs to return from the experiment
   - run: to instead define one or more run ids from which to retrieve runs (also supports the older format of run recovery ideas although these are obsolete now)
   - latest_run_file: to instead provide a path to a file containing the id of your latest run, for retrieval.
   - config_path: to specify a config.json file in which your workspace settings are defined
- 
-You can extend this list of arguments by creating a child class that inherits from AMLRunScriptConfig. 
+
+You can extend this list of arguments by creating a child class that inherits from AMLRunScriptConfig.
 
 ### Defining your own argument types
 
 Additional arguments can have any of the following types: `bool`, `integer`, `float`, `string`, `list`, `class/class instance`
-with no additional work required. You can also define your own custom type, by providing a custom class in your code that 
-inherits from `CustomTypeParam`. It must define 2 methods: 
-1. `_validate(self, x: Any)`: which should raise a `ValueError` if x is not of the type you expect, and should also make a call 
+with no additional work required. You can also define your own custom type, by providing a custom class in your code that
+inherits from `CustomTypeParam`. It must define 2 methods:
+1. `_validate(self, x: Any)`: which should raise a `ValueError` if x is not of the type you expect, and should also make a call
 `super()._validate(val)`
 2. `from_string(self, y: string)` which takes in the command line arg as a string (`y`) and returns an instance of the type
 that you want. For example, if your custom type is a tuple, this method should create a tuple from the input string and return that.
 An example of a custom type can be seen in our own custom type: `RunIdOrListParam`, which accepts a string representing one or more
-run ids (or run recovery ids) and returns either a List or a single RunId object (or RunRecoveryId object if appropriate) 
+run ids (or run recovery ids) and returns either a List or a single RunId object (or RunRecoveryId object if appropriate)
 
 ### Example:
 

diff --git a/docs/source/datasets.md b/docs/source/datasets.md
@@ -11,12 +11,12 @@ to one dataset.
 
 
 ### AzureML Data Stores
-Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described 
+Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described
 [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data). Data stores provide access to
 one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around
-in the code - rather, the credentials are stored in the data store once and for all. 
+in the code - rather, the credentials are stored in the data store once and for all.
 
-You can view all data stores in your AzureML workspace by clicking on one of the bottom icons in the left-hand 
+You can view all data stores in your AzureML workspace by clicking on one of the bottom icons in the left-hand
 navigation bar of the AzureML studio.
 
 One of these data stores is designated as the default data store.
@@ -27,11 +27,11 @@ Thirdly, there are datasets. Again, this is a concept coming from Azure Machine
 * A data store
 * A set of files accessed through that data store
 
-You can view all datasets in your AzureML workspace by clicking on one of the icons in the left-hand 
+You can view all datasets in your AzureML workspace by clicking on one of the icons in the left-hand
 navigation bar of the AzureML studio.
 
 ### Preparing data
-To simplify usage, the `hi-ml` package creates AzureML datasets for you. All you need to do is to 
+To simplify usage, the `hi-ml` package creates AzureML datasets for you. All you need to do is to
 * Create a blob storage account for your data, and within it, a container for your data.
 * Create a data store that points to that storage account, and store the credentials for the blob storage account in it
 
@@ -54,7 +54,7 @@ What will happen under the hood?
 is no dataset of that name, it will create one from all the files in blob storage in folder "my_folder". The dataset
 will be created using the data store provided, "my_datastore".
 * Once the script runs in AzureML, it will download the dataset "my_folder" to a temporary folder.
-* You can access this temporary location by `run_info.input_datasets[0]`, and read the files from it. 
+* You can access this temporary location by `run_info.input_datasets[0]`, and read the files from it.
 
 More complicated setups are described below.
 
@@ -77,19 +77,19 @@ output_folder = run_info.output_datasets[0]
 ```
 Your script can now read files from `input_folder`, transform them, and write them to `output_folder`. The latter
 will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
-will be uploaded to blob storage, and registered as a dataset. 
+will be uploaded to blob storage, and registered as a dataset.
 
 ### Mounting and downloading
 An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted,
-the files are accessed via the network once needed - this is very helpful for large datasets where downloads would 
+the files are accessed via the network once needed - this is very helpful for large datasets where downloads would
 create a long waiting time before the job start.
 
 Similarly, an output dataset can be uploaded at the end of the script, or it can be mounted. Mounting here means that
 all files will be written to blob storage already while the script runs (rather than at the end).
 
 Note: If you are using mounted output datasets, you should NOT rename files in the output folder.
 
-Mounting and downloading can be triggered by passing in `DatasetConfig` objects for the `input_datasets` argument, 
+Mounting and downloading can be triggered by passing in `DatasetConfig` objects for the `input_datasets` argument,
 like this:
 
 ```python
@@ -105,14 +105,14 @@ output_folder = run_info.output_datasets[0]
 
 ### Local execution
 For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML.
-Clearly, your script needs to be able to access data in those runs too. 
+Clearly, your script needs to be able to access data in those runs too.
 
 There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
 `DatasetConfig` objects:
 ```python
 from pathlib import Path
 from health_azure import DatasetConfig, submit_to_azure_if_needed
-input_dataset = DatasetConfig(name="my_folder", 
+input_dataset = DatasetConfig(name="my_folder",
                               datastore="my_datastore",
                               local_folder=Path("/datasets/my_folder_local"))
 run_info = submit_to_azure_if_needed(...,
@@ -134,8 +134,8 @@ AzureML has the capability to download/mount a dataset to such a fixed location.
 trigger that behaviour via an additional option in the `DatasetConfig` objects:
 ```python
 from health_azure import DatasetConfig, submit_to_azure_if_needed
-input_dataset = DatasetConfig(name="my_folder", 
-                              datastore="my_datastore", 
+input_dataset = DatasetConfig(name="my_folder",
+                              datastore="my_datastore",
                               use_mounting=True,
                               target_folder="/tmp/mnist")
 run_info = submit_to_azure_if_needed(...,
@@ -147,12 +147,12 @@ input_folder = run_info.input_datasets[0]
 This is also true when running locally - if `local_folder` is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the `target_folder`.
 
 ### Dataset versions
-AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML 
+AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
 workspace. In the `hi-ml` toolbox, you would always use the latest version of a dataset unless specified otherwise.
 If you do need a specific version, use the `version` argument in the `DatasetConfig` objects:
 ```python
 from health_azure import DatasetConfig, submit_to_azure_if_needed
-input_dataset = DatasetConfig(name="my_folder", 
+input_dataset = DatasetConfig(name="my_folder",
                               datastore="my_datastore",
                               version=7)
 run_info = submit_to_azure_if_needed(...,