# MDF Connect Client Tutorial

In [54]:
from mdf_connect_client import MDFConnectClient

import time  # 'time' is just for the example submission; it is not needed for regular use of the client.

## Table of contents

- [Overview/instantiation](#MDF-Connect-Client)
- [Mandatory inputs](#Mandatory-inputs)
- [Recommended inputs](#Recommended-inputs)
- [Optional inputs](#Optional-inputs)
- [Advanced inputs](#Advanced-inputs)
- [Submitting a dataset](#Submitting-a-dataset)
- [Checking submission status](#Checking-submission-status)
- [Curating a submission](#Curating-a-submission)

## MDF Connect Client
The MDF Connect Client (`MDFConnectClient`) is a class designed to help you submit datasets to MDF Connect using Python. When you instantiate the Client, it will attempt to authenticate you with Globus automatically. You cannot use MDF Connect anonymously.

Note: While you can access and modify the internal variables of a client (for example, `mdfcc.mdf`), it is recommended that you instead only use the helper functions. This tutorial accesses those variables only for display purposes.

**IMPORTANT**: To submit data to MDF Connect, you must have an account recognized by Globus Auth (including Google, ORCiD, many academic institutions, or a [free Globus ID](https://www.globusid.org/create)). Additionally, you must be in the [MDF Connect](https://www.globus.org/app/groups/cc192dca-3751-11e8-90c1-0a7c735d220a/about) Globus Group.

**Note:** You will see the comment `# NBVAL_SKIP` in some of these examples. It is an administrative flag. It does not affect the content of the example and can be safely ignored.

### MDFCC Constructor (`MDFConnectClient`)
It is recommended that you use the helper functions (detailed below) to construct your MDF Connect submission. However, if you have already assembled all or part of your submission, you can pre-load your client with the appropriate metadata.

Note: If you have the entire submission already prepared, you can skip to [Submitting a dataset](#Submitting-a-dataset) and pass your submission directly to `submit_dataset()`.

#### Optional arguments:
- `test` (boolean): When `True`, enables test mode. When `False`, disables test mode. For more information about test mode, see `set_test()`.

#### Advanced optional arguments (developer use only):
- `service_instance` (string): The instance of the MDF Connect service to use. Normal users should not alter this value.
- `authorizer` (Authorizer): A valid, authenticated Authorizer from the Globus SDK. Normal users do not need to change this value. Accepted Authorizers are:
    - globus_sdk.RefreshTokenAuthorizer
    - globus_sdk.ClientCredentialsAuthorizer
    - globus_sdk.NullAuthorizer (This Authorizer will fail authentication.)

The advanced arguments may cause issues, and are not recommended for normal users.

In [55]:
mdfcc = MDFConnectClient(test=True)

### `logout`
If you want to log out and invalidate your current login, you can call `logout()`. This method clears your current submission and invalidates your current authentication tokens.

Once this method is called, you must create a new MDF Connect Client in order to interact with MDF Connect.

In [56]:
# mdfcc.logout()

## Mandatory inputs

- [`create_dc_block`](#create_dc_block)
- [`add_data_source`](#add_data_source)

These helpers are for inputs that are mandatory to provide. All required arguments for these required inputs must be supplied. Your submission will be rejected if you do not provide this information.

### `create_dc_block`
The `dc` (DataCite) block is mandatory for all submissions. `create_dc_block()` helps create the `dc` block for you.

#### Required arguments:
- `title` (string): The title of the dataset.
- `authors` (string or list of strings): The authors of the dataset, in one of these forms:
    - "Givenname Familyname"
    - "Familyname, Givenname"
    - "Familyname; Givenname"

#### Arguments with defaults:
- `publisher` (string): The publisher of this dataset (*not* an associated paper). The default is `"Materials Data Facility"`.
- `publication_year` (integer or string): The year the dataset was published. The default is the current year.
- `resource_type` (string): The type of resource. Except in unusual cases, this should be `"Dataset"`. The default is `"Dataset"`. Unless you know that your submission needs a different value, please leave it as the default.

#### Optional arguments (not present by default):
- `affiliations` (string or list of strings): The affiliations of the authors, in the same order as the authors. If a different number of affiliations are given, all affiliations will be applied to all authors. Multiple author affiliations can be given as a list. (See examples below for more details.)
- `description` (string): A description of the dataset.
- `dataset_doi` (string): The DOI for this dataset (*not* an associated paper).
- `related_dois` (string or list of strings): DOIs related to this dataset, such as an associated paper's DOI. This *does not* include a DOI for the dataset itself. 
- `subjects` (string or list of strings): Subjects (in Datacite terminology) or tags related to the dataset.


If you understand the DataCite schema, you can also add other keyword arguments corresponding to DataCite fields. Additional information on DataCite fields is available from the [official DataCite website](https://schema.datacite.org/meta/kernel-4.1/).

You cannot clear the `dc` block. You can overwrite the `dc` block by calling this method again.

In [57]:
# Extra affiliations examples
# Assume we have three authors: Alice, Bob, and Cathy
authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"]

# If all authors are from NIST:
affiliations = "NIST"
# Equivalent to ["NIST", "NIST", "NIST"]

# If all authors are from both NIST and UChicago:
affiliations = ["NIST", "UChicago"]
# Equivalent to [["NIST", "UChicago"], ["NIST", "UChicago"], ["NIST", "UChicago"]]

# If Alice and Bob are from NIST, Cathy is from NIST and UChicago:
affliliations = ["NIST", "NIST", ["NIST", "UChicago"]]
# This is the only way to express these affiliations

# This is incorrect! If applying affiliations to all authors, lists must not be nested.
# These apply to all authors because there are 4 affiliations for 3 authors.
affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"]
# Do not use this format, it is incorrect for this number of authors.

In [58]:
mdfcc.create_dc_block(title="Sample Submission for Tutorial",
                      authors=["Foo Smith", "Smith, Bar", "Smith; Baz"],
                      affiliations=["The International Institute of Data"],
                      publisher="The Journal of Datasets",
                      publication_year=2000,
                      description="This is an example submission.",
                      dataset_doi="10.1234/dataset",
                      related_dois=["10.1234/paper1", "10.1234/paper2"],
                      subjects=["Examples", "Datasets"])
mdfcc.dc

{'titles': [{'title': 'Sample Submission for Tutorial'}],
 'creators': [{'creatorName': 'Smith, Foo',
   'familyName': 'Smith',
   'givenName': 'Foo',
   'affiliations': ['The International Institute of Data']},
  {'creatorName': 'Smith, Bar',
   'familyName': 'Smith',
   'givenName': 'Bar',
   'affiliations': ['The International Institute of Data']},
  {'creatorName': 'Baz, Smith;',
   'familyName': 'Baz',
   'givenName': 'Smith;',
   'affiliations': ['The International Institute of Data']}],
 'publisher': 'The Journal of Datasets',
 'publicationYear': '2000',
 'resourceType': {'resourceTypeGeneral': 'Dataset', 'resourceType': 'Dataset'},
 'descriptions': [{'description': 'This is an example submission.',
   'descriptionType': 'Other'}],
 'identifier': {'identifier': '10.1234/dataset', 'identifierType': 'DOI'},
 'relatedIdentifiers': [{'relatedIdentifier': '10.1234/paper1',
   'relatedIdentifierType': 'DOI',
   'relationType': 'IsPartOf'},
  {'relatedIdentifier': '10.1234/paper2',
   

In [59]:
mdfcc.create_dc_block(title="Tutorial Example Two",
                      authors=["Foo Smith", "Smith, Bar"],
                      affiliations=["Foo University", "The Bureau of Bar"])

mdfcc.dc

{'titles': [{'title': 'Tutorial Example Two'}],
 'creators': [{'creatorName': 'Smith, Foo',
   'familyName': 'Smith',
   'givenName': 'Foo',
   'affiliations': ['Foo University']},
  {'creatorName': 'Smith, Bar',
   'familyName': 'Smith',
   'givenName': 'Bar',
   'affiliations': ['The Bureau of Bar']}],
 'publisher': 'Materials Data Facility',
 'publicationYear': '2020',
 'resourceType': {'resourceTypeGeneral': 'Dataset', 'resourceType': 'Dataset'}}

### `add_data_source`
Some kind of data is mandatory for all submissions. `add_data_source()` will add a data location to your dataset. This action is cumulative, so each calls adds more data. Subsequent calls do not overwrite.

#### Required arguments:
- `data_source` (string): The location of the data.

You can add data located at a Globus endpoint, HTTP(S) link, or on Google Drive. MDF Connect will extract data located in archives, including zip files.

- Globus endpoint: `globus://endpoint_id/path/to/data` or you can copy the "Get link" link from the Globus Web App. Your Globus account must have permission to read the data.
- HTTP(S): Copy the link to the data file (NOT a landing page) exactly. The data must be accessible without authentication.
- Google Drive: `googledrive:///path/from/shared/location`. You must share the data with materialsdatafacility@gmail.com.

To clear all data from the submission, call `clear_data_sources()`.

In [60]:
mdfcc.add_data_source("https://dl.dropboxusercontent.com/u/12345/abcdef")
mdfcc.add_data_source(["googledrive:///mydata.zip", "globus://1a2b3c/data/"])
mdfcc.data_sources

['https://dl.dropboxusercontent.com/u/12345/abcdef',
 'googledrive:///mydata.zip',
 'globus://1a2b3c/data/']

In [61]:
mdfcc.clear_data_sources()
mdfcc.data_sources

[]

In [62]:
# This is the actual test data, using a Globus Web App link
mdfcc.add_data_source("https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F")
mdfcc.data_sources

['https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F']

## Recommended inputs

- [`add_tag`](#add_tag)
- [`add_index`](#add_index)
- [`add_service`](#add_service)
- [`set_test`](#set_test)
- [`add_organization`](#add_organization)

These helpers are for inputs that MDF recommends you provide or consider, but are not required.

### `add_tag`
`add_tag()` will add tags (also known as "subjects" in the DataCite schema, or "keywords") to your dataset.

`add_tag("tag")` is equivalent to `create_dc_block(..., subjects=["tag"])`. This method exists for convenience of managing tags.

#### Required arguments:
- `tag` (string or list of strings): The tag to add.

To clear all of the tags added with `add_tag()`, call `clear_tags()`. Note that `clear_tags()` does not remove tags set through `create_dc_block()`.

In [63]:
mdfcc.add_tag("example")
mdfcc.tags

['example']

In [64]:
mdfcc.clear_tags()
mdfcc.tags

[]

### `add_index`
To extract JSON, CSV, YAML, XML, or Excel files, MDF Connect requires a mapping that translates the file into MDF schema format. `add_index()` will add a mapping for a specific data type to your submission.

#### Required arguments:
- `data_type` (string): The type of data being mapped. Supported types include:
    - `json`
    - `csv`
    - `yaml`
    - `xml`
    - `excel`
    - `filename` (This type is special; see below.)
- `mapping` (dictionary of strings): The mapping of MDF fields to your data type's fields (see below).

#### Arguments with defaults:
- `delimiter` (string): For tabular data (ex. CSV), the column delimiter. The default is "," (comma).
- `na_values` (string or list of strings): Values to treat as "data missing" entries. The default for tabular data (ex. CSV), v is blank and space, while the default for other types (ex. JSON) is nothing (no values will be discarded).

#### About `mapping`:
Mappings must be dictionaries, where the key is the MDF schema field (expressed in dot notation) and the value is the data's field or column name. "Dot notation" means one string that uses a period between dictionary levels. For example, `block.field.subfield` is the dot notation equivalent of `my_dict["block"]["field"]["subfield"]` in Python.
The exception to this is `filename` mapping. To extract data from a file's name, create a regular expression that returns the correct information. The mapping field is still the associated MDF field in dot notation, but the mapping value is the regular expression you created.

Fields with missing data will be ignored. If you have multiple schemas for one data type in one dataset, you can combine the mappings safely.

Each data type can only have one associated mapping, so multiple calls with the same data type will overwrite. Calls with different data types will not overwrite. To clear all the mappings, call `clear_index()`.

For more information on the MDF schemas, see the [official schema repository](https://github.com/materials-data-facility/data-schemas).

For the following example, assume we're submitting a dataset that contains a JSON file structured like this:
```json
{
    "my_data": {
        "mat": {
            "comp": "H"
        },
        "atom_num": 1
    },
    "space_grp": 10
}
```

In [65]:
# This is the mapping we would use to get the JSON file into MDF format.
mapping = {
    "material.composition": "my_data.mat.comp",
    "crystal_structure.number_of_atoms": "my_data.atom_num",
    "crystal_structure.space_group_number": "space_grp",
    # We could add another field here, if we had multiple JSON schemas.
    "dft.converged": "dft_info.conv"
    # This field would be ignored by MDF Connect in this submission because the field doesn't exist in the data.
}

In [66]:
mdfcc.add_index("json", mapping)
mdfcc.index

{'json': {'mapping': {'material.composition': 'my_data.mat.comp',
   'crystal_structure.number_of_atoms': 'my_data.atom_num',
   'crystal_structure.space_group_number': 'space_grp',
   'dft.converged': 'dft_info.conv'}}}

In [67]:
mdfcc.clear_index()
mdfcc.index

{}

### `add_service`
MDF Connect has integrations to submit data to other community services, as well as additional MDF-related options. To automatically submit your dataset to an integrated service, use `add_service()`.

#### Required arguments:
- `service` (string): One service to push your dataset to. Integrated services include:
    - `mdf_publish`, the MDF publication service with DOI minting
    - `citrine`, industry-partnered machine-learning specialists
    - `mrr`, the NIST Materials Resource Registry

#### Arguments with defaults:
- `parameters` (dictionary): Optional, service-specific parameters. Fields include:
    - For `mdf_publish`:
        - publication_location (string): The Globus Endpoint and path on which to save the published files. The default will publish onto MDF resources.
    - For `citrine`:
        - public (boolean): When `True`, the data will be made public. Otherwise, the data will be inaccessible. The default is `True`.

This action is cumulative, so subsequent calls will add more services, not overwrite previous.

To clear all the service selections from your submission, call `clear_services()`.

In [68]:
mdfcc.add_service("citrine")
mdfcc.services

{'citrine': True}

In [69]:
mdfcc.clear_services()
mdfcc.services

{}

### `set_test`
You can use `set_test()` to create a test submission as a dry-run for MDF Connect. The submission will go through the normal processing, but the results will not be submitted to the normal locations. This flag is a great way to tell if your submission will process the way you want it to.

#### Required arguments:
- `test` (boolean): When `True`, enables test mode. When `False`, disables test mode.

#### About test mode:
Test datasets are submitted to test/sandbox/temporary resources instead of "production" resources, including the following. These setting override all other parameters.
- Tests are ingested into the `mdf-test` search index
- Tests are given a sandbox DOI, which is not permanent (if `mdf_publish` is a requested service)
- Tests are not made public on Citrination (if `citrine` is a requested service)
- Tests are given a special `source_id` by prepending `_test_`

In [70]:
mdfcc.set_test(False)
mdfcc.test

False

In [71]:
mdfcc.set_test(True)
mdfcc.test

True

### `add_organization`
`add_organization()` marks your submission for an organization. This action is cumulative, so each call adds more organizations. Subsequent calls do not overwrite. Organizations may modify the parameters of your submission, such as mandating curation. More information about specific organizations can be found using [MDF Forge](https://mdf-forge.readthedocs.io/en/master/mdf_forge.html#mdf_forge.Forge.describe_organization).

#### Required arguments:
- `organization` (string or list of strings): The organization to add.

Organizations automatically add their parent organizations. Organizations not registered with MDF will be discarded.

To clear your organizations, call `clear_organizations()`.

In [72]:
mdfcc.add_organization("CHiMaD")
mdfcc.mdf

{'organizations': ['CHiMaD']}

In [73]:
mdfcc.clear_organizations()
mdfcc.mdf

{}

## Optional inputs

- [`set_custom_block`](#set_custom_block)
- [`set_custom_descriptions`](#set_custom_descriptions)
- [`set_base_acl`](#set_base_acl)
- [`set_dataset_acl`](#set_dataset_acl)
- [`set_source_name`](#set_source_name)
- [`set_incremental_update`](#set_incremental_update)
- [`add_data_destination`](#add_data_destination)
- [`set_external_uri`](#set_external_uri)
- [`create_mrr_block`](#create_mrr_block)

These helpers are for inputs that are optional and can be skipped if you aren't interested in providing them.

### `set_custom_block`
The `custom` block is an area for you to add your own custom metadata, if it isn't covered by the MDF schema. It can be set by calling `set_custom_block()`.

#### Required arguments:
- `custom_fields` (dictionary): Custom field-value pairs for your dataset.

You are allowed ten keys in your custom dictionary. You may additionally add descriptions of your fields by creating a new field called "\[field\]\_desc" with the string description inside. You can also add descriptions by calling `set_custom_descriptions()`.

Note that, unlike the `index` mappings, you supply the actual values for the dataset-level `custom` block.

Subsequent calls will overwrite your `custom` block. You can clear the `custom` block by passing in an empty dictionary.

In [74]:
custom_values = {
        "quench_method": "water",
        "quench_method_desc": "The method of quenching"
}
mdfcc.set_custom_block(custom_values)
mdfcc.custom

{'quench_method': 'water', 'quench_method_desc': 'The method of quenching'}

In [75]:
mdfcc.set_custom_block({})
mdfcc.custom

{}

### `set_custom_descriptions`
To add descriptions for your `custom` block fields, you can call `set_custom_descriptions()`.

#### Required arguments:
- `custom_descriptions` (dictionary): The custom fields and descriptions. The dictionary fields must be the same as your `custom` block fields, and the values must be their descriptions.

Every field in `custom` can have a description, but descriptions are not allowed without a corresponding field.

Subsequent calls will overwrite the descriptions you provide. To clear descriptions, you have to use `set_custom_block()`.

In [76]:
custom_values = {
        "quench_method": "water"
}
mdfcc.set_custom_block(custom_values)
mdfcc.custom

{'quench_method': 'water'}

In [77]:
custom_desc = {
        "quench_method": "The method of quenching"
}
mdfcc.set_custom_descriptions(custom_desc)
mdfcc.custom

{'quench_method': 'water', 'quench_method_desc': 'The method of quenching'}

In [78]:
mdfcc.set_custom_block({})
mdfcc.custom

{}

### `set_base_acl`
`set_base_acl()` sets the Access Control List for all the data in this submission. Anyone in this list can read the dataset entry, record entries, and files in the dataset.

#### Required arguments:
- acl (string or list of strings): The Access Control List. The ACL must contain either the Globus UUIDs of users and/or groups allowed to access the submission, or `"public"` to make the submission open to everyone. The default ACL is `"public"`.

You can reset the ACL to the default with `clear_base_acl()`.

#### *Warning*:
The identities listed in the `base_acl` of your submission can always see your submission, including dataset entry, even if they are not listed in the `dataset_acl`. This means that **if you do not specify a `base_acl`**, because it defaults to `"public"`, **your entire dataset will be public.**

MDF encourages you to make your data public, but if you do not want it public you must specify this value.

In [79]:
mdfcc.set_base_acl(["UUID1", "UUID2"])
mdfcc.mdf

{'acl': ['UUID1', 'UUID2']}

In [80]:
mdfcc.clear_base_acl()
mdfcc.mdf

{}

### `set_dataset_acl`
`set_dataset_acl()` sets the Access Control List for just the dataset entry in MDF Search. Anyone in this list can see that dataset entry but _not_ the record entries or files (unless they are also in the `base_acl`, see above).

#### Required arguments:
- acl (string or list of strings): The Access Control List. The ACL must contain either the Globus UUIDs of users and/or groups allowed to access the dataset entry, or `"public"` to make the dataset entry open to everyone. By default, the dataset ACL is empty (which means that the base ACL is the only effective permission list).

You can reset the ACL to the default with `clear_dataset_acl()`.

In [81]:
mdfcc.set_dataset_acl(["public"])
mdfcc.dataset_acl

['public']

In [82]:
mdfcc.clear_dataset_acl()
mdfcc.dataset_acl

### `set_source_name`
`set_source_name()` sets the `source_name` of your dataset. By default, the `source_name` is generated based on the title of your dataset (as set in the `dc` block). If your title is long or otherwise unwieldy to type or remember, setting a custom `source_name` can help.

#### Required arguments:
- `source_name` (string): The desired `source_name`, which must be unique for new datasets.

Please note that your source name will be cleaned when submitted to MDF Connect, so the actual source_name may differ from this value. Additionally, the `source_id` (which is the `source_name` plus version) is required to fetch the status of a submission. `check_status()` can handle this for you.

You can reset the `source_name` to the default by calling `clear_source_name()`.

In [83]:
mdfcc.set_source_name("my_foobar_dataset")
mdfcc.mdf

{'source_name': 'my_foobar_dataset'}

In [84]:
mdfcc.clear_source_name()
mdfcc.mdf

{}

In [85]:
# Here we're setting a unique source_name, so the submission will create a new dataset.
mdfcc.set_source_name("tutorial_submission_{}".format(int(time.time())))

### `set_incremental_update`
`set_incremental_update()` makes this submission an incremental update of a previous submission. Incremental updates use the same submission metadata, except for whatever you specify in the new submission.

For example, if you submit an incremental update and only include a ``data_source``, the submission will run as if you copied the DC block and other metadata into the submission, but used the new ``data_source``. The new submission is processed normally, including data download and metadata extraction. (To update only the dataset metadata, use [`submit_dataset_metadata_update()`](#submit_dataset_metadata_update) when submitting.)

**Note:**
You must still set ``update=True`` when submitting an incremental update.

#### Required arguments:
- `source_id` (string): The ``source_id`` of the previous submission to update and resubmit.

You can unmark an incremental update by calling this method with the `source_id` set to `False`.

In [86]:
mdfcc.set_incremental_update("old_submission_v4.2")
mdfcc.incremental_update

'old_submission_v4.2'

In [87]:
mdfcc.set_incremental_update(False)
mdfcc.incremental_update

False

### `add_data_destination`
`add_data_destination()` will add secondary storage locations for your dataset. MDF Connect will automatically use Globus Transfer to send your data to all of the data destinations you list. Destinations must be Globus Endpoints and the path must be writable by `mdf_dataset_submission`, which is the MDF Connect Globus account. (This account will show up as `c17f27bb-f200-486a-b785-2a25e82af505@clients.auth.globus.org`.)

#### Required arguments:
- `data_destination` (string or list of strings): The Globus Endpoint for backing up the dataset. The destinations must be formatted as `globus://endpoint_id/path/to/destination` or you can copy the "Get link" link from the Globus Web App.

You can clear all data destinations by calling `clear_data_destinations()`.

In [88]:
mdfcc.add_data_destination("globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/my_data/mdf_submissions/")
mdfcc.data_destinations

['globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/my_data/mdf_submissions/']

In [89]:
mdfcc.clear_data_destinations()
mdfcc.data_destinations

[]

### `set_external_uri`
If your dataset is already hosted at another data repository, you can use `set_external_uri()` to point at it. This link will be added to the dataset entry in MDF Search.

#### Required arguments:
- `uri` (string): The link to the external landing page for this dataset.

You can clear the external URI by calling `clear_external_uri()`.

In [90]:
mdfcc.set_external_uri("https://example.com")
mdfcc.external_uri

'https://example.com'

In [91]:
mdfcc.clear_external_uri()
mdfcc.external_uri

### `create_mrr_block`
`create_mrr_block()` adds data for the NIST Materials Resource Registry into your submission.

Currently, you must build a dictionary with the appropriate fields yourself in order to attach MRR metadata. Helpful arguments, in line with `create_dc_block()`, are intended to be added in the future.

#### Required arguments:
- `mrr_data` (dictionary): The Materials Resource Registry metadata.

You can clear the `mrr` block by passing in an empty dictionary.

In [92]:
mdfcc.create_mrr_block({"dataOrigin": "experiment"})
mdfcc.mrr

{'dataOrigin': 'experiment'}

In [93]:
mdfcc.create_mrr_block({})
mdfcc.mrr

{}

## Advanced inputs

- [`set_passthrough`](#set_passthrough)
- [`set_project_block`](#set_project_block)
- [`set_curation`](#set_curation)
- [`set_conversion_config`](#set_conversion_config)

These helpers are for advanced inputs that most users don't need to worry about.

### `set_passthrough`
`set_passthrough()` sets the pass-through (or no-extract) flag for your submission.

Caution: The pass-through flag will cause metadata from your dataset's files to not be extracted by MDF Connect, so only high-level dataset metadata will be available in MDF Search. _This flag is only intended for datasets that cannot be extracted._

HTTP(S) data sources are not supported when the pass-through flag is set.

#### Required arguments:
- `passthrough` (boolean): When `False`, the dataset will be processed normally. When `True`, the metadata in the dataset files will not be extracted.

In [94]:
mdfcc.set_passthrough(True)
mdfcc.no_extract

True

In [95]:
mdfcc.set_passthrough(False)
mdfcc.no_extract

False

### `set_project_block`
`set_project_block()` sets project-specific metadata on your dataset entry. The project block is a special area for specific metadata, that must be registered with the MDF. If you have a project block defined in the MDF schema, you can set that metadata for your dataset entry in this way.

#### Required arguments:
- `project` (string): The name of the project block in MDF.
- `data` (dictionary): The metadata for the project block.

You can clear this block by passing in an empty `data` argument.

In [96]:
mdfcc.set_project_block("example_project", {"field": "value"})
mdfcc.projects

{'example_project': {'field': 'value'}}

In [97]:
mdfcc.set_project_block("example_project", None)
mdfcc.projects

{}

### `set_curation`
To trigger curation of your submission, use `set_curation()`. An approved curator (see below) must accept your submission before it will be indexed in MDF Search (and published with MDF Publish or sent to any other services, if applicable). Normally, this flag is set by an organization's rules, and not by an end-user, but you can set it yourself if you like.

##### About approved curators:
If your organization has set the curation flag, the approved curators are the managers and admins of the organization's permission groups. If you manually set the curation flag, the approved curators are based on your Access Control List (see [`set_acl()`](#set_acl)); anyone you list directly in your ACL can curate, as well as managers and admins of any group you list.
If you set your ACL to "public" then anyone can curate your submission.

#### Required arguments:
- `curation` (boolean): When `False`, the dataset will be fully processed automatically and not require approval. When `True`, the dataset will go through metadata extraction and then require an approved curator to accept it before ingesting to any service (including MDF Search and MDF Publish).

Remember that your organization (set with [`add_organization()`](#add_organization)) may force curation. Setting the curation flag to `False` does not override your organization's rules.

In [98]:
mdfcc.set_curation(True)
mdfcc.curation

True

In [99]:
mdfcc.set_curation(False)
mdfcc.curation

False

### `set_extraction_config`
`set_extraction_config()` sets advanced configuration parameters for your submission in the `extraction_config` block. These options are intended for advanced users and/or special-case datasets. Most submissions do not need to worry about these parameters.

#### Required arguments:
- `config` (dictionary): The extraction configuration options.

You can clear this block by passing in an empty dictionary.

In [100]:
mdfcc.set_extraction_config({"group_by_dir": True})
mdfcc.extraction_config

{'group_by_dir': True}

In [101]:
mdfcc.set_extraction_config({})
mdfcc.extraction_config

{}

## Submitting a dataset
- [`get_submission`](#get_submission)
- [`reset_submission`](#reset_submission)
- [`submit_dataset`](#submit_dataset)
- [`submit_dataset_metadata_update`](#submit_dataset_metadata_update)

After you have created your submission with the above helpers, you can submit and check your submission with these helpers.

### `get_submission`
`get_submission()` shows you your current submission, as it will be sent to MDF Connect. This method is a great way to check for any errors.

#### Return value:
- A dictionary containing the current submission.

In [102]:
mdfcc.get_submission()

{'dc': {'titles': [{'title': 'Tutorial Example Two'}],
  'creators': [{'creatorName': 'Smith, Foo',
    'familyName': 'Smith',
    'givenName': 'Foo',
    'affiliations': ['Foo University']},
   {'creatorName': 'Smith, Bar',
    'familyName': 'Smith',
    'givenName': 'Bar',
    'affiliations': ['The Bureau of Bar']}],
  'publisher': 'Materials Data Facility',
  'publicationYear': '2020',
  'resourceType': {'resourceTypeGeneral': 'Dataset',
   'resourceType': 'Dataset'}},
 'data_sources': ['https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F'],
 'test': True,
 'update': False,
 'mdf': {'source_name': 'tutorial_submission_1580155419'}}

### `reset_submission`
If you need to clear away your entire submission, call `reset_submission()`. This is irreversible.

Caution: This method will clear the current `source_id`, which means that you will have to keep track of any previous `source_id`s from other submissions to see their statuses.

In [103]:
# mdfcc.reset_submission()

### `submit_dataset`
`submit_dataset()` will send your dataset to MDF Connect for indexing. You will get back the `source_id` if the submission is successful. The `source_id` is the unique identifier for your specific submission, and can be used to check the status of your submission later. The `source_id` is also saved to the client.

#### Optional arguments:
- `update` (boolean): If you wish to submit this dataset after submitting it previously, set this to `True`. If this is the first submission, leave this `False`. The default is `False`.
- `submission` (dictionary): If you have assembled your own MDF Connect submission without this client, you can submit it by passing the dictionary in here. By default, the submission made in the client will be used.
- `reset` (boolean): If True, the submission will be cleared after the submission attempt, with `reset_submission()`. The test flag will be preserved. The default is `False`. Caution: This flag will clear your `source_id`, which means that you will have to keep track of it manually to check your submission's status.

#### Return value:
- A dictionary with the following submission information:
    - `success` (boolean): `True` if the submission was successfully sent to MDF Connect. `False` otherwise.
    - `source_id` (string): The `source_id` of the submission, from MDF Connect. This value may be `None` or an old ID if the submission failed.
    - `error` (string): If the submission failed, the reason for failure. If the submission suceeded, this will be `None` instead.

In [104]:
# NBVAL_SKIP

mdfcc.submit_dataset()

{'source_id': '_test_tutorial_submission_1580155419_v1.1',
 'success': True,
 'error': None,
 'status_code': 202}

### `submit_dataset_metadata_update`
To submit only updates to an existing submission's dataset entry, use `submit_dataset_metadata_update()`. This includes updates to the DC block, such as the author list. You can create the metadata using the same helpers, but any helpers that set non-dataset metadata (such as [`add_data_source()`](#add_data_source)) will be ignored.

To be clear, this method submits an update to a dataset entry only, and NOT the data or record entries. The submission you are updating must have completed processing successfully at the time you submit the update - you are not allowed to update a submission that is still processing, or failed processing.

#### Required arguments:
- `source_id` (string): The `source_id` of the dataset you wish to update. You must be the owner of the dataset.

#### Optional arguments:
- `metadata_update` (dictionary): If you have assembled the dataset metadata yourself, you can submit it here. This argument supersedes any data set through other methods. The default is `None`, to use the method-assembled data.
- `reset` (boolean): If `True`, will clear the old metadata from the client. The `test` flag will be preserved. If `False`, the metadata will be preserved. The default is `False`.

In [109]:
# NBVAL_SKIP

update_source_id = "author_updatable_example_submission_v1.1"
new_metadata = {
    "dc": {
        "titles": [{
            "title": "Updated Title"
        }]
    }
}
mdfcc.submit_dataset_metadata_update(update_source_id, metadata_update=new_metadata)

{'success': True, 'error': None, 'status_code': 200}

## Checking submission status
- [`check_status`](#check_status)
- [`check_all_submissions`](#check_all_submissions)

After submitting a dataset to MDF Connect, you can see the status of the submission's processing with these helpers.

### `check_status`
To see the progress your submission is making, use `check_status()`. If you haven't cleared the submission from the client, you can use it without arguments to check the most recent submission status.

#### Optional arguments:
- `source_id` (string): The `source_id` of the submission you want to check on. If you don't supply a `source_id`, the ID of the last submission you made with the client will be used instead (an error will result if you have not submitted a dataset with the client yet and also don't supply an ID).
- `short` (boolean): When `False`, a status summary will be printed for the submission. When `True`, an abbreviated summary containing only the minimum information will be printed (this is useful for checking many submissions at once). The default is `False`.
- `raw` (boolean): When `False`, a nicely-formatted status summary will be printed to standard output. When `True`, the full status result will be returned instead (the full result is not recommended for direct human consumption). The default is `False`.

#### Return value:
- A dictionary containing the full submission status (only when `raw` is `True`).

In [106]:
# NBVAL_SKIP

mdfcc.check_status()


Status of TEST submission _test_tutorial_submission_1580155419_v1.1 (Tutorial Example Two)
Submitted by Jonathon Gaff at 2020-01-27T20:03:52.207133Z

Submission initialization was successful.
Connect data download was successful: 21 files will be grouped and extracted (from 0 archives).
Data transfer to primary destination is in progress.
Metadata extraction has not started yet.
Dataset curation has not started yet.
MDF Search ingestion has not started yet.
Data transfer to secondary destinations has not started yet.
MDF Publish publication has not started yet.
Citrine upload has not started yet.
Materials Resource Registration has not started yet.
Post-processing cleanup has not started yet.

This submission is still processing.



In [107]:
mdfcc.check_status("_test_name_status_checking_example_v1.1")


Status of TEST submission _test_name_status_checking_example_v1.1 (Status Checking Example)
Submitted by Jonathon Gaff at 2019-08-08T20:26:36.263611Z

Submission initialization was successful.
Connect data download was successful: 12 files will be converted (0 archives extracted).
Data transfer to primary destination was successful.
Metadata extraction was successful: 4 records parsed out of 4 groups.
Dataset curation was not requested or required.
MDF Search ingestion was successful.
Data transfer to secondary destinations was not requested or required.
MDF Publish publication was not requested or required.
Citrine upload was not requested or required.
Materials Resource Registration was not requested or required.
Post-processing cleanup was successful.

This submission is no longer processing.



In [108]:
mdfcc.check_status("_test_name_status_checking_example_v1.1", short=True)

_test_name_status_checking_example_v1.1: This submission is no longer processing.


### `check_all_submissions`
If you want to see the status of all submissions you've made to MDF Connect, use `check_all_submissions()`. This method is helpful if you forget a submission's `source_id`, or you have multiple submissions processing at once.

#### Optional arguments:
- `verbose` (boolean): When `False`, a basic summary of your submissions will be printed. When `True`, the full status summary of each submission will be printed, in the same form as `check_status()`. The default is `False`. (This argument has no effect if `raw` is `True`.)
- `active` (boolean): When `False`, all of your submissions will be shown. When `True`, only submissions that are still active will be shown. The default is `False`.
- `raw` (boolean): When `False`, the summary selected by `verbose` will be printed. When `True`, the full status result will be returned instead (the full result is not recommended for direct human consumption). The default is `False`.

#### Return value:
- A dictionary containing the full submission statuses (only when `raw` is `True`).

In [52]:
mdfcc.check_all_submissions(active=True)


_test_tutorial_submission_1579021617_v1.1: Processing - Not started
_test_abrehabiruk_virtual_db_v1.1: Processing - Processing
_test_smaller_example_dataset_submission_v3-2: Processing - Processing
_test_smaller_example_dataset_submission_v3-1: Processing - Processing
_test_name_curation_task_example_v1.4: Processing - Processing
_test_name_curation_rejecting_example_v1.4: Processing - Processing
_test_name_curation_accepting_example_v4.2: Processing - Processing


In [53]:
mdfcc.check_all_submissions(verbose=True, active=True)



Status of TEST submission _test_tutorial_submission_1579021617_v1.1 (Tutorial Example Two)
Submitted by Jonathon Gaff at 2020-01-14T17:07:00.063623Z

Submission initialization has not started yet.
Connect data download has not started yet.
Data transfer to primary destination has not started yet.
Metadata extraction has not started yet.
Dataset curation has not started yet.
MDF Search ingestion has not started yet.
Data transfer to secondary destinations has not started yet.
MDF Publish publication has not started yet.
Citrine upload has not started yet.
Materials Resource Registration has not started yet.
Post-processing cleanup has not started yet.
This submission is still processing.


Status of TEST submission _test_abrehabiruk_virtual_db_v1.1 (Virtual Excited State Reference for the Discovery of Electronic Materials (VERDE Materials DB))
Submitted by Jonathon Gaff at 2019-09-12T18:51:41.927599Z

Submission initialization was successful.
Connect data download was successful: 8417

## Curating a submission
- [`get_curation_task`](#get_curation_task)
- [`get_available_curation_tasks`](#get_available_curation_tasks)
- [`accept_curation_submission`](#accept_curation_submission)
- [`reject_curation_submission`](#reject_curation_submission)

When a submission has the curation flag set through [`set_curation()`](#set_curation) or through an organization's rules (see [`add_organization()`](#add_organization)), the dataset goes through metadata extraction but is temporarily stopped before ingestion to any other service (including MDF Search and MDF Publish, when applicable). An approved curator must review and accept the submission before it can proceed (see below). These helpers allow an approved curator to view and approve or reject submissions waiting for curation.

##### About approved curators:
If the dataset submitter's organization has set the curation flag, the approved curators are the managers and admins of the organization's permission groups. If the curation flag was manually set, the approved curators are based on the Access Control List (see [`set_acl()`](#set_acl)); anyone listed directly in the ACL can curate, as well as managers and admins of any group listed.
If the ACL is "public" then anyone can curate the submission, but it will not show up in `get_available_curation_tasks()`.

### `get_curation_task`
To see the details of a specific dataset that needs curation, use `get_curation_task()`. You must have permission to curate any submission you wish to view.

#### Required arguments:
- `source_id` (string): The `source_id` of the submission you want to view. If you don't know the `source_id`, you can use [`get_available_curation_tasks`](#get_available_curation_tasks) or ask the dataset submitter.

#### Optional arguments:
- `summary` (boolean): When `False`, the entire curation task, including the dataset entry and sample records, will be printed. When `True`, only a summary of the curation task will be printed. The default is `False`.
- `raw` (boolean): When `False`, the curation task information selected by `summary` will be printed. When `True`, a dictionary containing the full curation task will be returned, regardless of `summary`. The default is `False`, which is recommended for direct human consumption.

#### Return value:
- A dictionary containing the full curation task (only when `raw` is `True`).

In [54]:
mdfcc.get_curation_task("_test_name_curation_task_example_v1.4")

{
    "allowed_curators": [
        "public"
    ],
    "curation_start_date": "2020-01-14 16:51:04.459251",
    "dataset": {
        "data": {
            "endpoint_path": "globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/_test_name_curation_task_example_v1.4/",
            "link": "https://app.globus.org/file-manager?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=/MDF/mdf_connect/prod/data/_test_name_curation_task_example_v1.4/"
        },
        "dc": {
            "creators": [
                {
                    "creatorName": "Name",
                    "familyName": "Name",
                    "givenName": ""
                }
            ],
            "publicationYear": "2020",
            "publisher": "Materials Data Facility",
            "resourceType": {
                "resourceType": "Dataset",
                "resourceTypeGeneral": "Dataset"
            },
            "titles": [
                {
                    "title": "Curat

In [55]:
mdfcc.get_curation_task("_test_name_curation_task_example_v1.4", summary=True)

_test_name_curation_task_example_v1.4 by Jonathon Gaff
Waiting since 2020-01-14 16:51:04.459251
4 records were extracted out of 4 groups from 12 files



### `get_available_curation_tasks`
To see all of the submissions you have permission to curate (excluding "public" submissions, which anyone can curate), use `get_available_curation_tasks()`.

#### Optional arguments:
- `summary` (boolean): When `False`, the entiretly of each curation task available to you, including the dataset entry and sample records, will be printed. When `True`, only a summary of each curation task will be printed. The default is `True`. The summaries are very useful to get an overview of tasks, but it is recommended to then use [`get_curation_task`](#get_curation_task) to view the details.
- `raw` (boolean): When `False`, the curation task information selected by `summary` will be printed. When `True`, a dictionary containing the full curation tasks will be returned, regardless of `summary`. The default is `False`, which is recommended for direct human consumption.

#### Return value:
- A dictionary containing the full curation tasks (only when `raw` is `True`).

**Note:** This helper does not show curation tasks that are "public", which anyone can curate, to minimize irrelevant results.

In [56]:
mdfcc.get_available_curation_tasks()

You have no open curation tasks.


### `accept_curation_submission`
After reviewing a curation task, you can accept the submission with `accept_curation_submission()`. You must have permission to curate any submission you wish to accept.

It is strongly recommended that you view submissions with [`get_curation_task`](#get_curation_task) before accepting or rejecting them.

After a submission is accepted, it will continue processing and ingest to MDF Search (and any other applicable services, such as MDF Publish).

#### Required arguments:
- `source_id` (string): The `source_id` of the submission you wish to accept.

#### Optional arguments:
- `reason` (string): The reason for accepting this submission. If a reason is not provided, a generic acceptance message will be used instead.
- `prompt` (boolean): When `True`, you will be prompted to confirm acceptance of the submission with a task summary. When `False`, confirmation will not be required. The default is `True`, and it is recommended to review this summary to avoid errors.
- `raw` (boolean): When `False`, the completion result will be printed. When `True`, a dictionary of the completion result will be returned. The default is `False`.

#### Return value:
- A dictionary containing the completion result (only when `raw` is `True`).

**Note:** This sample curation task was submitted outside of this notebook, to make a simple example of accepting a submission. It cannot be re-accepted.

In [57]:
# NBVAL_SKIP

mdfcc.accept_curation_submission("_test_name_curation_accepting_example_v4.2")

Are you sure you want to accept the following submission?
_test_name_curation_accepting_example_v4.2 by Jonathon Gaff
Waiting since 2020-01-14 16:51:05.598132
4 records were extracted out of 4 groups from 12 files


Confirm accepting submission [yes/no]: yes

What is the reason for accepting this submission?
	This dataset shows substantive results.

Submission accepted with reason: This dataset shows substantive results.


### `reject_curation_submission`
After reviewing a curation task, you can reject the submission with `reject_curation_submission()`. You must have permission to curate any submission you wish to reject.

It is strongly recommended that you view submissions with [`get_curation_task`](#get_curation_task) before accepting or rejecting them.

After a submission is rejected, it will fail and permanently stop processing. It will not be ingested to MDF Search or any other services (such as MDF Publish).

#### Required arguments:
- `source_id` (string): The `source_id` of the submission you wish to reject.

#### Optional arguments:
- `reason` (string): The reason for rejecting this submission. If a reason is not provided, a generic rejection message will be used instead. It is recommended to provide this argument to help the submitter understand why you rejected the submission.
- `prompt` (boolean): When `True`, you will be prompted to confirm rejection of the submission with a task summary. When `False`, confirmation will not be required. The default is `True`, and it is recommended to review this summary to avoid errors.
- `raw` (boolean): When `False`, the completion result will be printed. When `True`, a dictionary of the completion result will be returned. The default is `False`.

#### Return value:
- A dictionary containing the completion result (only when `raw` is `True`).

**Note:** This sample curation task was submitted outside of this notebook, to make a simple example of rejecting a submission. It cannot be re-rejected.

In [58]:
# NBVAL_SKIP

mdfcc.reject_curation_submission("_test_name_curation_rejecting_example_v1.4")

Are you sure you want to reject the following submission?
_test_name_curation_rejecting_example_v1.4 by Jonathon Gaff
Waiting since 2020-01-14 16:50:25.421009
4 records were extracted out of 4 groups from 12 files


Confirm rejecting submission [yes/no]: yes

What is the reason for rejecting this submission?
	This submission misused the analysis technique.

Submission rejected with reason: This submission misused the analysis technique.
