# Part 4 - General Helper Functions

In [1]:
from mdf_forge.forge import Forge

In [2]:
mdf = Forge()

## Generally Useful Help

### current_query
You can see the query you're currently building with `current_query()`.

In [3]:
mdf.match_field("mdf.source_name", "oqmd")
mdf.current_query()

'(mdf.source_name:oqmd)'

### reset_query
If you have a query in memory that you don't want, you can use `reset_query()` to start a new query. This method will clear the current query entirely.

In [4]:
mdf.reset_query()

In [5]:
mdf.current_query()

''

### Query info
We can build a query using `exclude_field()` and `match_field()` and execute it with `search()`. But if you are interested in knowing more about the query, including the actual query string that was made, you can use the `info=True` argument to `search()`.

In [6]:
mdf.exclude_field("mdf.source_name", "sluschi").match_field("material.elements", "Al").exclude_field("mdf.source_name", "oqmd")
res, info = mdf.search(limit=10, info=True)

When you use the `info=True` argument, `search()` will return a tuple instead of a list. The first element in the tuple will be the same list of results you're used to, but the second tuple element will be a dictionary of query info.

In [7]:
res[0]

{'crystal_structure': {'number_of_atoms': 128,
  'space_group_number': 1,
  'volume': 2103.1050159155607},
 'dft': {'converged': True,
  'cutoff_energy': 520.0,
  'exchange_correlation_functional': 'PAW_PBE'},
 'files': [{'data_type': 'ASCII text',
   'filename': 'OUTCAR_bcc_large',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/khazana_vasp_v1/OUTCARS/OUTCAR_bcc_large',
   'length': 883373,
   'mime_type': 'text/plain',
   'sha512': 'e91555c23528363ce14413c25927835e16191ee7980685139cff9850d7e1ddd355fb231c0f181ae936a37d5c010389e63b70b760c298368a2399417359edfa9e',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/khazana_vasp_v1/OUTCARS/OUTCAR_bcc_large'}],
 'material': {'composition': 'Al128', 'elements': ['Al']},
 'mdf': {'ingest_date': '2018-03-27T21:23:27.961187Z',
  'mdf_id': '5abab64f34a2263dfa3dbf4a',
  'parent_id': '5abab64f34a2263dfa3dbf49',
  'resource_type': 'record',
  'scroll_id': 1,
  'sour

In [8]:
info

{'index': 'mdf',
 'index_uuid': '1a57bbe5-5272-477f-9d31-343b8258b7a5',
 'query': {'advanced': True,
  'limit': 10,
  'offset': 0,
  'q': '( NOT mdf.source_name:sluschi AND material.elements:Al AND  NOT mdf.source_name:oqmd)'},
 'total_query_matches': 36149}

### Repeat a query
You can stop a query from being cleared out of memory after a search by using the `reset_query=False` argument.

In [9]:
mdf.match_field("mdf.source_name", "nist_xps_db")

<mdf_forge.forge.Forge at 0x7f50849d1cc0>

In [10]:
res, info = mdf.search(limit=10, info=True, reset_query=False)
info["query"]["q"]

'(mdf.source_name:nist_xps_db)'

In [11]:
res, info = mdf.search(limit=10, info=True)
info["query"]["q"]

'(mdf.source_name:nist_xps_db)'

### show_fields
How do you know what fields there are to search on? Use `show_fields()` to find out. If you just call `show_fields()` by itself, it will show you all of the top-level blocks (such as "mdf").

In [12]:
mdf.show_fields()

{'cip_v1': 'object',
 'cip_v2': 'object',
 'crystal_structure': 'object',
 'dc': 'object',
 'dft': 'object',
 'files': 'object',
 'image': 'object',
 'material': 'object',
 'mdf': 'object',
 'nist_xps_db_v1': 'object',
 'oqmd_v3': 'object'}

If you give `show_fields()` a top-level block, it will show you the mapping for that block, including the expected datatypes.

In [13]:
mdf.show_fields("mdf")

{'mdf.ingest_date': 'date',
 'mdf.mdf_id': 'text',
 'mdf.parent_id': 'text',
 'mdf.resource_type': 'text',
 'mdf.scroll_id': 'long',
 'mdf.source_name': 'text',
 'mdf.version': 'long'}

## Fetching Datasets

### fetch_datasets_from_results
This method allows you to automatically collect all the datasets that have records returned from a search. In other words, if you search for `mdf.elements:Al` and a _record_ from OQMD is returned, you can pass that record to `fetch_datasets_from_results()` and get the OQMD _dataset_ entry back.

In [14]:
records = mdf.search("dft.converged:true AND mdf.resource_type:record")

In [15]:
res = mdf.fetch_datasets_from_results(records)
res[0]

{'dc': {'contributors': [{'affiliations': ['University College London'],
    'contributorName': 'Schofield, Steven',
    'contributorType': 'ContactPerson',
    'familyName': 'Schofield',
    'givenName': 'Steven'}],
  'creators': [{'affiliations': ['Curtin University'],
    'creatorName': "O'Donnell, Kane",
    'familyName': "O'Donnell",
    'givenName': 'Kane'},
   {'affiliations': ['The Open University'],
    'creatorName': 'Hedgeland, Holly',
    'familyName': 'Hedgeland',
    'givenName': 'Holly'},
   {'affiliations': ['University College London'],
    'creatorName': 'Moore, Gareth',
    'familyName': 'Moore',
    'givenName': 'Gareth'},
   {'affiliations': ['University College London'],
    'creatorName': 'Suleman, Asif',
    'familyName': 'Suleman',
    'givenName': 'Asif'},
   {'affiliations': ['University College London'],
    'creatorName': 'Siegl, Manuel',
    'familyName': 'Siegl',
    'givenName': 'Manuel'},
   {'affiliations': ['The Australian Synchrotron'],
    'creatorN

If you don't want to keep the results at all, you can also use `fetch_datasets_from_results()` to execute a search and use those results instead of passing it your own results.

In [16]:
res = mdf.match_field("material.elements", "Al").fetch_datasets_from_results()
res[0]

{'dc': {'contributors': [{'affiliations': ['University of Illinois Urbana-Champaign'],
    'contributorName': 'Schleife, Andre',
    'contributorType': 'ContactPerson',
    'familyName': 'Schleife',
    'givenName': 'Andre'}],
  'creators': [{'affiliations': ['University of Illinois Urbana-Champaign'],
    'creatorName': 'Schleife, Andre',
    'familyName': 'Schleife',
    'givenName': 'Andre'}],
  'dates': [{'date': '2017-10-10T15:04:02.121824Z', 'dateType': 'Collected'}],
  'publicationYear': '2015',
  'publisher': 'MDF (placeholder)',
  'resourceType': {'resourceType': 'JSON', 'resourceTypeGeneral': 'Dataset'},
  'subjects': [{'subject': 'data_link'}],
  'titles': [{'title': 'Schleife 256 Al'}]},
 'mdf': {'ingest_date': '2018-03-28T16:53:29.457529Z',
  'mdf_id': '5abbc88934a2263dfa3e3555',
  'resource_type': 'dataset',
  'scroll_id': 0,
  'source_name': 'schleife_256_al_v1',
  'version': 1}}

## Aggregations

### aggregate
Queries submitted with `search()` are limited to returning 10,000 results. If this limit is too low, you can use `aggregate()` to retrieve _all_ results from a query, no matter how many. Please be careful with this function, as you can easily accidentally retrieve a very large number of results without meaning to. Consider using `search(your_query, limit=0, info=True)` (see above) first to discover how many results you will get beforehand.

For this example, we will see how many results the query will retrieve before aggregating.

In [17]:
mdf.match_field("mdf.source_name", "oqmd*").match_field("material.elements", "Pb").exclude_field("material.elements", "Al")
res, info = mdf.search(limit=0, info=True, reset_query=False)
print("Number of results:", info["total_query_matches"])

Number of results: 23290


Assuming we want all of these results, we can use `aggregate()` on the same query.

In [18]:
res = mdf.aggregate()
print("Number of results:", len(res))

100%|██████████| 23290/23290 [00:59<00:00, 382.96it/s]

Number of results: 23290



