# Part 4 - General Helper Functions

In [1]:
from mdf_forge.forge import Forge

In [2]:
mdf = Forge()

## Generally Useful Help

### current_query
You can see the query you're currently building with `current_query()`.

Note that your query may be enclosed in parentheses automatically. This does not alter the results of the query.

In [3]:
mdf.match_field("mdf.source_name", "oqmd")
mdf.current_query()

'(mdf.source_name:oqmd)'

### reset_query
If you have a query in memory that you don't want, you can use `reset_query()` to start a new query. This method will clear the current query entirely.

In [4]:
mdf.reset_query()

In [5]:
mdf.current_query()

''

### Query info
We can build a query using `exclude_field()` and `match_field()` and execute it with `search()`. But if you are interested in knowing more about the query, including the actual query string that was made, you can use the `info=True` argument to `search()`.

In [6]:
mdf.exclude_field("mdf.source_name", "sluschi").match_field("material.elements", "Al").exclude_field("mdf.source_name", "oqmd")
res, info = mdf.search(limit=10, info=True)

When you use the `info=True` argument, `search()` will return a tuple instead of a list. The first element in the tuple will be the same list of results you're used to, but the second tuple element will be a dictionary of query info.

In [7]:
res[0]

{'cip': {'bv': '79.0',
  'energy': '-3.36',
  'forcefield': 'Al99.eam.alloy',
  'gv': '29.4',
  'mpid': 'mp-134',
  'totenergy': '-107.52'},
 'files': [{'data_type': 'ASCII text, with very long lines, with no line terminators',
   'filename': 'classical_interatomic_potentials.json',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/cip_v1/classical_interatomic_potentials.json',
   'length': 1841203,
   'mime_type': 'text/plain',
   'sha512': '96635ee0c15d1d0187b18805653a02b1a6dfa5648db82153467045de18adcc08c753e2897d2b48a78a2167a442219e9aeff6b1103732c2158facac8fa4911b33',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/cip_v1/classical_interatomic_potentials.json'}],
 'material': {'composition': 'Al32', 'elements': ['Al']},
 'mdf': {'ingest_date': '2018-10-29T17:47:57.468388Z',
  'mdf_id': '5bd747cf2ef3880b0f213904',
  'parent_id': '5bd747cd2ef3880b0f2135d1',
  'resource_type': 'record',
  'scroll_id': 81

In [8]:
info

{'advanced': True,
 'errors': [],
 'index_uuid': '1a57bbe5-5272-477f-9d31-343b8258b7a5',
 'limit': 10,
 'query': '( NOT mdf.source_name:sluschi AND material.elements:Al AND  NOT mdf.source_name:oqmd)',
 'retries': 0,
 'total_query_matches': 14885}

### Repeat a query
You can stop a query from being cleared out of memory after a search by using the `reset_query=False` argument.

In [9]:
mdf.match_field("mdf.source_name", "nist_xps_db")

<mdf_forge.forge.Forge at 0x7f181d77e5f8>

In [10]:
res, info = mdf.search(limit=10, info=True, reset_query=False)
info["query"]

'(mdf.source_name:nist_xps_db)'

In [11]:
res, info = mdf.search(limit=10, info=True)
info["query"]

'(mdf.source_name:nist_xps_db)'

### show_fields
How do you know what fields there are to search on? Use `show_fields()` to find out. If you just call `show_fields()` by itself, it will show you all of the top-level blocks (such as "mdf").

In [12]:
mdf.show_fields()

{'calphad.phases': 'text',
 'cip.bv': 'text',
 'cip.energy': 'text',
 'cip.forcefield': 'text',
 'cip.gv': 'text',
 'cip.mpid': 'text',
 'cip.totenergy': 'text',
 'crystal_structure.cross_reference.icsd': 'long',
 'crystal_structure.number_of_atoms': 'float',
 'crystal_structure.space_group_number': 'long',
 'crystal_structure.stoichiometry': 'text',
 'crystal_structure.volume': 'float',
 'custom.atom_fractions': 'text',
 'custom.plate_id': 'text',
 'custom.sample_id': 'text',
 'data.endpoint_path': 'text',
 'data.link': 'text',
 'dc.alternateIdentifiers.alternateIdentifier': 'text',
 'dc.alternateIdentifiers.alternateIdentifierType': 'text',
 'dc.contributors.affiliations': 'text',
 'dc.contributors.contributorName': 'text',
 'dc.contributors.contributorType': 'text',
 'dc.contributors.familyName': 'text',
 'dc.contributors.givenName': 'text',
 'dc.creators.affiliations': 'text',
 'dc.creators.creatorName': 'text',
 'dc.creators.familyName': 'text',
 'dc.creators.givenName': 'text',
 

If you give `show_fields()` a top-level block, it will show you the mapping for that block, including the expected datatypes.

In [13]:
mdf.show_fields("mdf")

{'mdf.ingest_date': 'date',
 'mdf.mdf_id': 'text',
 'mdf.parent_id': 'text',
 'mdf.repositories': 'text',
 'mdf.resource_type': 'text',
 'mdf.scroll_id': 'long',
 'mdf.source_id': 'text',
 'mdf.source_name': 'text',
 'mdf.version': 'long'}

## Fetching Datasets

### fetch_datasets_from_results
This method allows you to automatically collect all the datasets that have records returned from a search. In other words, if you search for `mdf.elements:Al` and a _record_ from OQMD is returned, you can pass that record to `fetch_datasets_from_results()` and get the OQMD _dataset_ entry back.

In [14]:
records = mdf.search("dft.converged:true AND mdf.resource_type:record")

In [15]:
res = mdf.fetch_datasets_from_results(records)
res[0]

{'data': {'endpoint_path': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/mdr_item_772_v1/',
  'link': 'https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=/MDF/mdf_connect/prod/data/mdr_item_772_v1/'},
 'dc': {'alternateIdentifiers': [{'alternateIdentifier': 'http://hdl.handle.net/11115/163',
    'alternateIdentifierType': 'Handle'},
   {'alternateIdentifier': '772',
    'alternateIdentifierType': 'NIST DSpace ID'}],
  'creators': [{'creatorName': 'Lewandowski and A. Awadallah, J.J.',
    'familyName': 'Lewandowski and A. Awadallah',
    'givenName': 'J.J.'}],
  'publicationYear': '2013',
  'publisher': 'NIST Materials Data Repository',
  'resourceType': {'resourceType': 'Dataset',
   'resourceTypeGeneral': 'Dataset'},
  'titles': [{'title': 'Hydrostatic Extrusion of Metals and Alloys'}]},
 'mdf': {'ingest_date': '2018-11-15T19:06:23.862425Z',
  'mdf_id': '5bedc3af2ef38842b9953c09',
  'repositories': ['National Insti

If you don't want to keep the results at all, you can also use `fetch_datasets_from_results()` to execute a search and use those results instead of passing it your own results.

In [16]:
res = mdf.match_field("material.elements", "Al").fetch_datasets_from_results()
res[0]

{'data': {'endpoint_path': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/khazana_vasp_v4/',
  'link': 'https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=/MDF/mdf_connect/prod/data/khazana_vasp_v4/'},
 'dc': {'contributors': [{'affiliations': ['University of Connecticut'],
    'contributorName': 'Ramprasad, Rampi',
    'contributorType': 'ContactPerson',
    'familyName': 'Ramprasad',
    'givenName': 'Rampi'}],
  'creators': [{'affiliations': ['University of Connecticut'],
    'creatorName': 'Ramprasad, Rampi'}],
  'dates': [{'date': '2017-08-04T19:25:05.718973Z', 'dateType': 'Collected'}],
  'descriptions': [{'description': 'A computational materials knowledgebase',
    'descriptionType': 'Other'}],
  'publicationYear': '2016',
  'publisher': 'MDF (placeholder)',
  'resourceType': {'resourceType': 'JSON', 'resourceTypeGeneral': 'Dataset'},
  'subjects': [{'subject': 'DFT'}, {'subject': 'VASP'}],
  'titles': [{'tit

## Aggregations

### aggregate
Queries submitted with `search()` are limited to returning 10,000 results. If this limit is too low, you can use `aggregate()` to retrieve _all_ results from a query, no matter how many. Please be careful with this function, as you can easily accidentally retrieve a very large number of results without meaning to. Consider using `search(your_query, limit=0, info=True)` (see above) first to discover how many results you will get beforehand.

For this example, we will see how many results the query will retrieve before aggregating.

In [17]:
mdf.match_field("mdf.source_name", "oqmd*").match_field("material.elements", "Pb").exclude_field("material.elements", "Al")
res, info = mdf.search(limit=0, info=True, reset_query=False)
print("Number of results:", info["total_query_matches"])

Number of results: 15057


Assuming we want all of these results, we can use `aggregate()` on the same query.

In [18]:
res = mdf.aggregate()
print("Number of results:", len(res))

Number of results: 15057
