# API demo

This is a short demo demonstrating the main functionality of the matscholar API.

### Instantiate the rester

For an API key contact lweston@lbl.gov, jdagdelen@lbl.gov, or amalietrewartha@lbl.gov.

If an API key has already been obtained, the rester is instantiated as follows:

In [8]:
from matscholar.rest import Rester
rester = Rester(api_key="your-api-key", endpoint="http://api.matscholar.com")

### Resources

##### Searching documents

Our corpus of materials science abstracts can be searched based on text matching (ElasticSearch) or by filtering based on the Named Entities extracted from each document. Entity based searches support the following entity types: material, property, application, descriptor, characterization, synthesis, phase.

To get the raw text of abstracts matching a given query:

In [9]:
#include text mentioning "light-emitting diode"
example_text = "light-emitting diode"
#include material "GaN" exclude "InN"; include descriptor "thin film"
example_entities = {'material': ['GaN', "-InN"], "descriptor": ["thin film"]}
docs = rester.search_text_with_ents(text=example_text, filters=example_entities, cutoff=10)

In [10]:
docs[0]

{'abstract': 'In this work, a nano-cavity patterned sapphire substrate (nc-PSS) is fabricated by using a self-formed meshed Pt thin film on a c-plane sapphire substrate. The light output power of a GaN-based light emitting diode on the nc-PSS is 45% greater than that of a control light emitting diode that was prepared on a flat c-plane sapphire substrate (f-SS) wafer. The GaN-based light emitting diode that was prepared on the nc-PSS exhibited much less drooping than a GaN-based light-emitting diode that was prepared on a commercial semi-sphere patterned sapphire substrate (r-PSS), mainly because the voids that formed at the cavities at the GaN/nc-PSS interface buffered the stress in the GaN epi-layers that was imposed by the sapphire substrate.',
 'authors': ['Huang, S.W.',
  'Chang, C.C.',
  'Lin, H.Y.',
  'Li, X.F.',
  'Lin, Y.C.',
  'Liu, C.Y.'],
 'doi': '10.1016/j.tsf.2017.03.015',
 'journal': 'Thin Solid Films',
 'link': 'https://www.sciencedirect.com/science/article/pii/S0040609

##### Searching entities

We have extracted materials-science named entities from nearly 3.5 million materials science absracts.

The extracted named entities for each document associated with a query are returned by the search_ents method. This method takes as input a dictionary with entity types as keys and a list of entities as values. For example, to find all of the entities that co-occur with the query below:

In [15]:
#get entities for documnets mentioning material "BaZrO3" and descriptor "single crystal"
docs = rester.search_ents(query={"material": ["BaZrO3"], "descriptor": ["single crystal"]})
docs[0]

{'doi': '10.1039/C1CC14166J',
 'MAT': ['Ba5O15Ti2Zr3',
  'Ba5O15TiZr4',
  'Ba10O30Ti7Zr3',
  'Ba5O15Ti3Zr2',
  'Ba2O6TiZr',
  'BaO3Ti',
  'BaO3Zr',
  'Ba10O30Ti3Zr7',
  'Ba10O30TiZr9',
  'Ba5O15Ti4Zr',
  'Ba10O30Ti9Zr'],
 'PRO': ['noncentrosymmetric', 'local structural distortion'],
 'APL': [],
 'DSC': ['single crystal',
  'as-prepared nanocrystals',
  'crystalline',
  'nanocrystalline',
  'single',
  'crystals'],
 'CMT': [],
 'SMT': ['vapor diffusion sol – gel'],
 'SPL': []}

This wil return a list of dictionaries representing documents matching the query; each dict will contain the DOI as well as each unique entity found in the corresponding abstract.

A summary of the entities associated with a query can be generated using the get_summary method. To get statistics for entities co-occuring with GaN:

In [19]:
summary = rester.get_summary(query={"material": ["GaN"]})

This will return a dictionary with entity types as keys; the values will be a list of the top entities that occur in documents matching the query, each item in the list will be [entity, document count, fraction].

To show the synthesis methods from the summary:

In [20]:
summary["SMT"][:5]

[['metalorganic chemical vapor deposition', 1187, 0.1140688064578128],
 ['annealing', 970, 0.09321545262348645],
 ['molecular beam epitaxy', 651, 0.06256006150297905],
 ['metalorganic vapor phase epitaxy', 571, 0.05487218912166058],
 ['hydride vapor phase epitaxy', 408, 0.039208149144724196]]

To perform a fast literature review, the materials_search_ents method may be used. For a chosen application, this will return a list of all materials that co-occur with that application in our corpus. For example, to see which materials co-occur with the word thermoelectric in a document,

In [13]:
mat_list = rester.materials_search_ents(["thermoelectric"], elements=["-Pb"], cutoff=None)
for mat, counts, dois in mat_list[:20]:
    print(mat, counts, dois[:3])

Bi2Te3 625 ['10.1149/1.1509459', '10.1149/1.2454653', '10.1149/1.3493591']
Sb 389 ['10.1149/1.1545458', '10.1039/C6QI00520A', '10.1039/C7QI00138J']
Si 389 ['10.1149/2.084204jes', '10.1149/2.0031410jss', '10.1149/2.0021710jss']
Te 385 ['10.1149/1.1509459', '10.1149/1.2454653', '10.1149/2.033202jes']
Cu 361 ['10.1149/1.2358840', '10.1039/C6QI00340K', '10.1039/C7QI00146K']
Bi 351 ['10.1149/1.1509459', '10.1149/1.1545458', '10.1149/1.2454653']
Ag 263 ['10.1149/1.1509459', '10.1039/C6QI00162A', '10.1039/C7QI00121E']
Ni 241 ['10.1149/1.2358840', '10.1149/1.3385154', '10.1039/C2CE25119A']
Co 237 ['10.1149/1.3385154', '10.1039/C5EE02979A', '10.1039/C7CP03527F']
Al 228 ['10.1039/C6CE02191C', '10.1039/C1EE02465E', '10.1039/B508998K']
Fe 225 ['10.1039/C5EE02979A', '10.1039/C6CP00819D', '10.1039/C3TC30481G']
Sn 215 ['10.1039/C2CE25119A', '10.1039/C3CE40956B', '10.1039/C2CP43946H']
Ge 202 ['10.1039/C6CE01405D', '10.1039/C5EE02600H', '10.1039/C5CP02174J']
Se 192 ['10.1039/C6QI00340K', '10.1039/C4CE0

The above search will find all materials co-occurring with thermoelectric that do not contain lead. The result will be a list, with each element containing a list of [material, co-occurence counts, co-occurrence dois].

##### Word embeddings

Materials science word embeddings trained using word2vec.

To get the word embedding for a given word:

In [26]:
embedding = rester.get_embedding("photovoltaics")
embedding.keys()

dict_keys(['embeddings', 'original_wordphrases', 'processed_wordphrases'])

This will return a dict containing the embedding. The word embedding will be a 200-dimensional array.

The rester also has a close_words method (based on cosine similarity of embeddings) which can be used to explore the semantic similarity of materials science terms; this approach can be used discover materials for a new application (as outlined in the reference above),

To find words with a similar embedding to photovolatic:

In [32]:
close_words = rester.close_words("photovoltaics", top_k=1000)
close_words["close_words"][:10]

['photovoltaic_cells',
 'photovoltaic_devices',
 'solar_cells',
 'photovoltaic',
 'photovoltaic_applications',
 'next_-_generation_photovoltaics',
 'optoelectronics',
 'solar_cell',
 'solar_cell_applications',
 'optoelectronic_devices']

This will return the 1000 closest words to photovoltaics. The result will be a dictionary containing the close words and their cosine similarity to the input word.

##### Named entity recognition

In addition to the pre-processed entities present in our corpus, users can performed Named Entity Recognition on any raw materials science text.

The input should be a list of documents with the text represented as a string:

In [37]:
doc_1 = "The bands gap of TiO2 is 3.2 eV. This was measured via photoluminescence."
doc_2 = "We deposit GaN thin films using MOCVD"
docs = [doc_1, doc_2] 
tagged_docs = rester.get_ner_tags(docs, return_type="concatenated")
tagged_docs[0]

[[['the', 'O'],
  ['bands gap', 'PRO'],
  ['of', 'O'],
  ['O2Ti', 'MAT'],
  ['is', 'O'],
  ['3.2', 'PVL'],
  ['eV', 'PUT'],
  ['.', 'O']],
 [['this', 'O'],
  ['was', 'O'],
  ['measured', 'O'],
  ['via', 'O'],
  ['photoluminescence', 'CMT'],
  ['.', 'O']]]

The argument return_type may be set to iob, concatenated, or normalized. The latter will replace entities with their most frequently occurring synonym. A list of tagged documents will be returned. Each doc is a list of sentences; each sentence is a list of (word, tag) pairs.

### References

If you use the matscholar API in your research, please cite the following papers:

[1] V. Tshitoyan et al., Nature 571, 95 (2019). 

[2] L. Weston et al., Submitted to J. Chem. Inf. Model., https://doi.org/10.26434/chemrxiv.8226068.v1