Skip to content
Public API for the Materials Scholar database
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
matscholar Update Nov 7, 2019
.gitignore add relevance endpoint Jul 25, 2019
LICENSE Initial commit Oct 3, 2018 Update Nov 8, 2019
requirements.txt Tweaking progress messages. Sep 25, 2019 adding cli Jul 23, 2019

matscholar logo

matscholar (Materials Scholar) is a Python library for materials-focused natural language processing (NLP). It is maintained by a team of researchers at UC Berkeley and Lawrence Berkeley National Laboratory as part of a project funded by the Toyota Research Institute.

This library provides a Python interface for interacting with the Materials Scholar API, performing basic NLP tasks on scientific text, and example notebooks on using these tools for materials discovery and design.

Documentation for the API can be found in this readme, as well as in the jupyter notebook: docs/demo.ipynb. If the notebook fails to render on github, paste the link into nbviewer:

You can find our official support forum here, under the "Matscholar" category:


For installation and usage - it doesn't work with python3.7 as there are problems with dependencies, so use python3.6

We highly recommend using a conda environment when working with materials scholar tools.

  1. Clone or download this repo
  2. Navigate to the root directory (matscholar)
  3. pip install -r requirements.txt
  4. pip install . or python install

Configuring Your API Key

The Materials Scholar API can only be accessed by providing an API key in x-api-key request header field. API keys are currently only available to internal collaborators at LBNL, but will be available soon. To request an API key to access the Materials Scholar API (or to request to be added to the waiting list), please contact John Dagdelen at

API Usage

For convenience, the Materials Scholar API can be accessed via a python wrapper.

Instantiating the Rester

If an API key has already been obtained, the rester is instantiated as follows:

from import Rester

rester = Rester(api_key="your-api-key", endpoint="")

To avoid passing the API key and endpoint as arguments, set the following environment variables for ease of use: MATSCHOLAR_API_KEY, MATERIALS_SCHOLAR_ENDPOINT.


The methods of the Rester class can be used to access resources of the Materials Scholar API.

Searching documents

Our corpus of materials science abstracts can be searched based on text matching (ElasticSearch) or by filtering based on the Named Entities extracted from each document. Entity based searches support the following entity types: material, property, application, descriptor, characterization, synthesis, phase.

To get the raw text of abstracts matching a given query:

# text match for "solid oxide fuel cells"
example_text = "solid oxide fuel cells"

# entity filters: include documents mentioning BaZrO3 and nanoparticles; 
# exclude documents mentioning thin films
example_entities = {"material": ["BaZrO3"], "descriptor": ["nanoparticle", "-thin film"]}

docs = rester.search_text_with_ents(text=example_text, filters=example_entities)

This will return a list of dictionaries containing the raw-text for each abstracts along with associated metadata.

Searching entities

We have extracted materials-science named entities from nearly 3.5 million materials science absracts. Details on how this was performed can be found in Ref. [1].

The extracted named entities for each document associated with a query are returned by the search_ents method. This method takes as input a dictionary with entity types as keys and a list of entities as values. For example, to find all of the entities that co-occur with the material "GaN":

docs = rester.search_ents(query={"material": ["GaN"]})

This wil return a list of dictionaries representing documents matching the query; each dict will contain the DOI as well as each unique entity found in the corresponding abstract.

A summary of the entities associated with a query can be generated using the get_summary method. To get statistics for entities co-occuring with GaN,

summary = rester.get_summary(query={"material": ["GaN"]})

This will return a dictionary with entity types as keys; the values will be a list of the top entities that occur in documents matching the query, each item in the list will be [entity, document count, fraction].

To perform a fast literature review, the materials_search_ents method may be used. For a chosen application, this will return a list of all materials that co-occur with that application in our corpus. For example, to see which materials co-occur with the word thermoelectric in a document,

mat_list = rester.materials_search_ents(["thermoelectric"], elements=["-Pb"], cutoff=None)

The above search will find all materials co-occurring with thermoelectric that do not contain lead. The result will be a list, with each element containing a list of [material, co-occurence counts, co-occurrence dois].

Word embeddings

Materials science word embeddings trained using word2vec; details on how the embeddings were trained, and their application in materials science discovery can be found in Ref. [2].

To get the word embedding for a given word,

embedding = rester.get_embedding("photovoltaics")

This will return a dict containing the embedding. The word embedding will be a 200-dimensional array.

The rester also has a close_words method (based on cosine similarity of embeddings) which can be used to explore the semantic similarity of materials science terms; this approach can be used discover materials for a new application (as outlined in the reference above),

To find words with a similar embedding to photovolatic:

close_words = rester.close_words("photovoltaics", top_k=1000)

This will return the 1000 closest words to photovoltaics. The result will be a dictionary containing the close words and their cosine similarity to the input word.

Named Entity Recognition

In addition to the pre-processed entities present in our corpus, users can performed Named Entity Recognition on any raw materials science text. The details of the model can be found in Ref. [1].

The input should be a list of documents with the text represented as a string:

doc_1 = "The bands gap of TiO2 is 3.2 eV. This was measured via photoluminescence"
doc_2 = "We deposit GaN thin films using MOCVD"
docs = [doc_1, doc_2] 
tagged_docs = rester.get_ner_tags(docs, return_type="concatenated")

The arguement return_type may be set to iob, concatenated, or normalized. The latter will replace entities with their most frequently occurring synonym. A list of tagged documents will be returned. Each doc is a list of sentences; each sentence is a list of (word, tag) pairs.

Abstract Collection Tool and the Matscholar CLI

  1. Clone this repo with git pull
  2. Install matscholar’s dependencies with pip install -r requirements.txt in the repo directory.
  3. Install matscholar with python install in the repo directory.
  4. Get a text mining API key from
  5. Run mscli configure on the command line and enter the following info:
Name: <your full name>
Scopus API Key: <the key you just got from Scopus>
Matscholar Hostname: <mongodb hostname> 
Matscholar Username: <mongodb username>
Matscholar Password: <mongodb password> 
  1. Set up a collection in your MongoDB database called "bulid_log" and fill it with entries like the following for each year/journal combination you'd like to collect from.
year: <year>
issn: <issn>
status: "incomplete"
num_articles: -1
  1. Run mscli contribute to start pulling abstracts from the scopus API. Use the --count option to set the number of blocks you want to pull (they generally have fewer than 1000 abstracts)


If you use any of the API functionality in your research, please cite the following papers where relevent:

[1] L. Weston et al., Submitted to J. Chem. Inf. Model.,

[2] V. Tshitoyan et al., Nature 571, 95 (2019).


@jdagdelen, @vtshitoyan, @lweston

You can’t perform that action at this time.