Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



58 Commits

Repository files navigation

Materials Genome Initiative: NLP Informatics

This repo is for a joint project between Jordan Axelrod, Defne Çirci, Logan Cooper, Thomas Lilly, and Shota Miki. It is an attempt to apply NLP techniques to apply sentence-level classifications to materials science papers to make the data contained in them more available to the scientific community.


Due to permissions concerns the /data directory is empty. Please copy the provided test.tsv and train.tsv in the data directory as shown below.

|   |   test.tsv
|   |   train.tsv
│   │
│   └───raw
│       │   action.txt
│       │   constituent.txt
│       │   null.txt
│       │   property.txt

The file can also deterministically turn raw labeled data into a pair of TSV files (one for training, one for testing) which can be used. This is not necessary for paper results reproduction.

Adding Dependencies

Dependencies can be installed using conda or vanilla python venv. Python 3.8 was used for this project. You may experience issues with newer versions of python due to incompatibility with older packages.


You can install the required packages by running conda env create -n pollydarton --file env.yml. If you don't have conda installed, you can follow the instructions here to set it up. This also requires Gensim version 3.8.1 which conda seems to have trouble installing, so within the new conda env, also run pip3 install gensim==3.8.1.

If you need to add any dependencies, make sure to update the environment.yml file so that everyone has access to them. After adding a dependency with conda install ... you can do this by running conda env export --from-history>environment.yml and committing the new file to git.

Python venv

First, set the repo as your working directory.

$ cd MaterialsGenomeInitiative

Create a python venv directory if you don't alredy have one

$ python -m venv env

Activate the venv

$ source ./env/bin/activate

Install the necessary packages. The --no-binary arg is required for gensim because the version of gensim required does not build with modern versions of clang.

$ pip install --no-binary gensim -r requirements.txt

Install Jupyter kernal for env and select the env kernel for running notebooks

$ ipython kernel install --user --name=env

Running Code

Reproduction Models

The basic reproduction can be done by running python3 from within the /src directory. Running the code for the preprocessing, tuning, and extra data segments requires changing branches.

  • Preprocessing: Run python3 from the preprocessing branch here.
  • Preprocessing + Hyperparameter Tuning: Run python3 from the tuning branch here.
  • Preprocessing + Extra Data: Run python3 from the extra-data branch here.

Testing the rule-based model

The code to test the rule-based model against the original data is in prototyping.ipynb in the section titled "Testing Rule Model". The code to run the rule-based model itself is in


Run all cells starting at "Imports" section in src/roberta.ipynb. Please note that this will take some time if not using a GPU or other high performance compute system.


To use MatBERT, download these files into a folder and change the paths used by the model and the tokenizer in src/Sci_Bert_Models.ipynb

$ export MODEL_PATH="Your path"

$ mkdir $MODEL_PATH/matbert-base-cased $MODEL_PATH/matbert-base-uncased

$ curl -# -o $MODEL_PATH/matbert-base-cased/config.json

Run all cells in src/Sci_Bert_Models.ipynb for train and test results.

Random Forest, XGBoost, LSTM

Run all cells in src/preprocessing_additional_models.ipynb. To avoid preprocessing, comment out cells under sections named "Preprocess".


No description, website, or topics provided.






No releases published


No packages published