Awesome text mining ⛏️ for materials science

A collection of papers on text mining for materials science. Note: this is a work in progress, I will constantly update this page.

If you find an interesting paper and would like to add it here, please create a PR request.

Tools and codes

Plain text

spaCy: Fast NLP toolkit with pre-built deep learning models for tokenization, NER, POS, dependency parsing, word2vec, etc.
textacy: Pre-/post- processing of text used in conjunction with spaCy, such as text normalization, garbage text cleaning, extraction of ngrams, entities, etc.
ChemDataExtractor: A full-fledged toolkit for sentence segmentation, tokenization, chemical NER, and extracting chemical information.

PDF files

PDFMiner: A pure Python implementation of PDF parser.
textract: A bundle of markup-to-plain-text converters including PDF files.

OCR tools

tesseract: An open-source C++ OCR tool based on LSTM that supports many languages.
Google Cloud OCR: Google Cloud OCR is highly accurate for books but may suffer from bad recognition accuracy for chemical/materials science symbols and equations.

Image data extraction

ImageDataExtractor: A Tool To Extract and Quantify Data from Microscopy Images by Mukaddem et al: Extract information from microscopy images. The code homepage is http://www.imagedataextractor.org/.

Datasets/databases

On synthesis

Machine-learned and codified synthesis parameters of oxide materials by Kim et al: Dataset on syntheses of 30 oxide systems extracted from 76K articles.
Text-mined dataset of inorganic materials synthesis recipes by Kononova et al: 20K balanced inorganic synthesis reactions and metadata including experimental conditions extracted from 53K articles.
Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature by Kuniyoshi et al: NER and dependencies annotated/trained on 243 all-solid-state battery articles.
Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction by Court et al: 40K Curie and Néel temperatures extracted from 68K articles.
An open experimental database for exploring inorganic materials by Zakutayev et al: 140K entries on high throughput experimental materials (HTEM) including synthesis conditions, chemical composition, crystal structure, optoelectronic property measurements, etc.

NLP annotations

The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures by Mysore et al: 230 annotated synthesis procedures: synthesis operations, arguments, and their relations.

NLP pipelines

Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science by Yang et al: A computational information and knowledge management (CIKM) system that extracts preconditions, material inputs, operations, and outputs from literature.

Named Entity Recognition

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature by Weston et al: Extract from summaries inorganic material mentions, sample descriptors, phase labels, material properties, applications, and synthesis/characterization methods.
Automated Extraction of Chemical Synthesis Actions from Experimental Procedures by Vaucher et al: Use rule-based/ML(Transformer) model to extract synthesis actions from experimental procedures.
Automatically Extracting Action Graphs from Materials Science Synthesis Procedures by Mysore et al: Extraction of synthesis action graphs by combining LSTM, dilated CNN, CRF, and rule-based heuristics.
Using Natural Language Processing Techniques to Extract Information on the Properties and Functionalities of Energetic Materials from Large Text Corpora by Elton et al: Use GloVe vectors and word2vec model to extract compounds and assign function/property words for energetic materials corpora.

Text classification/categorization

Semi-supervised machine-learning classification of materials synthesis procedures by Huo et al: Identify synthesis paragraphs using LDA and random forest.

Data analysis

Synthesis data analysis/planning

Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning by Kim et al: Analysis of synthesis conditions for titania nanotubes extracted from literature.
Virtual screening of inorganic materials synthesis parameters with deep learning by Kim et al: Use variational autoencoder to encode text and represent synthesis conditions especially for SrTiO3, TiO2, MnO compounds.
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks by Kim et al: Conditional variational autoencoder learning of synthesis actions and predictions for perovskite compounds.

Chemical knowledge base/graph

Unsupervised word embeddings capture latent knowledge from materials science literature by Tshitoyan et al: Discover hidden chemistry information using unsupervised word embedding methods, with emphasis on thermoelectric materials.
A Relation Aware Search Engine for Materials Science by Shah et al: A search engine that indexes information tuples (object, property, value) from articles and allows relational search. (Their data reposited at NIST Materials Data Repository https://materialsdata.nist.gov/handle/11256/950
A Bayesian framework for materials knowledge systems by Kalidindi: A Bayesian framework for recommending experimental or simulation parameters using a knowledge base database.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome text mining ⛏️ for materials science

Tools and codes

Plain text

PDF files

OCR tools

Image data extraction

Datasets/databases

On synthesis

NLP annotations

NLP pipelines

Named Entity Recognition

Text classification/categorization

Data analysis

Synthesis data analysis/planning

Chemical knowledge base/graph

About

Releases

Packages

License

hhaoyan/awesome-textmining-materials-science

Folders and files

Latest commit

History

Repository files navigation

Awesome text mining ⛏️ for materials science

Tools and codes

Plain text

PDF files

OCR tools

Image data extraction

Datasets/databases

On synthesis

NLP annotations

NLP pipelines

Named Entity Recognition

Text classification/categorization

Data analysis

Synthesis data analysis/planning

Chemical knowledge base/graph

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages