A collection of papers on text mining for materials science. Note: this is a work in progress, I will constantly update this page.
If you find an interesting paper and would like to add it here, please create a PR request.
- spaCy: Fast NLP toolkit with pre-built deep learning models for tokenization, NER, POS, dependency parsing, word2vec, etc.
- textacy: Pre-/post- processing of text used in conjunction with spaCy, such as text normalization, garbage text cleaning, extraction of ngrams, entities, etc.
- ChemDataExtractor: A full-fledged toolkit for sentence segmentation, tokenization, chemical NER, and extracting chemical information.
- PDFMiner: A pure Python implementation of PDF parser.
- textract: A bundle of markup-to-plain-text converters including PDF files.
- tesseract: An open-source C++ OCR tool based on LSTM that supports many languages.
- Google Cloud OCR: Google Cloud OCR is highly accurate for books but may suffer from bad recognition accuracy for chemical/materials science symbols and equations.
- ImageDataExtractor: A Tool To Extract and Quantify Data from Microscopy Images by Mukaddem et al: Extract information from microscopy images. The code homepage is http://www.imagedataextractor.org/.
- Machine-learned and codified synthesis parameters of oxide materials by Kim et al: Dataset on syntheses of 30 oxide systems extracted from 76K articles.
- Text-mined dataset of inorganic materials synthesis recipes by Kononova et al: 20K balanced inorganic synthesis reactions and metadata including experimental conditions extracted from 53K articles.
- Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature by Kuniyoshi et al: NER and dependencies annotated/trained on 243 all-solid-state battery articles.
- Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction by Court et al: 40K Curie and Néel temperatures extracted from 68K articles.
- An open experimental database for exploring inorganic materials by Zakutayev et al: 140K entries on high throughput experimental materials (HTEM) including synthesis conditions, chemical composition, crystal structure, optoelectronic property measurements, etc.
- The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures by Mysore et al: 230 annotated synthesis procedures: synthesis operations, arguments, and their relations.
- Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science by Yang et al: A computational information and knowledge management (CIKM) system that extracts preconditions, material inputs, operations, and outputs from literature.
- Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature by Weston et al: Extract from summaries inorganic material mentions, sample descriptors, phase labels, material properties, applications, and synthesis/characterization methods.
- Automated Extraction of Chemical Synthesis Actions from Experimental Procedures by Vaucher et al: Use rule-based/ML(Transformer) model to extract synthesis actions from experimental procedures.
- Automatically Extracting Action Graphs from Materials Science Synthesis Procedures by Mysore et al: Extraction of synthesis action graphs by combining LSTM, dilated CNN, CRF, and rule-based heuristics.
- Using Natural Language Processing Techniques to Extract Information on the Properties and Functionalities of Energetic Materials from Large Text Corpora by Elton et al: Use GloVe vectors and word2vec model to extract compounds and assign function/property words for energetic materials corpora.
- Semi-supervised machine-learning classification of materials synthesis procedures by Huo et al: Identify synthesis paragraphs using LDA and random forest.
- Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning by Kim et al: Analysis of synthesis conditions for titania nanotubes extracted from literature.
- Virtual screening of inorganic materials synthesis parameters with deep learning by Kim et al: Use variational autoencoder to encode text and represent synthesis conditions especially for SrTiO3, TiO2, MnO compounds.
- Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks by Kim et al: Conditional variational autoencoder learning of synthesis actions and predictions for perovskite compounds.
- Unsupervised word embeddings capture latent knowledge from materials science literature by Tshitoyan et al: Discover hidden chemistry information using unsupervised word embedding methods, with emphasis on thermoelectric materials.
- A Relation Aware Search Engine for Materials Science by Shah et al: A search engine that indexes information tuples (object, property, value) from articles and allows relational search. (Their data reposited at NIST Materials Data Repository https://materialsdata.nist.gov/handle/11256/950
- A Bayesian framework for materials knowledge systems by Kalidindi: A Bayesian framework for recommending experimental or simulation parameters using a knowledge base database.