Skip to content

hhaoyan/awesome-textmining-materials-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Awesome text mining ⛏️ for materials science

A collection of papers on text mining for materials science. Note: this is a work in progress, I will constantly update this page.

If you find an interesting paper and would like to add it here, please create a PR request.

Tools and codes

Plain text

  • spaCy: Fast NLP toolkit with pre-built deep learning models for tokenization, NER, POS, dependency parsing, word2vec, etc.
  • textacy: Pre-/post- processing of text used in conjunction with spaCy, such as text normalization, garbage text cleaning, extraction of ngrams, entities, etc.
  • ChemDataExtractor: A full-fledged toolkit for sentence segmentation, tokenization, chemical NER, and extracting chemical information.

PDF files

  • PDFMiner: A pure Python implementation of PDF parser.
  • textract: A bundle of markup-to-plain-text converters including PDF files.

OCR tools

  • tesseract: An open-source C++ OCR tool based on LSTM that supports many languages.
  • Google Cloud OCR: Google Cloud OCR is highly accurate for books but may suffer from bad recognition accuracy for chemical/materials science symbols and equations.

Image data extraction

Datasets/databases

On synthesis

NLP annotations

NLP pipelines

Named Entity Recognition

Text classification/categorization

Data analysis

Synthesis data analysis/planning

Chemical knowledge base/graph

Releases

No releases published

Packages

No packages published