Text mining/preprocessing package for Spark. This package is purely a pet project.
Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.
Supported algorithms (implemented or on the roadmap):
-
Rapid Automatic Keyword Extraction (RAKE) as described in Text Mining: Applications and Theory by Michael W. Berry, Jacob Cogan.
-
TextRank as described in TextRank: Bringing Order into Texts by Rada Mihalcea and Paul Tarau.
-
Maui as described in Human-competitive automated topic indexing by Olena Medelyan
[Word|Phrase] Embedding is a technique for transforming words or phrases from the vocabulary into vectors of real numbers in a low-dimensional space relative to the vocabulary size.
Supported algorithms (implemented or on the roadmap):