In [None]:
@file:Repository("https://jitpack.io")

In [None]:
@file:DependsOn("com.londogard:londogard-nlp-toolkit:clf-SNAPSHOT")

# Londogard NLP Toolkit
This is a simple toolkit for Natural Language Processing (NLP) which will contain utilities that are very handy while prototyping but also for production deployment because of smart utilisation of resources.  
The initial aim is not to solve problems end-2-end but rather have a simple small dependency with great utilities.

## Supported Utilities

All completed utilities with simple usage accompanied. `LanguageSupport.<ISO_2_COUNTRY_CODE>` helps figuring out what is supported.

- [Word Embeddings](#WordEmbeddings) including basic Word Embeddings, 'Light Word Embeddings' & BytePairEmbeddings
    - The Embeddings include automated downloads of languages to simplify your life, unless a path is specified.
    - WordEmbeddings include 157 languages via `fastText`-embeddings ([fastText.cc](https://fasttext.cc/docs/en/crawl-vectors.html))
    - BPE-Embeddings include 275 languages via `BPEmb`-embeddings ([nlp.h-its.org](https://nlp.h-its.org/bpemb/))
    - All embeddings support ∞ languages through your own self-trained embeddings if a file-path is supplied!
- [Sentence Embeddings](#SentenceEmbeddings) including `AvgSentenceEmbeddings` & `USifEmbeddings`
- [Tokenizers](#Tokenizers) including Word, Char & Subword (SentencePiece) tokenizers (simple to add custom logic)
    - SentencePiece include 275 languages via `BPEmb`-embeddings ([nlp.h-its.org](https://nlp.h-its.org/bpemb/)) with 8 vocab-sizes (1000, 3000, 10_000, 25_000, 50_000, 100_000, 200_000).
        - Of course possible to supply your own tokenizer if you've a path to a trained one!
- [Stopwords](#Stopwords) based on NLTKs list
    - Supporting: ar, az, da, de, el, en, es, fi, fr, hu, id, it, kk, ne, nl, no, pt, ro, ru, sl, sv, tg & tr
- [Word Frequencies](#WordFrequencies) based on `wordfreq.py` by [LuminosoInsight](https://github.com/LuminosoInsight/wordfreq/).
    - Supporting: ar, cs, de, en, es, fi, fr, it, ja, nl, pl, uk, pt, ru, zh, bg, bn, ca, da, el, fa, he, hi, hu, id, ko, lv, mk, ms, nb, ro, sh, sv & tr. Some of these support "Large Word Frequency", call `LanguageSupport.<LANG_CODE>.largestWordFrequency()` to see if `large` variant is available.
- [Stemmer](#Stemmer) based on [Snowball Stemmer](https://snowballstem.org/)
    - Supporting: sv, nl, en, fi, fr, de, hu, it, no, pt, ro, ru, es & tr
- [Trie](#Trie) - Just a basic utility that is to be used for a custom SubwordTokenizer in the future.
- [Vectorizer](#Vectorizer) - Vectorizers & Transformers. Include BM25, TF-IDF & BagOfWords (CountVectorizer).
- [Classifiers](#Classifiers) - Classifiers that predicts a class (label), includes Logisitc Regression & Naïve Bayes.
- [Regression](#Regression) - Linear Regression which is a simple regression.
- [Sequence Classifier](#SequenceClassifier) - Hidden Markov Model to predict sequences like Part of Speech (PoS).


### WordEmbeddings
Word Embeddings currently exists in three variants. 

All Embeddings (excluding SentenceEmbeddings) extend `Embeddings` which have some good-to-know default methods:
```kotlin
interface Embeddings {
    fun contains(word: String): Boolean
    
    fun vector(word: String): SimpleMatrix?
    
    // OBS: Will return all possible vectors, not necessarily ALL
    fun traverseVectors(words: List<String>): List<SimpleMatrix>
    fun traverseVectorsOrNull(words: List<String>): List<SimpleMatrix>? // Returns null if any missing
    
    fun euclideanDistance(w1: String, w2: String): Double?
    fun cosineDistance(w1: String, w2: String): Double?
}
```

#### WordEmbedding

`WordEmbeddings` are the classical usecase of Embeddings where each word maps to a vector of floats. There exists some helper-methods. Currently requires to have the embeddings locally. 
Download functions for [fastText](https://fasttext.cc/) will come, but be warned they're large!

```kotlin
class WordEmbeddings(val delimeter: Char = ' ', val dimensions: Int, val filePath: Path) {
    // Returns the N nearest neighbours
    fun nearestNeighbour(vector: SimpleMatrix, N: Int): List<Pair<String, Double>>
    
    // Returns N nearest neighbours, for the average of all input
    fun distance(input: List<String>, N: Int): List<Pair<String, Double>>
    
    // w1 is to w2 what w3 is to ??. The N closest choices to ?? is selected
    fun analogy(w1: String, w2: String, w3: String, N: Int): List<Pair<String, Double>>?
    
    // Rank a set of words by their distance to word
    fun rank(word: String, set: Set<String>): List<Pair<String, Double>>
    
    // Pretty print the returned wordlist
    fun pprint(words: List<Pair<String, Double>>)
}
```

#### LightWordEmbedding

`LightWordEmbeddings` is something we at Londogard created to allow our embeddings to be loaded onto a Raspberry Pi 3B+ (1GB RAM). What's so effective about the `LightWordEmbeddings` is that ~ 10 % of all words makes up 90 % of all communications meaning that by just having a few embeddings (the most common ones) we cover most cases and can load the rest when required.  
They don't come free as you can't call the unique functions from `WordEmbeddings` such as `nearestNeighbour` and are a little bit more complicated to use.

```kotlin
class LightWordEmbeddings(val delimeter: Char = ' ', val dimensions: Int, val filePath: Path, val maxWordCount: Int = 1000) {
    
    // add words you'd like to read into the embeddings.
    // only delta from already loaded words are added, e.g. if everything is loaded it won't head to filesystem
    // the `maxWordCount` most common words are preloaded as part of instatiation
    fun addWords(words: Set<String) 
}
```

#### BytePieceEncoding-Embedding (BPEmb)
BytePieceEncoding-Embeddings are a new approach from _BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages_ by Benjamin Heinzerling and Michael Strube. In this paper they show how a 11 MB embedding for English is on par with 6 GB embeddings from `fastText`!

The default behaviour is that the tokenizer tokenize a word and average the embedding. It's also possible to use the extended method `subwordVector` to directly retrieve the embedding if you pretokenized the text. Remember if using `subwordVector` it's important that you choose the same vocab-size, otherwise you'll see a lot of misses!

```kotlin
class BpeEmbeddings {
    fun subwordVector(subword: String): SimpleMatrix?
}
```

#### EmbeddingLoader
Finally there exists a utility function that helps you load & download embeddings automatically and saving them on filesystem. 

Let's see how we can use this!

In [3]:
import com.londogard.nlp.embeddings.*
import com.londogard.nlp.utils.LanguageSupport.*
import kotlin.system.measureTimeMillis

EmbeddingLoader.fromLanguageOrNull<WordEmbeddings>(sv) // WordEmbeddings
EmbeddingLoader.fromLanguageOrNull<BpeEmbeddings>(sv) // BpeEmbeddings (vocabSize: 10_000, dim: 50)
EmbeddingLoader.fromLanguageOrNull<LightWordEmbeddings>(sv) // LightWordEmbeddings (size: 1000)

measureTimeMillis {
    val embeddings = EmbeddingLoader.fromLanguageOrNull<LightWordEmbeddings>(sv)
    println(embeddings?.vector("Hej")?.cols(0, 10)) // Trunctating 300 dimensions to 10 to make print nicer
}.let { ms -> println("Loading LightWordEmbeddings + retrieving 'Hej' took $ms milliseconds") }

Type = FDRM , rows = 1 , cols = 10
-7.9100E-02 -4.8000E-02 -1.7440E-01  1.1110E-01 -6.3600E-02 -2.3520E-01 -5.4400E-02  1.1200E-01 -4.0000E-04 -1.5900E-02 

Loading LightWordEmbeddings + retrieving 'Hej' took 79 milliseconds


`LightWordEmbeddings` are really good to keep memory requirements at low and having a quick boot-up from cold. `WordEmbeddings` are obviously faster once running "hot" but it might not be possible if your RAM is too low (e.g. running on embedded hardware, Raspberry Pi etc).

See how long `WordEmbeddings` take to boot up in comarison!

In [4]:
measureTimeMillis {
    val embeddings = EmbeddingLoader.fromLanguageOrNull<WordEmbeddings>(sv)
    println(embeddings?.vector("Hej")?.cols(0, 10)) // Trunctating 300 dimensions to 10 to make print nicer
}.let { ms -> println("Loading WordEmbeddings + retrieving 'Hej' took $ms milliseconds") }

Type = FDRM , rows = 1 , cols = 10
-7.9100E-02 -4.8000E-02 -1.7440E-01  1.1110E-01 -6.3600E-02 -2.3520E-01 -5.4400E-02  1.1200E-01 -4.0000E-04 -1.5900E-02 

Loading WordEmbeddings + retrieving 'Hej' took 63439 milliseconds


And finally we have `BpeEmbeddings` which I believe is the best out of all. They combine speed, RAM and eveything into one great package. The bootup is a little bit slower than `LightWordEmbeddings` but it is possible to keep the full Embedding in memory on low-memory hardware because of the great sizes meaning that it'll keep being incredibly fast!

In [5]:
measureTimeMillis {
    val embeddings = EmbeddingLoader.fromLanguageOrNull<BpeEmbeddings>(sv)
    println(embeddings?.vector("Hej")?.cols(0, 10)) // Trunctating 300 dimensions to 10 to make print nicer
}.let { ms -> println("Loading BpeEmbeddings + retrieving 'Hej' took $ms milliseconds") }

Type = FDRM , rows = 1 , cols = 10
 4.1733E-02  1.0365E-01  9.4801E-02  3.9651E-02  3.2153E-01  3.3000E-02 -2.7866E-01  1.6301E-01 -2.5834E-01  7.7778E-02 

Loading BpeEmbeddings + retrieving 'Hej' took 302 milliseconds


### SentenceEmbeddings

All Sentence Embeddings extend the same `interface`
```kotlin
interface SentenceEmbeddings {
    val tokenEmbeddings: Embeddings
    fun getSentenceEmbeddings(listOfSentences: List<List<String>>): List<SimpleMatrix>
    fun getSentenceEmbedding(sentence: List<String>): SimpleMatrix
}
```

The `fun getSentenceEmbeddings(listOfSentences: List<List<String>>)` exists because some Sentence Embeddings depends on the "global context".

Currently two variants are 'completed' but more are coming.

#### AvgSentenceEmbeddings
Just averages the Word Embeddings for a sentence.

In [6]:
import com.londogard.nlp.embeddings.sentence.*

val embeddings = EmbeddingLoader.fromLanguageOrNull<LightWordEmbeddings>(sv)!! // We know this exists.. :)
val sentEmbeddings = AverageSentenceEmbeddings(embeddings)

sentEmbeddings.getSentenceEmbedding(listOf("Hej", "där", "borta")).cols(0, 10) // once again reducing dimensions

Type = FDRM , rows = 1 , cols = 10
-2.3792E-02 -2.1282E-02 -4.2597E-02  3.0897E-02 -2.5323E-02 -9.7709E-02 -1.1211E-02  3.5199E-02  1.2124E-02 -8.6041E-03 


#### USifSentenceEmbeddings
This implementation is based on the paper _Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline_ by Kawin Ethayarajh, found on [aclweb.org](https://www.aclweb.org/anthology/W18-3012/).  
This paper bases its work on Smooth-Inverse-Frequency (SIF) Embeddings (paper _A Simple but Tough-to-Beat Baseline for Sentence Embeddings_ found [here](https://openreview.net/forum?id=SyK00v5xx)). The difference being that this approach is _unsupervised_ while remaining even stronger. See the abstract:

> Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we first show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.

What has to be noted is that it requires more input than other `SentenceEmbeddings` see

```kotlin
import com.londogard.nlp.embeddings.sentence.*

class USifSentenceEmbeddings(
    val tokenEmbeddings: Embeddings,
    private val wordProb: Map<String, Float>,
    randomWalkLength: Int, // = n, ~11
    val numCommonDiscourseVector: Int = 5 // = m, 0 should work. In practise max 5.
) {
    /** use it the same way as SentenceEmbeddings */
}
```

Where `wordProb` is simply taken through the `WordFrequencies` util. E.g. `WordFrequencies.getAllWordFrequenciesOrNull(sv): Map<String, Float>?`.

#### SentenceEmbeddings In Progress

`TfIdfEmbeddings` & `SifEmbeddings`, the latter might be scrapped as `USif` is such a amazing work.

In [7]:
import com.londogard.nlp.wordfreq.WordFrequencies

val usifEmbeddings = USifSentenceEmbeddings(embeddings, WordFrequencies.getAllWordFrequenciesOrNull(sv) ?: emptyMap(), 11)
usifEmbeddings.getSentenceEmbedding(listOf("Hej", "där", "borta")).cols(0, 10) // Also trunctating here

Type = FDRM , rows = 1 , cols = 10
-3.1692E-02 -2.7498E-02 -5.7967E-02  4.1469E-02 -3.2962E-02 -1.2680E-01 -1.5575E-02  4.6698E-02  1.4627E-02 -1.0986E-02 


### Tokenizers

There currently exists three types of tokenizers all extending the same `interface`.

```kotlin
interface Tokenizer {
    fun split(text: String): List<String>
}
```

#### CharTokenizer

Tokenizes a string into the chars.

In [8]:
import com.londogard.nlp.tokenizer.*

CharTokenizer().split("hello, world!")

[h, e, l, l, o, ,,  , w, o, r, l, d, !]

#### SimpleTokenizer

This is a word-tokenizer which splits out words based on a few different 

In [9]:
SimpleTokenizer().split("hello, world!")

[hello, ,, world, !]

#### SentencePieceTokenizer
The SentencePiece Tokenizer is a subword tokenizer which is Language Specific, through [BPEmb](https://nlp.h-its.org/bpemb/) we have 275 languages covered through Wikipedia. There exists model of the following vocab-sizes: `1000, 3000, 5000, 10_000, 25_000, 50_000, 100_000 & 200_000`. The larger vocab the less subwords are tokenized and more words.

The SentencePiece model is the raw C++ from [Google](https://github.com/google/sentencepiece/) with a wrapper from [DJL (Amazon)](http://docs.djl.ai/extensions/sentencepiece/index.html). I'm usually very hesitant in adding native libraries (JNI) when working on JVM-projects but no-one can deny the power of SentencePiece and I haven't had the time to implement the algorithm myself.  
This wrapper by DJL is very small (they provide a single dependency with only sentencepiece) and fits my philosophy of keeping dependencies and size at a low.

In [10]:
SentencePieceTokenizer.fromLanguageSupportOrNull(sv)?.split("hej där borta, hur mår ni?")

[▁he, j, ▁där, ▁bor, ta, ,, ▁hur, ▁mår, ▁ni, ?]

### Stopwords

The stopwords are taken from NLTK and are hosted directly on the GitHub. The object looks as follows:

```kotlin
object Stopwords {
    fun isStopword(word: String, language: LanguageSupport): Boolean
    fun stopwords(language: LanguageSupport): Set<String> // Throws if language does not support stopwords
    fun stopwordsOrNull(language: LanguageSupport): Set<String>?
}
```

This object has an internal cache which saves the previously loaded language. Use `Stopwords.stopwords` to simply retrieve the Stopwords and use them yourself as a `Set<String>`.

In [11]:
import com.londogard.nlp.stopwords.Stopwords
import com.londogard.nlp.utils.LanguageSupport.*

val hej = Stopwords.isStopword("hej", sv)
val och = Stopwords.isStopword("och", sv)

"'hej' is a stopword: $hej\n'och' is a stopword: $och"

'hej' is a stopword: false
'och' is a stopword: true

In [12]:
Stopwords.stopwords(sv).take(3)

[och, det, att]

In [13]:
runCatching { Stopwords.stopwords(af) }.isSuccess // Stopwords does not support af

false

In [14]:
Stopwords.stopwordsOrNull(af) == null

true

### WordFrequencies

The Word Frequencies are taken from `wordfreq.py` a library by [LuminosoInsight](https://github.com/LuminosoInsight/wordfreq/) and are hosted directly on the GitHub. The object looks as follows:

```kotlin
object WordFrequencies {
   fun getAllWordFrequenciesOrNull(language: LanguageSupport, size: WordFrequencySize = WordFrequencySize.Largest): Map<String, Float>?
   
   fun wordFrequency(word: String, language: LanguageSupport, minimum: Float = 0f, size: WordFrequencySize): Float // Throws if language does not support wordfreq
   fun wordFrequencyOrNull( word: String, language: LanguageSupport, minimum: Float = 0f, size: WordFrequencySize): Float?
}
```

This object has an internal cache which saves the previously loaded language. Use `WordFrequencies.getAllWordFrequenciesOrNull` to simply retrieve the WordFrequencies and use them yourself as a `Map<String, Float>`.  
Methods to recieve `zipfFrequencies` also exists.

In [15]:
import com.londogard.nlp.wordfreq.WordFrequencies

val hej = WordFrequencies.wordFrequency("hej", sv)
val och = WordFrequencies.wordFrequency("och", sv)

"WordFrequency of 'hej'=$hej and 'och'=$och"

WordFrequency of 'hej'=2.9512093E-4 and 'och'=0.025118863

In [16]:
val hej = WordFrequencies.zipfFrequency("hej", sv)
val och = WordFrequencies.zipfFrequency("och", sv)

"ZipfFrequency of 'hej'=$hej and 'och'=$och"

ZipfFrequency of 'hej'=5.4700003 and 'och'=7.4

In [17]:
val weird = WordFrequencies.wordFrequency("hraihaodjasmdiamo", sv)
val weirdOrNull = WordFrequencies.wordFrequencyOrNull("hraihaodjasmdiamo", sv)

"WordFrequency of 'hraihaodjasmdiamo' (non-word) using `wordFrequency` $weird and using `wordFrequencyOrNull` $weirdOrNull"

WordFrequency of 'hraihaodjasmdiamo' (non-word) using `wordFrequency` 0.0 and using `wordFrequencyOrNull` null

In [18]:
runCatching { WordFrequencies.wordFrequency("hello", af) }.isFailure // WordFrequencies does not support af

true

In [19]:
WordFrequencies.wordFrequencyOrNull("hello", af) == null

true

In [20]:
WordFrequencies.getAllWordFrequenciesOrNull(sv)?.entries?.take(3)

[är=0.037153527, det=0.031622775, att=0.026302677]

### Stemmer

The stemmer uses [Snowballstem](http://snowballstem.org/) which is a small dependency with a wrapper.  
If the stemmer is not supported by the called `LanguageSupport` it'll fall-back to the classic `PorterStemmer`.


There exists two ways to call the stemmer currently.

In [21]:
import com.londogard.nlp.stemmer.Stemmer

val stemmer = Stemmer(sv)
stemmer.stem("katten")

katt

In [22]:
Stemmer.stem("katten", sv)

katt

### Trie

:warning: Work In Progress :warning:

Does work with a `vocab: Map<String, Int>`.

### Vectorizer
Vectorizers and Transformers

A `Vectorizer` takes raw string input and outputs a vectorized format.
A `Transformer` on the other way takes vectorized input and outputs a new vectorized format, _transformed_.

The interface looks like the following:
```kotlin
interface Vectorizer<INPUT: Number, OUTPUT: Number> {
    fun fit(input: List<List<String>>): Unit
    fun transform(input: List<List<String>>): D2FloatArray
    fun fitTransform(input: List<List<String>>): D2FloatArray
}

interface Transformer<INPUT: Number, OUTPUT: Number> {
    fun fit(input: MultiArray<Float, D2>): Unit
    fun transform(input: MultiArray<Float, D2>): MultiArray<Float, D2>
    fun fitTransform(input: MultiArray<Float, D2>): MultiArray<Float, D2>
}
```

Using this is very simple and all have the same way of interacting, simply replace the vectorizer for what you wish to use - here is a TF-IDF Vectorizer.

In [None]:
val simpleTok = SimpleTokenizer()
val simpleTexts = listOf("hejsan jag älskar sverige", "hej vad bra det är i sverige", "jag älskar sverige", "jag hatar norge", "norge hatar", "norge hatar", "norge hatar")
    .map(simpleTok::split)
val tfidf = TfIdfVectorizer<Float>()

val lhs = tfidf.fitTransform(simpleTexts)
println("Vectorized: $lhs")

### Classifiers
Classifiers predicts labels / classes based on a vectorized input.

Interface
```kotlin
interface Classifier: BasePredictor<Int> {
    fun fit(X: MultiArray<Float, D2>, y: D2Array<Int>)
    fun predict(X: MultiArray<Float, D2>): D2Array<Int>
}
```

TODO

### Regression
Regressors predicts a continuous value based on a continous input.

Interface
```kotlin
interface Regressor: BasePredictor<Int> {
    fun fit(X: MultiArray<Float, D2>, y: D2Array<Float>)
    fun predict(X: MultiArray<Float, D2>): D2Array<Float    >
}
```

TODO

### SequenceClassifier
A Sequence Classifier predicts a sequence of labels based on a sequence as a input, as an example it can be if a word is a verb, noun or something different.

Currently only Hidden Markov Model (HMM) is supported.
The interface is as follows:

```kotlin
interface SequenceClassifier<T : Number> {
    // Using List<> as the input can be of different sizes between examples
    fun fit(X: List<D1Array<T>>, y: List<D1Array<Int>>)
    fun predict(X: List<D1Array<T>>): List<D1Array<Int>>
}
```

TODO