In [None]:
%use londogard-nlp-toolkit

# Londogard NLP Toolkit
This is a simple toolkit for Natural Language Processing (NLP) which will contain utilities that are very handy while prototyping but also for production deployment because of smart utilisation of resources.  
The initial aim is not to solve problems end-2-end but rather have a simple small dependency with great utilities.

## Supported Utilities

All completed utilities with simple usage accompanied. `LanguageSupport.<ISO_2_COUNTRY_CODE>` helps figuring out what is supported.

- [Word Embeddings](#WordEmbeddings) including basic Word Embeddings, 'Light Word Embeddings' & BytePairEmbeddings
    - The Embeddings include automated downloads of languages to simplify your life, unless a path is specified.
    - WordEmbeddings include 157 languages via `fastText`-embeddings ([fastText.cc](https://fasttext.cc/docs/en/crawl-vectors.html))
    - BPE-Embeddings include 275 languages via `BPEmb`-embeddings ([nlp.h-its.org](https://nlp.h-its.org/bpemb/))
    - All embeddings support ∞ languages through your own self-trained embeddings if a file-path is supplied!
- [Sentence Embeddings](#SentenceEmbeddings) including `AvgSentenceEmbeddings` & `USifEmbeddings`
- [Tokenizers](#Tokenizers) including Word, Char & Subword (SentencePiece) tokenizers (simple to add custom logic)
    - SentencePiece include 275 languages via `BPEmb`-embeddings ([nlp.h-its.org](https://nlp.h-its.org/bpemb/)) with 8 vocab-sizes (1000, 3000, 10_000, 25_000, 50_000, 100_000, 200_000).
        - Of course possible to supply your own tokenizer if you've a path to a trained one!
- [Stopwords](#Stopwords) based on NLTKs list
    - Supporting: ar, az, da, de, el, en, es, fi, fr, hu, id, it, kk, ne, nl, no, pt, ro, ru, sl, sv, tg & tr
- [Word Frequencies](#WordFrequencies) based on `wordfreq.py` by [LuminosoInsight](https://github.com/LuminosoInsight/wordfreq/).
    - Supporting: ar, cs, de, en, es, fi, fr, it, ja, nl, pl, uk, pt, ru, zh, bg, bn, ca, da, el, fa, he, hi, hu, id, ko, lv, mk, ms, nb, ro, sh, sv & tr. Some of these support "Large Word Frequency", call `LanguageSupport.<LANG_CODE>.largestWordFrequency()` to see if `large` variant is available.
- [Stemmer](#Stemmer) based on [Snowball Stemmer](https://snowballstem.org/)
    - Supporting: sv, nl, en, fi, fr, de, hu, it, no, pt, ro, ru, es & tr
- [Trie](#Trie) - Just a basic utility that is to be used for a custom SubwordTokenizer in the future.
- [Vectorizer](#Vectorizer) - Vectorizers & Transformers. Include BM25, TF-IDF & BagOfWords (CountVectorizer).
- [Classifiers](#Classifiers) - Classifiers that predicts a class (label), includes Logisitc Regression & Naïve Bayes.
- [Regression](#Regression) - Linear Regression which is a simple regression.
- [Sequence Classifier](#SequenceClassifier) - Hidden Markov Model to predict sequences like Part of Speech (PoS).
- [Keyword Extraction](#KeywordExtraction) - Co-occurence statistical keyword extraction.

### Trie

`TrieNode` is the entering point and we have `TrieNode<K,V>`.

In [None]:
val trie: TreeNode<Char, Int> = TreeNode.ofData(listOf(listOf('h', 'e', 'j') to 3, listOf('h', 'o', 'j') to 15))

### Vectorizer
Vectorizers and Transformers

A `Vectorizer` takes raw string input and outputs a vectorized format.
A `Transformer` on the other way takes vectorized input and outputs a new vectorized format, _transformed_.

The interface looks like the following:
```kotlin
interface Vectorizer<INPUT: Number, OUTPUT: Number> {
    fun fit(input: List<List<String>>): Unit
    fun transform(input: List<List<String>>): D2FloatArray
    fun fitTransform(input: List<List<String>>): D2FloatArray
}

interface Transformer<INPUT: Number, OUTPUT: Number> {
    fun fit(input: MultiArray<Float, D2>): Unit
    fun transform(input: MultiArray<Float, D2>): MultiArray<Float, D2>
    fun fitTransform(input: MultiArray<Float, D2>): MultiArray<Float, D2>
}
```

Using this is very simple and all have the same way of interacting, simply replace the vectorizer for what you wish to use - here is a TF-IDF Vectorizer.

In [None]:
val simpleTok = SimpleTokenizer()
val simpleTexts = val simpleTexts = listOf("hello world!", "this is a few sentences")
    .map(simpleTok::split)
val tfidf = TfIdfVectorizer<Float>()

val lhs = tfidf.fitTransform(simpleTexts)
println("Vectorized: $lhs")

### Classifiers
Classifiers predicts labels / classes based on a vectorized input.

Interface
```kotlin
interface Classifier: BasePredictor<Int> {
    fun fit(X: MultiArray<Float, D2>, y: D2Array<Int>)
    fun predict(X: MultiArray<Float, D2>): D2Array<Int>
}
```

TODO

### Regression
Regressors predicts a continuous value based on a continous input.

Interface
```kotlin
interface Regressor: BasePredictor<Int> {
    fun fit(X: MultiArray<Float, D2>, y: D2Array<Float>)
    fun predict(X: MultiArray<Float, D2>): D2Array<Float    >
}
```

TODO

### SequenceClassifier
A Sequence Classifier predicts a sequence of labels based on a sequence as a input, as an example it can be if a word is a verb, noun or something different.

Currently only Hidden Markov Model (HMM) is supported.
The interface is as follows:

```kotlin

interface SequenceClassifier<T : Number> {
    // Using List<> as the input can be of different sizes between examples
    fun fit(X: List<D1Array<T>>, y: List<D1Array<Int>>)
    fun predict(X: List<D1Array<T>>): List<D1Array<Int>>
}
```

TODO

### KeywordExtraction
It's possible to extract keywords.

```kotlin
interface Keywords<T: Number> {
    fun keywords(
        text: String,
        top: Int = 10,
        languageSupport: LanguageSupport = LanguageSupport.en
    ): List<Pair<List<String>, T>>
}

object CooccurrenceKeywords: Keywords<Int>
```

Where the return is a list of keyword(s) and their score(s).