In [1]:
@file:DependsOn("com.londogard:nlp:1.2.0-BETA2")

### WordEmbeddings
Word Embeddings currently exists in three variants. 

All Embeddings (excluding SentenceEmbeddings) extend `Embeddings` which have some good-to-know default methods:
```kotlin
interface Embeddings {
    fun contains(word: String): Boolean = embeddings.contains(word)
    fun vector(word: String): D1Array<Float>? = embeddings[word]

    // OBS: Will return all possible vectors, not necessarily ALL
    fun traverseVectors(words: List<String>): List<D1Array<Float>>
    fun traverseVectorsOrNull(words: List<String>): List<D1Array<Float>>?
    
    fun euclideanDistance(w1: String, w2: String): Float?
    fun cosineDistance(w1: String, w2: String): Float?
}
```

#### WordEmbedding

`WordEmbeddings` are the classical usecase of Embeddings where each word maps to a vector of floats. There exists some helper-methods. Currently requires to have the embeddings locally. 
Download functions for [fastText](https://fasttext.cc/) will come, but be warned they're large!

```kotlin
class WordEmbeddings(val delimeter: Char = ' ', val dimensions: Int, val filePath: Path) {
    fun nearestNeighbour(vector: SimpleMatrix, N: Int): List<Pair<String, Double>>
    fun distance(input: List<String>, N: Int): List<Pair<String, Double>>
    fun analogy(w1: String, w2: String, w3: String, N: Int): List<Pair<String, Double>>?
    fun rank(word: String, set: Set<String>): List<Pair<String, Double>>
}
```

#### LightWordEmbedding

`LightWordEmbeddings` is something we at Londogard created to allow our embeddings to be loaded onto a Raspberry Pi 3B+ (1GB RAM). What's so effective about the `LightWordEmbeddings` is that ~ 10 % of all words makes up 90 % of all communications, i.e. we can load few and load the rest ad-hoc to keep RAM low and performance high.  

```kotlin
class LightWordEmbeddings(val delimeter: Char = ' ', val dimensions: Int, val filePath: Path, val maxWordCount: Int = 1000) {
}
```

#### BytePieceEncoding-Embedding (BPEmb)
BytePieceEncoding-Embeddings are a new approach from _BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages_ by Benjamin Heinzerling and Michael Strube. In this paper they show how a 11 MB embedding for English is on par with 6 GB embeddings from `fastText`!

It's important to use the same tokenizer config as the embedding did!
```kotlin
class BpeEmbeddings {
    fun subwordVector(subword: String): SimpleMatrix?
}
```

#### EmbeddingLoader
This util helps us automatically download embeddings - let's see how it's done!

In [2]:
import com.londogard.nlp.embeddings.*
import com.londogard.nlp.utils.LanguageSupport.*
import kotlin.system.measureTimeMillis
import org.jetbrains.kotlinx.multik.ndarray.data.get

// EmbeddingLoader.fromLanguageOrNull<WordEmbeddings>(sv) // WordEmbeddings
EmbeddingLoader.fromLanguageOrNull<BpeEmbeddings>(sv) // BpeEmbeddings (vocabSize: 10_000, dim: 50)
// EmbeddingLoader.fromLanguageOrNull<LightWordEmbeddings>(sv) // LightWordEmbeddings (size: 1000)

measureTimeMillis {
    val embeddings = EmbeddingLoader.fromLanguageOrNull<BpeEmbeddings>(sv)
    println(embeddings?.vector("Hej")?.get(0..10))
}.let { ms -> println("Loading BpeEmbeddings + retrieving 'Hej' took $ms milliseconds") }

[0.04173301, 0.103647575, 0.09480104, 0.039651364, 0.3215342, 0.032999773, -0.2786558, 0.16301134, -0.2583407, 0.07777843]
Loading BpeEmbeddings + retrieving 'Hej' took 472 milliseconds


**Using other Embeddings**

Simply update the call `fromLanguageOrNull<T>` into the Embedder you wish to use, e.g. `fromLanguageOrNull<WordEmbeddings>(sv)` or `fromLanguageOrNull<LightWordEmbeddings>(sv)`