In [1]:
@file:DependsOn("com.londogard:nlp:1.2.0-BETA2")

### SentenceEmbeddings

All Sentence Embeddings extend the same `interface`
```kotlin
interface SentenceEmbeddings {
    val tokenEmbeddings: Embeddings
    fun getSentenceEmbeddings(listOfSentences: List<List<String>>): List<D1Array<Float>>
    fun getSentenceEmbedding(sentence: List<String>): D1Array<Float>
}
```

The `fun getSentenceEmbeddings(listOfSentences: List<List<String>>)` exists because some Sentence Embeddings depends on the "global context".

Currently there exists two variants,

1. `AverageSentenceEmbeddings` - a simple averaging
2. `USifSentenceEmbeddings` - This implementation is based on the paper: [Unsupervised Random Walk Sentence Embeddings](https://www.aclweb.org/anthology/W18-3012/)

**`AverageSentenceEmbeddings`**

In [2]:
import com.londogard.nlp.embeddings.sentence.*
import com.londogard.nlp.embeddings.*
import com.londogard.nlp.utils.LanguageSupport.*

val embeddings = EmbeddingLoader.fromLanguageOrNull<BpeEmbeddings>(sv)!!
val sentEmbeddings = AverageSentenceEmbeddings(embeddings)

sentEmbeddings.getSentenceEmbedding(listOf("Hej", "där", "borta")) // once again reducing dimensions

[-0.021820415, -0.019619884, -0.19734028, -0.032928836, 0.10354518, 0.34151012, -0.24666698, 0.078181244, -0.08154075, 0.046468385, -0.04787719, -0.15527071, -0.12270149, -0.16071723, 0.12675813, 0.13959172, 0.0065205805, -0.07756825, -0.20538653, -0.17769676, -0.228355, 0.09312665, 0.101570636, 0.006850484, 0.004427894, 0.13834868, 0.10071627, -0.07887914, -0.14899217, -0.08259938, 0.13948171, 0.024621591, 0.18702246, -0.0011664453, 0.051780734, -0.2753331, 0.086110406, -0.19958574, -0.021845786, -0.034773823, -0.10437406, 0.005306831, -0.17182294, 0.13179131, -0.02846826, 0.05326647, 0.42738548, -0.026026608, -0.06756461, 0.07601924]

**`USifSentenceEmbeddings`**
> Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we first show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.

What has to be noted is that it requires more input than other `SentenceEmbeddings` see

```kotlin
import com.londogard.nlp.embeddings.sentence.*

class USifSentenceEmbeddings(
    val tokenEmbeddings: Embeddings,
    private val wordProb: Map<String, Float>,
    randomWalkLength: Int, // = n, ~11
    val numCommonDiscourseVector: Int = 5 // = m, 0 should work. In practise max 5.
) {
    /** use it the same way as SentenceEmbeddings */
}
```

Where `wordProb` is simply taken through the `WordFrequencies` util. E.g. `WordFrequencies.getAllWordFrequenciesOrNull(sv): Map<String, Float>?`.

**Example**

In [3]:
import com.londogard.nlp.wordfreq.WordFrequencies
import org.jetbrains.kotlinx.multik.ndarray.data.get

val usifEmbeddings = USifSentenceEmbeddings(embeddings, WordFrequencies.getAllWordFrequenciesOrNull(sv) ?: emptyMap(), 11)
val senteceEmbedded = usifEmbeddings.getSentenceEmbedding(listOf("Hej", "där", "borta"))

println("Sentence embedded: ${senteceEmbedded::class}\nSentence Embedded (first 10 floats): ${senteceEmbedded?.get(0..10)}")

Sentence embedded: class org.jetbrains.kotlinx.multik.ndarray.data.NDArray
Sentence Embedded (first 10 floats): [-0.03540576, -0.08672718, -0.2572643, -0.107719705, -0.06377515, 0.4695419, -0.15353657, -0.037024133, 0.102188826, 0.012642855]
