The previous two notebooks might have gotten your attention but usually we get the response; 

> But what about BERT-embeddings? 

Let's explain how to get there, but first ... we should explain languages.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from whatlies import Embedding, EmbeddingSet
import spacy 
import matplotlib.pylab as plt

## Multi-Token Embeddings

We can also have embeddings that represent more than one token. If we'd do this via spacy, we'd have a an average of all the word embeddings.

In [4]:
from whatlies.language import SpacyLanguage
from whatlies.transformers import Pca 

lang = SpacyLanguage("en_core_web_sm")

contexts = ["i am super duper happy",
            "happy happy joy joy",
            "programming is super fun!",
            "i am going crazy i hate it",
            "boo and hiss",]

emb = lang[contexts]
emb.transform(Pca(2)).plot_interactive('pca_0', 'pca_1').properties(width=400, height=400)

In [5]:
nlp = spacy.load("en_core_web_md")

contexts = ("this snake is a python",
            "i like to program in python",
            "programming is super fun!",
            "i go to the supermarket",
            "i like to code", 
            "i love animals")

emb = EmbeddingSet({k: Embedding(k, nlp(k).vector) for k in contexts})

In [6]:
x_str, y_str = "python is for programming", "snakes are slimy creatures"
x_axis = Embedding(x_str, nlp(x_str).vector)
y_axis = Embedding(y_str, nlp(y_str).vector)
emb.plot_interactive(x_axis=x_axis, y_axis=y_axis)

## Embeddings of Tokens with Context

But maybe we'd like to have BERT-style models. These models work differently. Luckily ... spaCy also supports this these days. 

Note that you'll need to download and install this model first. You can do that by running;

```
pip install spacy-transformers
python -m spacy download en_trf_robertabase_lg
```

In [7]:
nlp = spacy.load("en_trf_robertabase_lg")

contexts = ("this snake is a python",
            "i like to program in python",
            "programming is super fun!",
            "i go to the supermarket",
            "i like to code", 
            "i love animals")

t = EmbeddingSet({k: Embedding(k, nlp(k).vector) for k in contexts})

x_str, y_str = "python is for programming", "dogs are cool"
x_axis = Embedding(x_str, nlp(x_str).vector)
y_axis = Embedding(y_str, nlp(y_str).vector)
t.plot_interactive(x_axis=x_axis, y_axis=y_axis)

We can go a step further too. If we have the sentence `this snake is a python` then an algorithm like Bert will not apply seperate word embeddings for each token. Rather, the entire document will first learn it's representation before assigning it to seperate tokens. If you are interested in a Bert representation of a word given the context that it is in ... you can get them with a special syntax.

In [8]:
contexts = ("i put my money on the [bank]",
            "i put my money on the bank",
            "the water flows on the river [bank]",
            "the water flows on the river bank",
            "i really like [to swim] in water",
            "i want to be so rich that i am [drowning] in money",
            "i have plenty of [cash] on me",
            "money is important to my [cash] flow", 
            "a beach is next to the ocean", 
            "google gives me a wealth of information",
            "that banker person is very wealthy", 
            "i like cats and dogs")

But to make use of this syntax we need a new object; the `Language` object. This is a tool for `whatlies` to grab the appropriate word embeddings on your behalf. It will handle the context but can also be seen as a lazy `EmbeddingSet`.

In [9]:
import numpy as np
from whatlies.language import SpacyLanguage

lang = SpacyLanguage("en_trf_robertabase_lg")

In [10]:
lang['red'].vector[:10]

array([-0.10710682,  0.02091791,  0.06176348, -0.21725698,  0.6821515 ,
       -0.39927804, -0.05537288,  0.14486104, -0.00605936, -0.07999073],
      dtype=float32)

Note that these embeddings are kind of special, they depend on the context around the token of interest!

In [11]:
np.array_equal(lang['Going to the [store]'].vector, 
               lang['[store] this in the drawer please.'].vector)

False

But we can also use the `EmbeddingSet` again. 

In [13]:
from whatlies.transformers import Umap

t = EmbeddingSet({k: lang[k] for k in contexts}).transform(Umap(2))

  "n_neighbors is larger than the dataset size; truncating to "


In [14]:
p1 = t.plot_interactive("i like cats and dogs", "i put my money on the [bank]")
p2 = t.plot_interactive("i like cats and dogs", "i put my money on the bank")
p1 | p2