## BERT Embeddings Serverless Function
This notebook presents deployment of pretrained BERT model that outputs embeddings for given textual sequences as a serverless function. Embeddings are meaningful, contextual representations of text in the form of ndarrays that are used frequently as input to various learning tasks in the field of NLP.

## Embeddings without bert

[One-Hot Encoding](https://en.wikipedia.org/wiki/One-hot) is a general method that can vectorize any categorical features. It is simple and fast to create and update the vectorization.<br>
in case of <b>text</b> embeddings, each <b>row</b> is a <b>sentence</b> and each <b>column</b> is a <b>word/char/[n-gram](https://en.wikipedia.org/wiki/N-gram)</b>.

In [1]:
# some sentences to do examine
sentences = ['the quick brown fox jumps over the lazy dog',
              'Hello I am Jacob',
              'Daniel visited Tel-Aviv last month']

lets see the difference between bert embeddings and one-hot encoding

In [2]:
# constructing a list of all the words (will be our columns) - make sure no duplicate words are set
tokens = []
for sentence in sentences:
    for word in sentence.split():
        tokens.append(word) if word not in tokens else ""
print(tokens)

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog', 'Hello', 'I', 'am', 'Jacob', 'Daniel', 'visited', 'Tel-Aviv', 'last', 'month']


In [3]:
# constructing the one hot vector
import pandas as pd
import numpy as np

one_hot = pd.DataFrame(columns = range(len(tokens)))
# filling our empty dataframe with each sentence encoding
for sentence in sentences:
    vector = np.zeros(len(tokens))
    for word in sentence.split():
        vector[tokens.index(word)]=1
    one_hot = one_hot.append(pd.Series(vector),ignore_index=True)
one_hot.columns = tokens

In [4]:
one_hot

Unnamed: 0,the,quick,brown,fox,jumps,over,lazy,dog,Hello,I,am,Jacob,Daniel,visited,Tel-Aviv,last,month
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


The table above represents the one-hot encoding of our sentences, each row is a sentence and each column is a word.
this representation is very slim and will be a very weak learning dataset.

## Introducing Bert embeddings

In [5]:
from mlrun import import_function, auto_mount

In [6]:
# importing the function from the hub
fn = import_function("hub://bert_embeddings").apply(auto_mount())

In [7]:
# deploying the function
addr = fn.deploy()

> 2023-02-02 09:29:59,002 [info] Starting remote function deploy
2023-02-02 09:29:59  (info) Deploying function
2023-02-02 09:29:59  (info) Building
2023-02-02 09:29:59  (info) Staging files and preparing base images
2023-02-02 09:29:59  (info) Building processor image
2023-02-02 09:32:09  (info) Build complete
2023-02-02 09:32:35  (info) Function deploy complete
> 2023-02-02 09:32:36,059 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-default-bert-embeddings.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['default-bert-embeddings-default.default-tenant.app.cto-office.iguazio-cd1.com/']}


In [8]:
import requests
import json
# sending a request to the function endpoint to get the sentences' embeddings
resp = requests.post(addr, json=json.dumps(sentences))

In [9]:
import pickle
output_embeddings = pickle.loads(resp.content)

In [10]:
print(f'embeddings per token shape: {output_embeddings[0].shape}, pooled embeddings shape: {output_embeddings[1].shape}')

embeddings per token shape: (3, 11, 768), pooled embeddings shape: (3, 768)


In [11]:
pd.DataFrame(output_embeddings[1])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,-0.733322,-0.22354,0.342462,0.383463,-0.164796,0.040522,0.802845,0.152842,0.331639,-0.999779,...,0.206564,0.231415,0.196433,0.797908,0.435175,0.74937,0.246098,0.427603,-0.577384,0.842063
1,-0.953005,-0.535132,-0.743822,0.893934,0.646276,-0.279388,0.943513,0.275504,-0.555109,-0.999992,...,0.582386,-0.004614,0.976079,0.931517,-0.391442,0.530384,0.675933,-0.682721,-0.746339,0.957809
2,-0.843678,-0.453405,-0.826011,0.650805,0.494036,-0.154117,0.821642,0.349507,-0.650629,-0.999978,...,0.618286,-0.3367,0.936262,0.857577,-0.787489,0.246137,0.676243,-0.612532,-0.708786,0.840879


we can see that the size of the first dimension of the outputs is three since we passed in three sequences. Also the intermediate dimension of the first output is the maximal number of tokens across all input sequences. Sequences with less tokens are padded with zero values.<br>
Note that the first input has an intermediate dimension of size 11 that corresponds to the number of max tokens in the input sequence after addition of two special tokens marking beginning and end of a sequence by the tokenizer.<br>
The last dimension for both is of size 768 which is the embedding dimension for this default configuration of bert.<br>
Now you tell me, which encoding are you gonna use in your project ??