<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/02_Embeddings/Experiment/BERT_Embeddings_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bi-Directional Encoder Representation from Transformer (BERT)

BERT is an approach of using large pretrained neural networks with some exceptional solutions to get the vectors from texts, which we can use with some similarity metrics like cosine similarity to compare meaning of these texts.

(By the way these networks are frequently used as a backbone or part of ensemble of models to solve some NLP tasks like Question Answering, Ranking, Named Entitity Recognition, etc.)

["I'm brave enough to read the paper on BERT"](https://arxiv.org/abs/1810.04805)


### How do we plan to use it?

 - get embeddings(vector representations) from documents using BERT
 - index them using knn algorithm included in ElasticSearch
 - get embeddings from queries(there're relevance labels of pairs query-document) using BERT
 - use relevance labels and ranking API from ElasticSearch to calculate metrics and compare it with classical approaches

## Basic Demonstration

### Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1


Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

[?25l[K     |█                               | 10kB 5.2MB/s eta 0:00:01[K     |██                              | 20kB 8.9MB/s eta 0:00:01[K     |███                             | 30kB 6.0MB/s eta 0:00:01[K     |████                            | 40kB 5.5MB/s eta 0:00:01[K     |█████                           | 51kB 5.5MB/s eta 0:00:01[K     |██████                          | 61kB 5.8MB/s eta 0:00:01[K     |███████▏                        | 71kB 3.8MB/s eta 0:00:01[K     |████████▏                       | 81kB 4.2MB/s eta 0:00:01[K     |█████████▏                      | 92kB 4.7MB/s eta 0:00:01[K     |██████████▏                     | 102kB 5.2MB/s eta 0:00:01[K     |███████████▏                    | 112kB 5.2MB/s eta 0:00:01[K     |████████████▏                   | 122kB 5.2MB/s eta 0:00:01[K     |█████████████▎                  | 133kB 5.2MB/s eta 0:00:01[K     |██████████████▎                 | 143kB 5.2MB/s eta 0:00:01[K     |███████████████▎           

True

In [None]:
#print new index list
create_response = es.cat.indices()
print(create_response)




### Download pretrained BERT model

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer('bert-base-nli-mean-tokens')

# using gpu to boost inference if it's possible
if torch.cuda.is_available():
  model.to('cuda')

In [None]:
print('Max Sequence Length:', model.max_seq_length)

#Change the length to max possible length (based on gpu memory)
model.max_seq_length = 364

print('Max Sequence Length:', model.max_seq_length)

In [None]:
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

In [None]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print('Sentence:', sentence)
    print('Embedding:', list(embedding[:5]) + ['...'])
    print('Embedding\'s length:', len(embedding))
    print('')

### Indexing

In [None]:
settings = {
  "settings": {
    "index": {
      "knn": True,
      "knn.space_type": "cosinesimil"
    }
  },
  "mappings": {
    "properties": {
      "bert_vector": {
        "type": "knn_vector",
        "dimension": 768
      }
    }
  }
}

#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
toy_index = "bert-toy_index"
es.indices.delete(index=toy_index, ignore=[400, 404])
es.indices.create(toy_index, body=settings)

{'acknowledged': True, 'index': 'bert-toy_index', 'shards_acknowledged': True}

In [None]:
from tqdm import tqdm_notebook

model.eval()

for i, sentence in tqdm_notebook(enumerate(sentences)):
  with torch.no_grad():
    if torch.cuda.is_available():
      torch.cuda.ipc_collect()
      torch.cuda.empty_cache()
    es.index(
      index=toy_index, 
      id=i, 
      body={
          'bert_vector': model.encode(sentence),
          'text': sentence
          }
    )

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




### Searching

In [None]:
test_str = "Where is the fox?"

#test query knn-search
res = es.search(
    index="bert-toy_index", 
    body={
        "query": {
            "knn": {
                "bert_vector": {
                    "vector": list(model.encode(test_str).astype(float)),
                    "k": 3
                    }}}})

print("Got %d Hits:\n\n" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
  print(f"Cosine similarity score: {hit['_score']}  \nid: {hit['_id']}\ntext: {hit['_source']['text']}\n\n")


Got 3 Hits:


Cosine similarity score: 0.6346718  
id: 2
text: The quick brown fox jumps over the lazy dog.


Cosine similarity score: 0.5445194  
id: 0
text: This framework generates embeddings for each input sentence


Cosine similarity score: 0.5397003  
id: 1
text: Sentences are passed as a list of string.




In [None]:
# Deleting index 
es.indices.delete(index=toy_index, ignore=[400, 404])

{'acknowledged': True}