# Embedding Based Retrieval: Text

# About

In this notebook we will do follows:
- Create a text corpus by using captions
- Use existing embedding model `sentence-transformers/all-MiniLM-L6-v2`
- Explore the retrieval results of the model 

# About the Embedding model

About the model `sentence-transformers/all-MiniLM-L6-v2`


From the [Docs](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

It maps sentences & paragraphs to a `384` dimensional dense vector space and can be used for tasks like clustering or semantic search.

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective.    

We fine-tuned in on a 1B sentence pairs dataset. 

We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.






| **Dataset**                                                                                                                      | **Number of training tuples** |
| -------------------------------------------------------------------------------------------------------------------------------- | ----------------------------- |
| [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit)                          | 726,484,430                   |
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts)                                                             | 116,288,806                   |
| [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs                                         | 77,427,422                    |
| [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs                                                          | 64,371,441                    |
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles)                                                                | 52,603,982                    |
| [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract)                                                                      | 41,769,185                    |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs                 | 25,316,456                    |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs          | 21,396,559                    |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs               | 21,396,559                    |
| [MS MARCO](https://microsoft.github.io/msmarco/) triplets                                                                        | 9,144,553                     |
| [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq)                                     | 3,012,496                     |
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer)                                      | 1,198,260                     |
| [Code Search](https://huggingface.co/datasets/code_search_net)                                                                   | 1,151,414                     |
| [COCO](https://cocodataset.org/#home) Image captions                                                                             | 828,395                       |
| [SPECTER](https://github.com/allenai/specter) citation triplets                                                                  | 684,100                       |
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer)                                   | 681,164                       |
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question)                                    | 659,896                       |
| [SearchQA](https://huggingface.co/datasets/search_qa)                                                                            | 582,261                       |
| [Eli5](https://huggingface.co/datasets/eli5)                                                                                     | 325,475                       |
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/)                                                                   | 317,695                       |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles)        | 304,525                       |
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/)                  | 277,230                       |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies)        | 250,519                       |
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | 250,460                       |
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression)                                         | 180,000                       |
| [Wikihow](https://github.com/pvl/wikihow_pairs_dataset)                                                                          | 128,542                       |
| [Altlex](https://github.com/chridey/altlex/)                                                                                     | 112,696                       |
| [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)                                | 103,663                       |
| [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/)                                                              | 102,225                       |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions)                                                        | 100,231                       |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/)                                                                          | 87,599                        |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa)                                                                            | 73,346                        |
| **Total**                                                                                                                        | **1,170,060,424**             |

Other models listed in this page are also great candidates

https://www.sbert.net/docs/pretrained_models.html#model-overview

## Setup

In [1]:
from sentence_transformers import SentenceTransformer

import datasets
import rich
from IPython.display import Image, JSON
from IPython.core.display import HTML
import numpy as np

from transformers import AutoTokenizer
import ipywidgets as widgets
from ipywidgets import interact
import ipyplot
import os

In [2]:
dset = datasets.load_from_disk("../data/processed")

In [3]:
dset

Dataset({
    features: ['photo_id', 'photo_url', 'photo_image_url', 'photo_submitted_at', 'photo_featured', 'photo_width', 'photo_height', 'photo_aspect_ratio', 'photo_description', 'photographer_username', 'photographer_first_name', 'photographer_last_name', 'exif_camera_make', 'exif_camera_model', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'photo_location_name', 'photo_location_latitude', 'photo_location_longitude', 'photo_location_country', 'photo_location_city', 'stats_views', 'stats_downloads', 'ai_description', 'ai_primary_landmark_name', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'blur_hash', 'description_final', 'image', 'image_path_local'],
    num_rows: 24992
})

In [4]:
rich.print ( dset[0] )

## Model

In [5]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

In [6]:

# Initialize retriever with SentenceTransformer model 
model = SentenceTransformer(model_name)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Tokenizer

In [46]:
text = "A man and his pet dog Perrito playdingd catch in the beach."

In [47]:
resp = model.encode(text, output_value=None)
rich.print(resp)

number of tokens if based on whitespace

In [48]:
len(text)

59

number of tokens according to the model

In [49]:
resp['input_ids'].shape 

torch.Size([18])

each token has an embedding

In [50]:
resp['token_embeddings'].shape


torch.Size([18, 384])

All the token are pooled to generate an embedding

In [51]:
resp['sentence_embedding'].shape

torch.Size([384])

### Explore the tokenizer more

In [52]:
tokenizer = AutoTokenizer.from_pretrained(model_name)


the number of unique words in the tokenizer

In [53]:
tokenizer.vocab_size

30522

In [56]:
resp = tokenizer(text)
resp

{'input_ids': [101, 1037, 2158, 1998, 2010, 9004, 3899, 2566, 28414, 2377, 4667, 2094, 4608, 1999, 1996, 3509, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [55]:
tokens = tokenizer.tokenize(text,add_special_tokens=True)
rich.print ( tokens)

In [57]:
decoded_string = tokenizer.decode(resp['input_ids'])
decoded_string

'[CLS] a man and his pet dog perrito playdingd catch in the beach. [SEP]'

What happens if the input query is longer than the query

In [None]:
# movie text of Puss in Boots https://en.wikipedia.org/wiki/Puss_in_Boots:_The_Last_Wish
text = f"""

Puss in Boots: The Last Wish is a 2022 American computer-animated adventure film produced by DreamWorks Animation and distributed by Universal Pictures. 

The sequel to the spin-off film Puss in Boots (2011) and the sixth installment in the Shrek franchise, the film was directed by Joel Crawford and co-directed by Januel Mercado. 

Based on the character from Shrek 2 (2004) and inspired by the eponymous fairy tale, the film's screenplay was written by Paul Fisher and Tommy Swerdlow, with a story by Swerdlow and Tom Wheeler (the latter of whom wrote the 2011 film). 

The voice cast of Puss in Boots: The Last Wish includes Antonio Banderas and Salma Hayek Pinault reprising their respective roles as the titular character and Kitty Softpaws, and are joined by Harvey Guillén, Florence Pugh, Olivia Colman, Ray Winstone, Samson Kayo, John Mulaney, Wagner Moura, Da'Vine Joy Randolph, and Anthony Mendez, who voice new characters introduced in the film. 

In the film, after the events of Shrek Forever After (2010), Puss in Boots teams up with Kitty and Perrito (Guillén) to find the mystical Last Wish for the Wishing Star to restore the first eight of his nine lives, before Goldilocks and her Three Bears Crime Family (Pugh, Winstone, Colman, and Kayo), and "Big" Jack Horner (Mulaney) do, while attempting to avoid a mysterious hooded wolf (Moura).


Plans for a sequel to Puss in Boots began in November 2012, when executive producer Guillermo del Toro shared plans to take the titular character on an adventure to a "very exotic locale". Work on a sequel began in April 2014, according to Banderas. After being stuck in development hell, the project was revived in November 2018, with Illumination founder and CEO Chris Meledandri confirmed to be an executive producer. It was announced that the film would be helmed by Bob Persichetti, the head of story of the first film and one of the three directors of Sony Pictures Animation's Spider-Man: Into the Spider-Verse (2018), in February 2019. Crawford was later announced as the new director in March 2021, along with Mercado. The story draws inspiration from Spaghetti Western films, with The Good, the Bad and the Ugly (1966) being cited as a particular influence. As with DreamWorks' previous film The Bad Guys (2022), the film has a stylized animation style, inspired by Spider-Man: Into the Spider-Verse. With new technology, the team was able to give the film a painterly style to resemble a fairy-tale story, rather than utilizing the realistic visual style of every previous installment of the Shrek franchise.

Puss in Boots: The Last Wish premiered at Lincoln Center in New York City on December 13, 2022, and was theatrically released in the United States on December 21, 2022, after being delayed due to restructuring at DreamWorks. The film received positive reviews from critics and was a commercial success, grossing $483 million worldwide on a production budget of $90–110 million, becoming the tenth-highest-grossing film of 2022. At the 95th Academy Awards it was nominated for Best Animated Feature, as well as receiving nominations at the Golden Globes, Critics' Choice Awards, and British Academy Film Awards.
"""



In [39]:
tokens = tokenizer.tokenize(text) 

In [40]:
len(tokens)

710

In [41]:
len(text),  tokenizer.model_max_length

(3194, 512)

In [42]:
print (" ".join( tokens ))

pu ##ss in boots : the last wish is a 202 ##2 american computer - animated adventure film produced by dream ##works animation and distributed by universal pictures . the sequel to the spin - off film pu ##ss in boots ( 2011 ) and the sixth installment in the sh ##rek franchise , the film was directed by joel crawford and co - directed by jan ##uel mer ##ca ##do . based on the character from sh ##rek 2 ( 2004 ) and inspired by the eponymous fairy tale , the film ' s screenplay was written by paul fisher and tommy sw ##er ##dlow , with a story by sw ##er ##dlow and tom wheeler ( the latter of whom wrote the 2011 film ) . the voice cast of pu ##ss in boots : the last wish includes antonio band ##eras and sal ##ma hay ##ek pin ##ault rep ##ris ##ing their respective roles as the titular character and kitty soft ##pa ##ws , and are joined by harvey gui ##llen , florence pu ##gh , olivia col ##man , ray winston ##e , samson kay ##o , john mu ##lane ##y , wagner mo ##ura , da ' vine joy rando

tokens after `512` position are skipped by the model.   


compute the embeddings of the entire corpus

# Embeddings

In [54]:
?model.encode

[0;31mSignature:[0m
[0mmodel[0m[0;34m.[0m[0mencode[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0msentences[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m32[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshow_progress_bar[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_value[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'sentence_embedding'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_to_numpy[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_to_tensor[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdevice[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnormalize_embeddings[0

In [56]:
corpus = model.encode(dset['description_final'], normalize_embeddings=True)

In [69]:
def find_results(query:str , k =5, explain=False,image_path_local_dir = "../data/raw/images"):
    
    # get embeddings of a query
    query_features = model.encode(query, normalize_embeddings=True)
    
    # compute cosine scores of the scores and all the other documents in the corpus
    doc_scores = query_features @ corpus.T
    
    # get index of top items
    top_items_idx = doc_scores.argsort()[-k:][::-1]


    tokens = tokenizer.tokenize(query) 
    degug_info = {
         "query_original":  query 
         , "query_processed" : query_features
         , "tokens": tokens
        , "doc_scores":  doc_scores 
        , "top_items":  top_items_idx 
    }
    
    

    display(HTML(f"<h4>Query: {query} </h4>"))
    
    if explain:
        rich.print (degug_info )
        
        
    
    images = []
    labels = []
    
    
  
    print(tokens)
    # Iterate over the top k results
    
    for idx, photo_data in enumerate( dset.select(top_items_idx)):

        
        doc_idx = top_items_idx[idx]
        
        photo_id = photo_data['photo_id']
        
        image_path_local = f"{image_path_local_dir}/{photo_id}.jpg"

        if os.path.exists(image_path_local):
            image_path = image_path_local
        else:
            image_path = photo_data["photo_image_url"]
        
        images.append(image_path)       
        score = "{:.5f}".format(doc_scores[doc_idx])

        labels.append (f"""
                     Photo title: {photo_data["description_final"]}   <br/>
                     Distance: {score}
            
                     """)
        
    ipyplot.plot_images(images=images, labels=labels, img_width=200)

        

# More Examples

In [70]:
find_results( "Two dogs playing in the snow", explain=False, k =5)

['two', 'dogs', 'playing', 'in', 'the', 'snow']


In [71]:
find_results( "boy and girl on a beach")

['boy', 'and', 'girl', 'on', 'a', 'beach']


In [72]:
find_results( "image of a man in a desert")

['image', 'of', 'a', 'man', 'in', 'a', 'desert']


In [73]:
sample_queries = [

"person on top of mountain"
, "picture of a man in a desert"
, "person in a desert"    

, "the boy and girl on a beach"
, "children in beach"    

, "Two dogs playing in the snow"

, "light at the end of the tunnel"
, "seven wonders of the world"
    
    
, "water droplets on a leaf"
    
, "ripley's aquarium of canada, toronto, canada"
, "the butterfly atrium at hershey gardens"
    
, "salar de uyuni uyuni bolivia"
, "沙漠青蛙 沙漠青蛙" #(desert frog)
, "por do sol no mar"
, "conhece te a ti mesmo" #	 ( Greek for know thyself)


, "there is no planet b"


, "nova scotia duck tolling retriever"
    
]

In [74]:
@interact
def interact_find_results(query=sample_queries, k =5):
    find_results( query, k =k)

interactive(children=(Dropdown(description='query', options=('person on top of mountain', 'picture of a man in…

1) How the `person on top of mountain` compare to the BM25 approach? Is it better or about the same?

2) How do the results of `picture of a man in a desert` and `person in a desert` compare ? 

3) How do `the boy and girl on a beach` and `children in beach` compare ?

4) How do the top result in `Two dogs playing in the snow` compare ?
5) How do the results of in `ight at the end of the tunnel` look ?

6) Are the results of `seven wonders of the world`relevnat ?

# Performance

In [75]:
%%timeit
search_query = "Two dogs playing in the snow"
k=5
text_features = model.encode(search_query, normalize_embeddings=True)
doc_scores = text_features @ corpus.T

top_items = doc_scores.argsort()[-k:][::-1]


19.6 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In a BM-25 approach, we were able to get the results in ~3 ms