## Semantic Search 

- ### Easy way - use sentence bert package 
- when you have a small data 
- depends on your task, model selction is very important , there is a difference on symetric and asymetic search
- https://www.sbert.net/examples/applications/semantic-search/README.html
- https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

In [1]:
from sentence_transformers import SentenceTransformer, util
import torch,os
import config

In [None]:
##  download an embeding model 
# all-mpnet-base-v2  is currently the best performing model, but 
embedder = SentenceTransformer('all-MiniLM-L6-v2') ## but mini model also achieve similar resulst, but much faster

In [None]:
# Corpus with example sentences
corpus = ['It may be one of the most familiar words in economics. Inflation has plunged countries into long periods of instability. ',
          'Inflation is the rate of increase in prices over a given period of time. Inflation is typically a broad measure, such as the overall increase in prices or the increase in the cost of living in a country. ',
          'Consumers’ cost of living depends on the prices of many goods and services and the share of each in the household budget. ',
          'First, will borrowing remain cheap for the entire horizon relevant for fiscal planning? Since that horizon seems to be the indefinite future, our answer here would be “no.” ',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
# Query sentences:
queries = ['Inflation is here to stay for a while', 'Fiscal policy must be supportive']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
query_embeddings = embedder.encode(queries, convert_to_tensor=True)


- use cosine similarity to look for neariest neighbor

In [None]:
top_k = 2
# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.cos_sim(query_embeddings, corpus_embeddings)
print('results size: {}'.format(cos_scores.size()))
top_results = torch.topk(cos_scores, k=top_k)
print('top results indices : {}'.format(top_results.indices))
print(corpus[top_results.indices[0][0]])
print(corpus[top_results.indices[1][0]])

- #### To do it more efficiently 
- use semetic search and normailze embeding 

In [None]:
### we can use GPU if we want 
#corpus_embeddings = corpus_embeddings.to('cuda') ## for GPU
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)

#query_embeddings = query_embeddings.to('cuda') ## for GPU
query_embeddings = util.normalize_embeddings(query_embeddings)
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score,top_k=2)
print(hits)

- ### More efficient way for large data set 
- https://huggingface.co/course/chapter5/6?fw=tf

- #### Load a large dataset 

In [2]:
#from huggingface_hub import hf_hub_url
from datasets import load_dataset,load_from_disk
from datasets import Dataset

In [None]:
## load data 
issues_dataset = load_dataset('lewtun/github-issues', split="train")
print(issues_dataset)
## filter and process data 
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
print(issues_dataset)

In [None]:
## further data processing 
issues_dataset.set_format("pandas")
df = issues_dataset[:]
comments_df = df.explode("comments", ignore_index=True) ## turn list in a column into seperate rows 
comments_df.head(4)

In [None]:
## filter by lentgh 
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
## let's concatenate all content in one field
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }

comments_dataset = comments_dataset.map(concatenate_text)

- ### Create text embeding
- here we are doing asymetric search , will use a QA pretrained model 

In [3]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"  ## prefious we are using 'all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

- set get embeding function 

In [4]:
## pull the cls token embeding 
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

- get embedings 

In [5]:
## get embeding and save dataset
dataset_cache_dir = os.path.join(config.data_folder,'Semantic_Search_cache','QA')
if os.path.exists(dataset_cache_dir):
    embeddings_dataset = load_from_disk(dataset_cache_dir)
else: 
    embeddings_dataset = comments_dataset.map(
        lambda x: {"embeddings": get_embeddings(x["text"]).detach().numpy()[0]}
    )
    embeddings_dataset.save_to_disk(dataset_cache_dir)

- ### Use FAISS for efficient similarity search
- refer to https://huggingface.co/course/chapter5/6?fw=tf#using-faiss-for-efficient-similarity-search
- another example here : https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_faiss.py


In [6]:
### not sure if add_faiss_index will automatically normalize it. maybe a good idea to normalize ourself 
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [7]:
question = "How can I load a dataset offline?"

question_embedding = get_embeddings([question]).detach().numpy()
type(question_embedding)

numpy.ndarray

In [8]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=2
)