It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA_clustering.ipynb">
        <img src="https://colab.research.google.com/img/colab_favicon_256px.png"  width="50" height="50" style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA_clustering.ipynb">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png"  width="50" height="50" style="padding-bottom:5px;" />View Source on GitHub</a></td>
</table>

# Semantic search & QA

In this notebook, we'll introduce semantic search and question-answering using [`sentence-transformers`](https://www.sbert.net/), a Python library for state-of-the-art sentence, text and image embeddings. These embeddings are useful for semantic similarity tasks, such as information retrieval and question-answering systems.

In [None]:
# Install the sentence-transformers library
#!pip install -U sentence-transformers

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import time
import gzip
import os

We'll use a pre-trained Sentence Transformer model to generate sentence embeddings. Many pre-trained models are available [here](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

For our semantic search and question-answering task, we need a list of documents or paragraphs to search through for relevant information.

In [None]:
# Sample paragraphs
paragraphs = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
    "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra."
]

paragraphs = np.array(paragraphs)

In [None]:
# Generate embeddings for paragraphs
corpus_embeddings = model.encode(paragraphs)
print(corpus_embeddings.shape)

Now, let's define a function to perform semantic search, given a query and a list of paragraph embeddings.

In [None]:
def semantic_search(query, model, corpus_embeddings, paragraphs, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for text, sim in zip(list(paragraphs[indexes]), similarities[indexes].tolist()):
        print(f"{sim:.3f}\t{text}")


semantic_search('Where is the Colosseum', model, corpus_embeddings, paragraphs, top_k=2)

## Multilingual models


In [None]:
# lets try in other languages
semantic_search('¿Dónde está el Coliseo?', model, corpus_embeddings, paragraphs, top_k=2)

We have multilinguals models available [here](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)

In [None]:
# we can use multilingual models
model_name = 'paraphrase-multilingual-MiniLM-L12-v2'
multi_model = SentenceTransformer(model_name)

In [None]:
multi_corpus_embeddings = multi_model.encode(paragraphs)
print(multi_corpus_embeddings.shape)

In [None]:
semantic_search('¿Dónde está el Coliseo?', multi_model, multi_corpus_embeddings, paragraphs, top_k=2)

## Wikipedia semantic search

As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
about 170k articles. We split these articles into paragraphs

In [None]:
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append(data['title']+':  '+ paragraph)

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))
print(passages[0])
print(passages[1])

In [None]:
reduced_passages = np.array(passages[:5000])
reduced_passages.shape

In [None]:
corpus_embeddings = model.encode(reduced_passages, show_progress_bar=True)

In [None]:
semantic_search('Best american actor', model, corpus_embeddings, reduced_passages, top_k=2)

In [None]:
semantic_search('Number countries Europe', model, corpus_embeddings, reduced_passages, top_k=2)

### Question1: Load a different pre-trained Sentence Transformer model and compare its performance to the last model on the same set of paragraphs and queries. Which model performs better?

In [None]:
# Load a different pre-trained model, generate embeddings, and test with the same queries
model_name = 'distiluse-base-multilingual-cased-v2'
new_model = SentenceTransformer(model_name)

In [None]:
corpus_embeddings = new_model.encode(reduced_passages, show_progress_bar=True)

In [None]:
semantic_search('Best american actor', new_model, corpus_embeddings, reduced_passages, top_k=5)

In [None]:
semantic_search('Number countries Europe', new_model, corpus_embeddings, reduced_passages, top_k=5)

## Question 2: Find text duplicates

Try to find duplicate or near-duplicate texts in a given corpus based on their semantic similarity using sentence-transformers.

In [None]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox leaps over the lazy dog.",
    "The sky is blue, and the grass is green.",
    "The grass is green, and the sky is blue.",
    "It's a sunny day today.",
    "The weather is sunny today.",
    "She was wearing a beautiful red dress.",
    "She had on a gorgeous red dress.",
    "I'm going to the supermarket to buy some groceries.",
    "I'm heading to the supermarket to purchase some groceries.",
    "He didn't like the movie because it was too long.",
    "He disliked the movie as it was too lengthy.",
    "The train was delayed due to technical issues.",
    "Technical issues caused the train to be delayed.",
    "I'll have a cup of coffee with milk and sugar, please.",
    "Can I get a coffee with milk and sugar, please?",
    "The conference was very informative and interesting.",
    "The conference turned out to be interesting and informative.",
    "He enjoys listening to classical music in his free time.",
    "In his leisure time, he likes to listen to classical music.",
    "Please make sure you turn off the lights before leaving.",
    "Before leaving, ensure that you switch off the lights.",
    "The boy was delighted with the gift he received.",
    "Receiving the present made the young lad ecstatic.",
    "She has a preference for Italian cuisine.",
    "Her favorite type of food is from Italy.",
    "The software engineer resolved the issue by modifying the code.",
    "By altering the programming, the tech expert fixed the problem.",
    "Due to the inclement weather, the baseball game was postponed.",
    "The baseball match was rescheduled because of bad weather conditions.",
    "The house was engulfed in a raging fire.",
    "Flames rapidly consumed the residence.",
    "He is constantly browsing the internet for the latest news.",
    "He frequently scours the web to stay updated on current events.",
    "The puppy was playing with a toy in the garden.",
    "In the yard, the young dog was frolicking with its plaything.",
    "The artist painted a beautiful landscape on the canvas.",
]

In [None]:
# Step 1: Initialize the SentenceTransformer model
model_name = 'paraphrase-multilingual-MiniLM-L12-v2'
model = SentenceTransformer(model_name)

In [None]:
# Step 2: Obtain corpus embeddings
# embeddings = ...
embeddings = model.encode(corpus, show_progress_bar=True)

In [None]:
# Step 3: Calculate similarity and find duplicates

# TODO: Define a similarity threshold
similarity_threshold = 0.85

# TODO: Iterate over each pair of embeddings in the corpus
# Calculate the cosine similarity between the embeddings
# If the similarity is above the threshold, add the sentences to the duplicates list
duplicates = []

for i, emb1 in enumerate(embeddings):
    for j, emb2 in enumerate(embeddings[i + 1:]):
        similarity = cosine_similarity([emb1], [emb2])[0][0]
        if similarity > similarity_threshold:
            duplicates.append((corpus[i], corpus[i + j + 1], similarity))

In [None]:
print("Duplicate sentences:")
for sent1, sent2, sim in duplicates:
    print(f"{sent1} | {sent2} | Similarity: {sim:.2f}")
    print()

# Document Clustering

K-means clustering is a popular unsupervised machine learning algorithm that groups data points into k clusters based on their similarity. In our case, we want to group documents based on their semantic similarity. The algorithm requires us to specify the number of clusters k in advance.

In [None]:
corpus = [
    "The apple is a sweet fruit",
    "Oranges are citrus fruits",
    "Bananas are rich in potassium",
    "Strawberries are red fruits",
    "Dogs are domesticated animals",
    "Cats are also pets",
    "Elephants are the largest land mammals",
    "Cows provide us with milk",
    "Sharks are marine predators",
    "Whales are the largest marine mammals",
    "Dolphins are very intelligent",
    "Artificial intelligence is the future",
    "Machine learning is a subset of AI",
    "Deep learning is a part of machine learning",
    "Neural networks are used in deep learning",
]

df = pd.DataFrame({'documents': corpus})

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the documents in the corpus
document_embeddings = model.encode(corpus)

In [None]:
from sklearn.cluster import KMeans
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10)
clustering_model.fit(document_embeddings)
cluster_assignment = clustering_model.labels_

df['cluster'] = cluster_assignment

In [None]:
for i in range(num_clusters):
    print(f"Cluster {i}:")
    print(df[df['cluster'] == i]['documents'].values, "\n")

# Community Detection with Sentence Transformers

The sentence_transformers library provides a utility for community detection which applies a threshold on the cosine similarity score to identify distinct communities of sentences that are semantically similar. This method can be particularly helpful for organizing a large corpus of text into meaningful groups.




The [`community_detection`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.community_detection) function in the Sentence Transformers library is a useful utility for finding clusters or communities of semantically similar sentences. Here are the details of the function parameters:

- `document_embeddings`: This is the list of embeddings for the documents in your corpus. The embeddings can be created using any of the Sentence Transformer models. The embeddings should be in the form of a 2D tensor or a list of 1D tensors. Each embedding should be a fixed-length vector that represents the semantic meaning of a document.

- `threshold`: This is a float value between 0 and 1 that determines the cutoff for considering two documents to be part of the same community. It's based on the cosine similarity of the document embeddings. If the cosine similarity of two document embeddings is greater than the threshold, those two documents are considered to be in the same community. The higher the threshold, the more similar the documents in each community will be. However, a higher threshold may also result in more communities.

- `min_community_size`: This is the minimum number of documents that a community must have. If a community has fewer than this number of documents, it will be discarded. The default value is 1, but you might want to set a higher value if you're interested in larger communities. This can help filter out noise and find more meaningful communities.

- `batch_size`: As the function computes cosine similarities between document pairs, it may consume a significant amount of memory for a large corpus. To manage this, the computations are done in batches. The batch_size parameter determines the number of document pairs to compute similarities for in each batch. Larger batch sizes can be faster but consume more memory, while smaller batch sizes are slower but more memory-efficient.

The function returns a list of communities, where each community is a list of indices in the original list of documents. Each community represents a group of semantically similar documents based on the provided threshold.

In [None]:
from sentence_transformers.util import community_detection
document_embeddings = model.encode(
    corpus, show_progress_bar=True, convert_to_tensor=True
)
communities = community_detection(
    document_embeddings, threshold=0.5, min_community_size=2, batch_size=1024
)
for i, comm in enumerate(communities):
    print('_'*50)
    print(f'community: {i}, size: {len(comm)}')
    print('\n'.join([corpus[ind] for ind in comm]))
    print()

In the output, we will see the communities of semantically similar sentences. Note that the choice of the threshold value can greatly affect the results: a lower threshold will result in larger but less cohesive communities, while a higher threshold will result in smaller but more tightly-knit communities.

The community_detection function is a fast and efficient way to group similar sentences together, but keep in mind that it's a rather simple method based on thresholding the cosine similarity, and more sophisticated community detection methods might yield better results for certain tasks or datasets.

This function is a great way to explore the semantic structure of your corpus and to get a high-level understanding of the main themes or topics in your text data.