# Semantic Search Engine using Sentence-BERT (SBERT)

## Introduction

In this notebook, we will build a semantic search engine using the Quora Question Pairs dataset. We will leverage the Sentence-BERT (SBERT) model to generate sentence embeddings and perform efficient similarity searches. The goal is to allow users to input queries and retrieve relevant questions based on their semantic meaning.

## Objectives

- Understand and implement the SBERT model for text embeddings.
- Develop a semantic search functionality.
- Evaluate the performance of the search engine.
## Dataset

We will use the [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs) dataset, which contains pairs of questions that may be semantically similar.


In [2]:
### Import Libraries
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModel, AutoTokenizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

## Data Cleaning Function
The clean_text function removes unwanted characters and standardizes the text to improve the quality of our embeddings.

In [None]:
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    return text  # Return the cleaned text

## Load the Dataset
We load the dataset and perform cleaning operations. This includes removing duplicate entries and handling missing values, which helps maintain the integrity of our data.

In [None]:
# Load your data from a CSV file
data = pd.read_csv('./data/questions.csv')

# Clean the text for both question columns
data['question1'] = data['question1'].apply(clean_text)
data['question2'] = data['question2'].apply(clean_text)

# Remove duplicate questions and NaN values
data.drop_duplicates(inplace=True)
data.dropna(inplace=True)

## Load the SBERT Model

In [3]:
# Load the pre-trained Sentence-BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')




## Batch Processing for Encoding
Encoding questions in batches helps manage memory usage and speeds up processing. The encode_questions function generates embeddings for all questions by processing them in smaller groups.

In [None]:
# Define batch size for processing
batch_size = 32

# Function to encode questions in batches
def encode_questions(questions):
    embeddings = []  # List to store embeddings
    # Iterate over the questions in batches
    for i in range(0, len(questions), batch_size):
        batch = questions[i:i + batch_size].tolist()  # Get a batch of questions
        # Encode the batch and extend the embeddings list
        embeddings.extend(model.encode(batch))
    return np.array(embeddings)  # Return embeddings as a numpy array

# Encode the questions from both columns
# We use "set" function to drop the repeated sentences
q_embeddings = encode_questions(list(set(list(data['question1']) + list(data['question2']))))
q_embeddings = normalize(q_embeddings)

## User Query and Similarity Calculation
Once we have the embeddings, we can compute the similarity between a user query and the questions in the dataset using cosine similarity. This metric evaluates how close two vectors are in the embedding space.

In [None]:
# Define a user input query
user_query = "What is there"

# Generate embedding for the user query
query_embedding = model.encode(user_query)
query_embedding = normalize(query_embedding.reshape(-1,1)).reshape(1,-1)[0]

# Calculate cosine similarities between the query embedding and question2 embeddings
similarities = cosine_similarity([query_embedding], q_embeddings)[0]

# Specify the number of top similar questions to retrieve
top_n = 5
# Get indices of the most similar questions based on cosine similarity
most_similar_indices = similarities.argsort()[-top_n*5:][::-1]

# Display the results
print("Top similar questions:")
for index in most_similar_indices[:top_n]:
    print(data['question1'].iloc[index])  # Print the most similar questions


## Conclusion
In this notebook, we have successfully implemented a semantic search engine using SBERT on the Quora Question Pairs dataset. By encoding the questions into embeddings, we can retrieve semantically similar questions based on user queries, demonstrating the power of modern natural language processing techniques.
### Future Work
Explore Other Models: Investigate the performance of different models for embedding generation.
Advanced Ranking Algorithms: Implement sophisticated ranking algorithms to improve search results.
User Feedback Loop: Incorporate user feedback to continuously refine and enhance the system.
## References
* Sentence Transformers Documentation
* Quora Question Pairs Dataset


### Instructions for Use:

1. **Copy the Markdown**: Copy the above markdown and paste it into a new Kaggle notebook.
2. **Add Your Data**: Make sure to upload the `questions.csv` dataset to your Kaggle notebook.
3. **Run the Cells**: Execute each code cell step by step.
4. **Save and Share**: Once you’re satisfied, save the notebook and share it on Kaggle. You can also upload it to GitHub.

Feel free to modify any part of the template as needed!
