# Sentence Similarity on Quora

This notebook explores sentence similarity using pre-trained sentence embeddings and clustering techniques. We leverage the Sentence Transformer library to convert sentences from the Quora Duplicate Questions Dataset into numerical representations. Subsequently, we employ community detection to group similar sentences based on their embeddings, potentially revealing thematic clusters within the dataset.

In [1]:
# Load the packages and the model used for sentence similarity
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time


"""
'sentence_transformers' library provides tools for text embedding and sentence similarity tasks. 
We import the SentenceTransformer class for creating a sentence embedding model and the util module for utility functions. 
We will be using the 'all-MiniLM-L6-v2' as our pre-trained sentence transformer model from the Sentence Transformers library. 
This particular model, all-MiniLM-L6-v2, is known to be effective for detecting similar questions. 
You can freely configure the threshold what is considered as similar. 
A high threshold will only find extremely similar sentences, a lower threshold will find more sentence that are less similar. 
A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned. 
The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation). 
"""

# Model for computing sentence embeddings. This particular model, all-MiniLM-L6-v2, is known to be effective for detecting similar questions.
model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [2]:

# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000  # We limit our corpus to only the first 50k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

In [3]:
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    
    # Create a CSV reader object, treating the first row as header and using tab delimiters with minimal quoting.
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)

    # Add both question1 and question2 from each row to the corpus_sentences set, ensuring uniqueness.
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])

        # Stop reading if the maximum corpus size is reached.
        if len(corpus_sentences) >= max_corpus_size:
            break

# Convert the set of unique sentences back into a list for compatibility with the model's encoding function.
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")

# Use the encode method of the model to generate numerical embeddings for each sentence in corpus_sentences.
# Return the embeddings as PyTorch tensors, suitable for further computations
corpus_embeddings = model.encode(corpus_sentences, 
                                 batch_size=64, 
                                 show_progress_bar=True, 
                                 convert_to_tensor=True)

Encode the corpus. This might take a while


Batches:   0%|          | 0/782 [00:00<?, ?it/s]

This block executes a clustering algorithm on the sentence embeddings, organizes sentences into groups based on semantic similarity, allows for adjustment of important parameters, measures the clustering efficiency, and provides a basic visualization of the resulting clusters

In [6]:
# Print sample sentences on which we want to calculate similarity
corpus_sentences[5:25]

['What are the things that are constant in life?',
 'What is the difference between speed and velocity in physics? What are some examples?',
 'How can you find the molar mass of deuterium?',
 'How do you overcome the politics in the workplace?',
 'What are important things for people intending to major in education to know about?',
 'Why do I feel sad when I see a beautiful girl?',
 'How do you know when you love somebody?',
 'How did Airbnb make its initial traction?',
 'What are the major differences between Indonesia and Malaysia?',
 'Was Pharaoh Akhenaten really a woman?',
 'How can I get free bitcoins?',
 'How do I start a grocery shop online?',
 'What is the difference between 多少 and 几 in Mandarin Chinese?',
 'Was giving Nobel Prize to Malala a complete joke?',
 'I trust people very quickly and as a result always have been betrayed? What do I do not to trust people and still hold the relation simple?',
 'How do I organize JavaScript code?',
 'Where can I find materials on design 

In [7]:

print("Start clustering")
start_time = time.time()

"""
Use the community_detection function from the sentence_transformers.util module to identify clusters of similar sentences within the embeddings.

# Params to tune
min_community_size=25: Sets a minimum size for clusters to consider, filtering out smaller groups.
threshold=0.75: Employs a cosine similarity threshold for defining cluster membership. Sentence pairs with cosine similarity above this threshold are considered similar and grouped together.
"""
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)

# Log the time taken for clustering
print("Clustering done after {:.2f} sec".format(time.time() - start_time))

#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])


Start clustering
Clustering done after 29.85 sec

Cluster 1, #103 Elements 
	 How can I improve my spoken English?
	 How will I improve my spoken English?
	 What should I do to improve my spoken English?
	 ...
	 How can I increase my knowledge in English language?
	 How do I improve my English writing and speaking skills?
	 What should I do to speak English fluently and not face any problem with vocabulary?

Cluster 2, #86 Elements 
	 How can one make money online?
	 How could I make money online?
	 How do I to make money online?
	 ...
	 How can an apprentice programmer make money online?
	 How do I earned big money even online without investment?
	 What are the ways to make money working from home?

Cluster 3, #82 Elements 
	 What are the economic implications of banning 500 and 1000 rupee notes?
	 What will be the implications of banning 500 and 1000 rupees currency notes on Indian economy?
	 How will the ban of 1000 and 500 rupee notes affect the Indian economy?
	 ...
	 What are you