### Fast Clustering on Quora Questions Set

Fast clustering excels at large datasets. Agglomerative perfoms slowly on large datasets and is only applicable to a few thousand sentences.

Fast clustering can fine-tune 50k sentences within a few seconds. Threshold can be used to define clustering criteria.

Dataset: Quora Duplicate Questions dataset: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

In [1]:
from sentence_transformers import SentenceTransformer, util

import pandas as pd
import torch
import time

In [2]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [3]:
df = pd.read_csv("quora_duplicate_questions.tsv", sep='\t')
df.shape

(404290, 6)

In [4]:
df.head(30)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


Get the first 1000 sentences

In [5]:
sentences = df['question1'].tolist()[:1000]
len(sentences)

1000

In [6]:
corpus_embeddings = model.encode(sentences, batch_size=64, show_progress_bar=True)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [7]:

# Assuming 'corpus_embeddings' is your NumPy array
embeddings_tensor = torch.tensor(corpus_embeddings, dtype=torch.float32)

# Move the tensor to a suitable device (e.g., GPU if available)
if torch.cuda.is_available():
    embeddings_tensor = embeddings_tensor.cuda()

# Now, you can use 'embeddings_tensor' with the community detection function
clusters = util.community_detection(embeddings_tensor, min_community_size=5, threshold=0.5)


In [8]:
clusters

[[92, 103, 304, 607, 688, 723, 777, 870, 919, 978],
 [105, 199, 295, 321, 439, 675, 689, 877, 907],
 [28, 78, 273, 284, 564, 647, 784, 945],
 [79, 299, 549, 590, 725, 726, 733],
 [100, 140, 287, 598, 618, 669],
 [93, 263, 401, 544, 930, 957],
 [72, 198, 364, 644, 686, 969],
 [384, 722, 734, 752, 895, 973],
 [49, 302, 566, 591, 967],
 [3, 63, 115, 218, 910],
 [233, 333, 419, 422, 425],
 [317, 502, 532, 608, 852],
 [219, 540, 703, 742, 858],
 [175, 612, 796, 926, 996]]

In [9]:
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Questions".format(i+1, len(cluster)))

    for id in cluster[0:3]:
        print("\t", sentences[id])
    print("\t", "...")


Cluster 1, #10 Questions
	 What are some of the best romantic movies in English?
	 Which is the best fiction novel of 2016?
	 Which are the best Hollywood thriller movies?
	 ...

Cluster 2, #9 Questions
	 Will the recent demonetisation results in higher GDP? If so how much?
	 What are the effects of demonitization of 500 and 1000 rupees notes on real estate sector?
	 What will be the effect of banning 500 and 1000 notes on stock markets in India?
	 ...

Cluster 3, #8 Questions
	 What is best way to make money online?
	 How can I make money through the Internet?
	 What is the best way to get traffic on your website?
	 ...

Cluster 4, #7 Questions
	 What is purpose of life?
	 What the meaning of this all life?
	 What is the best lesson in life?
	 ...

Cluster 5, #6 Questions
	 Will there really be any war between India and Pakistan over the Uri attack? What will be its effects?
	 What is our stance against Pakistan?
	 If there will be a war between India and Pakistan who will win?
	 ...

In [10]:
#clusters = util.community_detection(corpus_embeddings, min_community_size=5, threshold=0.5)