<a href="https://colab.research.google.com/github/itsdivya1309/Machine-Learning/blob/main/LLM/Text%20Clustering%20and%20Topic%20Modelling/Text_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Clustering

Text clustering aims to group similar texts based on their semantic content, meaning, and relationships.

# ArXive's Articles

We will explore the articles in 9 categories namely:
* 'q-bio.BM': 'Biomolecules',
* 'q-bio.CB': 'Cell Behavior',
* 'q-bio.GN': 'Genomics',
* 'q-bio.MN': 'Molecular Networks',
* 'q-bio.NC': 'Neurons and Cognition',
* 'q-bio.OT': 'Other Quantitative Biology',
* 'q-bio.PE': 'Populations and Evolution',
* 'q-bio.QM': 'Quantitative Methods',
* 'q-bio.SC': 'Subcellular Processes',
* 'q-bio.TO': 'Tissues and Organs',

We'll cluster the 45658 articles from the above categories.

In [1]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json



In [2]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 99% 1.38G/1.40G [00:15<00:00, 111MB/s] 
100% 1.40G/1.40G [00:15<00:00, 94.0MB/s]


In [3]:
!unzip arxiv.zip -d arxiv_data

Archive:  arxiv.zip
  inflating: arxiv_data/arxiv-metadata-oai-snapshot.json  


In [4]:
import pandas as pd
import json

# Load JSON file
json_file = "arxiv_data/arxiv-metadata-oai-snapshot.json"

# The categories of interest
bio_categories = {
    'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN',
    'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 'q-bio.QM', 'q-bio.SC'
}

# Prepare a list to store filtered data
filtered_data = []

# Process the file line by line
with open(json_file, "r") as f:
    for line in f:
        paper = json.loads(line)  # Parse JSON line

        # Check if any category in the paper matches our target categories
        if any(cat in paper['categories'] for cat in bio_categories):
            # Extract only the required fields
            filtered_data.append({
                "title": paper.get("title", ""),
                "authors": paper.get("authors", ""),
                "abstract": paper.get("abstract", ""),
                "doi": paper.get("doi", "N/A")  # Default to "N/A" if DOI is missing
            })

In [5]:
# Convert to a DataFrame
papers = pd.DataFrame(filtered_data)

# Display the first few rows
papers.head()

Unnamed: 0,title,authors,abstract,doi
0,Molecular Synchronization Waves in Arrays of A...,"Vanessa Casagrande, Yuichi Togashi, Alexander ...",Spatiotemporal pattern formation in a produc...,10.1103/PhysRevLett.99.048301
1,Origin of adaptive mutants: a quantum measurem...,Vasily Ogryzko,This is a supplement to the paper arXiv:q-bi...,
2,A remark on the number of steady states in a m...,Liming Wang and Eduardo D. Sontag,The multisite phosphorylation-dephosphorylat...,
3,Complexities of Human Promoter Sequences,"Fangcui Zhao, Huijie Yang, and Binghong Wang","By means of the diffusion entropy approach, ...",10.1016/j.jtbi.2007.03.035
4,Intricate Knots in Proteins: Function and Evol...,"Peter Virnau (1), Leonid A. Mirny (1,2), Mehra...",A number of recently discovered protein stru...,


In [6]:
# Extract metadata
abstracts = papers['abstract']
titles = papers['title']
abstracts.shape

(45658,)

# A common pipeline for text clustering

1. Convert the text documents to embeddings using an *embedding model*.
2. Reduce the dimensionality of the embeddings with a *dimensionality reduction model*.
3. Find groups of semantically similar documents with a *clustering model*.

---
## Embedding Documents

Choosing embedding models optimized for semantic similarity tasks is especially important for clustering as we attempt to find groups of semantically similar documents. Fortuntely, most embedding models focus on semantic similarity only.

Here, we'll use the `thenlper/gte-base` embedding model, considering its score on clustering task and run time.

In [7]:
from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer('thenlper/gte-small')

embeddings = embedding_model.encode(abstracts, show_progress_bar=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1427 [00:00<?, ?it/s]

In [8]:
# The dimensions of the resulting embeddings
embeddings.shape

(45658, 384)

## Reducing the Dimensionality of Embeddings

As the number of dimensions increases, there is an exponential growth in the number of possible values within each dimension. Finding all subspaces within each dimension becomes increasingly complex. As a result, high-dimensional data can be troublesome for many clustering techniques.

Well known Dimensionality Reduction techniques:
* For clustering raw text data: LSA (Latent Semantic Analysis), NMF (Non-negative Matrix Factorization) or feature selection techniques
* For clustering embeddings: PCA, t-SNE, UMAP
* Deep Learning based approaches: Autoencoders

For this project, we'll go with UMAP (Uniform Manifold Approximation and Projection as it preserves both local & global structure while being computationally efficient.


In [9]:
! pip install umap-learn

Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.13 umap-learn-0.5.7


In [10]:
import umap

# UMAP model to reduce 384 dimensions to 5
umap_model = umap.UMAP(n_components=5, min_dist=0.1, metric='cosine', random_state=42)

# Fit and transform the embeddings
reduced_embeddings = umap_model.fit_transform(embeddings)

  warn(


In [11]:
reduced_embeddings.shape

(45658, 5)

## Cluster the Reduced Embeddings