
# Document Clustering with LLM Embeddings

This notebook demonstrates the process of clustering a set of documents using state-of-the-art embeddings (Large Language Model, LLM) and K-Means clustering algorithm.

Dataset used - https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement



In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



## Step 1: Preprocessing the Data

First, we need to preprocess our data by cleaning and preparing the text for embedding. This involves removing any missing values and optionally combining different text fields for richer representation.


In [5]:
import pandas as pd

# Load the dataset
file_path = '/content/drive/MyDrive/DM-Assignment-Dataset/articles_data.csv'  # Replace with your dataset path
articles_df = pd.read_csv(file_path).head(100)

articles_df.head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


In [6]:
# Dropping rows where 'content' is missing
articles_df = articles_df.dropna(subset=['content'])

# Combining 'title', 'description', and 'content'
articles_df['combined_text'] = articles_df['title'] + ' ' + articles_df['description'] + ' ' + articles_df['content']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_df['combined_text'] = articles_df['title'] + ' ' + articles_df['description'] + ' ' + articles_df['content']



## Step 2: Text Embedding Using a Large Language Model (LLM)

Next, we convert the preprocessed text into embeddings using a pre-trained transformer model such as BERT (Bidirectional Encoder Representations from Transformers).


In [7]:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text to get embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze()

# Applying the function to the combined_text column
embeddings = articles_df['combined_text'].apply(get_embedding)


In [8]:
embeddings

0     [tensor(-0.3337), tensor(-0.3285), tensor(0.26...
1     [tensor(-0.4249), tensor(-0.1933), tensor(0.54...
2     [tensor(0.0348), tensor(-0.2962), tensor(0.716...
3     [tensor(-0.1814), tensor(-0.1089), tensor(0.43...
5     [tensor(-0.3022), tensor(0.1923), tensor(0.172...
                            ...                        
95    [tensor(-0.1174), tensor(-0.2068), tensor(0.27...
96    [tensor(-0.2994), tensor(-0.2044), tensor(0.70...
97    [tensor(-0.0264), tensor(0.0639), tensor(0.307...
98    [tensor(-0.1923), tensor(-0.0821), tensor(0.12...
99    [tensor(-0.1410), tensor(-0.2782), tensor(0.36...
Name: combined_text, Length: 82, dtype: object


## Step 3: Clustering the Documents

We use the K-Means algorithm to cluster the documents based on their embeddings.


In [9]:

from sklearn.cluster import KMeans
import numpy as np

# Convert the list of embeddings into a numpy array
embeddings_array = np.array([emb.numpy() for emb in embeddings])

# Define the K-Means model
kmeans = KMeans(n_clusters=5, random_state=42)

# Fit the model
kmeans.fit(embeddings_array)

# Get cluster assignments for each document
cluster_assignments = kmeans.labels_

# Add the cluster assignments to our dataframe
articles_df['cluster'] = cluster_assignments




In [16]:
articles_df[['cluster','description','content']]

Unnamed: 0,cluster,description,content
0,2,The National Transportation Safety Board said ...,WASHINGTON (Reuters) - The National Transporta...
1,0,Latest monthly figures reflect continued growt...,The States jobless rate fell to 5.2 per cent l...
2,1,Autumn-winter collection features designer’s g...,Louise Kennedy is showing off her autumn-winte...
3,3,Han is the first North Korean player in the Se...,"Han Kwang Song, the first North Korean footbal..."
5,1,"""This Tender Land"" by William Kent Krueger is ...","""This Tender Land: a Novel"" (Atria Books), by ..."
...,...,...,...
95,1,A woman dies at the hands of her partner every...,Mrs. Douib left her partner six months before ...
96,4,Experts say calm conditions above the storm ar...,Hurricane Dorian is wreaking more havoc on the...
97,1,Making the industry's 80 billion garments per ...,"The biggest fashion trend in recent years is ""..."
98,2,Federal immigration authorities have at least ...,Federal immigration authorities have at least ...
