GPT created a synthetic dataset with titles and descriptions of news articles.

In [1]:
import pandas as pd
import random

# Creating a synthetic dataset of news articles
titles = [
    "Global Climate Change Crisis",
    "Advancements in Artificial Intelligence",
    "Political Turmoil in Country X",
    "Breakthrough in Renewable Energy",
    "Economic Trends Post-Pandemic",
    "New Discoveries in Space Exploration",
    "Celebrity Scandal Shocks Fans",
    "Revolutionary Medical Treatment Unveiled",
    "Historic Peace Agreement Signed",
    "Major Cybersecurity Breach Reported"
]

descriptions = [
    "Experts discuss the severe impacts of climate change and potential solutions.",
    "A look into the future of AI and its impact on society.",
    "Country X faces political instability amid recent elections.",
    "Innovations in solar and wind energy could change the future of power.",
    "Analysis of global economic recovery following the pandemic.",
    "NASA announces a groundbreaking discovery in outer space.",
    "Famous celebrity involved in a major scandal, fans react on social media.",
    "A new drug promises to revolutionize healthcare.",
    "Two countries sign a historic peace treaty, ending years of conflict.",
    "A massive data breach exposes sensitive information of millions."
]

# Generating a DataFrame
data = {'title': titles, 'description': descriptions}
df = pd.DataFrame(data)
df.head()  # Display the first few rows of the dataframe


Unnamed: 0,title,description
0,Global Climate Change Crisis,Experts discuss the severe impacts of climate ...
1,Advancements in Artificial Intelligence,A look into the future of AI and its impact on...
2,Political Turmoil in Country X,Country X faces political instability amid rec...
3,Breakthrough in Renewable Energy,Innovations in solar and wind energy could cha...
4,Economic Trends Post-Pandemic,Analysis of global economic recovery following...


Simulate the generation of LLM embeddings for each document. For this demonstration, I'll use a simpler pre-trained model available in this environment to generate embeddings. We will treat these embeddings as a stand-in for LLM embeddings.

Let's proceed with generating the embeddings:

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Combining title and description for each article
combined_texts = df['title'] + ". " + df['description']

# Using TF-IDF Vectorizer to convert text to numerical data
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(combined_texts)

# Using Truncated SVD to reduce dimensionality (simulating dense embeddings)
svd = TruncatedSVD(n_components=50, random_state=42)
X_reduced = svd.fit_transform(X_tfidf)

# Displaying the shape of the reduced data
X_reduced.shape


(10, 10)

Apply a clustering algorithm (like K-Means) to these embeddings and analyze the resulting clusters:

1. Cluster the Embeddings: We will use K-Means clustering.
2. Analyze Clusters: Examine the clusters to see how the documents are grouped

In [3]:
from sklearn.cluster import KMeans

# Number of clusters - this can be adjusted
num_clusters = 3

# Applying K-Means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(X_reduced)

# Adding cluster information to the DataFrame
df['cluster'] = clusters

# Displaying articles with their assigned cluster
df.sort_values('cluster')




Unnamed: 0,title,description,cluster
1,Advancements in Artificial Intelligence,A look into the future of AI and its impact on...,0
2,Political Turmoil in Country X,Country X faces political instability amid rec...,0
6,Celebrity Scandal Shocks Fans,"Famous celebrity involved in a major scandal, ...",0
8,Historic Peace Agreement Signed,"Two countries sign a historic peace treaty, en...",0
9,Major Cybersecurity Breach Reported,A massive data breach exposes sensitive inform...,0
0,Global Climate Change Crisis,Experts discuss the severe impacts of climate ...,1
3,Breakthrough in Renewable Energy,Innovations in solar and wind energy could cha...,1
4,Economic Trends Post-Pandemic,Analysis of global economic recovery following...,1
5,New Discoveries in Space Exploration,NASA announces a groundbreaking discovery in o...,2
7,Revolutionary Medical Treatment Unveiled,A new drug promises to revolutionize healthcare.,2


The clustering algorithm has grouped the synthetic news articles into three clusters. Here's a summary of how the documents are distributed across these clusters:

* Cluster 0: Includes articles about advancements in artificial intelligence, political turmoil, celebrity scandal, a peace agreement, and a cybersecurity breach.
* Cluster 1: Contains articles related to climate change, renewable energy, and post-pandemic economic trends.
* Cluster 2: Comprises articles about space exploration and a revolutionary medical treatment.