# PCA, Clustering and Embeddings

Today we'll analyse the BBC News Articles dataset.


In [209]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Remember to run pip install sentence-transformers
from sentence_transformers import SentenceTransformer


In [210]:
df = pd.read_csv("https://huggingface.co/datasets/SetFit/bbc-news/resolve/main/bbc-text.csv")

In [172]:
df.value_counts("category")

category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [173]:
df.sample()["text"].values[0]

'vera drake leads uk oscar hopes mike leigh s film vera drake will lead british hopes at this year s academy awards after getting three nominations.  imelda staunton was nominated for best actress for her role in the abortion drama  while leigh received nods for best director and original screenplay. kate winslet was also nominated in the best actress category for her role in eternal sunshine of the spotless mind. and clive owen and sophie okonedo both got nominated for supporting roles in closer and hotel rwanda respectively. owen has already been made bookmakers  favourite for best supporting actor for the role in closer that has already clinched him a golden globe award.  and it is the first nomination for actress okonedo  chosen for her performance in hotel rwanda  about the 1994 rwandan genocide. it is also a debut nomination for staunton  49  who told bbc news 24 she had not thought the film would appeal to academy voters.  it was an extraordinary time making the film and i can t

## Embeddings

In [174]:
model = SentenceTransformer('all-MiniLM-L6-v2')


In [175]:
embeddings = model.encode(df['text'].tolist(), show_progress_bar=True)

emb_df = pd.DataFrame(embeddings)


Batches:   0%|          | 0/70 [00:00<?, ?it/s]

In [176]:
emb_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.001554,-0.067275,0.011174,-0.097146,0.054098,0.042254,-0.034863,-0.017126,0.062448,-0.024317,...,0.083276,-0.012365,0.099596,-0.002854,0.030708,0.069352,-0.008165,-0.015212,-0.063724,0.084557
1,-0.083547,0.059484,-0.013196,-0.011908,0.011631,0.002616,0.117316,0.002310,0.013166,0.028018,...,0.056850,-0.037205,0.039960,0.014939,-0.075912,-0.010551,0.011098,-0.093159,-0.002457,0.021243
2,-0.055962,-0.008481,-0.025843,-0.056737,-0.045265,-0.021684,0.030081,-0.013983,0.030562,-0.003697,...,0.014019,-0.006896,-0.006589,-0.018533,-0.056650,-0.000594,0.039833,-0.056027,0.043895,-0.046996
3,0.015004,-0.125696,-0.028034,-0.040649,0.080588,0.052048,0.025743,-0.012288,0.033973,0.002043,...,0.004305,-0.004015,-0.057729,-0.021071,0.009793,0.056244,-0.032304,-0.020664,-0.025317,0.056185
4,-0.018273,-0.018895,-0.047966,-0.070612,-0.006405,0.059953,-0.119612,0.026853,0.050197,-0.011960,...,0.051296,0.027649,-0.022006,-0.028721,-0.003901,0.082542,-0.065968,-0.073110,-0.035203,0.043847
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2220,-0.027352,-0.026011,0.054015,0.025434,-0.014118,0.073263,-0.022070,0.068582,-0.033815,-0.002269,...,-0.010152,-0.027655,-0.055435,-0.022257,-0.029578,-0.062166,-0.029384,-0.129902,-0.041675,0.080504
2221,-0.017127,0.009932,0.017573,-0.024224,0.050684,0.034838,0.082775,-0.099431,-0.101438,0.020063,...,0.045735,-0.004506,-0.019301,-0.011101,-0.026871,0.096571,-0.026657,0.012036,-0.049906,0.012583
2222,0.027212,-0.136341,0.026965,-0.063657,-0.002635,0.035392,0.036003,-0.031810,0.021682,0.062556,...,0.032426,0.031709,-0.021834,0.052545,0.000720,0.055665,-0.050568,-0.092031,-0.077371,-0.007813
2223,0.046371,-0.035961,0.065271,-0.032036,0.030133,0.061473,-0.000475,-0.001458,0.036582,0.081157,...,0.069589,0.028942,0.002779,-0.033583,-0.061675,0.092261,0.012805,0.045767,-0.016248,0.048731


In [177]:
# If above cell takes too long to run, use the precomputed embeddings
emb_df = pd.read_csv("https://raw.githubusercontent.com/harismck/ism-data-science-2025/master/week9/bbc_embeddings.csv", index_col=0)


In [178]:
emb_df.shape

(2225, 384)

## Tasks

### Task 1

- Reduce the dimensions of the embeddings dataset to two principal components. 
- Assign the principal component values to the original dataframe `df`.
- Plot the result on a scatterplot. 


### Task 2

Print out several articles that are close to one another on the scatterplot above. Is their content similar?


### Task 3

- Cluster the articles into three clusters using k-means clustering.
- Visualize the clusters on a scatterplot.
- Do the clusters have something in common in terms of article category?


### Task 4

Find the optimal number of clusters bit fitting kmeans multiple times for increasing number of `n_clusters`.


### Task 5

Fit kmeans with another value for n_clusters and visualize the clusters. Show which article categories belong to the clusters. 


### Task 6

Train a model (RandomForest or LogisticRegression) for predicting whether an article belongs to the politics category. The model `pipeline` should take the full embedding as input and perform PCA before feeding it into the next steps of the pipeline. You should also perform hyperparameter tuning for at least one hyperparameter.

As the last step, report the performance of the model.


### Task 7

Calculate feature importances of the principal components.
