# Solution

The purpose of this exercise is to use LSA in order to run unsupervised topic extraction on texts and compare the results to the target variable. We are not going to use the target variable to train a model but only to assess if the topics found by LSA are similar to the classes that would have been used for supervised classification.

Let's begin and import the libraries we will be using

In [22]:
import pandas as pd
import numpy as np
import en_core_web_sm
import spacy

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.figure_factory as ff

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans
from spacy.lang.en.stop_words import STOP_WORDS

In [3]:
news = fetch_20newsgroups()

In [4]:
print(news.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

Store the object news.data in a DataFrame and call the column text. Extract a sample of 5000 rows to begin with. Add the target variable to this dataframe in order to run analysis later.

In [7]:
data = pd.DataFrame(news.data, columns =["text"])
data = data.sample(5000, random_state=42)
data["target"] = news.target[data.index]
data.head()

Unnamed: 0,text,target
7492,From: rrn@po.CWRU.Edu (Robert R. Novitskey)\nS...,4
3546,From: ardie@ux1.cso.uiuc.edu (Ardie Mack)\nSub...,2
5582,From: tsa@cellar.org (The Silent Assassin)\nSu...,6
4793,From: guy@idacom.hp.com (Guy M. Trotter)\nSubj...,16
3813,From: jwodzia@fadel.uucp (john wodziak)\nSubje...,10


Create a column text_clean containing only alphanumerical characters and change all characters to lowercase. Also only keep the tex that is after the string "Subject:"

In [8]:
data['text_clean'] = data['text'].apply(lambda x: x.split("Subject:")[1])
data['text_clean'] = data['text_clean'].apply(lambda x: ''.join(ch for ch in x if ch.isalnum() or ch==" "))
data['text_clean'] = data['text_clean'].fillna('').apply(lambda x: x.lower())
data.head()

Unnamed: 0,text,target,text_clean
7492,From: rrn@po.CWRU.Edu (Robert R. Novitskey)\nS...,4,cyclone and tempestarticleid usenet1pskavqtur...
3546,From: ardie@ux1.cso.uiuc.edu (Ardie Mack)\nSub...,2,re does dos6 defragmentarticleid ux1ardie2727...
5582,From: tsa@cellar.org (The Silent Assassin)\nSu...,6,for sale misc ibm stufforganization the cell...
4793,From: guy@idacom.hp.com (Guy M. Trotter)\nSubj...,16,re guns in backcountry no thanksorganization ...
3813,From: jwodzia@fadel.uucp (john wodziak)\nSubje...,10,re goalie masksreplyto jwodziafadeluucp john ...


Create an object nlp with en_core_web_sm.load

In [9]:
nlp = en_core_web_sm.load()

Tokenize the cleaned sentences and remove english stopwords

In [15]:
# tokenisation
data["text_tokenized"] = data["text_clean"].apply(lambda x: [token.lemma_ for token in nlp(x) if token.text not in STOP_WORDS])
data.head()

Unnamed: 0,text,target,text_clean,text_tokenized
7492,From: rrn@po.CWRU.Edu (Robert R. Novitskey)\nS...,4,cyclone and tempestarticleid usenet1pskavqtur...,"[ , cyclone, tempestarticleid, usenet1pskavqtu..."
3546,From: ardie@ux1.cso.uiuc.edu (Ardie Mack)\nSub...,2,re does dos6 defragmentarticleid ux1ardie2727...,"[ , dos6, defragmentarticleid, ux1ardie2727340..."
5582,From: tsa@cellar.org (The Silent Assassin)\nSu...,6,for sale misc ibm stufforganization the cell...,"[ , sale, , misc, ibm, stufforganization, cel..."
4793,From: guy@idacom.hp.com (Guy M. Trotter)\nSubj...,16,re guns in backcountry no thanksorganization ...,"[ , gun, backcountry, thanksorganization, idac..."
3813,From: jwodzia@fadel.uucp (john wodziak)\nSubje...,10,re goalie masksreplyto jwodziafadeluucp john ...,"[ , goalie, masksreplyto, jwodziafadeluucp, jo..."


Detokenize the tokenized sentences and store them in an nlp_ready column

In [17]:
# detokenisation
# data["nlp_ready"] = data["text_tokenized"].apply(lambda x: ' '.join(x))
# data.head()

# Détokeniser en utilisant Pandas
data['nlp_ready'] = data['text_tokenized'].apply(' '.join)
data.head()

Unnamed: 0,text,target,text_clean,text_tokenized,nlp_ready
7492,From: rrn@po.CWRU.Edu (Robert R. Novitskey)\nS...,4,cyclone and tempestarticleid usenet1pskavqtur...,"[ , cyclone, tempestarticleid, usenet1pskavqtu...",cyclone tempestarticleid usenet1pskavqturepl...
3546,From: ardie@ux1.cso.uiuc.edu (Ardie Mack)\nSub...,2,re does dos6 defragmentarticleid ux1ardie2727...,"[ , dos6, defragmentarticleid, ux1ardie2727340...",dos6 defragmentarticleid ux1ardie27273409793...
5582,From: tsa@cellar.org (The Silent Assassin)\nSu...,6,for sale misc ibm stufforganization the cell...,"[ , sale, , misc, ibm, stufforganization, cel...",sale misc ibm stufforganization cellar bbs...
4793,From: guy@idacom.hp.com (Guy M. Trotter)\nSubj...,16,re guns in backcountry no thanksorganization ...,"[ , gun, backcountry, thanksorganization, idac...",gun backcountry thanksorganization idacom di...
3813,From: jwodzia@fadel.uucp (john wodziak)\nSubje...,10,re goalie masksreplyto jwodziafadeluucp john ...,"[ , goalie, masksreplyto, jwodziafadeluucp, jo...",goalie masksreplyto jwodziafadeluucp john wo...


Use sklearn to calculate the tf-idf

In [18]:
vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)
X = vectorizer.fit_transform(data['nlp_ready'])
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 456197 stored elements and shape (5000, 116452)>

Use the truncatedSVD model in order to create a topic model with 20 different topics

In [19]:
# Initialiser le modèle Truncated SVD
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=42)

# Appliquer le SVD pour réduire les dimensions
lsa = svd_model.fit_transform(X)

# Créer un DataFrame avec les topics encodés
topic_encoded_df = pd.DataFrame(
    lsa,
    columns=[f"topic_{i+1}" for i in range(lsa.shape[1])],  # Génère les colonnes automatiquement
    index=data.index  # Conserve l'alignement des index avec `data`
)

# Ajouter le texte d'origine (nettoyé) au DataFrame
topic_encoded_df["text"] = data["nlp_ready"]

# Afficher le DataFrame résultant
topic_encoded_df.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,...,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,text
7492,0.051849,-0.010821,-0.009423,-0.014433,-0.008042,0.002312,0.011048,-0.019142,-0.003941,-0.050777,...,0.020446,-0.040042,-0.04019,-0.054096,0.021121,-0.028638,-0.009876,-0.048188,-0.03362,cyclone tempestarticleid usenet1pskavqturepl...
3546,0.018149,-0.003031,0.000131,-0.007123,0.004234,0.003212,0.005269,-0.010027,-0.002491,-0.007888,...,0.009107,0.004405,0.003686,-0.003512,0.016173,0.012867,0.000324,-0.002977,-0.008221,dos6 defragmentarticleid ux1ardie27273409793...
5582,0.050858,-0.054776,-0.013459,0.03619,-0.039906,-0.031323,-0.02528,-0.004076,-0.011453,-0.019108,...,-0.011651,-0.026889,-0.013882,0.002497,-0.025401,-0.034682,-0.009145,0.015117,0.015439,sale misc ibm stufforganization cellar bbs...
4793,0.090677,0.02815,0.033529,-0.014257,-0.012785,0.043759,0.017277,-0.008308,0.111087,-0.00393,...,-0.068336,-0.074656,0.0159,-0.006194,0.043516,-0.008742,0.078084,0.098187,0.046375,gun backcountry thanksorganization idacom di...
3813,0.08117,0.00657,-0.003259,-0.041566,0.019661,-0.018651,0.002282,-0.004774,-0.010225,-0.007944,...,-0.000242,-0.002872,-0.011874,-0.004732,0.0086,-0.016942,0.004866,-0.019257,-0.023645,goalie masksreplyto jwodziafadeluucp john wo...


Assign each document to the topic it is the most linked to

In [20]:
# Ajouter la colonne class_pred
topic_encoded_df["class_pred"] = np.argmax(lsa, axis=1)

# Compter les classes
topic_counts = topic_encoded_df["class_pred"].value_counts()

# Afficher les résultats
print(topic_counts)

class_pred
0     3705
4      192
2      159
10     135
13     120
14     105
11      86
1       78
6       78
15      74
5       56
16      52
9       43
18      36
12      34
8       25
19      14
7        5
17       2
3        1
Name: count, dtype: int64


Add the target variable to thetopic model dataframe and print the confusion matrix for the topic against the target variable :

In [21]:
topic_encoded_df["target"] = news.target[data.index]
topic_encoded_df.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,...,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,text,class_pred,target
7492,0.051849,-0.010821,-0.009423,-0.014433,-0.008042,0.002312,0.011048,-0.019142,-0.003941,-0.050777,...,-0.04019,-0.054096,0.021121,-0.028638,-0.009876,-0.048188,-0.03362,cyclone tempestarticleid usenet1pskavqturepl...,0,4
3546,0.018149,-0.003031,0.000131,-0.007123,0.004234,0.003212,0.005269,-0.010027,-0.002491,-0.007888,...,0.003686,-0.003512,0.016173,0.012867,0.000324,-0.002977,-0.008221,dos6 defragmentarticleid ux1ardie27273409793...,0,2
5582,0.050858,-0.054776,-0.013459,0.03619,-0.039906,-0.031323,-0.02528,-0.004076,-0.011453,-0.019108,...,-0.013882,0.002497,-0.025401,-0.034682,-0.009145,0.015117,0.015439,sale misc ibm stufforganization cellar bbs...,0,6
4793,0.090677,0.02815,0.033529,-0.014257,-0.012785,0.043759,0.017277,-0.008308,0.111087,-0.00393,...,0.0159,-0.006194,0.043516,-0.008742,0.078084,0.098187,0.046375,gun backcountry thanksorganization idacom di...,8,16
3813,0.08117,0.00657,-0.003259,-0.041566,0.019661,-0.018651,0.002282,-0.004774,-0.010225,-0.007944,...,-0.011874,-0.004732,0.0086,-0.016942,0.004866,-0.019257,-0.023645,goalie masksreplyto jwodziafadeluucp john wo...,0,10


In [23]:

# Calculer la matrice de confusion et normaliser par le nombre total de documents
cm = confusion_matrix(y_true=topic_encoded_df["target"], y_pred=topic_encoded_df["class_pred"])
cm_normalized = cm / cm.sum(axis=1, keepdims=True)  # Normalisation par ligne

# Créer une heatmap avec des annotations
fig = ff.create_annotated_heatmap(
    z=cm_normalized.round(2),  # Matrice normalisée et arrondie à 2 décimales
    x=[f"Topic {i}" for i in range(cm.shape[1])],  # Axes des colonnes
    y=[f"Topic {i}" for i in range(cm.shape[0])],  # Axes des lignes
    colorscale="Viridis",  # Palette de couleurs moderne
    showscale=True  # Afficher l'échelle des couleurs
)

# Personnaliser la mise en page
fig.update_layout(
    title="Confusion Matrix Heatmap",
    xaxis=dict(title="Predicted Topics"),
    yaxis=dict(title="True Topics"),
    width=1000,  # Largeur de la figure
    height=1000,  # Hauteur de la figure
    font=dict(size=12)  # Taille des polices
)

# Afficher la heatmap
fig.show()

Conclusion : the topics found by LSA are very different from the target ! Here we can see that topic 0 is very frequent among the documents and spans accross lots of the target categories. LSA is very convenient to find some structure among a text corpus, but it usually creates topics that are quite different from the categories that would have been determined by a human.

Reminder : contrary to supervised classification and unsupervised clustering, LSA is based on the hypothesis that a given document can be related to several topics. This makes the interpretation of the model's output more complicated, but allows to create topic models that are more realistic (because in real life, a document is often related to different topics !)