To cluster our corpus, we can choose from several algorithms, including non-negative matrix factorization (NMF), sparse principal components analysis (sparse PCA), and latent dirichlet allocation (LDA). We’ll focus on LDA because it is widely used by the scientific community due to its good results in social media, medical science, political science, and software engineering.

LDA is a model for unsupervised topic decomposition: It groups texts based on the words they contain and the probability of a word belonging to a certain topic. The LDA algorithm outputs the topic word distribution. With this information, we can define the main topics based on the words that are most likely associated with them. Once we have identified the main topics and their associated words, we can know which topic or topics apply to each text.

![](https://bs-uploads.toptal.io/blackfish-uploads/public-files/image_0-259d7a671398a16dc7cdfe05d89d4880.png)

In [12]:
["I like Harry Potter",
"I like Star Wars"]

['I like Harry Potter', 'I like Star Wars']

In [13]:
[[1,1,1,1,0,0],
[1,1,0,0,1,1]]

[[1, 1, 1, 1, 0, 0], [1, 1, 0, 0, 1, 1]]

![](https://bs-uploads.toptal.io/blackfish-uploads/public-files/image_1-0e8ed3e4c4e3de798d821211ae2c0537.png)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from nltk.corpus import stopwords

Consider the following corpus composed of five short sentences (all taken from New York Times headlines). The algorithm should clearly identify one topic related to politics and coronavirus, and a second one related to Nadal and tennis.

In [5]:
corpus = ["Rafael Nadal Joins Roger Federer in Missing U.S. Open",
          "Rafael Nadal Is Out of the Australian Open",
          "Biden Announces Virus Measures",
          "Biden's Virus Plans Meet Reality",
          "Where Biden's Virus Plan Stands"]

For best results, it’s necessary to use multiple preprocessing techniques. Here are some of the most frequently used:

- Lowercase letters. Make all words lowercase. Make all words lowercase. The meaning of a word does not change regardless of its position in the sentence.
- n-grams. Consider all groups of n words in a row as new terms, called n-grams. This way, cases such as “white house” will be taken into account and added to the vocabulary list.
- Stemming. Identify prefixes and suffixes of words to isolate them from their root. This way, words like “play,” “played,” or “player” are represented by the word “play.” Stemming can be useful to reduce the number of words in the vocabulary list while preserving their meaning , but it slows preprocessing considerably because it must be applied to each word in the corpus.
- Stop words. Do not take into account groups of words lacking in meaning or utility. These include articles and prepositions but may also include words that are not useful for our specific case study, such as certain common verbs.
- Term frequency–inverse document frequency (tf–idf). Use the coefficient of tf–idf instead of noting the frequency of each word within each cell of the matrix. It consists of two numbers, multiplied:

  - tf—the frequency of a given term or word in a text, and
  - idf—the logarithm of the total number of documents divided by the number of documents that contain that given term.

  tf–idf is a measure of how frequently a word is used in the corpus. To be able to subdivide words into groups, it is important to understand not only which words appear in each text, but also which words appear frequently in one text but not at all in others.


Using CountVectorizer(), we generate the matrix that denotes the frequency of the words of each text using CountVectorizer(). Note that the CountVectorizer allows for preprocessing if you include parameters such as stop_words to include the stop words, ngram_range to include n-grams, or lowercase=True to convert all characters to lowercase.

In [6]:
count_vect = CountVectorizer(stop_words=stopwords.words('english'), lowercase=True)
x_counts = count_vect.fit_transform(corpus)
x_counts.todense()

matrix([[0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]])

In [8]:
count_vect.get_feature_names_out()

array(['announces', 'australian', 'biden', 'federer', 'joins', 'measures',
       'meet', 'missing', 'nadal', 'open', 'plan', 'plans', 'rafael',
       'reality', 'roger', 'stands', 'virus'], dtype=object)

In [9]:
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

In order to perform the LDA decomposition, we have to define the number of topics. In this simple case, we know there are two topics or “dimensions.” But in general cases, this is a hyperparameter that needs some tuning, which could be done using algorithms like random search or grid search:

In [10]:
dimension = 2
lda = LDA(n_components = dimension)
lda_array = lda.fit_transform(x_tfidf)
lda_array

array([[0.15689346, 0.84310654],
       [0.27535307, 0.72464693],
       [0.81253354, 0.18746646],
       [0.82590169, 0.17409831],
       [0.32353416, 0.67646584]])

LDA is a probabilistic method. Here we can see the probability of each of the five headlines belonging to each of the two topics. We can see that the first two texts have a higher probability of belonging to the first topic and the next three to the second topic, as expected.

Finally, if we want to understand what these two topics are about, we can see the most important words in each topic:

In [11]:
components = [lda.components_[i] for i in range(len(lda.components_))]
features = count_vect.get_feature_names()
important_words = [sorted(features, key = lambda x: components[j][features.index(x)], reverse = True)[:3] for j in range(len(components))]
important_words

[['virus', 'biden', 'announces'], ['rafael', 'open', 'nadal']]

As expected, LDA correctly assigned words related to tennis tournaments and Nadal to the first topic and words related to Biden and virus to the second topic.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [17]:
vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True) 
# under the hood - lowercasing,removing special chars,removing stop words
input_matrix = vectorizer.fit_transform(corpus).todense()

In [20]:
input_matrix

matrix([[0.        , 0.        , 0.        , 0.40986539, 0.40986539,
         0.        , 0.        , 0.40986539, 0.33067681, 0.33067681,
         0.        , 0.        , 0.33067681, 0.        , 0.40986539,
         0.        , 0.        ],
        [0.        , 0.5819515 , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.4695148 , 0.4695148 ,
         0.        , 0.        , 0.4695148 , 0.        , 0.        ,
         0.        , 0.        ],
        [0.58752141, 0.        , 0.39346994, 0.        , 0.        ,
         0.58752141, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.39346994],
        [0.        , 0.        , 0.33925099, 0.        , 0.        ,
         0.        , 0.50656277, 0.        , 0.        , 0.        ,
         0.        , 0.50656277, 0.        , 0.50656277, 0.        ,
         0.        , 0.33925099],
        [0.        , 0.        , 0.3

In [24]:
svd_modeling= TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122)
svd_modeling.fit(input_matrix)
components=svd_modeling.components_
vocab = vectorizer.get_feature_names_out()
vocab



array(['announces', 'australian', 'biden', 'federer', 'joins', 'measures',
       'meet', 'missing', 'nadal', 'open', 'plan', 'plans', 'rafael',
       'reality', 'roger', 'stands', 'virus'], dtype=object)

In [30]:
topic_word_list = []
def get_topics(components): 
    for i, comp in enumerate(components):
        terms_comp = zip(vocab,comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
        topic=" "
        for t in sorted_terms:
            topic= topic + ' ' + t[0]
            topic_word_list.append(topic)
    print(topic_word_list)
    return topic_word_list

get_topics(components)

['  biden', '  biden virus', '  biden virus measures', '  biden virus measures announces', '  biden virus measures announces plan', '  biden virus measures announces plan stands', '  biden virus measures announces plan stands meet', '  nadal', '  nadal open', '  nadal open rafael', '  nadal open rafael australian', '  nadal open rafael australian federer', '  nadal open rafael australian federer joins', '  nadal open rafael australian federer joins missing']


['  biden',
 '  biden virus',
 '  biden virus measures',
 '  biden virus measures announces',
 '  biden virus measures announces plan',
 '  biden virus measures announces plan stands',
 '  biden virus measures announces plan stands meet',
 '  nadal',
 '  nadal open',
 '  nadal open rafael',
 '  nadal open rafael australian',
 '  nadal open rafael australian federer',
 '  nadal open rafael australian federer joins',
 '  nadal open rafael australian federer joins missing']

In [None]:
!pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
for i in range(4):
    wc = WordCloud(width=1000, height=600, margin=3,  prefer_horizontal=0.7,scale=1,background_color='black', relative_scaling=0).generate(topic_word_list[i])
    plt.imshow(wc)
    plt.title(f"Topic{i+1}")
    plt.axis("off")
    plt.show()
