# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [2]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [4]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
print(df.shape)
df.head(5)

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [3]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
publish_date     20000 non-null int64
headline_text    20000 non-null object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [10]:
# TODO: Preprocess the input data
df['tokens'] = df['headline_text'].apply(lambda row: nltk.word_tokenize(row))
df['alphanumeric'] = df['tokens'].apply(lambda row: [word for word in row if word.isalpha])
stop = nltk.corpus.stopwords.words('English')
df['nostop'] = df['alphanumeric'].apply(lambda row: [word for word in row if word not in stop])
stemmer = nltk.PorterStemmer()
df['stemmed'] = df['nostop'].apply(lambda row: [stemmer.stem(word) for word in row])
df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                  [6yo, die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

In [12]:
!pip install --upgrade gensim

Collecting gensim
  Obtaining dependency information for gensim from https://files.pythonhosted.org/packages/ad/97/b8253236dfedb9094f4273393a3fd03997da81f27f15822e56128da894ae/gensim-4.3.2-cp311-cp311-win_amd64.whl.metadata
  Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl.metadata (8.5 kB)
Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.1/24.0 MB 3.6 MB/s eta 0:00:07
    --------------------------------------- 0.4/24.0 MB 4.5 MB/s eta 0:00:06
   - -------------------------------------- 0.7/24.0 MB 5.2 MB/s eta 0:00:05
   - -------------------------------------- 0.9/24.0 MB 5.2 MB/s eta 0:00:05
   - -------------------------------------- 1.2/24.0 MB 5.6 MB/s eta 0:00:05
   - -------------------------------------- 1.2/24.0 MB 4.4 MB/s eta 0:00:06
   - -------------------------------------- 1.2/24.0 MB 4.4 MB/s eta 0:00:06
   - ------------------------------

Now use Gensim to compute a BOW

In [13]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary
dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
# print(np.shape(corpus))
corpus[0:2]


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1)]]

Compute the TF-IDF using Gensim

In [15]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel
tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
# print(np.shape(tf_idf))

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [16]:
# TODO: Compute LSA
from gensim.models import LsiModel
Lsa = LsiModel(corpus = corpus, num_topics = 4, id2word = dictionary)


For each of the topic, show the most significant words.

In [19]:
# TODO: Print the 3 or 4 most significant words of each topic
Lsa.print_topics(num_words = 3)

[(0, '0.751*"polic" + 0.404*"man" + 0.208*"charg"'),
 (1, '0.670*"man" + -0.575*"polic" + 0.327*"charg"'),
 (2, '-0.653*"new" + -0.297*"plan" + 0.243*"man"'),
 (3, '-0.704*"new" + 0.341*"say" + 0.333*"plan"')]

What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [20]:
# TODO: Compute LDA
from gensim.models import LdaModel
Lda = LdaModel(corpus = corpus, num_topics = 4, id2word = dictionary)

In [26]:
# TODO: print the most frequent words of each topic
Lda.print_topics(num_words = 3)

[(0, '0.010*"govt" + 0.009*"council" + 0.009*"plan"'),
 (1, '0.015*"polic" + 0.006*"fire" + 0.006*"man"'),
 (2, '0.012*"interview" + 0.007*"group" + 0.005*"work"'),
 (3, '0.011*"man" + 0.011*"charg" + 0.009*"polic"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [28]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(Lda, corpus, dictionary)
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.