# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [55]:
# TODO: import needed libraries
import pandas as pd

Load the data in the file `random_headlines.csv`

In [56]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")

print(df.shape)
df.head()


(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [57]:
# TODO: Perform a short EDA
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
None


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [58]:
# TODO: Preprocess the input data
import numpy as np
from nltk import word_tokenize, wordpunct_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(quote):
    quote = quote.lower()
    tokens = word_tokenize(quote)
    token_punc = [t for t in tokens if t.isalpha()]
    token_stop = [t for t in token_punc if t not in stop_words]
    stemmed_words = [stemmer.stem(w) for w in token_stop]
    return stemmed_words

df["token"] = df["headline_text"].apply(lambda x: clean_data(x))
print(df.head())

   publish_date                          headline_text  \
0      20120305  ute driver hurt in intersection crash   
1      20081128           6yo dies in cycling accident   
2      20090325          bumper olive harvest expected   
3      20100201     replica replaces northernmost sign   
4      20080225           woods targets perfect season   

                                   token  
0  [ute, driver, hurt, intersect, crash]  
1                     [die, cycl, accid]  
2        [bumper, oliv, harvest, expect]  
3  [replica, replac, northernmost, sign]  
4        [wood, target, perfect, season]  


Now use Gensim to compute a BOW

In [59]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

dictionary = Dictionary(df["token"])
bow_corpus = [dictionary.doc2bow(doc) for doc in df["token"]]

print(len(bow_corpus))
print(bow_corpus[:2])

20000
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]


Compute the TF-IDF using Gensim

In [60]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

tfidf_model = TfidfModel(bow_corpus)
tfidf_corpus = tfidf_model[bow_corpus]

print(len(tfidf_corpus))
print(tfidf_corpus[:1])


20000
<gensim.interfaces.TransformedCorpus object at 0x000002300FF3A550>


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [61]:
# TODO: Compute LSA
from gensim.models import LsiModel

num_topics = 10  
lsi_model = LsiModel(tfidf_corpus, num_topics=num_topics, id2word=dictionary)
lsi_corpus = lsi_model[tfidf_corpus]

print(len(lsi_corpus))
print(lsi_corpus[:1])

20000
<gensim.interfaces.TransformedCorpus object at 0x000002300FF3A400>


For each of the topic, show the most significant words.

In [62]:
# TODO: Print the 3 or 4 most significant words of each topic

for i in range(num_topics):
    topic = lsi_model.show_topic(i, topn=4)
    print(f"{topic}")

[('man', 0.4550001121212468), ('polic', 0.387733200896211), ('charg', 0.32172455420856033), ('court', 0.14561801867663987)]
[('second', -0.4007561954703861), ('abc', -0.3396814439012171), ('news', -0.3245184561427584), ('man', 0.3034497303426008)]
[('second', 0.3658808368800599), ('man', 0.31999055683957156), ('abc', 0.290042302041209), ('news', 0.2603194617487637)]
[('polic', -0.7706380089932393), ('man', 0.2288412006547506), ('charg', 0.21809714226835972), ('investig', -0.15008879927102062)]
[('kill', -0.3660580916859539), ('crash', -0.3352594029426541), ('fire', -0.20952207110990656), ('charg', 0.20434912253248508)]
[('news', -0.41893317319425427), ('rural', -0.3825007264418176), ('nation', -0.3549991209618057), ('weather', 0.2983474239045055)]
[('new', 0.7457785482586742), ('abc', -0.2652892358866868), ('second', 0.21831072603324336), ('plan', -0.2135975833206534)]
[('interview', 0.8948183990576116), ('plan', -0.11430823023186251), ('court', 0.10928992564439571), ('michael', 0.0980

What do you think about those results?

Looking at the output, it appears that the topics are generally related to different news categories, crime, politics, and weather, among others. The most significant words for first topic are "man", "polic", "charg", and "court", which suggest that this topic may be related to news articles about crime or law enforcement. Similarly, the most significant words for the fifth topic are "news", "rural", "nation", and "weather", which suggest that this topic may be related to news articles about weather or rural areas.

Now let's try to use LDA instead of LSA using Gensim

In [63]:
# TODO: Compute LDA
from gensim.models import LdaModel

num_topics = 10 
lda_model = LdaModel(tfidf_corpus, num_topics=num_topics, id2word=dictionary, passes=10)


In [64]:
# TODO: print the most frequent words of each topic

for i in range(num_topics):
    topic = lda_model.show_topic(i, topn=10)
    words = ", ".join([word for word, _ in topic])
    print(f"[{words}]")

[health, budget, rate, cut, fund, market, announc, hospit, futur, worker]
[resid, region, industri, miner, world, cup, tourism, big, loss, demand]
[charg, murder, man, interview, polic, gold, court, jail, woman, teen]
[second, rural, news, alleg, weather, iraq, us, nation, rail, author]
[found, dead, bodi, hunt, clear, defenc, land, hold, despit, review]
[countri, hour, commun, melbourn, concern, east, bid, blaze, violenc, farm]
[polic, water, ban, safeti, shoot, sport, hit, abc, run, probe]
[miss, assault, public, storm, still, adelaid, approv, perth, test, tour]
[opposit, govt, studi, closer, act, claim, poll, dump, sex, defend]
[busi, power, chang, fatal, accid, arrest, live, two, tax, brisban]


Now, how does it work with LDA?

It appears that the topics are generally related to news categories, crime, politics, weather, and sports. The most frequent words for topic 0 are "miss", "polic", "search", and "found", which suggest that this topic may be related to news articles about missing persons or law enforcement. The most frequent words for the seventh topic are "weather", "kill", "bail", and "bomb", which suggest that this topic may be related to news articles about natural disasters or crime.

In comparison to LSA, LDA is a probabilistic modeling technique that assumes that each document in the corpus is a mixture of several topics, with each topic being a probability distribution over a fixed vocabulary of words. LDA infers the topics of the corpus by iteratively estimating the topic distribution for each document and the word distribution for each topic. This means that LDA can capture the probability distribution of words within each topic, and hence provide more interpretable and meaningful results than LSA.

Let's make some visualization of the LDA results using pyLDAvis.

In [None]:
# TODO: show visualization results of the LDA
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

lda_vis = gensimvis.prepare(lda_model, tfidf_corpus, dictionary)

pyLDAvis.display(lda_vis)

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.