# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [7]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [8]:
# TODO: load the dataset
df=pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [11]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [13]:
# TODO: Preprocess the input data
import string
# Tokenize
df['headline_text'] = df['headline_text'].apply(nltk.word_tokenize)
# Remove punctuation
df['headline_text']=df['headline_text'].apply(lambda tokens: [token for token in tokens if token not in string.punctuation])
# Remove stop words
df['headline_text']=df['headline_text'].apply(lambda tokens: [token for token in tokens if token.lower() not in nltk.corpus.stopwords.words('english')])
# Stem
stemmer = nltk.stem.PorterStemmer()
df['headline_text']=df['headline_text'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])
df.head()

Unnamed: 0,publish_date,headline_text
0,20120305,"[ute, driver, hurt, intersect, crash]"
1,20081128,"[6yo, die, cycl, accid]"
2,20090325,"[bumper, oliv, harvest, expect]"
3,20100201,"[replica, replac, northernmost, sign]"
4,20080225,"[wood, target, perfect, season]"


Now use Gensim to compute a BOW

In [14]:
# TODO: Compute the BOW using Gensim
from gensim import corpora
dic=corpora.Dictionary(df['headline_text'])
bow=[dic.doc2bow(doc) for doc in df['headline_text']]

Compute the TF-IDF using Gensim

In [15]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel
tfidf_model=TfidfModel(bow)
tfidf=tfidf_model[bow]
print(len(tfidf))
print(tfidf[0])

20000
[(0, 0.30725466582280214), (1, 0.35289437816784547), (2, 0.42129048115131124), (3, 0.5992666854471201), (4, 0.49442279315598586)]


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [16]:
# TODO: Compute LSA
from gensim.models import LsiModel
lsi = LsiModel(tfidf, id2word=dic, num_topics=5)
print(lsi.print_topics())

  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


[(0, '0.457*"man" + 0.388*"polic" + 0.314*"charg" + 0.149*"court" + 0.141*"murder" + 0.127*"face" + 0.110*"crash" + 0.110*"new" + 0.109*"miss" + 0.102*"death"'), (1, '0.433*"second" + 0.408*"90" + 0.335*"abc" + 0.295*"news" + 0.287*"weather" + 0.240*"busi" + -0.236*"man" + 0.185*"sport" + -0.167*"charg" + 0.104*"plan"'), (2, '0.378*"man" + 0.273*"second" + 0.269*"charg" + 0.262*"90" + -0.212*"plan" + -0.195*"govt" + -0.194*"council" + -0.176*"new" + 0.169*"abc" + 0.159*"weather"'), (3, '-0.770*"polic" + 0.244*"man" + 0.219*"charg" + -0.163*"investig" + -0.147*"probe" + 0.140*"council" + 0.133*"plan" + 0.114*"face" + 0.113*"court" + 0.107*"govt"'), (4, '0.717*"abc" + -0.439*"second" + -0.383*"90" + 0.147*"sport" + 0.137*"market" + 0.124*"entertain" + 0.114*"busi" + 0.099*"weather" + 0.082*"analysi" + -0.074*"council"')]


For each of the topic, show the most significant words.

In [17]:
# TODO: Print the 3 or 4 most significant words of each topic
print(lsi.show_topics(num_words=4))

[(0, '0.457*"man" + 0.388*"polic" + 0.314*"charg" + 0.149*"court"'), (1, '0.433*"second" + 0.408*"90" + 0.335*"abc" + 0.295*"news"'), (2, '0.378*"man" + 0.273*"second" + 0.269*"charg" + 0.262*"90"'), (3, '-0.770*"polic" + 0.244*"man" + 0.219*"charg" + -0.163*"investig"'), (4, '0.717*"abc" + -0.439*"second" + -0.383*"90" + 0.147*"sport"')]


What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [18]:
# TODO: Compute LDA
from gensim.models import LdaModel
lda = LdaModel(tfidf, id2word=dic, num_topics=5)
print(lda.print_topics())

[(0, '0.004*"murder" + 0.004*"us" + 0.003*"fall" + 0.003*"court" + 0.003*"guilti" + 0.003*"man" + 0.003*"call" + 0.003*"world" + 0.003*"australia" + 0.002*"media"'), (1, '0.003*"busi" + 0.003*"rate" + 0.003*"plan" + 0.003*"new" + 0.003*"offer" + 0.002*"escap" + 0.002*"defenc" + 0.002*"sale" + 0.002*"council" + 0.002*"critic"'), (2, '0.005*"second" + 0.004*"weather" + 0.004*"polic" + 0.004*"abc" + 0.003*"assault" + 0.003*"90" + 0.003*"man" + 0.003*"review" + 0.003*"news" + 0.003*"sport"'), (3, '0.004*"polic" + 0.004*"interview" + 0.003*"alleg" + 0.003*"man" + 0.003*"arrest" + 0.003*"charg" + 0.003*"start" + 0.003*"action" + 0.003*"say" + 0.003*"hit"'), (4, '0.004*"crash" + 0.004*"polic" + 0.004*"road" + 0.003*"new" + 0.003*"urg" + 0.003*"state" + 0.003*"miss" + 0.003*"nsw" + 0.003*"probe" + 0.003*"govt"')]


In [19]:
# TODO: print the most frequent words of each topic
print(lda.show_topics(num_words=3))

[(0, '0.004*"murder" + 0.004*"us" + 0.003*"fall"'), (1, '0.003*"busi" + 0.003*"rate" + 0.003*"plan"'), (2, '0.005*"second" + 0.004*"weather" + 0.004*"polic"'), (3, '0.004*"polic" + 0.004*"interview" + 0.003*"alleg"'), (4, '0.004*"crash" + 0.004*"polic" + 0.004*"road"')]


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [20]:
!pip install pyLDAvis --user

Looking in links: /usr/share/pip-wheels


In [21]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda, tfidf, dic)
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.