# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [103]:
# TODO: import needed libraries

import numpy as np
import pandas as pd
df = pd.read_csv('/Users/sunwoonam/Desktop/Griffith/2023_T1/7130ICT_Data Analytics/Lab/Lab7/random_headlines.csv')



Load the data in the file `random_headlines.csv`

In [104]:
# TODO: load the dataset
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [105]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [109]:
# TODO: Preprocess the input data

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words=stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import string



df['headline_text'] = df['headline_text'].str.lower()

# Tokenize the headlines
df['tokenized'] = df['headline_text'].apply(word_tokenize)

# Remove punctuation
punct = string.punctuation
df['no_punct'] = df['tokenized'].apply(lambda x: [word for word in x if word not in punct])

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['no_stopwords'] = df['no_punct'].apply(lambda x: [word for word in x if word not in stop_words])


stemmer = PorterStemmer()
df['stemmed'] = df['no_stopwords'].apply(lambda x: [stemmer.stem(word) for word in x])

print(df['stemmed'].head())



0    [ute, driver, hurt, intersect, crash]
1                  [6yo, die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object


Now use Gensim to compute a BOW

In [113]:
# TODO: Compute the BOW using Gensim

from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from pprint import pprint

corpus = df['stemmed']
id2word = Dictionary(corpus)
print(id2word[0])
bow = [id2word.doc2bow(line)for line in corpus]
print(len(bow))
print(bow[:2])


crash
20000
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1)]]


Compute the TF-IDF using Gensim

In [83]:
# TODO: Compute TF-IDF

tfidf_model = TfidfModel(bow)
tfidf_corpus = tfidf_model[bow]

print(len(tfidf_corpus))
print(tfidf_corpus)


20000
<gensim.interfaces.TransformedCorpus object at 0x7f7cc8ddba90>


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [114]:
# TODO: Compute LSA

from gensim.models import LsiModel

lsi = LsiModel(tfidf_corpus, id2word=id2word, num_topics=5)
pprint(lsi.print_topics())


  sparsetools.csc_matvecs(


[(0,
  '0.458*"man" + 0.390*"polic" + 0.315*"charg" + 0.147*"court" + '
  '0.144*"murder" + 0.129*"face" + 0.116*"crash" + 0.113*"new" + 0.108*"miss" '
  '+ 0.104*"death"'),
 (1,
  '-0.435*"second" + -0.411*"90" + -0.336*"abc" + -0.298*"news" + '
  '-0.293*"weather" + 0.244*"man" + -0.230*"busi" + -0.183*"sport" + '
  '0.161*"charg" + -0.106*"plan"'),
 (2,
  '-0.377*"man" + -0.270*"second" + -0.264*"charg" + -0.258*"90" + '
  '0.217*"plan" + 0.193*"council" + 0.189*"govt" + 0.168*"new" + '
  '-0.168*"weather" + -0.165*"abc"'),
 (3,
  '-0.777*"polic" + 0.239*"man" + 0.215*"charg" + -0.169*"investig" + '
  '-0.146*"probe" + 0.138*"council" + 0.130*"court" + 0.124*"plan" + '
  '0.114*"new" + 0.106*"face"'),
 (4,
  '0.718*"abc" + -0.434*"second" + -0.380*"90" + 0.162*"sport" + '
  '0.147*"market" + 0.123*"entertain" + 0.098*"busi" + 0.089*"weather" + '
  '0.080*"analysi" + 0.066*"news"')]


For each of the topic, show the most significant words.

What do you think about those results?

In [86]:
# TODO: Print the 3 or 4 most significant words of each topic
topics = lsi.print_topics(num_topics=4, num_words=3)
for topic in topics:
    topic_words = [word.split('*')[1] for word in topic[1].split(' + ')]
    topic_weights = [word.split('*')[0] for word in topic[1].split(' + ')]
    print('({}, {})'.format(topic[0], ' * '.join(['{}*{}'.format(round(float(weight), 3), word) for weight, word in zip(topic_weights, topic_words)])))




(0, 0.456*"man" * 0.388*"polic" * 0.316*"charg")
(1, 0.436*"second" * 0.412*"90" * 0.34*"abc")
(2, 0.381*"man" * 0.276*"charg" * 0.267*"second")
(3, -0.77*"polic" * 0.241*"man" * 0.216*"charg")


Now let's try to use LDA instead of LSA using Gensim

In [102]:
# # TODO: Compute LDA
from gensim.models import LdaModel


lda1 = LdaModel(corpus=tfidf_corpus, num_topics=5,id2word=id2word,passes=10)



In [73]:
# TODO: print the most frequent words of each topic
print(lda1.print_topics())

[(0, '0.004*"resid" + 0.003*"strike" + 0.003*"polic" + 0.003*"sentenc" + 0.003*"appeal" + 0.003*"liber" + 0.002*"author" + 0.002*"start" + 0.002*"escap" + 0.002*"brisban"'), (1, '0.007*"miss" + 0.003*"found" + 0.003*"budget" + 0.003*"search" + 0.003*"countri" + 0.003*"hour" + 0.003*"interview" + 0.003*"hit" + 0.003*"first" + 0.003*"high"'), (2, '0.008*"polic" + 0.006*"man" + 0.005*"charg" + 0.005*"kill" + 0.005*"crash" + 0.003*"death" + 0.003*"court" + 0.003*"driver" + 0.003*"car" + 0.003*"fire"'), (3, '0.005*"abc" + 0.005*"second" + 0.005*"news" + 0.005*"rate" + 0.004*"weather" + 0.004*"busi" + 0.004*"council" + 0.003*"90" + 0.003*"rural" + 0.003*"plan"'), (4, '0.004*"new" + 0.004*"chang" + 0.003*"test" + 0.003*"arrest" + 0.003*"alleg" + 0.003*"guilti" + 0.003*"court" + 0.003*"face" + 0.003*"blaze" + 0.003*"flood"')]


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [89]:
# TODO: show visualization results of the LDA
!pip install pyLDAvis
!pip install pandas --upgrade



Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.

In [100]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(topic_model=lda1, corpus=bow, dictionary=id2word)
vis



BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.