# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [2]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd
import gensim

Load the data in the file `random_headlines.csv`

In [3]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')

In [4]:
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [5]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [6]:
# TODO: Preprocess the input data
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stemmer = PorterStemmer()
stop = stopwords.words('english')

df['headline_stemmed'] = df['headline_text'].apply(lambda df: nltk.word_tokenize(df))
df['headline_stemmed'] = df['headline_stemmed'].apply(lambda x: [item for item in x if item.isalpha()])
df['headline_stemmed'] = df['headline_stemmed'].apply(lambda x: [item for item in x if item not in stop])
df['headline_stemmed'] = df['headline_stemmed'].apply(lambda x: [stemmer.stem(item) for item in x])
df['headline_stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: headline_stemmed, dtype: object

Now use Gensim to compute a BOW

In [7]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

gensim_dict = Dictionary(df['headline_stemmed'])
corpus = [gensim_dict.doc2bow(line) for line in df['headline_stemmed']]

print(len(corpus))
print(corpus[0:2])

20000
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]


Compute the TF-IDF using Gensim

In [8]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(len(tf_idf))


20000


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [14]:
# TODO: Compute LSA
from gensim.models import LsiModel

# use a different number, answer used 4
lsi_model = LsiModel(corpus=corpus, num_topics=4, id2word=gensim_dict)

  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


For each of the topic, show the most significant words.

In [15]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi_model.print_topics(num_words=3)

[(0, '-0.752*"polic" + -0.404*"man" + -0.208*"charg"'),
 (1, '0.669*"man" + -0.575*"polic" + 0.329*"charg"'),
 (2, '0.655*"new" + 0.298*"plan" + 0.242*"say"'),
 (3, '0.703*"new" + -0.337*"plan" + -0.336*"say"')]

What do you think about those results?

The results indicate an insight into the key words in the topic. 
The first topic (-0.752 "police" + -0.404 "man" + -0.208 "charged") seems to capture news related to crime or police activity.
The second topic (0.669 "man" + -0.575 "police" + 0.329 "charged") again hints at a focus on individuals (possibly criminals or victims) in news involving police.
The third and fourth topic (0.655 "new" + 0.298 "plan" + 0.242 "say") seems to involve new plans or policies being discussed or announced.

Adjusting the number of topics could provide better separation of themes between each topic. 

Now let's try to use LDA instead of LSA using Gensim

In [16]:
# TODO: Compute LDA
from gensim.models import LdaModel
lda_model = LdaModel(corpus=corpus, num_topics=4, id2word=gensim_dict, random_state=0, chunksize=512, passes=5)

In [17]:
# TODO: print the most frequent words of each topic
lda_model.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

It allows for sets of observations to be explained by unobserved groups.

Topic 0 ("report", "back", "may") might represent a general discussion or reporting on various events.
Topic 1 ("mine", "police", "elect") likely relates to topics around mining industry issues, law enforcement, and possibly politics or elections.
Topic 2 ("question", "council", "fund") seems to involve queries or discussions about local government and financing.
Topic 3 ("sydney", "charged", "australian") appears to be geographically specific, perhaps dealing with legal matters in Australia, with a focus on Sydney.

Evidently, the key words in the four topic using LDA are very different to LSA. They conclude different meanings. 

The words in each topic are the central terms around which the topic is built, but they do not provide a full picture. The real interpretative work comes in when looking at more words for each topic and consider them in the context of the documents they come from.

In [None]:
# TODO: print the most frequent words of each topic
lda_model.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Let's make some visualization of the LDA results using pyLDAvis.

In [13]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, gensim_dict)
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.