# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [8]:
!pip install --upgrade gensim



In [1]:
# TODO: import needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models import TfidfModel
from gensim.models import LsiModel
from gensim.models import LdaModel
from gensim.corpora import Dictionary

Load the data in the file `random_headlines.csv`

In [2]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [3]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [4]:
# TODO: Preprocess the input data
df["tokenize"] = df["headline_text"].apply(lambda row: word_tokenize(row))
df["alpha"] = df["tokenize"].apply(lambda row:[word for word in row if word.isalpha()])

stop_words = set(stopwords.words('english'))
df["stop"] = df["alpha"].apply(lambda row:[word for word in row if word not in stop_words])

stemmer = PorterStemmer()

df["stem"] = df["stop"].apply(lambda row:[stemmer.stem(word) for word in row ])
df["stem"].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stem, dtype: object

Now use Gensim to compute a BOW

In [5]:
# TODO: Compute the BOW using Gensim
dct = Dictionary(df["stem"])
df['BOW'] = df["stem"].apply(lambda document: dct.doc2bow(document)) 
df['BOW']

0                 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]
1                                 [(5, 1), (6, 1), (7, 1)]
2                       [(8, 1), (9, 1), (10, 1), (11, 1)]
3                     [(12, 1), (13, 1), (14, 1), (15, 1)]
4                     [(16, 1), (17, 1), (18, 1), (19, 1)]
                               ...                        
19995    [(121, 1), (743, 1), (983, 1), (9722, 1), (122...
19996    [(378, 1), (610, 1), (1442, 1), (1663, 1), (22...
19997                       [(154, 1), (446, 1), (535, 1)]
19998         [(163, 1), (3943, 1), (6310, 1), (12209, 1)]
19999    [(948, 1), (1741, 1), (1993, 1), (2717, 1), (3...
Name: BOW, Length: 20000, dtype: object

In [6]:
corpus_bow = [dct.doc2bow(document) for document in df["stem"]]
print(np.shape(corpus_bow))
corpus_bow[:2]

(20000,)


  return array(a, dtype, copy=False, order=order)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [7]:
# TODO: Compute TF-IDF
tfidf_model = TfidfModel(df['BOW'].to_list())
tfidf = tfidf_model[df['BOW'].to_list()]

print(np.shape(tfidf))
tfidf

(20000,)


<gensim.interfaces.TransformedCorpus at 0x7fd687596df0>

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [40]:
tfidf[0]

[(0, 0.30725466582280214),
 (1, 0.3528943781678455),
 (2, 0.42129048115131124),
 (3, 0.5992666854471201),
 (4, 0.49442279315598586)]

In [8]:
# TODO: Compute LSA
lsi_model = LsiModel(df['BOW'].to_list(),id2word=dct, num_topics = 4)
lsi_model

<gensim.models.lsimodel.LsiModel at 0x7fd687596c40>

For each of the topic, show the most significant words.

In [9]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi_model.print_topics(num_words=3)

[(0, '0.752*"polic" + 0.405*"man" + 0.208*"charg"'),
 (1, '-0.670*"man" + 0.575*"polic" + -0.329*"charg"'),
 (2, '0.656*"new" + 0.296*"plan" + 0.241*"say"'),
 (3, '0.702*"new" + -0.339*"say" + -0.333*"plan"')]

What do you think about those results?

There is only a slightly difference between the topics, with the significant words being more or less the same.

Now let's try to use LDA instead of LSA using Gensim

In [10]:
# TODO: Compute LDA
lda_model = LdaModel(df['BOW'].to_list(),id2word=dct, num_topics = 4)
lda_model

<gensim.models.ldamodel.LdaModel at 0x7fd687eef5e0>

In [11]:
# TODO: print the most frequent words of each topic
lda_model.print_topics(num_words=3)

[(0, '0.010*"kill" + 0.008*"interview" + 0.007*"win"'),
 (1, '0.013*"charg" + 0.013*"man" + 0.013*"polic"'),
 (2, '0.009*"polic" + 0.009*"new" + 0.007*"council"'),
 (3, '0.005*"fire" + 0.005*"report" + 0.005*"water"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [49]:
pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 6.6 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25ldone
[?25h  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136900 sha256=474cce18e90ea672085985dd1f5bb7c758c0d73e1b1f48b8ec801cbed6a1be33
  Stored in directory: /Users/laravaroni/Library/Caches/pip/wheels/90/61/ec/9dbe9efc3acf9c4e37ba70fbbcc3f3a0ebd121060aa593181a
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0

In [14]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

pyLDAvis.enable_notebook()

vis = gensimvis.prepare(lda_model, df['BOW'].to_list(), dct)
vis

  and should_run_async(code)


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.