# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [1]:
# TODO: import needed libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

  from pandas.core import (
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dinhthingocha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dinhthingocha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Load the data in the file `random_headlines.csv`

In [2]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
df.head(5)

Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [3]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [4]:
# TODO: Preprocess the input data



In [5]:
df['tokens'] = df.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis = 1)
df['alpha'] = df['tokens'].apply(lambda x: [item for item in x if item.isalpha()])
stop_words = stopwords.words('english')
df['stop'] = df['alpha'].apply(lambda x: [item for item in x if item not in stop_words])
stemmer = PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda x: [stemmer.stem(item) for item in x])
df['stemmed'].head(5)

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

In [6]:
! pip install gensim



Now use Gensim to compute a BOW

In [7]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary
dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(len(corpus))
corpus[0:2]

20000


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [8]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel
tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(len(tf_idf))

20000


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [9]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi = LsiModel(corpus=corpus, num_topics=4, id2word=dictionary)

For each of the topic, show the most significant words.

In [10]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words=3)

[(0, '0.752*"polic" + 0.404*"man" + 0.208*"charg"'),
 (1, '-0.669*"man" + 0.575*"polic" + -0.329*"charg"'),
 (2, '-0.656*"new" + -0.292*"plan" + 0.242*"man"'),
 (3, '0.701*"new" + -0.346*"say" + -0.332*"plan"')]

What do you think about those results? These rows are talking about same thing, but the words "man" and "polic" are in diffent orders. Maybe this relate to some politicians are making new plans, and this is also the most common topics in the article. 

Now let's try to use LDA instead of LSA using Gensim

In [11]:
# TODO: Compute LDA
from gensim.models import LdaModel
LDA = LdaModel(corpus=corpus, num_topics = 4, id2word=dictionary, random_state=0, chunksize=512, passes=5)

In [12]:
# TODO: print the most frequent words of each topic
LDA.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA? The result is different from LSA. In this case, it has no repeating topics, however, the topics don't make sense much. 

Let's make some visualization of the LDA results using pyLDAvis.

In [13]:
# TODO: show visualization results of the LDA
! pip install pyldavis



In [14]:
! pip install --upgrade joblib



In [15]:
! pip install --upgrade pandas



In [16]:
import pickle
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

In [17]:
import pyLDAvis 
import pyLDAvis.gensim
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(LDA, corpus, dictionary)
vis

  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.