# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [81]:
# TODO: import needed libraries
import pandas as pd
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import gensim
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import LsiModel
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

Load the data in the file `random_headlines.csv`

In [82]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory df analytics) on a dataset...

In [83]:
# TODO: Perform a short EDA
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
None


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [84]:
# TODO: Preprocess the input data
lemmatizer = WordNetLemmatizer()

# Function to preprocess the text
def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    lemmatized_text = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_text)

# Apply the preprocessing function to the headline_text column
df['headline_text'] = df['headline_text'].astype(str)
df['processed_headlines'] = df['headline_text'].apply(preprocess_text)
df[['headline_text', 'processed_headlines']].head()


Unnamed: 0,headline_text,processed_headlines
0,ute driver hurt in intersection crash,ute driver hurt intersection crash
1,6yo dies in cycling accident,6yo dy cycling accident
2,bumper olive harvest expected,bumper olive harvest expected
3,replica replaces northernmost sign,replica replaces northernmost sign
4,woods targets perfect season,wood target perfect season


Now use Gensim to compute a BOW

In [85]:
# TODO: Compute the BOW using Gensim
tokenized_headlines = df['processed_headlines'].apply(gensim.utils.simple_preprocess)

# Create a Gensim dictionary
dictionary = Dictionary(tokenized_headlines)

# Convert to a BoW format
bow_corpus = [dictionary.doc2bow(text) for text in tokenized_headlines]

print(f"({len(bow_corpus)},)")
print(bow_corpus[:2])

(20000,)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1)]]


Compute the TF-IDF using Gensim

In [86]:
# TODO: Compute TF-IDF
tfidf = TfidfModel(bow_corpus)

# Apply the TF-IDF model to the whole BOW corpus
tfidf_corpus = tfidf[bow_corpus]

print(f"({len(tfidf_corpus)},)")
print(tfidf_corpus)

(20000,)
<gensim.interfaces.TransformedCorpus object at 0x000001C10F38A050>


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [87]:
# TODO: Compute LSA
num_topics = 5

# Create the LSI model from the TF-IDF corpus
lsi_model = LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_topics=num_topics)

# Apply the LSI model to the TF-IDF corpus to create a topic distribution for each document
lsi_corpus = lsi_model[tfidf_corpus]

  sparsetools.csc_matvecs(
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m
  out = (1 - tri(m.shape[0], m.shape[1], k - 1, m.dtype.char)) * m


For each of the topic, show the most significant words.

In [88]:
# TODO: Print the 3 or 4 most significant words of each topic
topics = lsi_model.print_topics(num_topics=num_topics)

for topic_num, topic in enumerate(topics):
    print(f"Topic {topic_num + 1}: {topic}")

Topic 1: (0, '0.466*"man" + 0.422*"police" + 0.222*"charged" + 0.160*"court" + 0.132*"murder" + 0.122*"new" + 0.121*"missing" + 0.120*"face" + 0.116*"death" + 0.115*"crash"')
Topic 2: (1, '0.532*"second" + 0.436*"abc" + 0.414*"news" + 0.367*"weather" + 0.265*"business" + 0.217*"sport" + -0.148*"man" + 0.098*"rural" + 0.093*"national" + -0.091*"police"')
Topic 3: (2, '-0.473*"man" + -0.247*"charged" + 0.234*"council" + 0.217*"new" + 0.213*"govt" + 0.189*"plan" + 0.135*"say" + -0.131*"second" + 0.127*"call" + -0.113*"murder"')
Topic 4: (3, '-0.761*"police" + 0.290*"man" + 0.148*"charged" + 0.141*"council" + 0.141*"court" + -0.138*"probe" + -0.128*"investigate" + 0.122*"new" + 0.116*"plan" + -0.108*"search"')
Topic 5: (4, '-0.510*"abc" + 0.385*"news" + -0.366*"interview" + 0.304*"rural" + 0.269*"national" + 0.254*"second" + -0.252*"weather" + -0.168*"sport" + -0.151*"new" + 0.102*"qld"')


What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [89]:
# TODO: Compute LDA
# Create the LDA model from the BOW corpus
lda_model = LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=num_topics, passes=15, random_state=100)
topics = lda_model.print_topics(num_topics=num_topics)

In [90]:
# TODO: print the most frequent words of each topic
for topic_num, topic in enumerate(topics):
    print(f"Topic {topic_num + 1}: {topic}")

Topic 1: (0, '0.007*"rise" + 0.006*"rate" + 0.005*"may" + 0.005*"market" + 0.005*"cut" + 0.005*"price" + 0.004*"former" + 0.004*"share" + 0.004*"australian" + 0.004*"power"')
Topic 2: (1, '0.011*"interview" + 0.011*"crash" + 0.007*"missing" + 0.006*"police" + 0.006*"win" + 0.006*"driver" + 0.005*"china" + 0.005*"search" + 0.005*"qld" + 0.004*"killed"')
Topic 3: (2, '0.007*"change" + 0.007*"govt" + 0.007*"say" + 0.006*"new" + 0.005*"council" + 0.005*"set" + 0.005*"green" + 0.004*"group" + 0.004*"time" + 0.004*"sex"')
Topic 4: (3, '0.010*"new" + 0.010*"water" + 0.008*"plan" + 0.008*"council" + 0.006*"boost" + 0.006*"govt" + 0.005*"mine" + 0.005*"call" + 0.005*"mayor" + 0.005*"review"')
Topic 5: (4, '0.021*"man" + 0.019*"police" + 0.010*"death" + 0.009*"charged" + 0.009*"court" + 0.008*"face" + 0.008*"murder" + 0.007*"fire" + 0.007*"woman" + 0.006*"two"')


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [91]:
# TODO: show visualization results of the LDA
lda_display = gensimvis.prepare(lda_model, bow_corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.