Natural Language Processing & Text Analytics - Session 07 - March 30 2022

## Exercise 7.0* Topic Modeling Using Gensim

Follow this exercise to go through the setup of an LDA model. Here is an article on LDA, if you need some clarifying examples: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

**The Data**

The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from [Kaggle](https://www.kaggle.com/therohk/million-headlines/data?login=true#abcnews-date-text.csv).

In [None]:
# this is only needed when you are running the codes on Google Colab.
from google.colab import files
uploaded = files.upload()

# !pip install pandas==1.5.3 # THIS VERSION OF PANDAS MUST BE INSTALLED TO BE COMPATIBLE WITH pyLDAvis IN COLAB

Saving abcnews-date-text (2).csv to abcnews-date-text (2).csv


In [None]:
import pandas as pd

data = pd.read_csv('abcnews-date-text (2).csv')
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
len(documents)

1226258

In [None]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


**Data Preprocessing**

Write a function to perform the following steps:

Tokenization: Split the text into sentences and the sentences into words.

Lowercase the words and remove punctuation.

Words that have fewer than 3 characters are removed.

All stopwords are removed.

Words are lemmatized


In [None]:
#!pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import numpy as np
np.random.seed(2018)

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Lemmatize example

In [None]:
nltk.download('omw-1.4')
print(WordNetLemmatizer().lemmatize('went'))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


went


Write a function to perform lemmatize and preprocessing steps on the data.

In [None]:
# Write function here

**Test function on doc 4310 here**

In [None]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
print(doc_sample.split(' '))

print('\n\n Preprocessed document')
# Test the document here

original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


 Lemmatized document
['ratepayer', 'group', 'want', 'compulsory', 'local', 'govt', 'voting']


  and should_run_async(code)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])


Preprocess the documents' texts, saving the results as ‘processed_docs’

In [None]:
#

In [None]:
# print the 10 first documents of processed_docs

**Bag of Words on the Data set**

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [None]:
# Hint: use gensim.corpora.Dictionary()

In [None]:
# Print length of dictionary's items

7999


In [None]:
# View the firs ten items. Replact 'dictionary' if you named it something else
for i, (k, v) in enumerate(dictionary.iteritems()):
    print(k, v)
    if i > 9:
        break

0 community
1 witness
2 call
3 summit
4 aust
5 rise
6 staff
7 strike
8 australian
9 jump
10 win
11 record


**Gensim filter_extremes**

Filter out tokens that appear in

less than 15 documents (absolute number) or

more than 0.5 documents (fraction of total corpus size, not absolute number).

after the above two steps, keep only the first 100000 most frequent tokens.

In [None]:
# Hint: All steps can be performed with the filter_extremes method

**Gensim doc2bow**

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier (4310).

In [None]:
# Hint: Use the doc2bow on all docs in processed docs

bow_corpus =

docnr = 4310

print(bow_corpus[docnr]) # format is [(word_number, count), ...]


Preview Bag Of Words for our sample preprocessed document.

In [None]:
bow_doc_4310 = bow_corpus[docnr]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0],
                                                     dictionary[bow_doc_4310[i][0]],
                                                     bow_doc_4310[i][1]))

**TF-IDF**

Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally, preview TF-IDF scores for our first document.



In [None]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = tfidf[bow_corpus]

In [None]:
print(corpus_tfidf[4310])

### Running LDA using Bag of Words

Train our lda model using gensim.models.LdaMulticore (or gensim.models.ldamodel.LdaModel) and save it to 'lda_model'. Use 10 topics


In [None]:
lda_model =

In [None]:
for idx, topic in lda_model.print_topics(-1): # -1 corresponds to all topics
  print('Topic: {} \nWords: {}'.format(idx, topic))

Cool! Can you distinguish different topics using the words in each topic and their corresponding weights?

## Visualization

There is a nice way to visualize the LDA model you built using the package pyLDAvis:

In [None]:
# !pip install pyLDAvis
# !pip install pyLDAvis.gensim
# Use might also need  !pip install pandas==1.5.3

In [None]:
%matplotlib inline
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis


vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(vis)


**Allocating topics to documents**

In [None]:
processed_docs[4310]

In [None]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

## Testing model on unseen document

In [None]:
unseen_document = '' # Write some text here
unseen_document_preprocessed = # Perform the preprocessing here
bow_vector =  # Use dictionray.doc2bow to create BOW vector here

print('bow_vector: ', bow_vector)

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]): # The key=lambda tup: -1*tup[1] argument specifies that the sorting should be based on the second element of each tuple (i.e., tup[1]), which is the score. The -1 multiplier inverses the sort order, meaning that the list is sorted in descending order of the scores
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

bow_vector:  [(55, 1), (56, 1), (392, 1), (497, 1), (927, 1)]
Score: 0.1833622306585312	 Topic: 0.085*"govt" + 0.062*"claim" + 0.027*"home" + 0.026*"fund" + 0.026*"case"
Score: 0.18333686888217926	 Topic: 0.153*"iraq" + 0.040*"sars" + 0.038*"world" + 0.030*"woman" + 0.022*"denies"
Score: 0.18332484364509583	 Topic: 0.078*"iraqi" + 0.056*"plan" + 0.035*"attack" + 0.033*"protest" + 0.024*"open"
Score: 0.18331703543663025	 Topic: 0.099*"police" + 0.027*"coalition" + 0.027*"missing" + 0.024*"election" + 0.021*"probe"
Score: 0.18331320583820343	 Topic: 0.034*"north" + 0.032*"hope" + 0.031*"australian" + 0.027*"missile" + 0.026*"coast"
Score: 0.016669167205691338	 Topic: 0.052*"say" + 0.038*"report" + 0.036*"force" + 0.027*"hospital" + 0.025*"saddam"
Score: 0.016669156029820442	 Topic: 0.042*"killed" + 0.028*"crash" + 0.028*"dead" + 0.021*"title" + 0.021*"return"
Score: 0.016669156029820442	 Topic: 0.075*"baghdad" + 0.036*"water" + 0.032*"australia" + 0.028*"urged" + 0.027*"boost"
Score: 0.0

  and should_run_async(code)


Our test document has the highest probability to be part of the topic on the top.



1.   List item
2.   List item



Create a new LDA model on the tf-idf representation

In [None]:
lda_model_tfidf =

In [None]:
# Print the topics
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

In [None]:
processed_docs[4310]

In [None]:
# Get topic assignments for our sample doc
for index, score in sorted(lda_model_tfidf[corpus_tfidf[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a62772d5-473c-4698-8b17-a8c47ba190c5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>