<a href="https://colab.research.google.com/github/jhuarancca/Applied-AI-Building-NLP-Apps-with-Hugging-Face-Transformers/blob/main/02_04_Topic_Modelling_Visualization_with_pyLDAvis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Notebook
This notebook shows a handson session of the application of Topic Modelling using Gensim, a popular python library. The notebook will go through how to import an API based dataset, how to apply Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)​ algorithm and how to measure the accuracy of the topic modelling model using the coherence score.



# Dataset
The dataset that will be used as a sample in this notebook is the [Twenty Newsgroups Data](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups) Set from the open source UCI Machine Learning Repository. For the purpose of this exercise, a cleaned version of this dataset will be imported from the [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html?highlight=newsgroup#sklearn.datasets.fetch_20newsgroups) API

In [1]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
import pandas as pd

In [2]:
dataset = fetch_20newsgroups(subset = 'all',shuffle= False, random_state=32,remove=('headers', 'footers', 'qutes'))

In [3]:
dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
dataset['data']

["gajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)",
 'Well, I just got my Centris 610 yesterday.  It took just over two \nweeks from placing the order.  The dealer (Rutgers computer store) \nappologized because Apple made a substitution on my order.  I ordered\nthe one without ethernet, but they substituted one _with_ ethernet.\nHe wanted to know if that would be "alright with me"!!!  They must\nbe backlogged on Centri w/out ethernet so they\'re just shipping them\nwith!  \n\n\tAnyway, I\'m very happy with the 610 with a few exceptions.  \nBeing nosy, I decided to open it up _before_ powering it on for the first\ntime.  The SCSI cable to the hard drive was only partially connected\n(must have come loose in shipping).  No big deal, but I would have been\npissed if I tried to boot it and it wouldn\'t come up!\n\tTh

In [5]:
pprint(dataset.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [6]:
len(dataset.target_names)

20

In [7]:
dataset.target.shape

(18846,)

In [8]:
dataset.filenames.shape

(18846,)

In [9]:
dataset_df = pd.DataFrame({'News':dataset.data, 'Label' : dataset.target})

In [10]:
dataset_df

Unnamed: 0,News,Label
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9
1,"Well, I just got my Centris 610 yesterday. It...",4
2,Archive-name: cryptography-faq/part10\nLast-mo...,11
3,> ATTENTION: Mac Quadra owners: Many storage i...,4
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0
...,...,...
18841,\nWhy are circuit boards green? The material ...,12
18842,In article <1r941o$3tu@menudo.uh.edu> inde7wv@...,8
18843,We were told that the resolution on the 5FGe c...,4
18844,CAD Setup For Sale:\n\nG486PLB Local Bus Mothe...,6


In [11]:
dataset_df['Label_name'] = dataset_df['Label'].apply(lambda x: dataset.target_names[x])

In [12]:
dataset_df

Unnamed: 0,News,Label,Label_name
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism
...,...,...,...
18841,\nWhy are circuit boards green? The material ...,12,sci.electronics
18842,In article <1r941o$3tu@menudo.uh.edu> inde7wv@...,8,rec.motorcycles
18843,We were told that the resolution on the 5FGe c...,4,comp.sys.mac.hardware
18844,CAD Setup For Sale:\n\nG486PLB Local Bus Mothe...,6,misc.forsale


# Dataset preprocessing

In [13]:
%%capture
!pip install -U gensim

In [14]:
from gensim.utils import tokenize
from gensim.parsing.preprocessing import preprocess_string,strip_tags,strip_punctuation,strip_numeric,remove_stopwords,strip_short
from gensim.corpora.dictionary import Dictionary
from gensim import models

In [15]:
help(preprocess_string)

Help on function preprocess_string in module gensim.parsing.preprocessing:

preprocess_string(s, filters=[<function <lambda> at 0x7e3f92f43b50>, <function strip_tags at 0x7e3f92f435b0>, <function strip_punctuation at 0x7e3f92f43520>, <function strip_multiple_whitespaces at 0x7e3f92f43880>, <function strip_numeric at 0x7e3f92f43760>, <function remove_stopwords at 0x7e3f92f43400>, <function strip_short at 0x7e3f92f43640>, <function stem_text at 0x7e3f92f439a0>])
    Apply list of chosen filters to `s`.
    
    Default list of filters:
    
    * :func:`~gensim.parsing.preprocessing.strip_tags`,
    * :func:`~gensim.parsing.preprocessing.strip_punctuation`,
    * :func:`~gensim.parsing.preprocessing.strip_multiple_whitespaces`,
    * :func:`~gensim.parsing.preprocessing.strip_numeric`,
    * :func:`~gensim.parsing.preprocessing.remove_stopwords`,
    * :func:`~gensim.parsing.preprocessing.strip_short`,
    * :func:`~gensim.parsing.preprocessing.stem_text`.
    
    Parameters
    -------

In [16]:
dataset_df['Clean_news'] = dataset_df['News'].apply(preprocess_string)

In [17]:
dataset_df

Unnamed: 0,News,Label,Label_name,Clean_news
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball,"[gajarski, pilot, njin, net, write, morgan, gu..."
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware,"[got, centri, yesterdai, took, week, place, or..."
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt,"[archiv, cryptographi, faq, modifi, faq, sci, ..."
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware,"[attent, mac, quadra, owner, storag, industri,..."
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism,"[bobb, vice, ico, tek, com, robert, beauchain,..."
...,...,...,...,...
18841,\nWhy are circuit boards green? The material ...,12,sci.electronics,"[circuit, board, green, materi, goe, name, cir..."
18842,In article <1r941o$3tu@menudo.uh.edu> inde7wv@...,8,rec.motorcycles,"[articl, indewv, rosi, edu, write, bike, lucki..."
18843,We were told that the resolution on the 5FGe c...,4,comp.sys.mac.hardware,"[told, resolut, fge, anybodi, tri, run, higher..."
18844,CAD Setup For Sale:\n\nG486PLB Local Bus Mothe...,6,misc.forsale,"[cad, setup, sale, gplb, local, bu, motherboar..."


In [18]:
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_numeric,remove_stopwords,strip_short]
dataset_df['Clean_news1'] = dataset_df['News'].apply(lambda x: preprocess_string(x,filters))

In [19]:
dataset_df

Unnamed: 0,News,Label,Label_name,Clean_news,Clean_news1
0,gajarsky@pilot.njin.net writes:\n\nmorgan and ...,9,rec.sport.baseball,"[gajarski, pilot, njin, net, write, morgan, gu...","[gajarsky, pilot, njin, net, writes, morgan, g..."
1,"Well, I just got my Centris 610 yesterday. It...",4,comp.sys.mac.hardware,"[got, centri, yesterdai, took, week, place, or...","[got, centris, yesterday, took, weeks, placing..."
2,Archive-name: cryptography-faq/part10\nLast-mo...,11,sci.crypt,"[archiv, cryptographi, faq, modifi, faq, sci, ...","[archive, cryptography, faq, modified, faq, sc..."
3,> ATTENTION: Mac Quadra owners: Many storage i...,4,comp.sys.mac.hardware,"[attent, mac, quadra, owner, storag, industri,...","[attention, mac, quadra, owners, storage, indu..."
4,bobbe@vice.ICO.TEK.COM (Robert Beauchaine) wri...,0,alt.atheism,"[bobb, vice, ico, tek, com, robert, beauchain,...","[bobbe, vice, ico, tek, com, robert, beauchain..."
...,...,...,...,...,...
18841,\nWhy are circuit boards green? The material ...,12,sci.electronics,"[circuit, board, green, materi, goe, name, cir...","[circuit, boards, green, material, goes, names..."
18842,In article <1r941o$3tu@menudo.uh.edu> inde7wv@...,8,rec.motorcycles,"[articl, indewv, rosi, edu, write, bike, lucki...","[article, indewv, rosie, edu, writes, bike, lu..."
18843,We were told that the resolution on the 5FGe c...,4,comp.sys.mac.hardware,"[told, resolut, fge, anybodi, tri, run, higher...","[told, resolution, fge, anybody, tried, runnin..."
18844,CAD Setup For Sale:\n\nG486PLB Local Bus Mothe...,6,misc.forsale,"[cad, setup, sale, gplb, local, bu, motherboar...","[cad, setup, sale, gplb, local, bus, motherboa..."


In [20]:
dataset_dictionary = Dictionary(dataset_df['Clean_news1'])

In [21]:
len(dataset_dictionary)

96459

In [22]:
print(dataset_dictionary.token2id)



In [23]:
dataset_corpus_bow = [dataset_dictionary.doc2bow(text) for text in dataset_df['Clean_news1']] #create a dataset corpus with bag of word vectorization

In [24]:
len(dataset_corpus_bow)

18846

In [25]:
print(dataset_corpus_bow[1])

[(22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 3), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 2), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1)]


In [26]:
tfidf = models.TfidfModel(dataset_corpus_bow)
dataset_corpus_tfidf = tfidf[dataset_corpus_bow]

In [27]:
len(dataset_corpus_tfidf)

18846

In [28]:
print(dataset_corpus_tfidf[1])

[(22, 0.12794312043780054), (23, 0.1032933602529823), (24, 0.1437906445046912), (25, 0.19446130648981633), (26, 0.09972437101248886), (27, 0.19446130648981633), (28, 0.056593976938038), (29, 0.09712742378308543), (30, 0.11391287593244794), (31, 0.08578843198010519), (32, 0.19446130648981633), (33, 0.10142837384435295), (34, 0.09687548915310225), (35, 0.10125120507490903), (36, 0.09133742977605598), (37, 0.061307837508357034), (38, 0.10891119725810347), (39, 0.06606330836365701), (40, 0.08855334717656915), (41, 0.07391023272465086), (42, 0.19446130648981633), (43, 0.11381832135728932), (44, 0.3110653713924378), (45, 0.11566217615948503), (46, 0.0969847936661337), (47, 0.0694228263237942), (48, 0.09383842074796034), (49, 0.09300779209433964), (50, 0.0455655162646588), (51, 0.10391838009790301), (52, 0.07696272067182856), (53, 0.10763740852872361), (54, 0.05222277834588352), (55, 0.05433687654103351), (56, 0.05404373403325169), (57, 0.15338363279035241), (58, 0.08979736303341844), (59, 0.

# Topic Modelling with Latent Dirichlet Allocation(LDA)

In [29]:
from gensim.models.ldamodel import LdaModel

In [30]:
lda_bow = LdaModel(dataset_corpus_bow,num_topics=20,id2word=dataset_dictionary,random_state=0)



In [31]:
lda_topics_bow = lda_bow.print_topics(num_words=8)
for topic in lda_topics_bow:
  print(topic)

(0, '0.013*"writes" + 0.011*"edu" + 0.009*"drive" + 0.009*"article" + 0.007*"like" + 0.007*"gamma" + 0.006*"battery" + 0.006*"time"')
(1, '0.031*"god" + 0.011*"jesus" + 0.009*"christ" + 0.008*"church" + 0.007*"sin" + 0.007*"bible" + 0.006*"paul" + 0.006*"christian"')
(2, '0.013*"government" + 0.012*"key" + 0.008*"chip" + 0.008*"clipper" + 0.008*"encryption" + 0.007*"law" + 0.007*"use" + 0.006*"keys"')
(3, '0.012*"space" + 0.008*"nasa" + 0.007*"writes" + 0.006*"article" + 0.006*"shuttle" + 0.006*"earth" + 0.005*"hst" + 0.005*"edu"')
(4, '0.012*"file" + 0.012*"edu" + 0.011*"image" + 0.010*"graphics" + 0.008*"ftp" + 0.008*"files" + 0.008*"available" + 0.007*"program"')
(5, '0.018*"cramer" + 0.017*"men" + 0.015*"gay" + 0.014*"homosexual" + 0.010*"sex" + 0.010*"clayton" + 0.009*"sexual" + 0.009*"kinsey"')
(6, '0.012*"conference" + 0.009*"echo" + 0.007*"xdm" + 0.007*"page" + 0.006*"incoming" + 0.005*"paris" + 0.004*"title" + 0.004*"perlman"')
(7, '0.010*"like" + 0.009*"know" + 0.008*"good" +

In [32]:
lda_tfidf = LdaModel(dataset_corpus_tfidf, id2word=dataset_dictionary, num_topics=20)



In [33]:
lda_topics_tfidf = lda_tfidf.print_topics(num_words=8)
for topic in lda_topics_tfidf:
  print(topic)

(0, '0.003*"game" + 0.003*"games" + 0.002*"edu" + 0.002*"people" + 0.002*"writes" + 0.002*"article" + 0.002*"think" + 0.002*"right"')
(1, '0.003*"god" + 0.002*"people" + 0.002*"article" + 0.002*"edu" + 0.002*"like" + 0.002*"writes" + 0.002*"think" + 0.002*"know"')
(2, '0.003*"sabres" + 0.002*"chronic" + 0.002*"pitcher" + 0.002*"gant" + 0.002*"bike" + 0.002*"photography" + 0.002*"eisa" + 0.002*"bus"')
(3, '0.001*"pcx" + 0.001*"italy" + 0.001*"motherboards" + 0.001*"nyi" + 0.001*"icons" + 0.001*"sweden" + 0.001*"waikato" + 0.001*"stanza"')
(4, '0.002*"maynard" + 0.002*"lcd" + 0.001*"gauge" + 0.001*"ward" + 0.001*"laurentian" + 0.001*"yuan" + 0.001*"screws" + 0.001*"ramsey"')
(5, '0.001*"batter" + 0.001*"targa" + 0.001*"quicktime" + 0.001*"ncsl" + 0.001*"snake" + 0.001*"patents" + 0.001*"gui" + 0.001*"mccullou"')
(6, '0.003*"modems" + 0.002*"irq" + 0.002*"bios" + 0.002*"motherboard" + 0.002*"diamond" + 0.002*"cursor" + 0.002*"tiff" + 0.002*"ports"')
(7, '0.002*"hacker" + 0.002*"kirlian" +

# Topic Modelling with Latent Semantic Analysis/Indexing(LSA/LSI)

In [34]:
from gensim.models.lsimodel import LsiModel

In [35]:
lsi_bow = LsiModel(corpus=dataset_corpus_bow,id2word=dataset_dictionary,num_topics=20)

In [36]:
lsi_topics_bow = lsi_bow.print_topics(num_words=8)
for topic in lsi_topics_bow:
  print(topic)

(0, '0.994*"max" + 0.069*"giz" + 0.068*"bhj" + 0.025*"qax" + 0.015*"biz" + 0.014*"nrhj" + 0.014*"bxn" + 0.012*"nuy"')
(1, '0.255*"jpeg" + 0.253*"file" + 0.219*"edu" + 0.204*"image" + 0.204*"dos" + 0.164*"use" + 0.153*"available" + 0.137*"ftp"')
(2, '0.772*"dos" + 0.277*"windows" + 0.147*"microsoft" + -0.113*"people" + 0.109*"tcp" + -0.090*"jpeg" + 0.090*"mouse" + -0.083*"know"')
(3, '0.435*"jpeg" + 0.231*"image" + -0.216*"people" + -0.174*"said" + -0.171*"dos" + 0.167*"file" + 0.166*"gif" + -0.166*"know"')
(4, '-0.442*"jpeg" + 0.297*"edu" + -0.190*"dos" + 0.171*"pub" + -0.158*"gif" + -0.140*"people" + 0.138*"com" + 0.138*"ftp"')
(5, '0.439*"god" + 0.366*"jehovah" + 0.297*"lord" + 0.275*"elohim" + -0.200*"file" + 0.183*"christ" + 0.177*"jesus" + 0.141*"father"')
(6, '0.640*"file" + 0.204*"gun" + -0.164*"edu" + -0.162*"image" + 0.134*"jehovah" + 0.132*"god" + 0.115*"output" + 0.110*"control"')
(7, '-0.350*"stephanopoulos" + -0.267*"president" + 0.221*"use" + -0.221*"file" + -0.190*"graph

In [37]:
lsi_tfidf = LsiModel(dataset_corpus_tfidf, id2word=dataset_dictionary, num_topics=20)

In [38]:
lsi_topics_tfidf = lsi_tfidf.print_topics(num_words=8)
for topic in lsi_topics_tfidf:
  print(topic)

(0, '0.127*"people" + 0.122*"god" + 0.110*"know" + 0.109*"like" + 0.101*"think" + 0.099*"windows" + 0.094*"edu" + 0.093*"use"')
(1, '0.278*"god" + -0.244*"windows" + -0.193*"drive" + -0.169*"card" + -0.164*"scsi" + -0.152*"dos" + -0.124*"thanks" + 0.123*"jesus"')
(2, '-0.483*"god" + -0.199*"jesus" + 0.193*"game" + 0.137*"team" + -0.128*"bible" + -0.123*"windows" + -0.121*"scsi" + 0.121*"games"')
(3, '0.256*"game" + -0.244*"key" + 0.216*"drive" + 0.194*"scsi" + -0.170*"encryption" + -0.167*"clipper" + -0.163*"chip" + 0.161*"team"')
(4, '-0.383*"scsi" + -0.363*"drive" + 0.315*"windows" + -0.197*"ide" + 0.145*"file" + -0.138*"controller" + -0.128*"drives" + -0.128*"chip"')
(5, '0.275*"key" + 0.273*"god" + 0.220*"chip" + -0.204*"israel" + 0.197*"game" + 0.173*"clipper" + 0.169*"encryption" + 0.139*"keys"')
(6, '-0.304*"car" + 0.241*"windows" + 0.209*"israel" + 0.204*"scsi" + 0.190*"dos" + 0.144*"game" + -0.134*"bike" + 0.124*"israeli"')
(7, '0.443*"card" + -0.267*"drive" + 0.222*"video" + 

# Topic Modelling Visualization with pyLDAvis

In [39]:
%%capture
!pip install pyLDAvis

In [40]:
import pyLDAvis
import pyLDAvis.gensim_models

In [41]:
pyLDAvis.enable_notebook()

  and should_run_async(code)


In [42]:
vis_bow = pyLDAvis.gensim_models.prepare(lda_bow, dataset_corpus_bow, dataset_dictionary)
vis_bow

  and should_run_async(code)


In [43]:
vis_tfidf = pyLDAvis.gensim_models.prepare(lda_tfidf, dataset_corpus_tfidf, dataset_dictionary)
vis_tfidf

  and should_run_async(code)
