# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
! pip install pyLDAvis gensim spacy



ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

en-core-web-sm 3.0.0 requires spacy<3.1.0,>=3.0.0, but you'll have spacy 2.3.5 which is incompatible.



Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp38-cp38-win_amd64.whl (178 kB)
Installing collected packages: srsly
  Attempting uninstall: srsly
    Found existing installation: srsly 2.4.0
    Uninstalling srsly-2.4.0:
      Successfully uninstalled srsly-2.4.0
Successfully installed srsly-1.0.5


### Import the libraries

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

--2021-03-08 08:56:47--  https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... 

  and should_run_async(code)


connected.
HTTP request sent, awaiting response... 200 OK
Length: 23237087 (22M) [text/plain]
Saving to: ‘newsgroups.json’


2021-03-08 08:56:49 (13.8 MB/s) - ‘newsgroups.json’ saved [23237087/23237087]



### Load the dataset

In [2]:
df = pd.read_json("newsgroups.json")
df

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
...,...,...,...
11309,From: jim.zisfein@factory.com (Jim Zisfein) \n...,13,sci.med
11310,From: ebodin@pearl.tufts.edu\nSubject: Screen ...,4,comp.sys.mac.hardware
11311,From: westes@netcom.com (Will Estes)\nSubject:...,3,comp.sys.ibm.pc.hardware
11312,From: steve@hcrlgw (Steven Collins)\nSubject: ...,1,comp.graphics


In [6]:
df['content'][0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [11]:
for i in df['content'].iteritems():
    text = i[1].lower()
    print(text)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



from: lmarsha@cms.cc.wayne.edu (laurie marshall)
subject: re: if you were pat burns ...
organization: wayne state university, detroit mi  u.s.a.
lines: 30
nntp-posting-host: cms.cc.wayne.edu

in article <1r1chb$5l2@jethro.corp.sun.com>
jake@rambler.eng.sun.com (jason cockroft) writes:
 
>suggestions:  clarke-anderson-gilmour vs. sheppard-yserbeart-??
>              andreychuck-borchevsy-??  vs. detroit checking line
>              toronto's checking line  vs. yzerman-fedorov-probert (pray lots)
>
 
 well, i'm a wings fan and i think the first thing that you should do is to
get the opponent's line combinations correct before you try to match up anyone
with them.  there is no yzerman-fedorov-probert line, except for maybe on a
powerplay.  these three players usually play on three different lines.
which would mean that toronto's checking line would have to pull a triple
shift.
the wings' lines usually look like this:
 
                      gallant-yzerman-ciccarelli
 
                   

In [32]:
text = df.content.values.tolist()
text

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

### Preprocess the data

### Email Removal

In [34]:
text

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [35]:
import re
text1 = [re.sub('\S+@\S+', r'', sent) for sent in text]
text1

["From:  (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From:  (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n

### Newline Removal

In [46]:
text2 = [re.sub('\n', r'', sent) for sent in text1]
text2

["From:  (where's my thing)Subject: WHAT car is this!?Nntp-Posting-Host: rac3.wam.umd.eduOrganization: University of Maryland, College ParkLines: 15 I was wondering if anyone out there could enlighten me on this car I sawthe other day. It was a 2-door sports car, looked to be from the late 60s/early 70s. It was called a Bricklin. The doors were really small. In addition,the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, yearsof production, where this car is made, history, or whatever info youhave on this funky looking car, please e-mail.Thanks,- IL   ---- brought to you by your neighborhood Lerxst ----",
 "From:  (Guy Kuo)Subject: SI Clock Poll - Final CallSummary: Final call for SI clock reportsKeywords: SI,acceleration,clock,upgradeArticle-I.D.: shelley.1qvfo9INNc3sOrganization: University of WashingtonLines: 11NNTP-Posting-Host: carson.u.washington.eduA fair number of brave souls who upgraded their SI clock o

### Single Quotes Removal

In [47]:
text3 = [re.sub("\'", "", sent) for sent in text2]
text3

['From:  (wheres my thing)Subject: WHAT car is this!?Nntp-Posting-Host: rac3.wam.umd.eduOrganization: University of Maryland, College ParkLines: 15 I was wondering if anyone out there could enlighten me on this car I sawthe other day. It was a 2-door sports car, looked to be from the late 60s/early 70s. It was called a Bricklin. The doors were really small. In addition,the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, yearsof production, where this car is made, history, or whatever info youhave on this funky looking car, please e-mail.Thanks,- IL   ---- brought to you by your neighborhood Lerxst ----',
 'From:  (Guy Kuo)Subject: SI Clock Poll - Final CallSummary: Final call for SI clock reportsKeywords: SI,acceleration,clock,upgradeArticle-I.D.: shelley.1qvfo9INNc3sOrganization: University of WashingtonLines: 11NNTP-Posting-Host: carson.u.washington.eduA fair number of brave souls who upgraded their SI clock os

In [48]:
print(text3[:1])

['From:  (wheres my thing)Subject: WHAT car is this!?Nntp-Posting-Host: rac3.wam.umd.eduOrganization: University of Maryland, College ParkLines: 15 I was wondering if anyone out there could enlighten me on this car I sawthe other day. It was a 2-door sports car, looked to be from the late 60s/early 70s. It was called a Bricklin. The doors were really small. In addition,the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, yearsof production, where this car is made, history, or whatever info youhave on this funky looking car, please e-mail.Thanks,- IL   ---- brought to you by your neighborhood Lerxst ----']


### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [49]:
def sent_to_words(sents):
    for s in sents:
        yield(gensim.utils.simple_preprocess(str(s), deacc=True))  # deacc=True removes punctuations

word3 = list(sent_to_words(text3))

print(word3[0])

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'eduorganization', 'university', 'of', 'maryland', 'college', 'parklines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'sawthe', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'youhave', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [50]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

#### remove_stopwords( )

In [54]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [52]:
bigram = gensim.models.Phrases(word3, threshold=100)

In [58]:
bigram1 = gensim.models.phrases.Phraser(bigram)

In [59]:
print(bigram1[word3[0]])

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting', 'host', 'rac_wam', 'umd_eduorganization', 'university', 'of', 'maryland_college', 'parklines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'sawthe', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'yearsof', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'youhave', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


  and should_run_async(code)


#### make_bigrams( )

In [63]:
def make_bigrams(texts):
    return [bigram1[doc] for doc in texts]

In [66]:
from gensim.utils import simple_preprocess
import spacy

In [64]:
word3_stop_removed = remove_stopwords(word3)
word3_bigram1 = make_bigrams(word3_stop_removed)

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [53]:
! python -m spacy download en

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py): started
  Building wheel for en-core-web-sm (setup.py): finished with status 'done'
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047113 sha256=b40aceec951e37903ae34dfd1f2e5b7ef8b3488e9b42ea30b9bb71c92e2b090e
  Stored in directory: C:\Users\jiash\AppData\Local\Temp\pip-ephem-wheel-cache-ioxg4gp2\wheels\ee\4d\f7\563214122be1540b5f9197b52cb3ddb9c4a8070808b22d5a84
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.0.0
    Uninstalling en-core-web-sm-3.0.0:
      Successfully uninstalled en-core-web-sm-3.0.0
Successfully installed en-core-web-sm-2.3.1
[+] Download an

You do not have sufficient privilege to perform this operation.


In [68]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [69]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [71]:
data_lemmatized = lemmatization(word3_bigram1, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [119]:
print(data_lemmatized[:1])

[['where', 'thing', 'car', 'nntp_poste', 'host', 'parkline', 'wonder', 'could', 'enlighten', 'car', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'engine', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


In [24]:
print(data_lemmatized[:1])

[['where', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'could', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


  and should_run_async(code)


### Create a Dictionary

In [150]:
dictionary = Dictionary(data_lemmatized)

### Create Corpus

In [151]:
texts = data_lemmatized
corpus = [dictionary.doc2bow(text) for text in texts]

### Filter low-frequency words

In [152]:
dictionary.filter_extremes(no_below=1, no_above=0.5)
dictionary[71900]

'willow'

In [153]:
corpus = [dictionary.doc2bow(text) for text in texts]

### Create Index 2 word dictionary

In [154]:
temp = dictionary[0]
index2word = dictionary.id2token

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [155]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=index2word,
                                           num_topics=10, 
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto')

### Print the Keyword in the 10 topics

In [156]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.064*"team" + 0.054*"game" + 0.032*"play" + 0.031*"player" + 0.024*"win" + '
  '0.021*"hockey" + 0.018*"score" + 0.015*"wing" + 0.013*"playoff" + '
  '0.013*"goal"'),
 (1,
  '0.015*"line" + 0.015*"problem" + 0.014*"use" + 0.014*"work" + 0.013*"need" '
  '+ 0.013*"host" + 0.011*"thank" + 0.011*"also" + 0.010*"help" + 0.009*"get"'),
 (2,
  '0.014*"people" + 0.012*"say" + 0.011*"believe" + 0.010*"evidence" + '
  '0.009*"may" + 0.009*"reason" + 0.007*"law" + 0.007*"claim" + 0.007*"sense" '
  '+ 0.006*"mean"'),
 (3,
  '0.020*"would" + 0.014*"go" + 0.012*"think" + 0.011*"article" + 0.011*"be" + '
  '0.011*"say" + 0.010*"make" + 0.010*"know" + 0.010*"time" + 0.009*"see"'),
 (4,
  '0.027*"village" + 0.023*"family" + 0.019*"turkish" + 0.018*"community" + '
  '0.017*"occupy" + 0.014*"greek" + 0.013*"turk" + 0.013*"armenian" + '
  '0.010*"overall" + 0.010*"liar"'),
 (5,
  '0.029*"sale" + 0.023*"internet" + 0.022*"season" + 0.020*"break" + '
  '0.019*"fan" + 0.019*"tape" + 0.018*"compare"

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [167]:
print('Perplexity: ', lda_model.log_perplexity(corpus))

Perplexity:  -9.117856920027336


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [159]:
from gensim.models import CoherenceModel

In [166]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.5198316861037526


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [170]:
import pyLDAvis.gensim

In [173]:
pyLDAvis.enable_notebook()

  and should_run_async(code)


In [174]:
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

  and should_run_async(code)
