# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
! pip install pyLDAvis gensim spacy





### Import the libraries

In [2]:
import nltk
# ! nltk.download('stopwords')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jlira\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# !pip install wasabi==0.9.1

In [4]:
import spacy
# We're currently working on a fix (c.f. explosion/wasabi#25) as this seems to be an issue for the latest release of wasabi. 
# In the meantime, can you try the suggested fix to downgrade wasabi with:
# pip install wasabi==0.9.1


In [5]:
# from spacy import load
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint  
# The pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can 
# be used as input to the interpreter. 
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [6]:
# ! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

### Load the dataset

In [7]:
df = pd.read_json("newsgroups.json")
df.head(10)

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
5,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...,16,talk.politics.guns
6,From: bmdelane@quads.uchicago.edu (brian manni...,13,sci.med
7,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...,3,comp.sys.ibm.pc.hardware
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...,2,comp.os.ms-windows.misc
9,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...,4,comp.sys.mac.hardware


In [8]:
df.shape

(11314, 3)

### Preprocess the data

### Email Removal

In [9]:
df['content'].replace(to_replace=r'\S*@\S*\s?', 
                      value='', 
                      regex=True, 
                      inplace=True)
df['content']
# \S* : match as many non-space characters you can
# @ : then a @
# \S* : then another sequence of non-space characters
# \s? : And eventually a space, if there is one. Note that the '?' is needed to match an address at the end of the line. 
# Because of the greediness of '?', if there is a space, it will always be matched.
# if you use only \S*@\S*, the remaining words will be separated by more than one space if an address 
# has been deleted between them. By adding \s?, each time you delete an address, you will delete one space with it

# Other solutions:
# required_output=re.sub(r'[A-Za-z0-9]*@[A-Za-z]*\.?[A-Za-z0-9]*', "", text)
# ' '.join([i for i in inp.split() if '@' not in i])
# df['content'].replace(to_replace=r'[A-Za-z0-9]*@[A-Za-z]*\.?[A-Za-z0-9]*', value='', regex=True)


0        From: (where's my thing)\nSubject: WHAT car is...
1        From: (Guy Kuo)\nSubject: SI Clock Poll - Fina...
2        From: (Thomas E Willis)\nSubject: PB questions...
3        From: (Joe Green)\nSubject: Re: Weitek P9000 ?...
4        From: (Jonathan McDowell)\nSubject: Re: Shuttl...
                               ...                        
11309    From: (Jim Zisfein) \nSubject: Re: Migraines a...
11310    From: Subject: Screen Death: Mac Plus/512\nLin...
11311    From: (Will Estes)\nSubject: Mounting CPU Cool...
11312    From: (Steven Collins)\nSubject: Re: Sphere fr...
11313    From: (Kevin J. Gunning)\nSubject: stolen CBR9...
Name: content, Length: 11314, dtype: object

### Newline Removal

In [10]:
df['content'].replace(to_replace=r'\n', 
                      value='', 
                      regex=True, 
                      inplace=True)
df['content']

0        From: (where's my thing)Subject: WHAT car is t...
1        From: (Guy Kuo)Subject: SI Clock Poll - Final ...
2        From: (Thomas E Willis)Subject: PB questions.....
3        From: (Joe Green)Subject: Re: Weitek P9000 ?Or...
4        From: (Jonathan McDowell)Subject: Re: Shuttle ...
                               ...                        
11309    From: (Jim Zisfein) Subject: Re: Migraines and...
11310    From: Subject: Screen Death: Mac Plus/512Lines...
11311    From: (Will Estes)Subject: Mounting CPU Cooler...
11312    From: (Steven Collins)Subject: Re: Sphere from...
11313    From: (Kevin J. Gunning)Subject: stolen CBR900...
Name: content, Length: 11314, dtype: object

### Single Quotes Removal

In [11]:
df['content'].replace(to_replace=r"'", 
                      value='', 
                      regex=True, 
                      inplace=True)
df['content']

0        From: (wheres my thing)Subject: WHAT car is th...
1        From: (Guy Kuo)Subject: SI Clock Poll - Final ...
2        From: (Thomas E Willis)Subject: PB questions.....
3        From: (Joe Green)Subject: Re: Weitek P9000 ?Or...
4        From: (Jonathan McDowell)Subject: Re: Shuttle ...
                               ...                        
11309    From: (Jim Zisfein) Subject: Re: Migraines and...
11310    From: Subject: Screen Death: Mac Plus/512Lines...
11311    From: (Will Estes)Subject: Mounting CPU Cooler...
11312    From: (Steven Collins)Subject: Re: Sphere from...
11313    From: (Kevin J. Gunning)Subject: stolen CBR900...
Name: content, Length: 11314, dtype: object

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [12]:
from gensim.utils import simple_preprocess

In [13]:
simple_preprocess(df['content'][0], 
                  deacc=False, 
                  min_len=2, 
                  max_len=15)
# Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

['from',
 'wheres',
 'my',
 'thing',
 'subject',
 'what',
 'car',
 'is',
 'this',
 'nntp',
 'posting',
 'host',
 'rac',
 'wam',
 'umd',
 'eduorganization',
 'university',
 'of',
 'maryland',
 'college',
 'parklines',
 'was',
 'wondering',
 'if',
 'anyone',
 'out',
 'there',
 'could',
 'enlighten',
 'me',
 'on',
 'this',
 'car',
 'sawthe',
 'other',
 'day',
 'it',
 'was',
 'door',
 'sports',
 'car',
 'looked',
 'to',
 'be',
 'from',
 'the',
 'late',
 'early',
 'it',
 'was',
 'called',
 'bricklin',
 'the',
 'doors',
 'were',
 'really',
 'small',
 'in',
 'addition',
 'the',
 'front',
 'bumper',
 'was',
 'separate',
 'from',
 'the',
 'rest',
 'of',
 'the',
 'body',
 'this',
 'is',
 'all',
 'know',
 'if',
 'anyone',
 'can',
 'tellme',
 'model',
 'name',
 'engine',
 'specs',
 'yearsof',
 'production',
 'where',
 'this',
 'car',
 'is',
 'made',
 'history',
 'or',
 'whatever',
 'info',
 'youhave',
 'on',
 'this',
 'funky',
 'looking',
 'car',
 'please',
 'mail',
 'thanks',
 'il',
 'brought',
 

In [14]:
# sent_to_words() generator
def sent_to_words(documents):
    new_doc=[]
    for sentence in documents:
        new_doc.append(simple_preprocess( sentence, 
                                          deacc=False, 
                                          min_len=2, 
                                          max_len=15))
    yield(new_doc)

# In command prompt launch jupyter like this:
# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
# to avoid the following error:
# IOPub data rate exceeded.
# The notebook server will temporarily stop sending output
# to the client in order to avoid crashing it.

In [15]:
# print(next(sent_to_words(df['content'])))

In [16]:
text=next(sent_to_words(df['content']))
text

[['from',
  'wheres',
  'my',
  'thing',
  'subject',
  'what',
  'car',
  'is',
  'this',
  'nntp',
  'posting',
  'host',
  'rac',
  'wam',
  'umd',
  'eduorganization',
  'university',
  'of',
  'maryland',
  'college',
  'parklines',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'sawthe',
  'other',
  'day',
  'it',
  'was',
  'door',
  'sports',
  'car',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  'early',
  'it',
  'was',
  'called',
  'bricklin',
  'the',
  'doors',
  'were',
  'really',
  'small',
  'in',
  'addition',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  'this',
  'is',
  'all',
  'know',
  'if',
  'anyone',
  'can',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'yearsof',
  'production',
  'where',
  'this',
  'car',
  'is',
  'made',
  'history',
  'or',
  'whatever',
  'info',
  'youhave',
  '

In [17]:
len(text)

11314

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [18]:
stop_words = stopwords.words('english')
print("Size of stop words (original from library): {}".format(len(stop_words)))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print("Size of stop words (after extending it): {}".format(len(stop_words)))

Size of stop words (original from library): 179
Size of stop words (after extending it): 184


In [19]:
print("Size of stop words after (dropping duplicates): {}".format(len(set(stop_words))))

Size of stop words after (dropping duplicates): 182


**remove_stopwords( )** function

In [20]:
# remove stop words from tokens
def remove_stopwords(texts):
    tokens_without_stop_words=[]
    for i in texts:
        tokens_without_stop_words.append([raw for raw in i if not raw in set(stop_words)])
    return tokens_without_stop_words

In [21]:
%time
tokens_without_stopwords = remove_stopwords(text)
tokens_without_stopwords

CPU times: total: 0 ns
Wall time: 0 ns


[['wheres',
  'thing',
  'car',
  'nntp',
  'posting',
  'host',
  'rac',
  'wam',
  'umd',
  'eduorganization',
  'university',
  'maryland',
  'college',
  'parklines',
  'wondering',
  'anyone',
  'could',
  'enlighten',
  'car',
  'sawthe',
  'day',
  'door',
  'sports',
  'car',
  'looked',
  'late',
  'early',
  'called',
  'bricklin',
  'doors',
  'really',
  'small',
  'addition',
  'front',
  'bumper',
  'separate',
  'rest',
  'body',
  'know',
  'anyone',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'yearsof',
  'production',
  'car',
  'made',
  'history',
  'whatever',
  'info',
  'youhave',
  'funky',
  'looking',
  'car',
  'please',
  'mail',
  'thanks',
  'il',
  'brought',
  'neighborhood',
  'lerxst'],
 ['guy',
  'kuo',
  'si',
  'clock',
  'poll',
  'final',
  'callsummary',
  'final',
  'call',
  'si',
  'clock',
  'reportskeywords',
  'si',
  'acceleration',
  'clock',
  'upgradearticle',
  'shelley',
  'qvfo',
  'innc',
  'sorganization',
  'universi

In [22]:
len(tokens_without_stopwords)

11314

In [23]:
# tokens_without_stopwords[0]

In [24]:
import string
# string_text= string_text.translate(str.maketrans('','',string.punctuation))

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [25]:
from gensim.models.phrases import Phrases, Phraser, ENGLISH_CONNECTOR_WORDS

In [26]:
#Building Bigram & Trigram Models

bigram = gensim.models.Phrases(tokens_without_stopwords, 
                               min_count=5, 
                               threshold=100)
trigram = gensim.models.Phrases(bigram[tokens_without_stopwords], 
                                threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [27]:
bigram

<gensim.models.phrases.Phrases at 0x239abf55a30>

In [28]:
# sent = [row.split() for row in df['content']]

In [29]:
# sent[0]

In [30]:
# models.phrases – Phrase (collocation) detection
# Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences.
phrases = Phrases(tokens_without_stopwords, 
                  min_count=30, 
                  progress_per=10000,
                  threshold=100,
                  connector_words=ENGLISH_CONNECTOR_WORDS)

In [31]:
# bigram = Phraser(phrases)

In [32]:
print(phrases)

Phrases<1143359 vocab, min_count=30, threshold=100, max_vocab_size=40000000>


In [33]:
for phrase, score in phrases.find_phrases(tokens_without_stopwords).items():
    print(phrase, score)

nntp_posting 206.29230713270192
maryland_college 157.70468965517242
distribution_worldnntp 581.5026397137995
posting_host 199.92121277986243
comx_newsreader 558.0763880414887
tin_version 586.8596016972352
second_amendment 304.02966608062366
keep_bear 185.1747513744441
investors_packet 549.691826923077
ibm_pc 107.4425837216011
university_illinois 155.91664069196602
communications_services 139.6783734371314
youve_got 141.2470666033641
hewlett_packard 5603.731476676469
newsreader_tin 1770.8193082085697
version_pl 236.7470852463633
space_station 114.71466697313312
software_vax 370.7913835970697
vms_vnews 3791.7517857142857
space_shuttle 140.40971384172852
new_york 256.7095843949045
political_atheists 291.0131412224934
keith_ryan 285.9640822096564
edux_newsreader 597.2338704571356
hope_helps 447.9025923150293
years_ago 224.1294432828019
san_jose 1160.3683011955743
mountain_view 226.17880692808157
greatly_appreciated 974.6944773906597
tcp_ip 2431.8163771712157
keith_allan 1355.7616600790516


In [34]:
# phrases[tokens_without_stopwords]

In [35]:
phrases.find_phrases(tokens_without_stopwords[1])

{}

In [36]:
# # // to create the bigrams
# bigram_model = Phrases(tokens_without_stopwords)

# # // apply the trained model to a sentence
# for unigram_sentence in tokens_without_stopwords:                
#             bigram_sentence = u' '.join(bigram_model[tokens_without_stopwords])

# # // get a trigram model out of the bigram
# trigram_model = Phrases(bigram_sentences)

#### make_bigrams( )

In [37]:
def make_bigrams(texts):
    return None

In [38]:
#function to create bigrams

def create_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [39]:
#function to create trigrams

def create_trigrams(texts):
    [trigram_mod[bigram_mod[doc]] for doc in texts]

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [40]:
# ! python -m spacy download en

In [41]:
nlp = spacy.load('en_core_web_sm', 
                 disable=['parser', 'ner'])

In [42]:
#function for lemmatization

def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_op = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_op

In [43]:
%time
data_bigrams = create_bigrams(tokens_without_stopwords)

data_lemmatized = lemmatize(data_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

CPU times: total: 0 ns
Wall time: 0 ns


#### lemmatizaton( )

In [44]:
# print(data_lemmatized[:1])

### Create a Dictionary

In [45]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

### Create Corpus

In [46]:
# Create Corpus
texts = data_lemmatized

### Filter low-frequency words

In [47]:
id2word.filter_extremes(no_below=10, no_above=0.5)
# convert tokenized documents into a document-term matrix
# corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

In [48]:
# Term Document Frequency
# convert tokenized documents into a document-term matrix
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)]]


### Create Index 2 word dictionary

In [49]:
# temp = dictionary[0]  # This is only to "load" the dictionary.
# id2word = dictionary.id2token

In [50]:
# import gensim.corpora as corpora

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [51]:
%time
ldamodel = LdaModel(corpus, 
                    num_topics=15, 
                    id2word = id2word, 
                    passes=20)

CPU times: total: 0 ns
Wall time: 0 ns


### Print the Keyword in the 10 topics

In [52]:
pprint(ldamodel.top_topics(corpus,topn=10))

[([(0.017876577, 'go'),
   (0.016535735, 'say'),
   (0.01046355, 'get'),
   (0.0096915085, 'see'),
   (0.009458377, 'come'),
   (0.009042963, 'people'),
   (0.00839252, 'time'),
   (0.0080830455, 'take'),
   (0.007660046, 'think'),
   (0.00684271, 'know')],
  -0.9654992793696301),
 ([(0.026778689, 'get'),
   (0.025136758, 'know'),
   (0.02427268, 'article'),
   (0.022452034, 'm'),
   (0.021416303, 'think'),
   (0.015935907, 'go'),
   (0.014099669, 'nntp_poste'),
   (0.013910925, 's'),
   (0.013267949, 'organization'),
   (0.013072959, 'want')],
  -1.21328858707001),
 ([(0.015135925, 'say'),
   (0.011142106, 'think'),
   (0.010849505, 'believe'),
   (0.009744869, 'exist'),
   (0.009475797, 'mean'),
   (0.009234496, 'point'),
   (0.008665446, 'evidence'),
   (0.008034923, 'claim'),
   (0.00784586, 'well'),
   (0.0077291504, 'article')],
  -1.3416099023398647),
 ([(0.05369118, 'game'),
   (0.04481111, 'team'),
   (0.030245317, 'play'),
   (0.02734093, 'player'),
   (0.02223531, 'win'),
  

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [53]:
perplexity = ldamodel.log_perplexity(corpus)
print(perplexity)

-7.2150138323821


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [54]:
%time
from gensim.models import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, 
                                     texts=data_lemmatized, 
                                     dictionary=id2word, 
                                     coherence='c_v')


CPU times: total: 0 ns
Wall time: 0 ns


In [55]:
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5560462285142508


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [56]:

import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

In [57]:
pyLDAvis.gensim_models.prepare(ldamodel, corpus, id2word)

  default_term_info = default_term_info.sort_values(
