# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
!pip install pyLDAvis gensim spacy

### Import the libraries

In [2]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!pwd
import os
os.chdir('/content/drive/My Drive/Colab Notebooks')
!pwd

/content
/content/drive/My Drive/Colab Notebooks


### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

In [5]:
import json

# Open the file
with open('newsgroups.json', 'r') as file:
    # Load the data
    data = json.load(file)


### Load the dataset

In [None]:
data

### Preprocess the data

### Email Removal

In [6]:
import re

# Define a regular expression to match email addresses
email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# Access the content dictionary
content_dict = data['content']

# Iterate through the items in content_dict and remove emails
for key, value in content_dict.items():
    # Remove emails from each value
    content_dict[key] = re.sub(email_regex, '', value)

### Newline Removal

In [8]:
# Access the content dictionary
content_dict = data['content']

# Iterate through the items in content_dict and remove newlines
for key, value in content_dict.items():
    # Remove newlines from each value
    content_dict[key] = value.replace('\n', '')

### Single Quotes Removal

In [9]:
# Access the content dictionary
content_dict = data['content']

# Iterate through the items in content_dict and remove single quotes
for key, value in content_dict.items():
    # Remove single quotes from each value
    content_dict[key] = value.replace("'", "")

### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [10]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# Access the content dictionary
content_dict = data['content']

# Create a list of sentences from the content dictionary
sentences = list(content_dict.values())

# Use the generator function to tokenize the sentences
words = list(sent_to_words(sentences))

### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [11]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
from nltk.corpus import stopwords

# Load the NLTK stop words
stop_words = stopwords.words('english')

# Extend the stop words with your own list
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

#### remove_stopwords( )

In [14]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

# Access the content dictionary
content_dict = data['content']

# Create a list of sentences from the content dictionary
sentences = list(content_dict.values())

# Use the generator function to tokenize the sentences
words = list(sent_to_words(sentences))

In [15]:
words = remove_stopwords(words)

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [16]:
from gensim.models import Phrases

# Build the bigram model
bigram = Phrases(words, min_count=1, threshold=100)  # higher threshold fewer phrases

# Faster way to get a sentence clubbed as a bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)


#### make_bigrams( )

In [17]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

# Now you can use the make_bigrams function to add bigrams to your tokenized words
words = make_bigrams(words)

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


  and should_run_async(code)


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [None]:
#! python -m spacy download en
import spacy

In [23]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

#### lemmatizaton( )

In [24]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [26]:
data_lemmatized = lemmatization(words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [27]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'wam_umd', 'eduorganization_university', 'maryland_college', 'parkline', 'enlighten', 'car', 'sawthe', 'day', 'door_sport', 'car', 'look', 'late_early', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'name', 'engine', 'spec', 'yearsof', 'production', 'car', 'make', 'history', 'info', 'youhave', 'funky_looking', 'car', 'mail', 'thank', 'lerxst']]


### Create a Dictionary

In [28]:
from gensim.corpora import Dictionary

# Create a Dictionary from the lemmatized data
id2word = Dictionary(data_lemmatized)

### Create Corpus

In [29]:
# Create Corpus
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

### Filter low-frequency words

In [30]:
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
id2word.filter_extremes(no_below=20, no_above=0.5)

# Create Corpus
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

### Create Index 2 word dictionary

In [31]:
# Create index to word dictionary
index2word = {v: k for k, v in id2word.token2id.items()}

### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [32]:
from gensim.models import LdaModel

# Define the number of topics
num_topics = 10

# Define the number of documents to be used in each training chunk
chunksize = 100

# Define the alpha parameter
alpha = 'auto'

# Define the number of passes
passes = 20

# Build the LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics,
                     chunksize=chunksize, alpha=alpha, passes=passes)

### Print the Keyword in the 10 topics

In [34]:
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

(0, '0.025*"believe" + 0.022*"reason" + 0.022*"evidence" + 0.018*"say" + 0.016*"people" + 0.016*"claim" + 0.013*"question" + 0.013*"exist" + 0.013*"mean" + 0.012*"man"')
(1, '0.017*"also" + 0.016*"include" + 0.015*"number" + 0.015*"information" + 0.014*"send" + 0.013*"new" + 0.011*"list" + 0.011*"available" + 0.010*"source" + 0.010*"group"')
(2, '0.049*"system" + 0.044*"power" + 0.025*"value" + 0.024*"light" + 0.017*"replace" + 0.017*"black" + 0.015*"picture" + 0.015*"remove" + 0.015*"white" + 0.015*"unit"')
(3, '0.069*"space" + 0.035*"sale" + 0.035*"player" + 0.020*"item" + 0.020*"tape" + 0.017*"com" + 0.017*"launch" + 0.014*"project" + 0.014*"orbit" + 0.013*"build"')
(4, '0.799*"ax" + 0.055*"max" + 0.013*"_" + 0.008*"responsible" + 0.007*"door" + 0.006*"motto" + 0.006*"tm" + 0.005*"pm" + 0.004*"remind" + 0.004*"trace"')
(5, '0.034*"year" + 0.023*"good" + 0.018*"team" + 0.017*"game" + 0.015*"first" + 0.015*"point" + 0.013*"wrong" + 0.012*"play" + 0.011*"word" + 0.010*"well"')
(6, '0.0

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [35]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.



Perplexity:  -6.68570305831323


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [36]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5223305884342762


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [37]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Prepare the visualization
vis = gensimvis.prepare(lda_model, corpus, id2word)

# Display the visualization
pyLDAvis.display(vis)