# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [6]:
! pip install pyLDAvis gensim spacy

[33mDEPRECATION: Loading egg at /Users/omisesan/anaconda3/lib/python3.11/site-packages/endtoendmlgit-0.0.1-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/omisesan/anaconda3/lib/python3.11/site-packages/fonttools-4.55.2-py3.11-macosx-10.9-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting spacy
  Downloading spacy-3.8.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (27 kB)
Collecting FuzzyTM>=0.4.0 (from gensim)
  Downloading FuzzyTM-2.0.9-py3-none-any.whl.metadata (7.9 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Do

### Import the libraries

In [7]:
import numpy as np
import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import requests

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [15]:


with open('newsgroups.json', 'r') as f:
    data = json.load(f)

print(type(data))  
print(data.keys() if isinstance(data, dict) else "Not a dictionary")  

<class 'dict'>
dict_keys(['content', 'target', 'target_names'])


In [17]:
with open('newsgroups.json', 'r') as f:
    data = json.load(f)

print(f"Keys in data: {data.keys()}")
if isinstance(data["content"], list):
    data_text = data["content"]
elif isinstance(data["content"], dict):
    data_text = list(data["content"].values())
else:
    raise ValueError("Unexpected format for 'content' key")

print(f"Number of documents: {len(data_text)}")
print(data_text[0][:500])

Keys in data: dict_keys(['content', 'target', 'target_names'])
Number of documents: 11314
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a m


### Preprocess the data

### Email Removal

In [18]:
def remove_emails(texts):
    return [re.sub(r'\S*@\S*\s?', '', text) for text in texts]

data_text = remove_emails(data_text)

### Newline Removal

In [19]:
def remove_newlines(texts):
    return [re.sub(r'\n', ' ', text) for text in texts]

data_text = remove_newlines(data_text)

### Single Quotes Removal

In [20]:
def remove_single_quotes(texts):
    return [re.sub(r"\'", "", text) for text in texts]

data_text = remove_single_quotes(data_text)

### Tokenize
- Create **sent_to_words()** 
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [21]:
def sent_to_words(sentences):
    for sentence in sentences:
        # Use gensim's simple_preprocess to tokenize and clean text
        yield(simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data_text))
print(data_words[0][:30])  

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could']


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use

In [22]:
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

data_words_nostops = remove_stopwords(data_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/omisesan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### remove_stopwords( )

In [27]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]


In [31]:
data_words_nostops = remove_stopwords(data_words)

### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold

In [32]:
bigram = gensim.models.Phrases(data_words_nostops, min_count=2, threshold=10)  
bigram_mod = gensim.models.phrases.Phraser(bigram)

def make_bigrams(texts):
   
    return [bigram_mod[doc] for doc in texts]

data_words_bigrams = make_bigrams(data_words_nostops)

print(data_words_bigrams[0][:100])

['wheres', 'thing', 'car', 'nntp_posting', 'host_rac', 'wam_umd', 'organization_university', 'maryland_college', 'park_lines', 'wondering_anyone', 'could_enlighten', 'car', 'saw', 'day', 'door_sports', 'car', 'looked', 'late_early', 'called', 'bricklin', 'doors', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky_looking', 'car', 'please_mail', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']


#### make_bigrams( )

In [33]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [34]:
data_words_bigrams = make_bigrams(data_words_nostops)

### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [37]:
! python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
[33mDEPRECATION: Loading egg at /Users/omisesan/anaconda3/lib/python3.11/site-packages/endtoendmlgit-0.0.1-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /Users/omisesan/anaconda3/lib/python3.11/site-packages/fonttools-4.55.2-py3.11-macosx-10.9-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━

In [40]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### lemmatizaton( )

In [41]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [42]:
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [43]:
print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'wam_umd', 'could_enlighten', 'car', 'see', 'day', 'door_sport', 'car', 'look', 'late_early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky_looke', 'car', 'please_mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


### Create a Dictionary

In [44]:
id2word = corpora.Dictionary(data_lemmatized)
print(f"Number of unique words in the dictionary: {len(id2word)}")

Number of unique words in the dictionary: 80969


### Create Corpus

In [45]:
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(f"Number of documents in corpus: {len(corpus)}")
print(corpus[0][:20])  # Print first 20 word-id pairs of first document

Number of documents in corpus: 11314
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]


### Filter low-frequency words

In [46]:
# Filter out words that occur in less than 5 documents, or more than 50% of the documents
id2word.filter_extremes(no_below=5, no_above=0.5)
corpus = [id2word.doc2bow(text) for text in texts]
print(f"Number of unique words after filtering: {len(id2word)}")

Number of unique words after filtering: 17699


### Create Index 2 word dictionary

In [47]:
id2word_dict = dict((id, word) for word, id in id2word.token2id.items())
print(f"Sample of index to word mapping: {list(id2word_dict.items())[:5]}")

Sample of index to word mapping: [(0, 'addition'), (1, 'body'), (2, 'bring'), (3, 'call'), (4, 'car')]


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess

In [48]:
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=10,
    random_state=100,
    update_every=1,
    chunksize=100,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

### Print the Keyword in the 10 topics

In [49]:
print("Top 10 Keywords per Topic:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

Top 10 Keywords per Topic:
Topic 0: 0.016*"science" + 0.012*"truth" + 0.011*"research" + 0.011*"accept" + 0.010*"study" + 0.010*"earth" + 0.009*"world" + 0.009*"report" + 0.009*"describe" + 0.008*"reference"
Topic 1: 0.033*"write" + 0.025*"get" + 0.022*"think" + 0.021*"say" + 0.020*"know" + 0.018*"go" + 0.016*"see" + 0.015*"make" + 0.014*"organization" + 0.013*"time"
Topic 2: 0.028*"people" + 0.015*"evidence" + 0.014*"state" + 0.011*"fact" + 0.010*"say" + 0.010*"case" + 0.010*"law" + 0.009*"right" + 0.009*"issue" + 0.009*"person"
Topic 3: 0.018*"year" + 0.012*"team" + 0.011*"game" + 0.010*"good" + 0.010*"first" + 0.009*"point" + 0.009*"line" + 0.007*"play" + 0.007*"_" + 0.007*"well"
Topic 4: 0.020*"use" + 0.016*"line" + 0.016*"system" + 0.014*"program" + 0.010*"problem" + 0.010*"file" + 0.010*"also" + 0.010*"window" + 0.009*"need" + 0.009*"thank"
Topic 5: 0.027*"key" + 0.016*"chip" + 0.014*"sale" + 0.012*"bike" + 0.010*"tape" + 0.010*"ripem" + 0.009*"use" + 0.009*"com" + 0.009*"system"

## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [50]:
print(f"Perplexity: {lda_model.log_perplexity(corpus)}")  

Perplexity: -8.761123945294319


In [51]:
actual_perplexity = np.exp(-8.761123945294319)
print(actual_perplexity)

0.00015670837792431427


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [53]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")

Coherence Score: 0.40410622495073856


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [56]:
# Setup for visualization
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, n_jobs=1)
pyLDAvis.display(vis)

# Optional: Save visualization to HTML
pyLDAvis.save_html(vis, 'Abdulrasaq_lda_visualization.html')