###Assignment 2: Extracting Topics from the Documents
####Objective
This assignment aims to help you understand the fundamentals of topic modeling,
preprocessing text for topic modeling, and evaluating the generated topics.
####Instructions
Complete the tasks below. Each task specifies the marks assigned. Submit your code,
outputs, and a brief explanation for each step.
* Dataset: [text_docs](https://docs.google.com/spreadsheets/d/1LvkaY8hjimc24qtLn0pXsBSQtvRwS_WFyUdbjNpexu0/edit?gid=611723605#gid=611723605)

In [34]:
# Import Liabrary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import liabraries for text preprocessing
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Import liabraries for topic modeling
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [35]:
# upload the dataset
url = "https://docs.google.com/spreadsheets/d/1LvkaY8hjimc24qtLn0pXsBSQtvRwS_WFyUdbjNpexu0/export?format=csv&gid=611723605"
df = pd.read_csv(url)
#Display the few rows of Dataset
df.head(5)

Unnamed: 0,document_id,text
0,1,The stock market has been experiencing volatil...
1,2,"The economy is growing, and businesses are opt..."
2,3,Climate change is a critical issue that needs ...
3,4,Advances in artificial intelligence have revol...
4,5,The rise of electric vehicles is shaping the f...


In [36]:
# Display the basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   document_id  10 non-null     int64 
 1   text         10 non-null     object
dtypes: int64(1), object(1)
memory usage: 292.0+ bytes


In [37]:
print("Total Number of Documents:", len(df))
print("Number of Unique Documents ID:",df['document_id'].nunique())

Total Number of Documents: 10
Number of Unique Documents ID: 10


In [38]:
# Duplicate values
df.duplicated().sum()

0

In [39]:
# Basic text Statistics
df['text_length'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))

In [40]:
df.head(5)

Unnamed: 0,document_id,text,text_length,word_count
0,1,The stock market has been experiencing volatil...,82,12
1,2,"The economy is growing, and businesses are opt...",71,11
2,3,Climate change is a critical issue that needs ...,73,11
3,4,Advances in artificial intelligence have revol...,77,8
4,5,The rise of electric vehicles is shaping the f...,79,13


In [41]:
df.describe()

Unnamed: 0,document_id,text_length,word_count
count,10.0,10.0,10.0
mean,5.5,75.7,10.7
std,3.02765,4.620005,1.418136
min,1.0,68.0,8.0
25%,3.25,72.25,10.25
50%,5.5,76.0,11.0
75%,7.75,79.75,11.0
max,10.0,82.0,13.0


####Preprocessing Text

In [42]:
def preprocess(text):
  text = text.lower()
  text = re.sub(r'[^a-zA-Z\s]','',text)
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
  return tokens

In [43]:
df['tokens'] = df['text'].apply(preprocess)

print("Original Text:\n", df['text'][1])
print("Clean Text:\n", df['tokens'][1])

Original Text:
 The economy is growing, and businesses are optimistic about the future.
Clean Text:
 ['economy', 'growing', 'business', 'optimistic', 'future']


####Create a Document-Term Matrix (for LDA)

In [44]:
dictionary = corpora.Dictionary(df['tokens'])
# Create a corpus: a list of Bag of Words (BoW) model
corpus = [dictionary.doc2bow(text) for text in df['tokens']]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 1), (11, 1)],
 [(12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)],
 [(9, 1), (22, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)],
 [(31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)],
 [(22, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1)],
 [(43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1)],
 [(49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)],
 [(37, 1), (39, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1)]]

####Create a Latent Dirichlet Allocation (LDA) Model

In [45]:
# Train LDA model
lda_model = gensim.models.LdaMulticore(corpus,num_topics = 5, id2word=dictionary, passes=10)

In [46]:
# Display top 5 words per topic
for idx, topic in lda_model.print_topics(num_words=5):
  print(f"Topic #{idx+1}: {topic}")

Topic #1: 0.016*"industry" + 0.016*"future" + 0.016*"platform" + 0.016*"digital" + 0.016*"medium"
Topic #2: 0.050*"future" + 0.050*"energy" + 0.050*"investing" + 0.050*"project" + 0.050*"renewable"
Topic #3: 0.048*"digital" + 0.048*"platform" + 0.048*"integrated" + 0.048*"concern" + 0.048*"become"
Topic #4: 0.037*"need" + 0.037*"immediate" + 0.037*"change" + 0.037*"issue" + 0.037*"attention"
Topic #5: 0.066*"industry" + 0.036*"stock" + 0.036*"experiencing" + 0.036*"vehicle" + 0.036*"due"


####Topic Modeling with NMF (Non-negative Matrix Factorization)

In [51]:
# NMF for Topic Modeling (with TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [52]:
# Prepare text (Tokens variable is already preprocessed)
df['clean_text'] = df['tokens'].apply(lambda x:' '.join(x))
df['clean_text'][1]

'economy growing business optimistic future'

In [55]:
# Preprocess and vectorize the documents using TF-IDF
tfidf_v = TfidfVectorizer()
X = tfidf_v.fit_transform(df['clean_text'])

In [57]:
# Apply NMF for topic modeling
n_topics = 5
nmf_model = NMF(n_components=n_topics, random_state=42)
W = nmf_model.fit_transform(X)  # Document Topic Matrix
H = nmf_model.components_       # Topic word Matrix

In [59]:
# Apply NMF for topic modeling
feature_names = tfidf_v.get_feature_names_out()
for topic_idx, topic in enumerate(H):
  top_words = [feature_names[i] for i in topic.argsort()[-5:]]
  print(f"Topic {topic_idx+1}: {', '.join(top_words)}")

Topic 1: streaming, entertainment, towards, platform, digital
Topic 2: vehicle, automobile, rise, industry, future
Topic 3: global, immediate, attention, need, critical
Topic 4: introduction, evolving, new, technology, healthcare
Topic 5: investing, world, government, energy, around


In [60]:
# Assign most likely topic to each document
df['nmf_topic'] = W.argmax(axis=1)
df[['document_id', 'nmf_topic']].head()

Unnamed: 0,document_id,nmf_topic
0,1,2
1,2,1
2,3,2
3,4,1
4,5,1


In [49]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Visualize
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, dictionary)
vis