**TOPIC MODELING**

**Muhamed Hisham bin Mohamed Bahurudeen (IS01081947)**  
**Muhammad Afiq Fikri Bin Ahmad Sabri (IS01082516)**

*The LDA model achieved a coherence score of 0.561, indicating that the topics are reasonably interpretable.*  
*This suggests that the model was able to uncover some meaningful themes within the dataset, though further refinement could enhance topic clarity.*

In [4]:
# For text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# For topic modeling
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel
import pandas as pd

# Download NLTK resources (run once)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the cleaned dataset
df = pd.read_csv("Processed_News_Lemmatized.csv")

# Extract documents from th column
documents = df['lemmatized'].dropna().tolist()

# Preprocess function
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalnum()]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Preprocess all documents
preprocessed_documents = [preprocess_text(doc) for doc in documents]

# Create dictionary and corpus
dictionary = corpora.Dictionary(preprocessed_documents)
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

# Train LDA model with 4 topics
lda_model = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=15)

# Assign dominant topic to each document
article_labels = []
for doc in preprocessed_documents:
    bow = dictionary.doc2bow(doc)
    topics = lda_model.get_document_topics(bow)
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    article_labels.append(dominant_topic)

# Create DataFrame with results
df_result = pd.DataFrame({"Article": documents, "Topic": article_labels})
print("Table with Articles and Topic:")
print(df_result.head())

# Show top terms for each topic
print("\nTop Terms for Each Topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

# Evaluate model using coherence score
coherence_model = CoherenceModel(model=lda_model, texts=preprocessed_documents, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\isham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\isham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\isham/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Table with Articles and Topic:
                                             Article  Topic
0  wonder anyone could enlighten car saw day door...      3
1  recently post article ask kind rate single mal...      0
2  depend priority lot people put high priority g...      0
3  excellent automatic find subaru legacy switch ...      3
4  ford automobile need information whether ford ...      0

Top Terms for Each Topic:
Topic 0:
- "would" (weight: 0.011)
- "one" (weight: 0.008)
- "say" (weight: 0.008)
- "people" (weight: 0.008)
- "go" (weight: 0.006)
- "think" (weight: 0.006)
- "get" (weight: 0.006)
- "know" (weight: 0.006)
- "make" (weight: 0.006)
- "u" (weight: 0.005)

Topic 1:
- "x" (weight: 0.016)
- "key" (weight: 0.013)
- "use" (weight: 0.010)
- "encryption" (weight: 0.009)
- "system" (weight: 0.007)
- "information" (weight: 0.006)
- "file" (weight: 0.006)
- "privacy" (weight: 0.005)
- "clipper" (weight: 0.005)
- "security" (weight: 0.005)

Topic 2:
- "game" (weight: 0.014)
- "team" (we