## Modeling Natural Language Data 

The assignment uses the 20 newsgroups text dataset. The 20 newsgroups dataset comprises around 12000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. The split between the train and test set is based upon a messages posted before and after a specific date.

In this assignment, you will complete the following text-processing Pipeline using Python:

1. Clean and preprocess the text (tokenization, noise removal, normalization)
2. Create Bag-of-Words (BoW) and TF-IDF representations
3. Extend BoW and TF-IDF with Bigrams
4. Perform topic modeling with 10 topics and visualize using pyLDAvis
5. Use GloVe embeddings to check document similarity

In [40]:
# Importing required packages
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import requests
import pandas as pd
import numpy as np
import re

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings 
warnings.filterwarnings('ignore')

In [41]:
import nltk
import os

# Explicitly set the path to your NLTK data
nltk.data.path.append("/Users/neelaropp/nltk_data")

# Ensure all necessary resources are downloaded
nltk.download('punkt', download_dir="/Users/neelaropp/nltk_data")
nltk.download('stopwords', download_dir="/Users/neelaropp/nltk_data")
nltk.download('wordnet', download_dir="/Users/neelaropp/nltk_data")


[nltk_data] Downloading package punkt to /Users/neelaropp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:

# Fetching 20 Newsgroups data
newsgroups_data = fetch_20newsgroups(subset='train')
documents = newsgroups_data.data
print("Number of articles:", len(documents))


Number of articles: 11314


### *Part A*: Text Preprocessing

Write a function preprocess_text(text) that performs all of the below steps and returns a list of clean tokens. Apply this function to every article in your dataset:
- Tokenize the articles (you may use nltk.word_tokenize, spacy, or any other tokenizer).
- Remove Noise:
    - Filter out non-alphabetic tokens (numbers, punctuation, etc.).
    - Remove stopwords.
    - Optionally remove or handle repeated characters, HTML tags, etc.
- Normalize:
    - Convert text to lowercase.
    - Apply lemmatization.

**Additional Instructions**:
- Show sample output for the first 2 articles/documents


In [43]:
def preprocess_text(text):
    """
    Preprocess the input text by tokenizing, removing noise, and normalizing.
    Returns a list of clean tokens.
    """
    # Convert text to lowercase
    text = text.lower()

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove punctuation and non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Apply preprocessing to all documents
preprocessed_documents = [preprocess_text(doc) for doc in documents]

# Show sample output for the first 2 articles
print("Sample Processed Article 1:", preprocessed_documents[0])
print("Sample Processed Article 2:", preprocessed_documents[1])



Sample Processed Article 1: ['lerxst', 'thing', 'subject', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'line', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sport', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'thanks', 'il', 'brought', 'neighborhood', 'lerxst']
Sample Processed Article 2: ['guykuo', 'guy', 'kuo', 'subject', 'si', 'clock', 'poll', 'final', 'call', 'summary', 'final', 'call', 'si', 'clock', 'report', 'keywords', 'si', 'acceleration', 'clock', 'upgrade', 'organization', 'university', 'washington', 'line', 'fair', 'number', 'brave', 'soul', 'upgraded', 'si', 'clock', 'oscillator', 'shared', 'experience', 'poll', 'please', 'send', 'brief', 'message', 'detailing', 'expe

### *Part B*: Bag-of-Words Vectorization:

- Use CountVectorizer from sklearn.feature_extraction.text to transform your entire corpus into a BoW representation.
- Print the vocabulary size (i.e., the number of unique words after preprocessing).
- Show the BoW vector for one sample document.


In [44]:

# Convert tokenized documents back to string format
preprocessed_texts = [" ".join(doc) for doc in preprocessed_documents]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
bow_matrix = vectorizer.fit_transform(preprocessed_texts)

# Get vocabulary size
vocab_size = len(vectorizer.vocabulary_)
print("Vocabulary Size:", vocab_size)


# Convert BoW matrix to an array and display the first document's vector
bow_vector_sample = bow_matrix[0].toarray()
print("BoW Vector for First Document:\n", bow_vector_sample)


Vocabulary Size: 66828
BoW Vector for First Document:
 [[0 0 0 ... 0 0 0]]


### *Part C*: Add Bigrams

- Extend your vectorization to include bigrams
    - Use bigrams function from nltk (imported below) to generate bigrams
    - Then combine original unigrams with the new bigram tokens to extend the vocabulary
    - Recreate the BoW vectors with the new vocabulary (we will be only using this recreated BoW ahead)
- Compare the resulting vocabulary size with the one created in the previous question


In [45]:
from nltk import bigrams

In [46]:
def generate_bigrams(text_tokens):
    """
    Generate bigrams from a list of tokens and return a combined list of unigrams and bigrams.
    """
    bigram_tokens = list(bigrams(text_tokens))  # Create bigrams as tuples
    bigram_tokens = ["_".join(bigram) for bigram in bigram_tokens]  # Convert tuples to strings
    return text_tokens + bigram_tokens  # Combine unigrams and bigrams

# Apply bigram generation to all preprocessed documents
preprocessed_with_bigrams = [generate_bigrams(doc) for doc in preprocessed_documents]

# Convert tokenized documents (with bigrams) back to string format
preprocessed_bigram_texts = [" ".join(doc) for doc in preprocessed_with_bigrams]

# Initialize CountVectorizer for unigrams + bigrams
vectorizer_bigram = CountVectorizer()

# Fit and transform the text data
bow_bigram_matrix = vectorizer_bigram.fit_transform(preprocessed_bigram_texts)

# Get vocabulary size with bigrams
vocab_size_bigram = len(vectorizer_bigram.vocabulary_)

# Print comparison
print("Original Vocabulary Size (Unigrams Only):", vocab_size)
print("Extended Vocabulary Size (Unigrams + Bigrams):", vocab_size_bigram)


Original Vocabulary Size (Unigrams Only): 66828
Extended Vocabulary Size (Unigrams + Bigrams): 936412


### *Part D*: Topic Modeling with LDA

Use the BoW representation you created in the previous question for topic modeling.

**Steps**:
- Utilize Gensim’s LdaModel to learn 10 topics
    - Convert the BoW representation into Gensim’s compatible format: you can do so by creating a Dictionary and a corpus
    - Train `LdaModel(num_topics=10, ...)`
    
Take assistance from the [Gensim LDAModel Documentation](https://radimrehurek.com/gensim/models/ldamodel.html) for this step

In [47]:
import gensim
from gensim import corpora
from gensim.models import LdaModel


In [48]:
# Create a dictionary from the processed documents with bigrams
dictionary = corpora.Dictionary(preprocessed_with_bigrams)

# Filter out rare and very frequent words
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Print dictionary size
print("Dictionary Size:", len(dictionary))

# Convert the preprocessed documents into a bag-of-words corpus
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_with_bigrams]

# Print sample corpus entry
print("Sample Corpus Entry:", corpus[0])

# Train LDA model with 10 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=42, passes=10)

# Print the top words for each topic
for idx, topic in lda_model.print_topics(num_words=10):
    print(f"Topic {idx}: {topic}")



Dictionary Size: 56879
Sample Corpus Entry: [(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 5), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
Topic 0: 0.010*"god" + 0.008*"one" + 0.007*"people" + 0.007*"would" + 0.005*"say" + 0.005*"think" + 0.005*"christian" + 0.004*"know" + 0.004*"article" + 0.004*"jesus"
Topic 1: 0.009*"car" + 0.009*"article" + 0.006*"would" + 0.005*"people" + 0.004*"get" + 0.004*"like" + 0.004*"one" + 0.004*"line_article" + 0.003*"think" + 0.003*"go"
Topic 2: 0.052*"max" + 0.039*"max_max" + 0.037*"r" + 0.

### Visualizing the topics created with pyLDAvis

- No coding required in this step

In [49]:
# Visualize with pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
lda_vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_vis_data)


### *Part E*: Word Embeddings with GloVe

- Load GloVe Vectors:
    - Download pretrained GloVe embeddings from [this link](https://nlp.stanford.edu/data/glove.6B.zip).
    - From the zip file, use only glove.6B.50d.txt.
    - Load them into your notebook.

- Create Document Embeddings:
    - For each article, compute a “document embedding” by averaging the GloVe embeddings of all words in that document (or any other strategy you might prefer, e.g., using only nouns, etc.)
    - If an embedding doesn't exist for a token and you get an error, skip the token


In [50]:
import numpy as np
glove_path = "glove.6B.50d.txt"  

# Load the GloVe embeddings into a dictionary
glove_embeddings = {}

with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]  # The word itself
        vector = np.array(values[1:], dtype=np.float32)  # The embedding
        glove_embeddings[word] = vector

print("Loaded GloVe embeddings:", len(glove_embeddings), "words.")


Loaded GloVe embeddings: 400000 words.


In [51]:
def compute_document_embedding(tokens):
    """
    Compute the document embedding by averaging the GloVe embeddings of all words in the document.
    If a word is not in GloVe, it is skipped.
    """
    vectors = [glove_embeddings[word] for word in tokens if word in glove_embeddings]
    
    if len(vectors) == 0:
        return np.zeros(50)  # Return a zero vector if no valid words found
    
    return np.mean(vectors, axis=0)  # Average the word vectors

# Compute document embeddings for all articles
document_embeddings = np.array([compute_document_embedding(doc) for doc in preprocessed_with_bigrams])

# Print shape of document embeddings
print("Document Embeddings Shape:", document_embeddings.shape)


Document Embeddings Shape: (11314, 50)


### *Part F*: Check Document Similarity:

Since an embedding can give semantic representation of a document, we can use cosine similarity to see how similar a pair of documents are.

- Compute cosine similarity between any two documents’ embeddings.
- Print out the resulting similarity score for the pairs of documents.
- (Bonus points) Find the most similar document to a given query document in your corpus.

In [52]:
from sklearn.metrics.pairwise import cosine_similarity

# Select two document indices for comparison
doc_idx1, doc_idx2 = 0, 1 

# Compute cosine similarity between the two selected document embeddings
similarity_score = cosine_similarity(
    document_embeddings[doc_idx1].reshape(1, -1), 
    document_embeddings[doc_idx2].reshape(1, -1)
)[0][0]

# Print the similarity score
print(f"Cosine Similarity between Document {doc_idx1} and Document {doc_idx2}: {similarity_score:.4f}")


Cosine Similarity between Document 0 and Document 1: 0.8913


In [53]:
def find_most_similar_document(query_idx, document_embeddings):
    """
    Find the most similar document to the given query document based on cosine similarity.
    """
    # Compute cosine similarities with all other documents
    similarities = cosine_similarity(
        document_embeddings[query_idx].reshape(1, -1), 
        document_embeddings
    )[0]

    # Ignore self-comparison by setting the similarity of the query document to -1
    similarities[query_idx] = -1  

    # Find the index of the most similar document
    most_similar_idx = np.argmax(similarities)

    return most_similar_idx, similarities[most_similar_idx]

# Choose a document as the query
query_doc_idx = 5  # Change this index as needed

# Find the most similar document
most_similar_doc_idx, similarity_score = find_most_similar_document(query_doc_idx, document_embeddings)

# Print results
print(f"Document {query_doc_idx} is most similar to Document {most_similar_doc_idx} with similarity score {similarity_score:.4f}")


Document 5 is most similar to Document 198 with similarity score 0.9976
