## HW-4: Modeling Natural Language Data (50 marks)

The assignment uses the 20 newsgroups text dataset. The 20 newsgroups dataset comprises around 12000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. The split between the train and test set is based upon a messages posted before and after a specific date.

In this assignment, you will complete the following text-processing Pipeline using Python:

1. Clean and preprocess the text (tokenization, noise removal, normalization)
2. Create Bag-of-Words (BoW) and TF-IDF representations
3. Extend BoW and TF-IDF with Bigrams
4. Perform topic modeling with 10 topics and visualize using pyLDAvis
5. Use GloVe embeddings to check document similarity

In [26]:
# Importing required packages
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import requests
import pandas as pd
import numpy as np
import re

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

In [27]:
import nltk
import os

# Explicitly set the path to your NLTK data
nltk.data.path.append("/Users/neelaropp/nltk_data")

# Ensure all necessary resources are downloaded
nltk.download('punkt', download_dir="/Users/neelaropp/nltk_data")
nltk.download('stopwords', download_dir="/Users/neelaropp/nltk_data")
nltk.download('wordnet', download_dir="/Users/neelaropp/nltk_data")


[nltk_data] Downloading package punkt to /Users/neelaropp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [28]:
# Fetching 20 Newsgroups data
newsgroups_data = fetch_20newsgroups(subset='train')
documents = newsgroups_data.data
print("Number of articles:", len(documents))

Number of articles: 11314


### *Part A*: Text Preprocessing

Write a function preprocess_text(text) that performs all of the below steps and returns a list of clean tokens. Apply this function to every article in your dataset:
- Tokenize the articles (you may use nltk.word_tokenize, spacy, or any other tokenizer).
- Remove Noise:
    - Filter out non-alphabetic tokens (numbers, punctuation, etc.).
    - Remove stopwords.
    - Optionally remove or handle repeated characters, HTML tags, etc.
- Normalize:
    - Convert text to lowercase.
    - Apply lemmatization.

**Additional Instructions**:
- Show sample output for the first 2 articles/documents


In [29]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove non-alphabetic tokens and stopwords
    clean_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]
    
    return clean_tokens

# Apply preprocessing to first two documents
processed_docs = [preprocess_text(doc) for doc in documents[:2]]

# Display sample output for first 2 documents
for i, doc in enumerate(processed_docs):
    print(f"Processed Document {i+1}: \n", doc[:50], "...\n")

Processed Document 1: 
 ['lerxst', 'thing', 'subject', 'car', 'organization', 'university', 'maryland', 'college', 'park', 'line', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'saw', 'day', 'sport', 'car', 'looked', 'late', 'early', 'called', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car'] ...

Processed Document 2: 
 ['guykuo', 'guy', 'kuo', 'subject', 'si', 'clock', 'poll', 'final', 'call', 'summary', 'final', 'call', 'si', 'clock', 'report', 'keywords', 'si', 'acceleration', 'clock', 'upgrade', 'organization', 'university', 'washington', 'line', 'fair', 'number', 'brave', 'soul', 'upgraded', 'si', 'clock', 'oscillator', 'shared', 'experience', 'poll', 'please', 'send', 'brief', 'message', 'detailing', 'experience', 'procedure', 'top', 'speed', 'attained', 'cpu', 'rated', 

### *Part B*: Bag-of-Words Vectorization:

- Use CountVectorizer from sklearn.feature_extraction.text to transform your entire corpus into a BoW representation.
- Print the vocabulary size (i.e., the number of unique words after preprocessing).
- Show the BoW vector for one sample document.


In [30]:
# Initialize CountVectorizer using the pre-defined preprocess_text function
vectorizer = CountVectorizer(tokenizer=preprocess_text)

# Fit and transform the documents into a BoW representation
bow_matrix = vectorizer.fit_transform(documents)

# Get the vocabulary size
vocab_size = len(vectorizer.vocabulary_)

# Convert BoW representation to array for a sample document
sample_bow_vector = bow_matrix[0].toarray()

# Display results
print("Vocabulary size:", vocab_size)
print("BoW vector for first document:\n", sample_bow_vector)

Vocabulary size: 66857
BoW vector for first document:
 [[0 0 0 ... 0 0 0]]


### *Part C*: Add Bigrams

- Extend your vectorization to include bigrams
    - Use bigrams function from nltk (imported below) to generate bigrams
    - Then combine original unigrams with the new bigram tokens to extend the vocabulary
    - Recreate the BoW vectors with the new vocabulary (we will be only using this recreated BoW ahead)
- Compare the resulting vocabulary size with the one created in the previous question


In [31]:
from nltk import bigrams

In [32]:
from nltk.util import ngrams

# Function to generate bigrams and extend vocabulary
def generate_bigrams(tokens):
    bigrams = ['_'.join(bigram) for bigram in ngrams(tokens, 2)]
    return tokens + bigrams  # Combine unigrams with bigrams

# Modified tokenizer to include bigrams
def preprocess_with_bigrams(text):
    tokens = preprocess_text(text)  # Apply previous preprocessing
    return generate_bigrams(tokens)  # Extend with bigrams

# Initialize CountVectorizer with bigrams
vectorizer_bigrams = CountVectorizer(tokenizer=preprocess_with_bigrams)

# Fit and transform the documents into a BoW representation with bigrams
bow_matrix_bigrams = vectorizer_bigrams.fit_transform(documents)

# Get the new vocabulary size
vocab_size_bigrams = len(vectorizer_bigrams.vocabulary_)

# Compare vocabulary sizes
print("Vocabulary size without bigrams:", vocab_size)
print("Vocabulary size with bigrams:", vocab_size_bigrams)

Vocabulary size without bigrams: 66857
Vocabulary size with bigrams: 936441


### *Part D*: Topic Modeling with LDA

Use the BoW representation you created in the previous question for topic modeling.

**Steps**:
- Utilize Gensim’s LdaModel to learn 10 topics
    - Convert the BoW representation into Gensim’s compatible format: you can do so by creating a Dictionary and a corpus
    - Train `LdaModel(num_topics=10, ...)`
    
Take assistance from the [Gensim LDAModel Documentation](https://radimrehurek.com/gensim/models/ldamodel.html) for this step

In [33]:
import gensim
from gensim import corpora

In [34]:
from gensim import corpora
from gensim.models import LdaModel

# Create a dictionary from the processed documents
dictionary = corpora.Dictionary([preprocess_with_bigrams(doc) for doc in documents])

# Create a bag-of-words corpus
corpus = [dictionary.doc2bow(preprocess_with_bigrams(doc)) for doc in documents]

# Train the LDA model with 10 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10, random_state=42)

# Display the top words in each topic
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)


(0, '0.003*"db" + 0.003*"db_db" + 0.001*"baalke" + 0.001*"jupiter" + 0.001*"mov" + 0.001*"ron_baalke" + 0.001*"comet" + 0.001*"tower" + 0.001*"cooling" + 0.001*"cooling_tower"')
(1, '0.001*"dyer" + 0.001*"men" + 0.001*"homosexual" + 0.001*"providence" + 0.000*"seizure" + 0.000*"gay" + 0.000*"sexual" + 0.000*"master_slave" + 0.000*"male" + 0.000*"stafford"')
(2, '0.012*"window" + 0.008*"drive" + 0.006*"card" + 0.004*"driver" + 0.004*"disk" + 0.003*"do" + 0.003*"scsi" + 0.003*"problem" + 0.003*"controller" + 0.003*"file"')
(3, '0.007*"line" + 0.007*"subject" + 0.006*"organization" + 0.006*"would" + 0.005*"one" + 0.004*"writes" + 0.004*"get" + 0.004*"like" + 0.004*"article" + 0.003*"use"')
(4, '0.006*"game" + 0.006*"line" + 0.005*"team" + 0.005*"organization" + 0.005*"subject" + 0.004*"writes" + 0.004*"year" + 0.003*"article" + 0.003*"player" + 0.002*"university"')
(5, '0.001*"steveh" + 0.001*"revolver" + 0.000*"hendricks" + 0.000*"steve_hendricks" + 0.000*"sehari" + 0.000*"tmc" + 0.000*"

### Visualizing the topics created with pyLDAvis

- No coding required in this step

In [35]:
# Visualize with pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
lda_vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_vis_data)


### *Part E*: Word Embeddings with GloVe

- Load GloVe Vectors:
    - Download pretrained GloVe embeddings from [this link](https://nlp.stanford.edu/data/glove.6B.zip).
    - From the zip file, use only glove.6B.50d.txt.
    - Load them into your notebook.

- Create Document Embeddings:
    - For each article, compute a “document embedding” by averaging the GloVe embeddings of all words in that document (or any other strategy you might prefer, e.g., using only nouns, etc.)
    - If an embedding doesn't exist for a token and you get an error, skip the token


In [36]:
import numpy as np

# Path to GloVe file (update this with the correct location after downloading and extracting)
glove_file_path = "glove.6B.50d.txt"  # Update this path accordingly

# Load GloVe embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype="float32")
            embeddings[word] = vector
    return embeddings

# Load the embeddings
glove_embeddings = load_glove_embeddings(glove_file_path)

# Function to compute document embedding by averaging word vectors
def compute_document_embedding(tokens, embeddings, embedding_dim=50):
    valid_vectors = [embeddings[word] for word in tokens if word in embeddings]
    
    if valid_vectors:
        return np.mean(valid_vectors, axis=0)
    else:
        return np.zeros(embedding_dim)  # Return a zero vector if no valid words are found

# Compute document embeddings for all articles
document_embeddings = np.array([compute_document_embedding(preprocess_with_bigrams(doc), glove_embeddings) for doc in documents])

# Display shape of document embeddings matrix
print("Document embeddings shape:", document_embeddings.shape)

Document embeddings shape: (11314, 50)


### *Part F*: Check Document Similarity:

Since an embedding can give semantic representation of a document, we can use cosine similarity to see how similar a pair of documents are.

- Compute cosine similarity between any two documents’ embeddings.
- Print out the resulting similarity score for the pairs of documents.
- (Bonus points) Find the most similar document to a given query document in your corpus.

In [37]:
from sklearn.metrics.pairwise import cosine_similarity

# Function to compute cosine similarity between two document embeddings
def compute_similarity(doc1_index, doc2_index):
    return cosine_similarity(
        document_embeddings[doc1_index].reshape(1, -1), 
        document_embeddings[doc2_index].reshape(1, -1)
    )[0][0]

# Compute similarity between the first two documents
similarity_score = compute_similarity(0, 1)

# Find the most similar document to a given query document in the corpus
def find_most_similar(query_doc_index):
    similarities = cosine_similarity(
        document_embeddings[query_doc_index].reshape(1, -1), 
        document_embeddings
    )[0]
    
    # Ignore self-comparison by setting similarity with itself to -1
    similarities[query_doc_index] = -1
    most_similar_index = np.argmax(similarities)
    
    return most_similar_index, similarities[most_similar_index]

# Find most similar document to the first document
most_similar_doc_index, most_similar_score = find_most_similar(0)

# Display results
print("Cosine similarity between first two documents:", similarity_score)
print(f"Most similar document to document 0 is document {most_similar_doc_index} with similarity score {most_similar_score}")

Cosine similarity between first two documents: 0.8913003
Most similar document to document 0 is document 958 with similarity score 0.9874471426010132
