## HW-4: Modeling Natural Language Data (50 marks)

The assignment uses the 20 newsgroups text dataset. The 20 newsgroups dataset comprises around 12000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. The split between the train and test set is based upon a messages posted before and after a specific date.

In this assignment, you will complete the following text-processing Pipeline using Python:

1. Clean and preprocess the text (tokenization, noise removal, normalization)
2. Create Bag-of-Words (BoW) and TF-IDF representations
3. Extend BoW and TF-IDF with Bigrams
4. Perform topic modeling with 10 topics and visualize using pyLDAvis
5. Use GloVe embeddings to check document similarity

In [1]:
# Importing required packages
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import requests
import pandas as pd
import numpy as np
import re

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
import nltk
import os

# Explicitly set the path to your NLTK data
nltk.data.path.append("/Users/neelaropp/nltk_data")

# Ensure all necessary resources are downloaded
nltk.download('punkt', download_dir="/Users/neelaropp/nltk_data")
nltk.download('stopwords', download_dir="/Users/neelaropp/nltk_data")
nltk.download('wordnet', download_dir="/Users/neelaropp/nltk_data")


[nltk_data] Downloading package punkt to /Users/neelaropp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/neelaropp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:

# Fetching 20 Newsgroups data
newsgroups_data = fetch_20newsgroups(subset='train')
documents = newsgroups_data.data
print("Number of articles:", len(documents))


Number of articles: 11314


### *Part A*: Text Preprocessing

Write a function preprocess_text(text) that performs all of the below steps and returns a list of clean tokens. Apply this function to every article in your dataset:
- Tokenize the articles (you may use nltk.word_tokenize, spacy, or any other tokenizer).
- Remove Noise:
    - Filter out non-alphabetic tokens (numbers, punctuation, etc.).
    - Remove stopwords.
    - Optionally remove or handle repeated characters, HTML tags, etc.
- Normalize:
    - Convert text to lowercase.
    - Apply lemmatization.

**Additional Instructions**:
- Show sample output for the first 2 articles/documents


In [4]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove non-alphabetic tokens and stopwords
    clean_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]
    
    return clean_tokens

# Apply preprocessing to first two documents
processed_docs = [preprocess_text(doc) for doc in documents[:2]]

# Display sample output for first 2 documents
for i, doc in enumerate(processed_docs):
    print(f"Processed Document {i+1}: \n", doc[:50], "...\n")


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/neelaropp/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/neelaropp/nltk_data'
**********************************************************************


### *Part B*: Bag-of-Words Vectorization:

- Use CountVectorizer from sklearn.feature_extraction.text to transform your entire corpus into a BoW representation.
- Print the vocabulary size (i.e., the number of unique words after preprocessing).
- Show the BoW vector for one sample document.


In [4]:
# Write your code below


### *Part C*: Add Bigrams

- Extend your vectorization to include bigrams
    - Use bigrams function from nltk (imported below) to generate bigrams
    - Then combine original unigrams with the new bigram tokens to extend the vocabulary
    - Recreate the BoW vectors with the new vocabulary (we will be only using this recreated BoW ahead)
- Compare the resulting vocabulary size with the one created in the previous question


In [5]:
from nltk import bigrams

In [6]:
# Write your code below


### *Part D*: Topic Modeling with LDA

Use the BoW representation you created in the previous question for topic modeling.

**Steps**:
- Utilize Gensim’s LdaModel to learn 10 topics
    - Convert the BoW representation into Gensim’s compatible format: you can do so by creating a Dictionary and a corpus
    - Train `LdaModel(num_topics=10, ...)`
    
Take assistance from the [Gensim LDAModel Documentation](https://radimrehurek.com/gensim/models/ldamodel.html) for this step

In [7]:
import gensim
from gensim import corpora

In [8]:
# Write your code below

# Create a dictionary from the processed data

# Create a bag-of-words corpus

# Train the LDA model with 10 topics

### Visualizing the topics created with pyLDAvis

- No coding required in this step

In [9]:
# Visualize with pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
lda_vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_vis_data)


ModuleNotFoundError: No module named 'pyLDAvis'

### *Part E*: Word Embeddings with GloVe

- Load GloVe Vectors:
    - Download pretrained GloVe embeddings from [this link](https://nlp.stanford.edu/data/glove.6B.zip).
    - From the zip file, use only glove.6B.50d.txt.
    - Load them into your notebook.

- Create Document Embeddings:
    - For each article, compute a “document embedding” by averaging the GloVe embeddings of all words in that document (or any other strategy you might prefer, e.g., using only nouns, etc.)
    - If an embedding doesn't exist for a token and you get an error, skip the token


In [None]:
# Write your code below

# Load GloVe Embeddings

# Create document embeddings by averaging

### *Part F*: Check Document Similarity:

Since an embedding can give semantic representation of a document, we can use cosine similarity to see how similar a pair of documents are.

- Compute cosine similarity between any two documents’ embeddings.
- Print out the resulting similarity score for the pairs of documents.
- (Bonus points) Find the most similar document to a given query document in your corpus.

In [None]:
# Write your code below