# Lab 04: Extracting topics from the NYT comments section

# Part I: Data preprocessing

### Load dataset

Data source: https://georgetown.box.com/s/wda8q83elj0khd6khsvdzzyaumpisrg2

*Note: This data consists of the comments section from the New York Times in Fall 2020. Some of it's content may be offensive and toxic.*

In [None]:
import pandas as pd

fp = "../../.local/nyt-comments-part8.csv.zip"
df = pd.read_csv(fp)
df = df.sample(frac=1).reset_index(drop=True)

### Dataset size

In [None]:
df.shape

### Data format

In [None]:
df.iloc[0]

### Example comment

In [None]:
df['commentBody'][22]

### Reuse Spacy pipeline from Lab-02 for text normalization & preprocessing

In [None]:
import re
import spacy
from spacy.language import Language

M = 2500

pipeline = spacy.load('en_core_web_sm')

# http://emailregex.com/
email_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# replace = [ (pattern-to-replace, replacement),  ...]
replace = [
    (r"<a[^>]*>(.*?)</a>", r"\1"),  # Matches most URLs
    (email_re, "email"),            # Matches emails
    (r"(?<=\d),(?=\d)", ""),        # Remove commas in numbers
    (r"\d+", "number"),              # Map digits to special token <numbr>
    (r"[\t\n\r\*\.\@\,\-\/]", " "), # Punctuation and other junk
    (r"\s+", " ")                   # Stips extra whitespace
]

comments = []
for i, comment in enumerate(df['commentBody'][:M]):
    for repl in replace:
        comment = re.sub(repl[0], repl[1], comment)
    comments.append(comment)


@Language.component("nytComments")
def ng20_preprocess(doc):
    tokens = [token for token in doc 
              if not any((token.is_stop, token.is_punct))]
    tokens = [token.lemma_.lower().strip() for token in tokens]
    tokens = [token for token in tokens if token]
    return " ".join(tokens)


pipeline.add_pipe("nytComments")

### Pass data through our Spacy pipeline

In [None]:
docs = []
for comment in comments[:M]:
    docs.append(pipeline(comment))

### Compute number of unique words (vocabulary size)

In [None]:
vocab_size = len(set(" ".join(docs).split(" ")))
vocab_size

# Part 2: Build Features

### Build the term-document matrix (i.e., BOW features)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tf_vectorizer = CountVectorizer(max_features=vocab_size, max_df=0.95, min_df=0.005, stop_words='english')
# tf_vectorizer = TfidfVectorizer(max_features=vocab_size, max_df=0.95, stop_words='english')
X = tf_vectorizer.fit_transform(docs)
type(X), X.shape

### Create a index-to-word map

In [None]:
idx2word = {idx: word for word, idx in tf_vectorizer.vocabulary_.items()}

### Check X for correct counts

In [None]:
# tf_vectorizer.vocabulary_
docs[0]

In [None]:
idx = tf_vectorizer.vocabulary_['covid']
X[0, idx]

### Plotting subroutine to visualize words

In [None]:
from time import time
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(K // 5, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[::-1][:n_top_words]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]
        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f'Topic {topic_idx +1}',
                     fontdict={'fontsize': 30})
        ax.invert_yaxis()
        ax.tick_params(axis='both', which='major', labelsize=20)
        for i in 'top right left'.split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)
    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

### (10 pts) Task I: Build a LSA model

In this task we are going to build a LSA topic model from scratch. From lecture-04, we learned about LSA from the perspective of document retrieval. For document retrieval, you'll recall that we computed a truncated SVD by choosing some number of dimension $K << N$. This gave us the left singular column vectors, $\mathbf{U} \in \mathbb{R}^{N \times K}$, and the diagonal singular value matrix, $\boldsymbol{\Sigma} \in \mathbb{R}^{K \times K}$ that we needed in order to project our queries, $\mathbf{q} \in \mathbb{R}^{N}$, and documents, $\mathbf{d} \in \mathbb{R}^{N}$, into $\mathbb{R}^{K}$ space. Recall that the operation to do that was:

$$\hat{\mathbf{q}} = \mathbf{q}\mathbf{U}\mathbf{\Sigma}^{-1} $$

In this task, we're going to evaluate the singular values, $\sigma_{i,j}$ in $\mathbf{\Sigma}$, and their corresponding basis vectors, $\mathbf{u}^{(j)}$, in $\mathbf{U}$, to extract the principal themes in the data. Execute the following subtasks.

1. For each column vector, print out the top 10 most relevant words.
2. Visualize the top 10 words using the `plot_topics()` function provided above.
3. What affect does the hyperparameter $K$ have on the result?
4. Is there a principled way to determine the best $K$?

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
K = 20 # number of topics (feel free to adjust)
# Your code goes here

### (5 pts) Task 3: Compute distances between topics

In this task, we're going to compute distances between topics. The right singular vectors produced from SVD are the $K$ most relevant directions in our original $N$-dimensional space. Each topic, or theme, is nothing more than a weighted sum over the words. Compute the cosine distance between each of the topic vectors. Are these distances consistent (in an intuitive sense) with the top words in each topic?

In [None]:
# Your code goes here

### (5 pts) Task 2: Perform topic extraction using the NMF and LDA models from sklearn

In this task we perform topic extraction using the Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) models provided by sklearn. 

1. Fit the NMF and LDA models in the provided cells below.
2. Visualize the results using the `plot_topics` function.
3. How do the results compare to your home-spun LSA topic model?
4. What are the differences between these model that might give rise to these results?

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [None]:
# Your code goes here

### (5 pts extra credit) Task 3: Fit all three of these topic models on a new dataset

Choose a new dataset, perhaps one that naturally breaks down into topics (unlike the comments section of the NYT in Fall 2020), and fit these models using the same approach. Summarize any conclusions that you make with a plot.

In [None]:
# Your code goes here