<a href="https://colab.research.google.com/github/nlahri/dsba6211-summer2024/blob/main/notebooks/dsba6211_summer2024_lab05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install umap-learn
!pip install bertopic

# Load Data

In [None]:
import pandas as pd

# import csv
data_url = "https://raw.githubusercontent.com/sultanawar321/reviews_text_classification/main/data/reviews.csv"

df = pd.read_csv(data_url)
df.head()

In [None]:
# let's keep only the raw text
docs = [_ for _ in df['ReviewBody']]

len(docs)

In [None]:
docs[0]

In [None]:
# matplotlib bar chart of length of each string in docs list
import matplotlib.pyplot as plt

plt.hist([len(doc) for doc in docs], bins=100)
plt.show()

## scikit-learn pipeline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

import umap

# Create a TF-IDF pipeline
tfidf_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("svd", TruncatedSVD(n_components=50, random_state=42)),
    ("umap", umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42))
])

# Fit the pipeline to the text data
tfidf_pipeline.fit(docs)

# Transform the text data into the UMAP representation
umap_representations = tfidf_pipeline.transform(docs)

## TF-IDF

In [None]:
# Get the TF-IDF vectorizer from the pipeline
tfidf_vectorizer = tfidf_pipeline.named_steps["tfidf"]

# Get the vocabulary from the TF-IDF vectorizer
vocabulary = tfidf_vectorizer.vocabulary_

doc_idx = 0
print(docs[doc_idx])

In [None]:
# Get the TF-IDF weights
tfidf_weights = tfidf_vectorizer.transform(docs)

# Get for sample document
print(tfidf_weights[0])

In [None]:
import numpy as np

# Get the words with the highest frequency in the document
top_words = np.argsort(tfidf_weights[0].toarray())[0][-1:]

# Print the top words
for word_idx in top_words:
    word = list(vocabulary.keys())[list(vocabulary.values()).index(word_idx)]
    print(word)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert sparse array to dense array for easier calculation
dense_weights = tfidf_weights.toarray()

# Get the feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Explore the top 20 words with the highest average TF-IDF weights
avg_weights = np.mean(dense_weights, axis=1)
top_idx = np.argsort(-avg_weights)[:20]
print("Top 20 words with highest average TF-IDF weights:")
for idx in top_idx:
    print(f"{feature_names[idx]}: {avg_weights[idx]:.4f}")

# Explore the distribution of TF-IDF weights for all documents
plt.hist(avg_weights, bins=50)
plt.xlabel("Average TF-IDF weight")
plt.ylabel("Frequency")
plt.title("Distribution of average TF-IDF weights")
plt.show()

In [None]:
# Find the most common words across all documents
common_words = np.argsort(-np.sum(dense_weights, axis=0))[:10]
print("Most common words across all documents:")
for idx in common_words:
    print(f"{feature_names[idx]}: {np.sum(dense_weights[:, idx]):.4f}")

See [this demo](https://pair-code.github.io/understanding-umap/) for more intuition on UMAP.

In [None]:
import numpy as np

# Plot the UMAP representation of the documents with altair
plt.scatter(umap_representations[:, 0], umap_representations[:, 1], alpha=0.5)
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.title("UMAP Representation of Documents")
plt.show()

Let's now cluster the UMAP representation using DBSCAN.

For more details on DBSCAN, watch [this](https://youtu.be/RDZUdRSDOok?feature=shared) StatQuest video.

In [None]:
import altair as alt
from sklearn.cluster import DBSCAN

# Run DBSCAN clustering on the UMAP representation
dbscan = DBSCAN(eps=0.1, min_samples=5)
dbscan_labels = dbscan.fit_predict(umap_representations)

df = pd.DataFrame(np.column_stack((umap_representations, dbscan_labels, docs)), columns=["x", "y", "dbscan_cats","text"])

alt.Chart(df).mark_circle().encode(
    x="x:Q",
    y="y:Q",
    color = "dbscan_cats:N",
    tooltip=["x", "y", "dbscan_cats", "text"]
).interactive()

In [None]:
# view examples
from ipywidgets import interact

def get_examples(index, cat):
    return [_ for _ in df[df["dbscan_cats"] == cat].text][index]

interact(get_examples, index=(0, 18), cat="77")

# Topic Modeling

This is just one implementation of Topic Modeling (Non-negative Matrix Factorization, a close cousin of the most popular LDA topic modeling technique).

It's important to remember that topic modeling is a task, not a model itself. There are a variety of algorithms, most of which have been replaced recently by word embedding based approaches (e.g., see [BERTopic](https://maartengr.github.io/BERTopic/index.html)).

Very important points to remember about topic modeling:

1. Running the model is the start, not the end. It requires human interpretation after running the model. Do not forgot this (#1 mistake by new data scientists)!

2. Validation is critical; don't just accept the model. It's unsupervised so you need to use knowledge to check the model.

3. Evaluation is hard and not as clear as supervised ML. There are metrics (e.g., Perplexity) but these are more theoretical and in practice hard to justify.

4. Topic modeling is all about exploration: "learning what you don't know". Very rarely are topic models into production.

In [None]:
from sklearn.decomposition import NMF
# Run NMF topic modeling
nmf = NMF(n_components=20, random_state=42)
nmf_topics = nmf.fit_transform(tfidf_pipeline.named_steps["tfidf"].fit_transform(docs))

In [None]:
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(4, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[-n_top_words:]
        top_features = feature_names[top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()


plot_top_words(
    nmf, tfidf_feature_names, 10, "Topics in NMF model"
)

# Lab Questions

## Questions 1

Write a Python function that uses the `similarity_matrix` and the index of a document (e.g., document 0) and returns the k-most similar documents.
- Include a 2nd optional parameter to set k, which by default should be 10.

In [None]:
# Calculate the document similarity using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(dense_weights)
print("Document similarity matrix:")
print(similarity_matrix)

In [None]:
# find the 10 most similar documents to this document
docs[0]


### Answer this:

- Write a Python function/code to find what are the 10 most similar documents to document `0` (provide an list of the indices). You may use ChatGPT but you **must** provide your prompt (e.g., provide a [shared link](https://help.openai.com/en/articles/7925741-chatgpt-shared-links-faq)). If you don't, you'll lose points.

`!<--- ADD CODE SNIPPET BELOW AND THEN ANSWER HERE -->!`

- Read through the 10 most similar documents; what is a recurring theme in the documents? what is the most similar topic in the NMF topic modeling?

`!<--- WRITE ANSWER HERE -->!`

- Critique this search; what are ways in accuracy, speed, and reduced memory you could improve this (feel free to use outside references)?

`!<--- WRITE ANSWER HERE -->!`

In [None]:
# write code here

## Question 2

For this question, use this code to visualize in UMAP that color codes documents based on:

- Query: Document 0 from Q1 (blue)
- Results: Top 100 documents most similar to Document 0 (red)
- Other: All other documents (in light grey)

1. Replace the `df.loc[!<-- PUT ANSWER FROM Q1 HERE, CHANGE FOR k=100 --> !, "text_type"]`
2. Run the code below and analyze the plot (e.g., zoom in and find some of the examples)

### Answer this:

- Explain why the top 100 most similar documents to the query are **not** the nearest neighbors on the plot.

`!<--- WRITE ANSWER HERE -->!`

In [None]:
# generate altair code to plot umap representations
import altair as alt

df = pd.DataFrame(np.column_stack((umap_representations, docs)), columns=["x", "y", "text"])
# add column for document type; add in your function from Q1 to #######
# df["text_type"] = "Other"
# df.loc[#######, "text_type"] = "Results"
# df.loc[0, "text_type"] = "Query"

alt.Chart(df).mark_circle().encode(
    x="x:Q",
    y="y:Q",
    # color = alt.Color('text_type',
    #                   scale=alt.Scale(domain=['Other', 'Query', 'Results'],
    #                   range=['lightgray', 'blue', 'red'])),
    tooltip=["x", "y", "text"]
).interactive()


## Question 3: BERTopic

[BERTopic](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#code-overview) is a more modern approach to topic modeling.

Instead of learning of word representations from scratch, typically BERTopic will leverage pre-trained embeddings to provide a deeper knowledge of word meanings.

Then, you may use those representations to cluster (e.g., UMAP, DBSCAN, etc.). Typically, this leads to better topics (clusters) but at the cost of more memory/computationally intensive (not too big of a problem unless you have massive data).

For these questions, you will likely need to search the [BERTopic](https://maartengr.github.io/BERTopic/index.html) website.

Simply saying "I don't know" will not be given points; the point of this exercise is to improve your search skills of documentation and make an argument.

### What to do:

- Run the code below (`.fit_transform()` and then `.get_topic_info()`).

- Your boss asked you from this dataset: "What are most important topics that customers are talking about?" Answer this question.

`!<--- WRITE ANSWER HERE -->!`

- Your boss then says: "There are too many topics; I want fewer topics." Describe to your boss what options there are to reduce the number of topics. Rerun (add code) a topic model to accomplish this goal.

`!<--- WRITE ANSWER HERE -->!`

- Your boss says: "Hmm... there's a lot of stop words. Should we include them or remove them from the analysis? What are other options we have?".  Answer this question. You may (although not required to run an example).

`!<--- WRITE ANSWER HERE -->!`

- Your boss asks about saving the model. Generate code to save the model as a pickle file. As a check, reload your model again.

`!<--- CREATE A NEW CODE SNIPPET -->!`

- Your boss mentions that the model will be run in a slightly different production environment than Colab. What safeguards can be done to make sure it works and consistent?

`!<--- WRITE ANSWER HERE -->!`

In [None]:
# step will take 5+ minutes
from bertopic import BERTopic

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

In [None]:
topic_model.get_topic_info()