# Installation Instructions and References

## Installation

- BERTopic can be installed using pip.
- To use GPU acceleration for UMAP and hDBSCAN, you need to install RAPIDS cuML.
  - For specific installation info, please reference: https://docs.rapids.ai/install
    - Make sure you install for the proper CUDA version (11.6 on Cheaha)
    - Ensure you have the proper cupy version (cupy-cuda11x for Cheaha)
  - Non-GPU accelerated versions of these packages are available if needed.

## References:
- https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html
- https://medium.com/rapids-ai/faster-topic-modeling-with-bertopic-and-rapids-cuml-5c7559aba898
- https://towardsdatascience.com/topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5
- https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
- https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ 

In [2]:
# installs
# !pip install bertopic 

"""
NOTE: to use GPU acclerated UMAP and HDBSCAN, you need to install RAPIDS cuML.
Ensure you have the proper cupy version (cupy-cuda11x for Cheaha)
For more installation info, see: https://docs.rapids.ai/install
"""

# imports
import os
import gzip
import json
import torch
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

# GPU accelerated
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN

# not GPU accelerated
# from umap import UMAP
# from hdbscan import HDBSCAN
    
print("imports compelete")


imports compelete


### Download and decompress data - run to initially get data

- NOTE: if you have already downloaded a dataset using this cell, it is saved as a `json` file which can be used in the future. You do not need to run this cell again. If you want to download a particular dataset, comment out all except that one. Some of these datasets are rather large, so the process may take some time.

In [2]:
## Download and decompress data

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Books_5.json.gz

chunk_size=5 * 1024 * 1024

with gzip.open('Books_5.json.gz') as f:
    with open("Books_5.json", 'wb') as f_out:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            f_out.write(chunk)

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Home_and_Kitchen_5.json.gz
with gzip.open('Home_and_Kitchen_5.json.gz') as f:
    with open("Home_and_Kitchen_5.json", 'wb') as f_out:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            f_out.write(chunk)

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Sports_and_Outdoors_5.json.gz

with gzip.open('Sports_and_Outdoors_5.json.gz') as f:
    with open("Sports_and_Outdoors_5.json", 'wb') as f_out:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            f_out.write(chunk)

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Electronics_5.json.gz

with gzip.open("Electronics_5.json.gz", 'rb') as f:
    with open("Electronics_5.json", 'wb') as f_out:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            f_out.write(chunk)

print("done")

done


### Read data into lists, concatenating the relevant items `summary` and `reviewText` for each review.  

In [3]:
%%time

training_data_books = []
training_data_outdoors = []
training_data_electronics = []
training_data_kitchen=[]
with open('Books_5.json') as f:
    for review in f:
        text = json.loads(review).get("reviewText", "").strip()
        summary = json.loads(review).get("summary", "").strip()
        review = summary + " " + text
        if review.strip():
            training_data_books.append(review)
print("Books done")
    
with open('Home_and_Kitchen_5.json') as f:
    for review in f:
        review = json.loads(review)
        text = review.get("reviewText", "").strip()
        summary = review.get("summary", "").strip()
        review = summary + " " + text
        if review.strip():
            training_data_kitchen.append(review)
            
print("kitchen done")

with open('Sports_and_Outdoors_5.json') as f:
    for review in f:
        review = json.loads(review)
        text = review.get("reviewText", "").strip()
        summary = review.get("summary", "").strip()
        review = summary + " " + text
        if review.strip():
            training_data_outdoors.append(review)
            
print("outdoors done")

with open('Electronics_5.json') as f:
    for review in f:
        review = json.loads(review)
        text = review.get("reviewText", "").strip()
        summary = review.get("summary", "").strip()
        review = summary + " " + text
        if review.strip():
            training_data_electronics.append(review)
print("electronics done")


Books done
kitchen done
outdoors done
electronics done
CPU times: user 6min 4s, sys: 28.4 s, total: 6min 33s
Wall time: 6min 33s


### Select the number of reviews to consider from each topic dataset

In [4]:
num_book_reviews = 0
num_electronics_reviews = 0
num_outdoors_reviews = 500000
num_kitchen_reviews = 500000

training_data=[]
training_data.extend(training_data_books[:num_book_reviews])
training_data.extend(training_data_electronics[:num_electronics_reviews])
training_data.extend(training_data_outdoors[:num_outdoors_reviews])
training_data.extend(training_data_kitchen[:num_kitchen_reviews])

print(f"Sample review: {training_data[0]}")    

Sample review: Five Stars What a spectacular tutu! Very slimming.


### Pre-calculate embeddings

In [5]:
%%time

# Generate embeddings for data, using a GPU if available

if torch.cuda.is_available():
    device = 'cuda' 
    print("using gpu")
else:
    device = 'cpu'

model = SentenceTransformer('all-MiniLM-L6-v2').to(device)

embeddings = model.encode(training_data, device=device, show_progress_bar=True)



using gpu


Batches: 100%|██████████| 31250/31250 [05:05<00:00, 102.22it/s]


CPU times: user 11min 42s, sys: 6min 11s, total: 17min 54s
Wall time: 5min 54s


### Train and Save BERTopic model

In [6]:
%%time

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

print(f"training data amount: {len(training_data)}")

# reduces dimensionality
umap_model = UMAP(random_state=42)

# does clustering
hdbscan_model = HDBSCAN(min_cluster_size=300, min_samples=50)

# remove stopwords, tokenize
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df = 5)


keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# Representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

bert_model = BERTopic(
    embedding_model=model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=10,
    nr_topics=500,
    verbose=True
)

topics, probs = bert_model.fit_transform(training_data, embeddings)

bert_model.save("bertopic_model_sports_outdoors_1M")

print("model trained")

training data amount: 1000000


2023-11-30 22:50:02,561 - BERTopic - Reduced dimensionality
2023-11-30 22:50:51,241 - BERTopic - Clustered reduced embeddings
2023-11-30 22:53:08,970 - BERTopic - Reduced number of topics from 244 to 244


model trained
CPU times: user 6min 24s, sys: 2min 6s, total: 8min 30s
Wall time: 7min 41s


### Save visualizations of results, print info on each topic and its representations

In [5]:
bert_model.visualize_topics().write_html("./intertopic_dist_sports_outdoors_1M.html")
bert_model.visualize_barchart(top_n_topics = 25).write_html("./barchart_sports_outdoors_1M.html")
bert_model.visualize_hierarchy().write_html("./hieararchy_sports_outdoors_1M.html")

print(bert_model.get_topic_info())

     Topic   Count                                         Name  \
0       -1  532506                      -1_stars_great_good_use   
1        0    1574        0_great stars_stars great_stars_great   
2        1     869           1_good stars_stars good_stars_good   
3        2     331           2_good stars_stars good_stars_good   
4        3     976  3_great stars_stars works_works great_works   
..     ...     ...                                          ...   
239    238     382              238_sturdy_rack_clothes_garment   
240    239   32226               239_product_quality_good_price   
241    240     380                 240_washer_dryer_shelf_color   
242    241    2716                      241_cake_pan_bacon_pans   
243    242   10625                  242_pan_cast iron_cast_iron   

                                        Representation  \
0    [stars, great, good, use, just, like, product,...   
1    [great stars, stars great, stars, great, , , ,...   
2    [good stars, sta