# Installation Instructions and References

## Installation

- BERTopic can be installed using pip.
- To use GPU acceleration for UMAP and hDBSCAN, you need to install RAPIDS cuML.
  - For specific installation info, please reference: https://docs.rapids.ai/install
    - Make sure you install for the proper CUDA version (11.6 on Cheaha)
    - Ensure you have the proper cupy version (cupy-cuda11x for Cheaha)
  - Non-GPU accelerated versions of these packages are available if needed.

## References:
- https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html
- https://medium.com/rapids-ai/faster-topic-modeling-with-bertopic-and-rapids-cuml-5c7559aba898
- https://towardsdatascience.com/topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5
- https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
- https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ 

In [1]:
# installs
!pip install bertopic 

"""
NOTE: to use GPU acclerated UMAP and HDBSCAN, you need to install RAPIDS cuML.
Ensure you have the proper cupy version (cupy-cuda11x for Cheaha)
For more installation info, see: https://docs.rapids.ai/install
"""

# imports
import os
import gzip
import json
import torch
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

# GPU accelerated
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN

# not GPU accelerated
# from umap import UMAP
# from hdbscan import HDBSCAN
    
print("imports compelete")


  from .autonotebook import tqdm as notebook_tqdm


imports compelete


### Download and decompress data - run to initially get data

- NOTE: if you have already downloaded a dataset using this cell, it is saved as a `json` file which can be used in the future. You do not need to run this cell again.

In [2]:
chunk_size=5 * 1024 * 1024

!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Books_5.json.gz

with gzip.open('Books_5.json.gz') as f:
    with open("Books_5.json", 'wb') as f_out:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            f_out.write(chunk)

print("done")

done


### Read data into list, concatenating the relevant items `summary` and `reviewText` for each review. Set the max number of reviews to consider.

In [7]:
%%time

# maximum number of review to consider
max_len = 1000000

training_data_books = []

with open('Books_5.json') as f:
    for review in f:
        review = json.loads(review)
        text = review.get("reviewText", "").strip()
        summary = review.get("summary", "").strip()
        review = summary + " " + text
        if review.strip():
            training_data_books.append(review)
        
training_data = training_data_books[:max_len]
            
print("done")


done
CPU times: user 2min 43s, sys: 22.1 s, total: 3min 6s
Wall time: 3min 6s


### Pre-calculate embeddings

In [8]:
%%time

# Generate embeddings for data, using a GPU if available

if torch.cuda.is_available():
    device = 'cuda' 
    print("using gpu")
else:
    device = 'cpu'

model = SentenceTransformer('all-MiniLM-L6-v2').to(device)

embeddings = model.encode(training_data, device=device, show_progress_bar=True)



using gpu


Batches: 100%|██████████| 31250/31250 [11:48<00:00, 44.11it/s] 


CPU times: user 11min 22s, sys: 1min 7s, total: 12min 29s
Wall time: 12min 20s


### Train and Save BERTopic model

In [21]:
%%time

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

print(f"training data amount: {len(training_data)}")

# reduces dimensionality
umap_model = UMAP(random_state=42)

# does clustering
hdbscan_model = HDBSCAN(min_cluster_size=300, min_samples=50)

# remove stopwords, tokenize
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df = 5)


keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# Representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

bert_model = BERTopic(
    embedding_model=model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=10,
    nr_topics=500,
    verbose=True
)

topics, probs = bert_model.fit_transform(training_data, embeddings)

bert_model.save("bertopic_model_books_1M_3")

print("model trained")

training data amount: 1000000


2023-11-26 23:43:54,155 - BERTopic - Reduced dimensionality
2023-11-26 23:44:47,139 - BERTopic - Clustered reduced embeddings
2023-11-26 23:49:20,045 - BERTopic - Reduced number of topics from 246 to 246


model trained
CPU times: user 8min 39s, sys: 2min 10s, total: 10min 49s
Wall time: 10min 6s


### Save visualizations of results, print info on each topic and its representations

In [22]:
bert_model.visualize_topics().write_html("./intertopic_dist_model_800K.html")
bert_model.visualize_barchart(top_n_topics = 25).write_html("./barchart_model_800K.html")
bert_model.visualize_hierarchy().write_html("./hieararchy_model_800K.html")

print(bert_model.get_topic_info())

     Topic   Count                                               Name  \
0       -1  393764                            -1_book_read_story_like   
1        0     469          0_book stars_stars great_great book_stars   
2        1     824          1_book stars_stars great_great book_stars   
3        2     323  2_excellent stars_stars excellent_excellent_stars   
4        3     649  3_excellent stars_stars excellent_stars good_b...   
..     ...     ...                                                ...   
241    240     319                240_minor_winchester_dictionary_oed   
242    241  248561                           241_read_book_story_good   
243    242     562                     242_trump_history_browne_urras   
244    243    4210                  243_condition_arrived_cover_print   
245    244     304          244_christina_wyeth_andrew wyeth_painting   

                                        Representation  \
0    [book, read, story, like, good, just, life, gr...   
1    [b