# Amazon Review Data

Download [Amazon Review Data](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) using the AWS CLI. For convenience, a sample dataset has been included in the data folder.

```
# list all datasets
aws s3 ls s3://amazon-reviews-pds/tsv/

# download the sample data
aws s3 cp s3://amazon-reviews-pds/tsv/sample_us.csv .
```

In [27]:
import nltk
nltk.download('stopwords')
import re
from nltk.corpus import stopwords
import pandas as pd

review_data = pd.read_csv("data/sample_us.tsv", sep='\t')

stop=stopwords.words('english')
pat1=re.compile(r"[^a-zA-Z ]+")
pat2=re.compile(r'\b(?:{})\b'.format('|'.join(stop)))
review_body=review_data.review_body.astype(str).str.replace(pat1," ").str.replace(pat2," ").str.strip()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
%%time
from bertopic import BERTopic
from cuml.cluster import HDBSCAN as HDBSCAN_gpu
from cuml.manifold import UMAP as UMAP_gpu
from sklearn.datasets import fetch_20newsgroups

topic_model = BERTopic(hdbscan_model=HDBSCAN_gpu(), umap_model=UMAP_gpu())
topics, probs = topic_model.fit_transform(review_body)

Label prop iterations: 6
Label prop iterations: 3
Iterations: 2
301,49,30,7,63,180
CPU times: user 6.43 s, sys: 377 ms, total: 6.8 s
Wall time: 375 ms


In [29]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,31,-1_calendar_toys_the_really
1,0,13,0_the_of_yoga_same
2,1,5,1_lego_brand_generic_class


In [30]:
topic_model.get_topic(0)

[('the', 0.08574705315990165),
 ('of', 0.052810614640541965),
 ('yoga', 0.052810614640541965),
 ('same', 0.052810614640541965),
 ('track', 0.052810614640541965),
 ('play', 0.052810614640541965),
 ('battery', 0.052810614640541965),
 ('batteries', 0.052810614640541965),
 ('great', 0.04606789477712114),
 ('they', 0.044470440895415526)]