### Amazon Arts, Crafts, and Sewing Reviews LSA and LDA Modeling Project

Chosen Dataset: Amazon Arts and Crafts Reviews (2018)
Link: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

### Code:

In [35]:
# LSA Model
# Imports
import os
import re
import json
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Downloads
nltk.download('punkt')
nltk.download('stopwords')

# Load Data
file_path = r"/Users/christinewu/Downloads/Arts_Crafts_and_Sewing_5.json"
review_list = []

with open(file_path, "r") as infile:
    for line in infile:
        review = json.loads(line)
        review_list.append(review)

# Preprocessing Function
def preprocess(text):
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    from nltk.tokenize import TreebankWordTokenizer
    tokenizer = TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text)
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in tokens if word not in stop_words])

# Apply Preprocessing
texts = [preprocess(review['reviewText']) for review in review_list if 'reviewText' in review]
# LSA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_df=0.9, min_df=5, stop_words='english')
X = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()
n_topics = 5
lsa_model = TruncatedSVD(n_components=n_topics, random_state=42)
lsa_topic_matrix = lsa_model.fit_transform(X)

# Print top words per topic
for i, topic in enumerate(lsa_model.components_):
    top_words = [feature_names[j] for j in topic.argsort()[-10:]]
    print(f"LSA Topic #{i + 1}: {', '.join(top_words[::-1])}")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/christinewu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christinewu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


LSA Topic #1: use, great, machine, like, love, good, paper, used, really, set
LSA Topic #2: machine, sewing, thread, foot, machines, bobbin, needle, embroidery, brother, stitches
LSA Topic #3: great, machine, price, product, sewing, thread, works, love, embroidery, foot
LSA Topic #4: use, paper, great, cut, easy, cutting, glue, mat, cricut, blade
LSA Topic #5: love, yarn, use, easy, needles, perfect, hooks, crochet, make, colors


In [39]:
# LDA Model
# Imports
import json
import re
import random
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer
from gensim import corpora, models

# Downloads
nltk.download('stopwords')

# Load Data
file_path = r"/Users/christinewu/Downloads/Arts_Crafts_and_Sewing_5.json"
review_list = []
with open(file_path, "r") as infile:
    for line in infile:
        review = json.loads(line)
        review_list.append(review)

# Preprocessing Function
stop_words = set(stopwords.words('english'))
tokenizer = TreebankWordTokenizer()

def preprocess(text):
    if not isinstance(text, str):
        return []
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = tokenizer.tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

# Apply Preprocessing/Tokenization
texts = [preprocess(review.get('reviewText', '')) for review in review_list]

# Filter out empty token lists
texts = [text for text in texts if len(text) > 0]

# Sample smaller subset for faster training (e.g., 10,000)
sampled_texts = random.sample(texts, 10000)

# Create dictionary and corpus
dictionary = corpora.Dictionary(sampled_texts)
dictionary.filter_extremes(no_below=10, no_above=0.5)

bow_corpus = [dictionary.doc2bow(doc) for doc in sampled_texts]

# Train LDA model
lda_model = models.LdaMulticore(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=5,
    passes=5,
    workers=2,
    chunksize=2000
)

# Show topics
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {0} \n Words: {1}".format(idx, topic))
    print("\n")
    
# Visualize
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(vis)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christinewu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Topic: 0 
 Words: 0.028*"love" + 0.019*"nice" + 0.014*"one" + 0.014*"use" + 0.013*"perfect" + 0.011*"color" + 0.010*"size" + 0.010*"easy" + 0.009*"great" + 0.009*"colors"


Topic: 1 
 Words: 0.048*"great" + 0.036*"good" + 0.023*"product" + 0.019*"quality" + 0.018*"use" + 0.014*"price" + 0.012*"nice" + 0.012*"colors" + 0.010*"would" + 0.009*"well"


Topic: 2 
 Words: 0.013*"one" + 0.013*"machine" + 0.012*"get" + 0.011*"use" + 0.010*"dont" + 0.009*"ive" + 0.009*"used" + 0.009*"like" + 0.009*"love" + 0.008*"im"


Topic: 3 
 Words: 0.022*"yarn" + 0.015*"beautiful" + 0.012*"good" + 0.012*"make" + 0.012*"color" + 0.012*"work" + 0.011*"well" + 0.008*"great" + 0.008*"love" + 0.008*"easy"


Topic: 4 
 Words: 0.023*"use" + 0.020*"great" + 0.014*"like" + 0.011*"paper" + 0.009*"used" + 0.009*"cut" + 0.008*"needles" + 0.008*"love" + 0.007*"well" + 0.007*"really"




### Comparison of the LSA and LDA Results - Patterns, Insights

Topic Comparisons between LSA and LDA:

+ Topic 1 from the LSA model and Topics 1 and 2 from the LDA model both contain key words that I thought related to general product satisfaction and versatility.
+ Topic 2 from the LSA model and Topic 2 from the LDA model both show how the words "sewing" and "machine" are commonly used together.
+ Topic 3 from the LSA model and Topic 0 from the LDA model both seem like they emphasize product value and quality.
+ Topic 4 from the LSA model and Topic 3 from the LDA model both mention words that relate to crafting tools and accessories.
+ Topic 5 from the LSA model and Topics 2 and 4 from the LDA model both focus on yarn-related hobbies like knitting and crochet for  the theme. 

Key Patterns and Insights:
Overall, I found that both models generated themes with words that mention specific tools including machines, threads, and blades, which shows that amazon reviews for arts, crafts, and sewing are generally focused on the product-specific feedback. Additionally, words that show sentiment such as "love" and "great" appear prevalently across the topics, which tells me that the arts and crafts product reviews are generally positive. I also noticed that words only relating to knitting and crochet were prevalent enough to be formed into a topic on their own, so this means that knitting and crocheting products are very commonly reviewed. Finally, LDA Topic 0 and 4 both mention colors and words that describe colors, which I think shows that people who are reviewing these types of items care a lot about aesthetics.