# Introduction  

The fashion industry is one of the most significant factors in the contribution of a country’s and the world’s GDP. According to Fashion United, the industry accounts for approximately 2% of the world’s GDP, with a labor force of 3384.1 million and value of 3 trillion dollars. With new materials and technology being rapidly introduced, it is important to understand how the industry operates. A particular area that even non-experts can easily understand and become more knowledgeable is sentiment regarding various fashionwear. In today’s time, there are countless ratings and reviews available for online shoppers to refer to before making a purchase. These are invaluable resources for both consumers and manufacturers to take advantage of in order to keep up with current trends. 

### Unsupervised Approach
Taking the perspective of the manufacturers, I aim to study the relationship between product types and the consumers’ overall sentiment by analyzing amazon fashion product reviews. Specifically, it will extract reviews of products filtered by their rating and classify by sentiment from the review texts. The topic modelling process will be an unsupervised one, modeled by a package called BERT.

Examining either the top (best) or last (worst) fashion products in terms of product rating and review will provide the most insightful information. For this study, the top performing products’ reviews will be extracted. In its raw state, the datasets are a collection of random products and reviews. The task is to classify the data so that it can be inspected visually. More specifically, the results will be able to visually show the top fashion products’ most associated words, which we can use to derive certain features and characteristics about the products.

### Datasets
Two datasets derived from Prof. Julian McAuley of UC-San Diego’s Amazon Product Data, containing the meta-data about the products and their product reviews, respectively.  
Luckily the datasets are organized and in a analysis-ready structure, as I will show below. 



###Methods
There are two datasets that this study will leverage. The first contains general information about the products (meta-data), and the second contains the product reviews. Only the top products will be extracted from the meta-data, followed by extraction of reviews from the second dataset by matching product codes. The review texts will be stored separately so that they can be used for the classification algorithm. Regarding initial data cleaning and exploratory analysis processes, the datasets have already been cleaned and are in an organized structure, not requiring any additional work in that sense. Subsequent runs with parameter tuning and additional segmenting on the review texts may be required depending on the first run’s results. 
The methodology mainly involves the use of two packages, namely BERT and Tensorflow. BERT stands for bi-directional Encoder Representation of Transformers. It was developed by Google as a machine learning technique based on the mechanisms of transformers. The initial run will first be pre-trained using this algorithm, creating models consisting of frequent context-based words that represent a topic, serving as labels for the clusters.
Tensorflow is another machine learning technique that is both popular and widely used for developing deep learning applications. 

1.	Data segmentation. This step takes the meta-data and segment by one or more conditions, such as ‘overall’ review rating on a scale of 5, sales rank of the product relative to all other products, or the product subcategory. 
2.	ID creation. Store ‘asin’, or product IDs, of items from the result of step 1 and store as a separate list.
3.	Review extraction. Take the list of product IDs from step 2 and extract reviews that match those IDs. Dictionary structure allows quick access to both the ID and review information such as content text, rating, reviewer, and date.
4.	Apply filter conditions. The results will still contain too many rows of output for optimal classification. Conditions can vary depending on area of interest.
5.	Tune parameters. At this point, there are distinct topic models that aptly represent products and their characteristics. Classification by clustering requires optimizing intertopic distances as well as number of clusters.


```
import gzip
import itertools
import json
import operator
import numpy as np
import pickle
try:
  import pyLDAvis
  import tmtoolkit
  from lda import LDA
except:
  !pip install pyLDAvis
  !pip install tmtoolkit
  !pip install lda
  from lda import LDA
  import pyLDAvis
  import tmtoolkit

import pandas as pd
import operator
from lda import LDA
from tmtoolkit.bow.bow_stats import doc_lengths
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics
from tmtoolkit.topicmod.visualize import parameters_for_ldavis
```

```
from tmtoolkit.topicmod.tm_lda import compute_models_parallel
```

```
import gzip
import json

```
meta_path = 'drive/MyDrive/Final_Project/meta_Clothing_Shoes_and_Jewelry.json.gz'
reviews_path = 'drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry.json.gz'
```

The data has been loaded in.  
This is the first part that is necessary for creating the model: Data segmentation. This step takes the meta-data and segment by one or more conditions, such as ‘overall’ review rating on a scale of 5, sales rank of the product relative to all other products, or the product subcategory. 


```
i = 0
with gzip.open(reviews_path) as products:
    for product in products:
        if i > 5:
          break
        data = eval(product)
        print(data)
        i += 1
```

```
>>> {'reviewerID': 'A2XVJBSRI3SWDI', 'asin': '0000031887', 'reviewerName': 'abigail', 'helpful': [0, 0], 'reviewText': 'Perfect red tutu for the price. I baught it as part of my daughters Halloween costume and it looked great on her.', 'overall': 5.0, 'summary': 'Nice tutu', 'unixReviewTime': 1383523200, 'reviewTime': '11 4, 2013'}
{'reviewerID': 'A2G0LNLN79Q6HR', 'asin': '0000031887', 'reviewerName': 'aj_18 "Aj_18"', 'helpful': [1, 1], 'reviewText': 'This was a really cute tutu the only problem is that it was super short on my 5 yr old daughter. Other than that it was really adorable.', 'overall': 4.0, 'summary': 'Really Cute but rather short.', 'unixReviewTime': 1337990400, 'reviewTime': '05 26, 2012'}
{'reviewerID': 'A2R3K1KX09QBYP', 'asin': '0000031887', 'reviewerName': 'alert consumer', 'helpful': [1, 1], 'reviewText': 'the tutu color was very nice. the only issue with this tutu is the quality of material used. it appears cheap and after much play my 3 year old managed to snag a piece of the fabric and that was the end of life for this particular tutu.', 'overall': 2.0, 'summary': 'not very good material.', 'unixReviewTime': 1361059200, 'reviewTime': '02 17, 2013'}
{'reviewerID': 'A19PBP93OF896', 'asin': '0000031887', 'reviewerName': 'Alinna Satake "Can\'t Stop Eating"', 'helpful': [0, 1], 'reviewText': "My 3-yr-old daughter received this as a gift for her birthday.  She's no pixy, but she's not huge either, and it was VERY tight on her, so I doubt a 6 year old can fit it comfortably.  The tutu fell apart after 12 hours -- the satin waistband detached from the tulle.  Unless twirling counts as rough wear, I'd say this was poorly constructed.  I sent two messages to Sydney So Sweet directly, trying to get a replacement or at least some kind of acknowledgement and NOTHING.  So ... crappy construction and crappy customer service.  I already don't like tutu/fairy/princess stuff, and this just furthers my opinion that companies like this are preying on daughters.  Boo!", 'overall': 1.0, 'summary': 'Tiny and Poorly Constructed!', 'unixReviewTime': 1363824000, 'reviewTime': '03 21, 2013'}
{'reviewerID': 'A1P0IHU93EF9ZK', 'asin': '0000031887', 'reviewerName': 'Amanda', 'helpful': [0, 0], 'reviewText': 'Bought it for my daughters first birthday which is lady bug themed and it fits perfect the stitching is a little loose but only need it one day', 'overall': 4.0, 'summary': 'i love it', 'unixReviewTime': 1390435200, 'reviewTime': '01 23, 2014'}
{'reviewerID': 'A3Q6CTO56DJ8UZ', 'asin': '0000031887', 'reviewerName': 'Amazing Amazon', 'helpful': [3, 4], 'reviewText': 'I ordered this for a costume for me (I\'m a 5\'5" adult) and was surprised by the quality considering it was under $10.  The tulle is double layered and the waistband is satin and very comfortable.  The waistband stretches to about 30".', 'overall': 4.0, 'summary': 'Good Quality!', 'unixReviewTime': 1268697600, 'reviewTime': '03 16, 2010'}
```


We can see that the overall structure is nicely organized and does not require cleaning. The data is stored as a dictionary which allows quick access to specific items and their identification numbers.  
  
Next, we will save the product ID numbers based on subcategories.  
Note that the word filter can be replaced with any category of interest.
This is utilized in subsequent runs made to generate the additional clusters below.

```
asins = []

with gzip.open(meta_path) as products:
    for product in products:
        data = eval(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])
```

```
bad_texts = []
bad_asins = []
bad_reviews = {}

with gzip.open(reviews_path) as reviews:
    for review in reviews:
        data = eval(review)
        _id = "%s.%s" % (data['asin'], data['reviewerID'])

        if data['asin'] in asins:
            if data.get('overall') <= 3.0:
                bad_asins.append(data['asin'])
                bad_reviews[_id] = data
                text = data["reviewText"]
                bad_texts.append(text)
```

```
outputfile_badtext = open('bad_texts.txt', 'w')
outputfile_badasins = open('bad_asins.txt', 'w')
outputfile_badreviews = open('bad_reviews.txt', 'w')
outputfile_badtext.write(','.join(bad_texts))
outputfile_badtext.close()
outputfile_badasins.write(','.join(bad_asins))
outputfile_badasins.close()
```

```
bad_reviews
```

Import necessary packages for creating corpora and generating models.

```
from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import TMPreproc
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
from tmtoolkit.topicmod.tm_lda import compute_models_parallel
try:
    from bertopic import BERTopic
except:
    !pip install bertopic
    from bertopic import BERTopic
```

```
topic_model = BERTopic(language = 'english', calculate_probabilities = True, verbose = True)
topics, probs = topic_model.fit_transform(bad_texts)
freq = topic_model.get_topic_info(); freq.head(10)
```

Initiate Vectorizer and fit models according to parameters.  
For this stage I used sklearn's CountVectorizer model with ngram_range (number of words that constitute a 'term' that should be included in the processing stage) of (1, 3).  
Then I used model.update_topics with that vectorizer to fit the data.

```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words = "english", ngram_range = (1, 3))
topic_model.update_topics(bad_texts, topics, vectorizer_model = vectorizer_model)
freq = topic_model.get_topic_info(); freq.head(10)

>>> freq
Topic	Count	Name
0	-1	1433	-1_shoes_shoe_size_fit
1	0	400	0_watch_band_wrist_battery
2	1	128	1_color_black_white_red
3	2	105	2_size_small_head_fit
4	3	102	3_fake_real_nike_shoes
5	4	94	4_price_pay_shoes_money
6	5	93	5_wide_narrow_feet_wide feet
7	6	92	6_running_shoes_shoe_running shoes
8	7	91	7_item_product_received_like
9	8	83	8_nike_shoe_shoes_pair
```

We can observe the first topic and the elements contained within. This gives a general idea of what the final classification will show.

```
topic_model.get_topic(0)

>>>
[('watch', 0.04147225040424569),
 ('band', 0.014635142623022322),
 ('wrist', 0.01061386050184958),
 ('battery', 0.00991637074224343),
 ('nike', 0.007691470163720198),
 ('time', 0.007624289358415673),
 ('face', 0.007396593408941119),
 ('watches', 0.0072838276210527315),
 ('broke', 0.006372994725387081),
 ('months', 0.006371949888803141)]
```

We can change/tune the hyperparameters in the vectorizer model to generate classification according to different conditions. The code below is set to narrow down the cluster radius so that only the more detailed (unique/specific) reviews are included in the model fitting.

```
vectorizer_model = CountVectorizer(stop_words = "english", ngram_range = (1, 3), max_df = 0.7, min_df = 0.2)
topic_model.update_topics(bad_texts, topics, vectorizer_model = vectorizer_model)

freq = topic_model.get_topic_info(); freq.head(10)

>>>
Topic	Count	Name
0	-1	1433	-1_tight_ordered_narrow_running
1	0	400	0_broke_months_read_year
2	1	128	1_color_white_black_red
3	2	105	2_bigger_sizes_expected_larger
4	3	102	3_real_box_original_nike shoes
5	4	94	4_pay_money_people_dollars
6	5	93	5_wide_narrow_width_wider
7	6	92	6_running_running shoes_miles_running shoe
8	7	91	7_item_received_review_cheap
9	8	83	8_nikes_nike shoes_leather_years
```

```
topic_model.get_topic(0)

>>>
[('broke', 0.03305476293499219),
 ('months', 0.02712911505601609),
 ('read', 0.02537250317195788),
 ('year', 0.022928815579935143),
 ('worked', 0.017310300863954495),
 ('easy', 0.017031659712593256),
 ('replace', 0.016637082866294863),
 ('does', 0.015657600959412758),
 ('light', 0.015530976544856188),
 ('thing', 0.014669671173207112)]
```

In [None]:
topic_model.visualize_hierarchy()


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



Building up from run #2, tuning the clustering parameters so as to reduce the number of clusters, the overlapping subclusters are merged into a single cluster to represent the main context of the review. Apart from being visually appealing and providing a quicker overview of context, the disadvantage is that the subcategories of each product cannot be distinguished without closer inspection of individual review contents, which is a manual task and can be quite laboring.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

In [None]:
freq = topic_model.get_topic_info();
print(freq.head(10))

   Topic  Count                                        Name
0     -1   1433             -1_tight_ordered_narrow_running
1      0    400                    0_broke_months_read_year
2      1    128                     1_color_white_black_red
3      2    105              2_bigger_sizes_expected_larger
4      3    102              3_real_box_original_nike shoes
5      4     94                  4_pay_money_people_dollars
6      5     93                   5_wide_narrow_width_wider
7      6     92  6_running_running shoes_miles_running shoe
8      7     91                7_item_received_review_cheap
9      8     83            8_nikes_nike shoes_leather_years


In [None]:
topic_model.visualize_topics(top_n_topics = 7)

###Assessment / Evaluation
Out of the top products from the initial review extraction, secondary segmentation by ‘overall’ rating less than 3.0 can be deemed a more general, or common, filter criteria. Considering there are hundred of thousands of reviews for the top products, even such poor ratings amount to the thousands. Consequently, there will be many smaller, overlapping clusters that represent a topic of their own. For example, a topic consisting of four frequent words may include ‘headband’, ‘color’, ‘size’, and ‘fit’. Any variation of permutations, as long as the words are in a similar context, can be clustered into the same effective radius of that contextual cluster. However, the topic itself represents a small cluster of apparel, namely, headband. There are some advantages to this. Being an interactive cluster map, one can control the bottom slider to view specific subclusters to analyze, while getting a rough sense of the larger context at the same time.  
  
Narrowing down the result from run #1 by segmenting the review contents by sentiment using TextBlob results in the intertopic distance more closed in to one another. This provides the same effect as enlarging the context and providing the subtopics a more central identity.  

In [None]:
topic_model.visualize_barchart(top_n_topics = 7)

The figure shows the final score of cluster representative words by frequency. Upon inspection we can see that most of the poor reviews of the top ranking products are due to anomalies, misuse, and third party sellers of counterfeit products. The sentiment of the affected consumers are clearly represented and easily distinguishable by the distinct clusters. 