### Topic Model Using Amazon Review Data

Using `.json` files containing Amazon reviews of clothing and shoes, and Amazon products, I created a couple topic models to explore brand insights for Columbia Sportswear. The Amazon Product data used in this project contained about 1.5 million clothing, shoe, and jewelry products and 5.7 million reviews of those products. Of those, 4,988 products were Columbia Sportswear products having a total of 27,278 reviews. Across all 27,278 reviews, Columbia products received an average rating of 4.32 out of 5. Using this data, I was able to create a few different topic models using K-means clustering to better understand strengths and weakness according to reviewers across different product groups. 

First, I imported all the required packages.

In [1]:
import pandas as pd
import json
import nltk
from nltk.corpus import stopwords

# For Kmeans clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import os

# For IBM Watson
from os.path import join, dirname
from watson_developer_cloud import PersonalityInsightsV3
import requests
from watson_developer_cloud import WatsonApiException
import time
print('Packages Imported')

Packages Imported


For this project, I had a lot of things that I was going to use multiple times, so I put many of the processes into functions. This loads the review texts and creates a set of all the reviews.

In [2]:
# Function that loads the texts of the review dictionary and puts the relevant info into a set
def load_texts(reviewdata):
    texts = set()
    for review in reviewdata:
        if 'reviewText' in reviewdata[review]:
            reviewtext = reviewdata[review]['reviewText']
            summary = reviewdata[review]['summary']
            asin = reviewdata[review]['asin']

            reviewwords = '%s %s %s' % (reviewtext, summary, asin)
            # Add the text to a set of review data
            texts.add(reviewwords)
    return texts
    print('Texts Loaded')

This function creates a k-means model and saves each topic's documents to an output folder inside the working directory.

In [3]:
# Function that creates a topic model using K-Means and saves the information to a folder     
def kmeans_creator(documents, review_dictionary, n_topics, brand, foldername):
    print('Creating and Saving K-Means Topic Model')
    # Creates a list of spanish and english stop words.
    stop_words = stopwords.words('english')
    spanish = stopwords.words('spanish')
    for word in spanish:
        stop_words.append(word)
    stop_words.append(brand)
    
    # Vectorizing the data and creating a matrix of data
    vectorizer = TfidfVectorizer(stop_words = stop_words)
    X = vectorizer.fit_transform(documents)
    
    # Assuming number of topics
    true_k = n_topics
    # Creating a Kmeans clustering model
    model = KMeans(n_clusters=true_k, max_iter=10000)
    model.fit(X)
        
    # Looking in the model and printing the names of the clusters and the order of centroids
    order_centroids = model.cluster_centers_.argsort()[:,::-1]
    terms = vectorizer.get_feature_names()
    
    # Iterating the number of topics
    print("Top Terms per Cluster:")
    for i in range(true_k):
        # printing the top four topics
        topic_terms = [terms[ind] for ind in order_centroids[i, :4]]
        # Printing the cluster number and the topic terms
        print('%d: %s' % (i, ' '.join(topic_terms)))

    # Saving the topics in a folder (foldername) as txt files
    outputfiles = {}
    print('Creating a new directory') 
    try:
        os.mkdir(foldername)
        
    except OSError:
        print('\nDirectory already exists. Documents added to existing folder.')
               
    else:
        print('\nSuccessfully created the directory')
    
    for topic in range(true_k):
        topic_terms = [terms[ind] for ind in order_centroids[topic, :4]]
        # Creating output file inside of dictionary
        outputfiles[topic] = open(os.path.join(foldername, '_'.join(topic_terms) + '.txt'), 'w')
        
    print('Filling Directory')
    for review in review_dictionary:
        # If there's text in this review, do something
        if 'reviewText' in review_dictionary[review]:
            review = review_dictionary[review]
            reviewbit = '%s %s %s %s' % (review['asin'], review['overall'],  review['summary'], review['reviewText'])
            # Puting Review text chunks with asins. Then find which cluster it belongs in
                # Takes 1 review at a time, using the existing vectorizer that we've already used
            Y = vectorizer.transform([reviewbit])
            # Each document gets a score of how much it belongs in a certain topic.
                # (true_k) scores per document 
            for prediction in model.predict(Y):
                outputfiles[prediction].write('%s\n' % (reviewbit))
    # n = count, f = name
    for n, f in outputfiles.items():
        f.close()
    print('K-Means Creation Finished and Saved')

This function gathers a score for all the reviews of the company in question.

In [4]:
# Function that gives an average score of a review dictionary
def score_getter(review_dictionary):
    score = 0
    for i in review_dictionary:
        score = score + review_dictionary[i]['overall']
    print('The average score is ' + str(round((score/len(review_dictionary)), 2)))

I then reformatted my IBM Watson code used in my Russian Troll Ad research to work for these reviews, gathering personality insights of the ads. This also computes the average star rating for each cluster.

In [5]:
def IBM_Watson_Cluster_Personality(user, passw, foldername, savefile):
    # IBM Watson API------------------------------------------------
    class WatsonException(Exception):
        """
        Custom exception class for Watson Services.
        """
        pass
    
    class WatsonApiException(WatsonException):
        """
        Custom exception class for errors returned from Watson APIs.
    
        :param int code: The HTTP status code returned.
        :param str message: A message describing the error.
        :param dict info: A dictionary of additional information about the error.
        """
        def __init__(self, code, message, info=None):
            # Call the base class constructor with the parameters it needs
            super(WatsonApiException, self).__init__(message)
            self.message = message
            self.code = code
            self.info = info
    
        def __str__(self):
            return 'Error: ' + self.message + ', Code: ' + str(self.code)
    # Gathering data from folder and saving the names and the contents to separate lists
    filelst = []
    namelst = []
    for filename in os.listdir(foldername):
        with open(os.path.join(foldername, filename)) as f:
            contentlst = []
            namelst.append(filename)
            for line in f:
                contentlst.append(line)
            filelst.append(contentlst)
    # Authentification info for IBM Watson
    service = PersonalityInsightsV3(
        version='2017-10-13',
        ## url is optional, and defaults to the URL below. Use the correct URL for your region.
        # url='https://gateway.watsonplatform.net/personality-insights/api',
        username=user,
        password=passw)
    
    # Test: Asking the watson to analyze the inputted string.
    response = service.profile(
        'YOUR TEXT HERE the dog and the dog the dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dogthe dog and the dog', # Must be 100 words in length
        content_type='text/plain',
        accept="text/csv",
        charset='utf-8',
        csv_headers=True).get_result()
    
    #print(response.content)
    # Splitting the lines from the headers and the variables
    profile = response.content
    cr = profile.splitlines()
    
    # Creating column headers
    labelslst = []
    # The data is in one long list of bytes. We need to convert this to strings
    letter = ''
    # Iterating over each set of bytes in the list
    # This little for loops creates a column of soon-to-be column headers 
        # from the bytes gathered by test call to Watson
    for i in cr[0]:
        # The letter = the character value converted from ASCII decimal
        letter = letter + chr(i)
        # If the byte is 44 (a comma), append the full letter value to the labelslst
        if i == 44:
            letter = letter[:-1]
            labelslst.append(letter)
            #print(letter)
            letter = ''
    # Create a dataframe of the labels
    insightsdf = pd.DataFrame(labelslst)
            
     # For each ad in the textlst,
    for item in range(len(filelst)):
        itemtxt = ''.join(filelst[item])
        try:
            # API call to Watson
            response = service.profile(
                    itemtxt,
                    content_type='text/plain',
                    accept="text/csv",
                    charset='utf-8',
                    csv_headers=True).get_result()
            
            profile = response.content
            cr = profile.splitlines()
            
            # Appending values for the text to a list
            vallst = []
            val = ''
            for i in cr[1]:
                # The letter = the character value converted from ASCII decimal
                val = val + chr(i)
                # If the byte is 44 (a comma), append the full letter value to the labelslst
                if i == 44:
                    try:
                        val = val[:-1]
                        val = float(val)
                        vallst.append(val)        
                    except ValueError:
                        vallst.append(1)
                    val = ''
            # Appending the list to the dataframe, leaving room for the column headers
            insightsdf[item+1] = vallst
            time.sleep(-time.time()%1)
        except WatsonApiException:
            insightsdf[item+1] = 0
            continue 
        print(str(item) + ' Done!')  

    # Transpose the data frame 
    insightsdfT = insightsdf.T 
    # Add column names as the labels
    insightsdfT.columns = insightsdfT.iloc[0]
    # Will need to drop the 0th row once we have data inside the dataframe
    insightsdfT = insightsdfT.iloc[1:]
    insightsdfT['Clusters'] = namelst
    
    # Adding the average scores to the data frame
    kmeansdict = {}
    for filename in os.listdir(foldername):
        with open(os.path.join(foldername, filename)) as f:
            ratinglst = []
            kmeansdict[filename] = []
            for line in f:
                ratinglst.append(float(line[11:14]))
            kmeansdict[filename].append(ratinglst)
    
    # Printing average score of each cluster
    avgscore = []
    for cluster in kmeansdict:
         print(cluster + ': ')
         print(sum(kmeansdict[cluster][0])/len(kmeansdict[cluster][0]))
         print('\n')
         avgscore.append(sum(kmeansdict[cluster][0])/len(kmeansdict[cluster][0]))
    insightsdfT['Avg Scores'] = avgscore
    
    insightsdfT.to_csv(savefile)
    print('IBM Watson Finished')

Finally, we're able to create a dictionary of product information. Though this file contains over 1.5 million products, it runs relatively quickly despite the nested for-loops. To know where the iteration is at, it will display a count every 100,000 product entries.

In [6]:
# Loading in the json file of products
loadedjson = open('meta_Clothing_Shoes_and_Jewelry.json', 'r')
print('Creating Product Dictionary')
count = 0    
allproducts = {}
catlist = {}
# Iterating over each entry
for row in loadedjson:
    count += 1
    if count % 100000 == 0:
        print(str(format(count, ',')) + ' Items Created')
    # Putting the product info back into a dictionary from a printed output
    product = eval(row)

    # Keeping the ASIN number for each product
    allproducts[product['asin']] = product
    # Iterating over each category in each product
    for categories in product['categories']:

        # Adding categories to the list
        for acategory in categories:
            if acategory in catlist:
                catlist[acategory] += 1
            else:
                catlist[acategory] = 1
print('Product Dictionary and Category List Created')

Creating Product Dictionary
100,000 Items Created
200,000 Items Created
300,000 Items Created
400,000 Items Created
500,000 Items Created
600,000 Items Created
700,000 Items Created
800,000 Items Created
900,000 Items Created
1,000,000 Items Created
1,100,000 Items Created
1,200,000 Items Created
1,300,000 Items Created
1,400,000 Items Created
1,500,000 Items Created
Product Dictionary and Category List Created


I chose to look at Columbia Sportswear as a brand, so I created a set of Columbia Amazon Standard Identification Numbers, or ASINs for short. This will allow us to look through the `.json` full of product reviews and match review to product ASIN.

In [7]:
# Writing the Columbia asins to a txt file
count = 0
allcolumbiaasins = set() # Set is just a unique list
print('Creating Set of ASINs')  
for product in allproducts: # Iterating through the products in all the products
    theproduct = allproducts[product] # setting a variable equal to the current product
    
    count += 1
    if count % 100000 == 0:
        print(str(round(100*(count/len(allproducts)),2)) + '%')
        
    # iterating through the categories that the product is in
    for categories in theproduct['categories']: 
        # Iterating through each category in each list
        for acategory in categories:
            # If columbia is a category, add the asin to the set
            if 'columbia' in acategory.lower():
                allcolumbiaasins.add(theproduct['asin'])
print('ASIN Set Created')
with open("columbia.txt", "w") as output:
    output.write(str(allcolumbiaasins))  
output.close()

Creating Set of ASINs
6.65%
13.3%
19.95%
26.61%
33.26%
39.91%
46.56%
53.21%
59.86%
66.52%
73.17%
79.82%
86.47%
93.12%
99.77%
ASIN Set Created


This file contains over 5.7 million reviews for products in the clothing, shoes and jewelry section of amazon. When finished, our dictionary will contain review text, review title, reviewer ID, star rating, and the product ASIN.

In [8]:
print('Creating Review Dictionary')
# Reading in the json for reviews
loadjson2 = open('reviews_Clothing_Shoes_and_Jewelry.json', 'r')
allreviews = {}
count = 0
for line in loadjson2:
    count += 1
    if count % 100000 == 0:
        print(str(round(100*(count/5700000),2)) + '%')
        
    review = eval(line)
    theasin = review['asin']
    
    if theasin in allcolumbiaasins:
        thekey = '%s.%s' % (theasin, review['reviewerID'])
        allreviews[thekey] = review
json.dump(allreviews, open('ColumbiaReviews.json', 'w'))

Creating Review Dictionary
1.75%
3.51%
5.26%
7.02%
8.77%
10.53%
12.28%
14.04%
15.79%
17.54%
19.3%
21.05%
22.81%
24.56%
26.32%
28.07%
29.82%
31.58%
33.33%
35.09%
36.84%
38.6%
40.35%
42.11%
43.86%
45.61%
47.37%
49.12%
50.88%
52.63%
54.39%
56.14%
57.89%
59.65%
61.4%
63.16%
64.91%
66.67%
68.42%
70.18%
71.93%
73.68%
75.44%
77.19%
78.95%
80.7%
82.46%
84.21%
85.96%
87.72%
89.47%
91.23%
92.98%
94.74%
96.49%
98.25%
100.0%


Here we can load the reviews and create a k-means cluster analysis with them. The topics will be saved to an output folder in our working directory. I've specified 9 clusters, as this was the number where most of the duplicate clusters started to fade out. With this method of k-means, it's important to try to narrow down the number of clusters to avoid duplicate clusters, while still having as many meaningful clusters as we can. These nine clusters are: `Boots`, `Quality and Pricing`, `Warmth and Fit`, `Jackets`, `Gifts`, `Shoes`, `Sizing`, `Vests`, and `Wallets`. These nine clusters represent a wide variety of products and with the slight exception of Warmth and Fit, they seem to be fairly distinct in their topics. 

In [11]:
documents = list(load_texts(allreviews))
kmeans_creator(documents, allreviews, 9, 'columbia', 'output')
score_getter(allreviews)

Creating and Saving K-Means Topic Model
Top Terms per Cluster:
0: great good warm fit
1: vest b003nx8c2o b00062nnlk great
2: boot boots warm great
3: jacket great warm nice
4: pants great fit b003s9vuh2
5: shoes shoe comfortable great
6: coat great warm nice
7: size small large ordered
8: boots warm feet snow
Creating a new directory

Directory already exists. Documents added to existing folder.
Filling Directory
K-Means Creation Finished and Saved
The average score is 4.32


In [None]:
# Creating a list of cluster names and cluster contents            
print('Analyzing Review Dictionary with IBM Watson')
IBM_Watson_Cluster_Personality('USERNAME','PASSWORD', 'output', 'cluster_personality.csv')

When looking through my clusters, I noticed that `wallet_cards_pocket_leather` and `size_small_large_ordered` had the lowest score out of any the clusters, with `size_small_large_ordered` being significantly lower than any other clusters. I decided to subset these and do further analysis on these specific clusters, looking at personality insights for each.

The ‘Adjusted Wallet’ cluster had 66 ratings below 3.0 stars (not inclusive) with an average rating of 1.530 and the ‘Adjusted Sizing’ cluster had 433 ratings below 3.0 stars (not inclusive) with an average rating of 1.643. This is a significantly higher proportion of poor reviews in the ‘Adjusted Sizing’ cluster than the ‘Adjusted Wallet’ cluster.  Simply reading through the reviews in the ‘Adjusted Sizing’ cluster, one gets an idea of the problem in this cluster: the sizing is unexpected by customers. Many customers state that the product is smaller than what they expect and that the products “run small”. In fact, the word “small” shows up in over half of the ‘Adjusted Sizing’ cluster reviews that have a rating below 3.0 (not inclusive), or 280 times in just 433 reviews. Other reviewers state that the color or the thickness is not what they expect – most likely a common problem across all clothing sales online. Reading through the reviews in the ‘Adjusted Wallet’ cluster, one gets a sense that customers see the products within as ‘poor quality’ and that the wallet is not durable. Two reviews state that the wallets smell bad.   

In [13]:
# Low scoring areas: Wallets and Sizing
poorproductlst = set()
for filename in os.listdir('output'):
    if filename == 'wallet_cards_pocket_leather.txt':
        with open(os.path.join('output', filename)) as f:
            walletlst = []
            for line in f:
                if float(line[11:13]) < 3.0:
                    walletlst.append(line)
                    poorproductlst.add(line[0:10])
    elif filename == 'size_small_large_ordered.txt':
        with open(os.path.join('output', filename)) as f:
            sizelst = []
            for line in f:
                if float(line[11:13]) < 3.0:
                    sizelst.append(line)
                    poorproductlst.add(line[0:10])
# Saving the files to their own folder
try:
    os.mkdir('poor_clusters') 
except OSError:
    print('\nDirectory already exists.')     
else:
    print('\nSuccessfully created the directory')
with open('poor_clusters\walletbadreviews.txt', 'w') as output:
    for item in walletlst:
        output.write("%s" % item)
with open("poor_clusters\sizebadreviews.txt", "w") as output:
    for item in sizelst:
        output.write('%s' % item)  
output.close()

# Finding insights based on word count
count = 0
for i in sizelst:
    txt = i.lower()
    if 'small' in txt:
        count += 1
print('Number of times "small" is mentioned: ', count)


Directory already exists.
Number of times "small" is mentioned:  278


In [None]:
# Running IBM Watson on the two files
IBM_Watson_Cluster_Personality('USERNAME','PASSWORD', 'poor_clusters', 'poorcluster_personality.csv')

Additionally, I wanted to subset the reviews based on rating. With the code below, I created two more review dictionaries - one for reviews equal to or over 4.0 and one for reviews equal to or lower than 2.0. This would allow me to see what are common topics in each to better make branding decisions.

In [6]:
# Creating High and Low review dictionaries based on star rating
highreviews = {}
lowreviews = {}
count = 0
loadjson2 = open('reviews_Clothing_Shoes_and_Jewelry.json', 'r')
for line in loadjson2:
    count += 1
    if count % 100000 == 0:
        print(str(format(count, ',')) + ' Items Created')
        
    review = eval(line)
    theasin = review['asin']
    stars = review['overall']
    stars = float(stars)
    
    if stars >= 4.0:
        if theasin in allcolumbiaasins:
            thekey = '%s.%s' % (theasin, review['reviewerID'])
            highreviews[thekey] = review
    if stars <= 2.0:
        if theasin in allcolumbiaasins:
            thekey = '%s.%s' % (theasin, review['reviewerID'])
            lowreviews[thekey] = review
print('High and Low Review Dictionaries Created')

Applying IBM Watson Personality Insights to rating-sifted dictionaries.

In [None]:
# Clustering the high and low reviews separately and applying IBM Watson API
documents = list(load_texts(highreviews))
kmeans_creator(documents, highreviews, 5, 'columbia', 'outputHigh')
score_getter(highreviews)
json.dump(highreviews, open('ColumbiaHighReviews.json', 'w'))
print('Analyzing High Review Dictionary with IBM Watson')
IBM_Watson_Cluster_Personality('USERNAME','PASSWORD', 'outputHigh', 'Highcluster_personality.csv')

documents = list(load_texts(lowreviews))
kmeans_creator(documents, lowreviews, 5, 'columbia', 'outputLow')
score_getter(lowreviews)
json.dump(lowreviews, open('ColumbiaLowReviews.json', 'w'))
print('Analyzing Low Review Dictionary with IBM Watson')
IBM_Watson_Cluster_Personality('USERNAME','PASSWORD', 'outputLow', 'Lowcluster_personality.csv')