In this notebook we will analyze the reviews of an online Amazon business, the Orthokey. 

Rather than building predictive models, we’ll focus on word frequency analysis and topic modeling, as these approaches seem more relevant and actionable for the business.

In [1]:
from IPython.display import Image

Image(url= "https://www.theorthokey.com/wp-content/uploads/2019/05/cropped-OrthoKeyLogo-Rasterized-2.png")

First, we'll scrape reviews from the product page. Keep in mind that Amazon is not very scraping-friendly, so while the code below currently works, it may stop functioning if Amazon updates its website.

We’ll need to install a web driver for Selenium. If you haven’t already downloaded it, you’ll need to do that first. In this notebook, we’re using ChromeDriver version 91. You can download it here:

https://chromedriver.chromium.org/downloads

In [1]:
from selenium import webdriver

This is the url of the product's reviews on Amazon

In [15]:
url = 'https://www.amazon.com/OrthoKey-OrthoPod-Aligners-Toothbrush-Toothpaste/product-reviews/B08GJK3KW9/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1'

Let's now instantiate the driver.

In [3]:
driver = webdriver.Chrome(r'C:\Users\jaliu\Downloads\chromedriver.exe')

Use the driver to navigate to the Amazon review page.

In [16]:
driver.get(url)

Before continuing, make sure all required packages are installed and properly imported. If any section of the notebook fails to execute as expected, missing or misconfigured packages are a likely cause.

In [17]:
from bs4 import BeautifulSoup as bs
import numpy as np
import seaborn as sns
import pandas as pd
import requests
import time
import random
from selenium import webdriver
import matplotlib.pyplot as plt
import csv
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import string
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import FreqDist
import nltk
from textblob import TextBlob
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

After scraping, we'll return the content as a BeautifulSoup object. At this stage, the extracted text is not yet human-readable.

In [18]:
soup = bs(driver.page_source, 'html.parser')
soup

<html class="a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d a-touch-scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.23.1-2023-06-23" lang="en-us"><!-- sp:feature:head-start --><head><script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/I/31bJewCvY-L.js"></script><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-na.ssl-images-amazon.com" rel="dns-prefetc

Now that we’ve retrieved the raw HTML, we’ll begin parsing the reviews. This involves extracting specific elements from the page structure that contain review content. Once the reviews are extracted, we’ll store them in a DataFrame to make further analysis easier and more structured.

Technically, we begin to look for html tags in the soup object so we can create our dataframe.

Getting reviewer names.

In [19]:
names = soup.find_all('span', class_='a-profile-name')
names

[<span class="a-profile-name">Noble Path</span>,
 <span class="a-profile-name">T. Miller</span>,
 <span class="a-profile-name">Noble Path</span>,
 <span class="a-profile-name">Mayra</span>,
 <span class="a-profile-name">Karen<span class="a-icon a-profile-verified-badge"><span class="a-profile-verified-text"></span></span></span>,
 <span class="a-profile-name">A. Perez</span>,
 <span class="a-profile-name">T. Miller</span>,
 <span class="a-profile-name">Richard Rahl</span>,
 <span class="a-profile-name">Richard Rahl</span>,
 <span class="a-profile-name">Just Jo</span>,
 <span class="a-profile-name">Just Jo</span>,
 <span class="a-profile-name">Kat.L</span>,
 <span class="a-profile-name">P Bru</span>,
 <span class="a-profile-name">D. Bandy</span>]

In [20]:
reviewers = []

for i in range(0,len(names)):
    reviewers.append(names[i].get_text())
    
reviewers

['Noble Path',
 'T. Miller',
 'Noble Path',
 'Mayra',
 'Karen',
 'A. Perez',
 'T. Miller',
 'Richard Rahl',
 'Richard Rahl',
 'Just Jo',
 'Just Jo',
 'Kat.L',
 'P Bru',
 'D. Bandy']

Remove first and second reviewers because they are duplicates from top positive and negative review sections


In [21]:
reviewers.pop(0)
reviewers.pop(0)

reviewers

['Noble Path',
 'Mayra',
 'Karen',
 'A. Perez',
 'T. Miller',
 'Richard Rahl',
 'Richard Rahl',
 'Just Jo',
 'Just Jo',
 'Kat.L',
 'P Bru',
 'D. Bandy']

Repeat for review titles.

In [22]:
title = soup.find_all('a', class_='review-title')
title

[<a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R2DEUNBRUIGHHU/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&amp;ASIN=B08GJK3KW9"><i class="a-icon a-icon-star a-star-4 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">4.0 out of 5 stars</span></i><span class="a-letter-space"></span>
 <span>Very handy product</span>
 </a>,
 <a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/RAAUVKNZXQW66/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&amp;ASIN=B08GJK3KW9"><i class="a-icon a-icon-star a-star-5 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">5.0 out of 5 stars</span></i><span class="a-letter-space"></span>
 <span>Great</span>
 </a>,
 <a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R

In [23]:
review_titles = []

for i in range(0,len(title)):
    review_titles.append(title[i].get_text())  
    
review_titles

['4.0 out of 5 stars\nVery handy product\n',
 '5.0 out of 5 stars\nGreat\n',
 '4.0 out of 5 stars\nGreat for keeping everything together when traveling\n',
 '4.0 out of 5 stars\nKeeps most Invisalign cleaning supplies together\n',
 '3.0 out of 5 stars\nVery deceiving photo. Item is CASE ONLY.\n',
 '4.0 out of 5 stars\nBad use of space\n',
 '4.0 out of 5 stars\nSeems like a good case\n',
 "3.0 out of 5 stars\nDoesn't fit well in your purse\n",
 '3.0 out of 5 stars\nWhat it pictures does not come with it.  It’s just a case.\n',
 '5.0 out of 5 stars\nPefect\n']

Strip extra text from the beginning and ends of the review texts.

In [32]:
review_titles = [i.rstrip('\n') for i in review_titles]
review_titles

review_titles = [i.lstrip('\n') for i in review_titles]
review_titles

['2.0 out of 5 stars\nProduct',
 '2.0 out of 5 stars\nToo bulky',
 '1.0 out of 5 stars\nMisleading Description - Do Not Buy',
 '3.0 out of 5 stars\nGet one.',
 '3.0 out of 5 stars\nMixed review',
 '5.0 out of 5 stars\nBest All-In-One Case For Aligners',
 '5.0 out of 5 stars\nGET THIS if you have aligners or a retainer!',
 '5.0 out of 5 stars\nNice product',
 '5.0 out of 5 stars\nGreat for daily travel.',
 '5.0 out of 5 stars\nHolds everything you need']

Repeat for ratings of reviews.

In [33]:
rating = soup.find_all("i", {"data-hook":"review-star-rating"})

rating

[<i class="a-icon a-icon-star a-star-2 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">2.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-2 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">2.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-1 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">1.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-3 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">3.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-3 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">3.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-5 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">5.0 out of 5 stars</span></i>,
 <i class="a-icon a-icon-star a-star-5 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">5.0 out of 5 stars</span></i>,
 <i class="a-

In [34]:
ratings = []

for i in range(0,len(rating)):
    ratings.append(rating[i].get_text())
    
ratings

['2.0 out of 5 stars',
 '2.0 out of 5 stars',
 '1.0 out of 5 stars',
 '3.0 out of 5 stars',
 '3.0 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars',
 '5.0 out of 5 stars']

Remove characters to isolate numerical values

In [35]:
for i in range(0,len(ratings)):
    ratings[i] = ratings[i].replace('out of 5 stars','').strip()
    
ratings    

['2.0', '2.0', '1.0', '3.0', '3.0', '5.0', '5.0', '5.0', '5.0', '5.0']

Repeat for the actual review text.

In [36]:
text = soup.find_all("span", class_='review-text-content')

text

[<span class="a-size-base review-text review-text-content" data-hook="review-body">
 <span>It is too big and bulky for a purse, did not like</span>
 </span>,
 <span class="a-size-base review-text review-text-content" data-hook="review-body">
 <span>It is too big and bulky. You need a big purse to carry it on. Not very functional when you travel. It is better to keep it home to have things organized</span>
 </span>,
 <span class="a-size-base review-text review-text-content" data-hook="review-body">
 <span>I bought this case on 2/23/23. Nowhere in the initial description did it mention accessories were not included. The 5 photos and one video show a toothbrush and toothpaste and some show the key as well.<br/><br/>Way down in the at the bottom of the Features and Details it says…. “Please note that this product only includes the Orthokey Orthopod. Toothbrushes, toothpaste, Orthokey Aligner Chews, and Clear Aligners are not included.”  It was in very small print and did not stand out at a

In [37]:
review_text = []

for i in range(0,len(text)):
    review_text.append(text[i].get_text())  
    
review_text

['\nIt is too big and bulky for a purse, did not like\n',
 '\nIt is too big and bulky. You need a big purse to carry it on. Not very functional when you travel. It is better to keep it home to have things organized\n',
 '\nI bought this case on 2/23/23. Nowhere in the initial description did it mention accessories were not included. The 5 photos and one video show a toothbrush and toothpaste and some show the key as well.Way down in the at the bottom of the Features and Details it says…. “Please note that this product only includes the Orthokey Orthopod. Toothbrushes, toothpaste, Orthokey Aligner Chews, and Clear Aligners are not included.”  It was in very small print and did not stand out at all.The disclaimer needs to be shown in the pics, video, and in the initial description at the top of the page. I feel like the placement of the only disclaimer is intentional.I will take the blame for not reading the information near the bottom of the page, but shame on them for misleading us sho

Cleaning the text.

In [38]:
review_text = [i.rstrip('\n') for i in review_text]
review_text

review_text = [i.lstrip('\n') for i in review_text]
review_text

['It is too big and bulky for a purse, did not like',
 'It is too big and bulky. You need a big purse to carry it on. Not very functional when you travel. It is better to keep it home to have things organized',
 'I bought this case on 2/23/23. Nowhere in the initial description did it mention accessories were not included. The 5 photos and one video show a toothbrush and toothpaste and some show the key as well.Way down in the at the bottom of the Features and Details it says…. “Please note that this product only includes the Orthokey Orthopod. Toothbrushes, toothpaste, Orthokey Aligner Chews, and Clear Aligners are not included.”  It was in very small print and did not stand out at all.The disclaimer needs to be shown in the pics, video, and in the initial description at the top of the page. I feel like the placement of the only disclaimer is intentional.I will take the blame for not reading the information near the bottom of the page, but shame on them for misleading us shoppers.',
 

Create dataframe with our lists and check data.


In [39]:
df = pd.DataFrame()

df['Customer Name'] = reviewers
df['Title'] = review_titles
df['Rating'] = ratings
df['Text'] = review_text

df

ValueError: Length of values (10) does not match length of index (12)

The row lengths are equal and the data corresponds to what we see on the site.

Now we have to do this for all pages of reviews. We set our loop to some arbitrary large number to ensure all pages are captured.

## We will create a function that will do all the above for each page.


First, we make a soup function.

In [None]:
def make_soup(url):
    driver.get(url)
    soup = bs(driver.page_source, 'html.parser')
    return soup

Next, we create a function to scrape the data we want from the soup. We will combine and clean up the code we produced earlier to do this cleanly and efficiently.

In [None]:
def scrape_reviews(soup):

    #get names
    names = soup.find_all('span', class_='a-profile-name')

    for i in range(0,len(names)):
        reviewers.append(names[i].get_text())
    
    #remove first and second reviewers because they are duplicates from top positive and negative review sections
    reviewers.pop(-11)
    reviewers.pop(-11)

    #get title
    title = soup.find_all('a', class_='review-title')
    
    for i in range(0,len(title)):
        title[i] = title[i].get_text().rstrip('\n').lstrip('\n')
        review_titles.append(title[i])


    #get ratings
    rating = soup.find_all("i", {"data-hook":"review-star-rating"})
    
    #clean ratings
    for i in range(0,len(rating)):
        rating[i] = rating[i].get_text().replace('out of 5 stars','').strip()
        ratings.append(rating[i])
    
    #get text
    text = soup.find_all("span", class_='review-text-content')
    
    for i in range(0,len(text)):
        text[i] = text[i].get_text().rstrip('\n').lstrip('\n')
        review_text.append(text[i])  

We begin scraping reviews from every page. We'll need to create empty lists for each column to add our data into.

In [None]:
reviewers = []
review_titles = []
ratings = []
review_text = []

The different review pages can be identified by changing the page number element in the URL. So we will loop through every page changing the page number of the url in our loop. 

***note the html changes when we hit the international reviews so we stop collecting data once we land on that page.


In [None]:
for i in range(1,999):
    soup = make_soup(f'https://www.amazon.com/Clear-Aligner-Removal-Tool-PULTOOL/product-reviews/B07YCSXLK2/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber={i}')
    if not soup.find(text='From other countries'):
        pass
    else:
        break
    scrape_reviews(soup)
   
    print(len(reviewers))
    print(len(review_titles))
    print(len(review_text))
    print(len(ratings))

Creating our dataframe.

In [None]:
df = pd.DataFrame()

df['Customer Name'] = reviewers
df['Title'] = review_titles
df['Rating'] = ratings
df['Text'] = review_text


We combine title text with review text into one new column so we don't lose information.

In [None]:
df['Combined'] = df['Title'] + ' ' + df['Text']

df


Check our dataframe dimensions.

In [None]:
df.shape

Check tail of dataset to verify data matches online reviews.

In [None]:
df.tail()

We can export our dataframe to csv and excel otherwise user can skip this step.

In [None]:
df.to_csv('orthokeydat',index=True)
df.to_excel('orthokeydat.xlsx', index=False)

Let's do some exploratory data analysis!!!

Check missing values.

In [None]:
df.isnull().sum()

Check unique values.

In [None]:
df['Rating'].value_counts()

We will categorize the ratings as positive or negative. Ratings 3 and lower will be graded '0' for 'bad' and higher than 3 as '1' for'good'.

In [None]:
df['Rating'] = df['Rating'].astype(float)
df['Target']= [0  if x <= 3 else 1 for x in df['Rating']]

Plotting review counts for different groups.

In [None]:
rating = pd.DataFrame(df['Rating'].value_counts()).reset_index().sort_values(by='index',ascending=True)

sns.barplot(rating['index'],rating['Rating'])


Plotting review counts in bad review group and good review group.

In [None]:
binary = pd.DataFrame(df['Target'].value_counts()).reset_index().sort_values(by='index',ascending=False)

sns.barplot(binary['index'],binary['Target'])

Let's check frequency of words.

Now that we have a structured dataset of reviews, we first need to do some text pre-processing. to prepare it for analysis. This includes things like removing unuseful and unmeaningful text such as stop words, punctuations, and symbols because these texts dont convey useful information. We also will extract the root word/meaning from verbs with different tenses, and treat words with different casings as the same, as well as with singulars and plurals.

Make list of stopwords (words that occur frequently but aren't useful to analyze). We will use the stopwords provided by nltk.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

Include some punctuation characters in our list.

In [None]:
stopwords += string.punctuation

stopwords

Now, we'll make a tokenizer. This splits our text into individual words. Our function will include stopwords removal.

In [None]:
def Tokenizer(text):
    tokens = nltk.word_tokenize(text)
    processed_text = [token.lower() for token in tokens if token.lower() not in stopwords]
    return processed_text

Applying the tokenizer to our text. We can see the individual lists of words for each row of reviewers.

In [None]:
processed_text = df.Combined.apply(Tokenizer)

len(processed_text)

Check first row after applying Tokenizer.

In [None]:
processed_text[0]

Next, we’ll perform stemming and lemmatization to reduce words to their base or root forms, helping to group similar words together during analysis. In other words we convert words to their roots, so we don't count the "same words" multiple times as different words.

Create instance of stemmer.

In [None]:
ps = PorterStemmer()

Make stemmer function.

In [None]:
def Stemmer(text):
    stemmed_text=[]
    for word in text:
        stemmed_text.append(ps.stem(word))
    return stemmed_text

Stem the processed text

In [None]:
processed_text = processed_text.apply(Stemmer)

len(processed_text)

Check first row after applying Stemmer.

In [None]:
processed_text[0]

Now, we'll lemmatize the text. This is a similar idea to stemming, but we count words with the same meaning as the same word to prevent unwanted noise. We use nltk's wordnet database.

In [None]:
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

Create lemmatization function.

In [None]:
def Lemmatizer(text):
    lemmatized_list=[]
    for word in text:
        lemma_word=lemmatizer.lemmatize(word,pos='v') 
        lemmatized_list.append(lemma_word)
    return lemmatized_list

Lemmatize the processed text

In [None]:
processed_text.apply(Lemmatizer)

Check first row after applying Lemmatizer

In [None]:
processed_text[0]

Now we'll check the word frequencies before we processed the text.

In [None]:
df.Combined.str.split(expand=True).stack().value_counts()

After processing the text. 

In [None]:
word_counts = processed_text.astype(str).str.split(expand=True).stack().value_counts()
word_counts

We’ll visualize the top 20 word frequencies using a bar plot to make the most common terms stand out more clearly.

In [None]:
word_counts[:20].plot(kind='bar')

Our lists look quite different after the processing.

With the text cleaned, let's look at the top words across positive and negative reviews to get a general sense of what customers are talking about.

Subset data into negative and positive and apply processing functions to text.

In [None]:
neg_text = df.Combined[df.Target==0].apply(Tokenizer).apply(Stemmer).apply(Lemmatizer)
pos_text = df.Combined[df.Target==1].apply(Tokenizer).apply(Stemmer).apply(Lemmatizer)

Plotting our top 20 from negative group.

In [None]:
neg_counts = neg_text.astype(str).str.split(expand=True).stack().value_counts()

neg_counts[:20].plot(kind='barh')

Plotting top 20 from positive group.

In [None]:
pos_counts = pos_text.astype(str).str.split(expand=True).stack().value_counts()

pos_counts[:20].plot(kind='barh')

Our word frequencies don't seem so useful. Let's try looking at Bigrams and Trigrams. These are, as a loose explanation, groups of 2 and 3 words paired together that will convey more information. Please research n-grams to understand more this topic.

Creating bigrams for our negative reviews. We'll look at the top 10.

In [None]:
vectorizer = CountVectorizer(lowercase=True, ngram_range = (2,2), stop_words=('english'))

bi_neg = vectorizer.fit_transform(df.Combined[df.Target==0])

bi_neg_df = pd.DataFrame(bi_neg.todense(), columns = vectorizer.get_feature_names())

bi_neg_counts = bi_neg_df.sum().sort_values(ascending=False)[:10]

bi_neg_counts

Top 10 bigrams for our positive reviews.

In [None]:
vectorizer = CountVectorizer(lowercase=True, ngram_range = (2,2), stop_words=('english'))

bi_pos = vectorizer.fit_transform(df.Combined[df.Target==1])

bi_pos_df = pd.DataFrame(bi_pos.todense(), columns = vectorizer.get_feature_names())

bi_pos_counts = bi_pos_df.sum().sort_values(ascending=False)[:10]

bi_pos_counts

Top 10 trigrams for our negative reviews.

In [None]:
vectorizer = CountVectorizer(lowercase=True, ngram_range = (3,3), stop_words=('english'))

tri_neg = vectorizer.fit_transform(df.Combined[df.Target==0])

tri_neg_df = pd.DataFrame(tri_neg.todense(), columns = vectorizer.get_feature_names())

tri_neg_counts = tri_neg_df.sum().sort_values(ascending=False)[:10]

tri_neg_counts

Top 10 trigrams for our positive reviews.

In [None]:
vectorizer = CountVectorizer(lowercase=True, ngram_range = (3,3), stop_words=('english'))

tri_pos = vectorizer.fit_transform(df.Combined[df.Target==1])

tri_pos_df = pd.DataFrame(tri_pos.todense(), columns = vectorizer.get_feature_names())

tri_pos_counts = tri_pos_df.sum().sort_values(ascending=False)[:10]

tri_pos_counts

We see negative reviews mention that using fingers was easier while positive reviewers seemed to buy the product because they didn't like sticking their fingers in their mouths (and the tool helped with that).

Let's visualize bigrams and trigrams in a word cloud! Bigger sized text implies larger frequency.

In [None]:
from wordcloud import WordCloud

Build word cloud function.

In [None]:
def Word_Cloud(counts):
    wordcloud = WordCloud(colormap='Spectral').generate_from_frequencies(counts)
    plt.figure(figsize=(8,8), facecolor='black')
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()

Word Cloud on negative, and then positive trigrams to visualize results.

In [None]:
Word_Cloud(tri_neg_counts)

In [None]:
Word_Cloud(tri_pos_counts) 

We will try TF-IDF to take into account term frequency not only in each review but across all reviews. This will give more weight to unique words rather than common words. Essentially this gives us the context of words across the entire corpus of reviews rather than a single review. Please do some research if you're interested in a more detailed explanation of TD-IDF vectorization.

Creating the instance, fitting and transforming the text into a sparse matrix.

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words= 'english')

tfidf_vectorizer.fit(df.Combined)

tfidf_df = pd.DataFrame(tfidf_vectorizer.transform(df.Combined).todense(),
                        columns = tfidf_vectorizer.get_feature_names())

Checking the number of different words in our text.

In [None]:
len(tfidf_df.columns)

Adding the words and weights to our dataframe.

In [None]:
df_combined = pd.concat([df, tfidf_df], axis=1)

"Melting" or pivoting, the words to change the format of the dataframe from wide to long so we can sort by weights.

In [None]:
df_long = pd.melt(df_combined, 
                  id_vars='Customer Name', value_name='tdidf', 
                  value_vars=tfidf_vectorizer.get_feature_names())

Dropping rows with zeroes.

In [None]:
df_long = df_long[df_long['tdidf']!= 0]

Grouping by reviewer and looking at their top 5 words.

In [None]:
df_long.set_index('variable', inplace= True)
tdifs = df_long.groupby('Customer Name')['tdidf'].nlargest(5)

Check customer Aurora who gave one-star.

In [None]:
tdifs['Aurora']

Aurora's review mentions she had problems with the tip of the device being too thick, so the code is doing its job.

Finally, we want to do topic modelling on our negative reviews to see if we can see what needs to be improved. We will use LDA, which extracts latent topics found by searching groups of words that occur together in documents across a corpus. Each doc has a probability distribution of topics and each topic has a probability distrubtion of words. Again, please do some research if interested in further details.

The first step is to convert text to document matrix and include some parameters. Of note is that we want to exclude words that occur too frequently and infrequently, so we set a low and high min/max arguments. 

In [None]:
vectorizer = CountVectorizer(lowercase   = True,
                             ngram_range = (1,2),
                             max_df      = .90,
                             min_df      = .01,
                             stop_words   = 'english',
                             max_features = None)


Fit and transform vectorizer on negative reviews.

In [None]:
vectorizer.fit(df['Combined'][df.Target==0])

In [None]:
review_word_counts = vectorizer.transform(df['Combined'][df.Target==0])

Since LDA is a subjective art, the "right" number of topics is debatable. We'll try a grid search to see which number of topics performs best.

Create parameters for number of topics ranging from 1-7.

In [None]:
parameters = {'n_components': [1,2,3,4,5,6,7]}

Initalize the LDA model.

In [None]:
lda = LatentDirichletAllocation()

Search for the optimal number of parameters.

In [None]:
grid = GridSearchCV(lda, param_grid=parameters)

grid.fit(review_word_counts)

Let's see which parameter was the most optimal.

In [None]:
grid.best_estimator_

The optimal number of topics is 1. Playing with the arguments, the model still wants to choose 1 topic as the best number of topics. However, this isn't so useful for us. So let's see what happens if we force the model to use 3 topics.

In [None]:
lda = LatentDirichletAllocation()

lda_model = LatentDirichletAllocation(n_components = 3, random_state=777)

lda_model.fit(review_word_counts)

three_topics = lda_model.transform(review_word_counts)


We'll need to do some coding in order to display the words and weights for each topic.

Create dictionary and enumerate the elements.

In [None]:
top_dictionary = {}

for index, topic in enumerate(lda_model.components_):
        top_dictionary["%d words" % (index+1)]= ['{}'.format(vectorizer.get_feature_names()[i]) for i in topic.argsort()[:-10 - 1:-1]]
        top_dictionary["%d weights" % (index+1)]= ['{:.1f}'.format(topic[i]) for i in topic.argsort()[:-10 - 1:-1]]


Display our dataframe. This gives up the top 10 words with their weights for each topic.

In [None]:
pd.DataFrame(top_dictionary)

We see that the words likely comprising the topics have simply been split across three topics rather than one. In other words, the top words in each topic were the top three words if we chose one topic. (This was not shown here). The first topic is more about 'work', the second about 'use', and the third about 'invisalign'.

Some more coding to display the likelihood of each review belonging to a certain topic.

In [None]:
top_names = ["Topic" + str(i) for i in range(lda_model.n_components)]

reviewer_names = ["Reviewer" + str(i) for i in range(len(df['Combined'][df.Target==0]))]

In [None]:
doc_weights = pd.DataFrame(np.round(three_topics, 2), columns=top_names, index=reviewer_names)

doc_weights

We see that review 0, which focuses specifically about the user's experience with Invisalign, likely belongs to the first topic which emphasizes Invisalign! We note however, that most reviews are very similar which is probably why the model chose one topic and is causing peculiarities in the results.