# Identifying Topics among 2151 Amazon Reviews

## In college I've gotten the chance to experiment with the entrepreneurial process through the SDSU ZIP Launchpad. I began by attempting to create a knee brace with help from friends which would relieve arthritis. 

## A continuous issue hindering our progress has been identifying and clearly communicating problems which are explicitly addressed by the features of the product. I found algorithms associated with topic modeling, specifically "Latent Dirichlet Allocaiton" to be promising tools to use to identify trends in large amounts of Amazon product review data. 

## This short report discusses, in brief, the methodology behind the algorithms used, implementation of the tools, challenges I faced, ways that you can use this technology and an update on the current state of topic modeling algorithms. 

#### Latent Dirichlet Association is a statistical process within the practice of Natural Language Processing which involves creating a frequency distribution table and grouping together common words based on how often they occur in the dataset. 

#### LDA is a unique algorithm within NLP, however, the necessary data preprocessing tasks are similar regardless of what statistical process you apply to your data.  In order to use the LDA model, I had to clean my data with 3 different processes: removing stopwords, tokenizing the data and stemming the data. 
#### "Stopwords" are common english conjunction words like "and", "or", and other parts of speech which don't add any value to our final output. To run the LDA algorithm correctly you have to "stop" these "words" from being included in the group of text. 
#### "Tokenizing" is the process of separating each word into individual objects so that they can be entered into a matrix(system of columns and rows) which the algorithm uses to measure the frequency of words and create topics. 
#### "Stemming" is the process of reducing words to their "stem" i.e. removing "ing" from "stemming" to be left with "stem". It is easier for the LDA algorithm to compare words when they are all present tense which is the product of stemming. 
#### Once the data was cleaned I converted the list of tokens (individual stemmed words) into a dictionary which is a required data format by the LDA model function. I then created the 'lda_model' variable and called the function with the necessary parameters
    Parameters are instructions given to the function to modify its output - they can be adjusted within parantheses next to function names

## I am going to avoid going into depth on any individual component of this project because I want it to be accessable for people who are completely new to data science, especially people in the ZIP Launchpad who want to use this as a tool to enhance their project pitches. To make it as usable as possible I am going to break down the general steps from the point when you open your computer to begin the project. For depth on the topic there are various paper available online.
#### 1) Download the Python programming language package: https://www.python.org/downloads/
#### 2) Download Anaconda: https://www.anaconda.com/products/individual
        Anaconda is a platform which lets you download pre-made libraries and access development tools from one application on your computer>
            Libraries are packages of code created by data scientists allow people to deploy complicated algorithms without coding them themselves from scratch.
            Development Tools, namely Jupyter Notebooks, let people write and test code on the same screen.
#### 3)Download Pycharm: https://www.jetbrains.com/pycharm/download/
        Pycharm is the best Integrated Development Environment to use to build the Amazon reviews scraper
#### 4) Build the Amazon scraper with 'Scrapy' in PyCharm
    buildwithpython has a great tutorial on how to build a web scraper. The tool is immensely powerful but also incredibly complicated. Follow his tutorial as close as possible.
    buildwithpython: https://www.youtube.com/playlist?list=PLhTjy8cBISEqkN-5Ku_kXG4QW33sxQo0t

#### 5) Save your scraped reviews to a csv using this code in the PyCharm terminal: scrapy crawl 'crawler name' -o 'filename'.csv
#### 6) Upload your .csv file into your Jupyter Notebook Home environment
#### 7) Input you file name into the "df = pd.read_csv('filename.csv')" section of code.
#### 8) If you are not aggregating multiple files together, prevent the section of code related to this action from running by placing a comment before each of the following lines:
    dataframes = [df1, df2, df3, df4, df5]
    df = pd.concat(dataframes, ignore_index=True)
#### 9) Load this file into jupyter notebooks 
#### 10)Load your .csv(s) into the code line: df = pd.read_csv('')
#### 11)Import the necessary libraries and run (the run function can be found under "Kernel" -> "Restart and Run All")

### After cleaning the data you can apply the LDA algorithm and whatever visualization tools you desire to understand the results.

# Import .CSV(s) and Combine Them (If Necessary)

In [1]:
import nltk
import pandas as pd

en_stop = set(nltk.corpus.stopwords.words('english'))

df1 = pd.read_csv('powerlix (1).csv')
df2 = pd.read_csv('bodyprox.csv')
df3 = pd.read_csv('cambivo.csv')
df4 = pd.read_csv('physix.csv')
df5 = pd.read_csv('riptgear.csv')

dataframes = [df1, df2, df3, df4, df5]

df = pd.concat(dataframes, ignore_index = True)
df

Unnamed: 0,review
0,The brace DOES NOT stay in place. I bought th...
1,"First of all, it's just one knee brace. Knee ..."
2,I don't know how I decided to buy this item th...
3,Doesn’t look like picture & not comfortable. T...
4,The knee brace is not good for work outs it do...
...,...
270,"Ran small,dissappointing,No pain relief and it..."
271,I thought it would have more support than it d...
272,"Not recommended.,too small' Amaz should ref..."
273,It works alright. Only see a little bit of a d...


# Split Rows of 10 Reviews Each Into Rows of 1 Review Each

In [2]:
corpus = df['review']

list = []
for i in range(len(df)):
    list.append(df['review'][i].split('.,'))
    
new_list = []
for i in list:
    for t in i:
        new_list.append(t)
        
df = pd.DataFrame({'review': new_list})
df

Unnamed: 0,review
0,The brace DOES NOT stay in place. I bought th...
1,The size chart is way off. A true 5” above the...
2,I would recommend the Littlejian Compression K...
3,This brace worked fine for the first 3-4 weeks...
4,I've had this Knee Brace for about 10 days. D...
...,...
2147,But! Although this felt great around the knee ...
2148,This item is just a plain sleeve that can be p...
2149,It doesn’t help at all plus it is not stable (...
2150,It made me break out in a rash under my knee


# Define the Data Cleaning Function

In [3]:
import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()
        
        more_stopwords = ['knee', 'brace', 'amazon', 'doe', 'thigh','would', 'sleeve', 'product', 'like', 'one','two', 'leg','work', 'around']

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in en_stop]
        tokens = [word for word in tokens if len(word)  > 2]
        tokens = [word for word in tokens if word not in more_stopwords]

        return tokens

# Apply the preprocess_text Cleaning Function to Your Data

In [4]:
processed_data = [];
for doc in corpus:
    tokens = preprocess_text(doc)
    processed_data.append(tokens)

# Transform processed_data List into a Dictionary - Necessary Object Type for LDA Model

In [5]:
from gensim import corpora

gensim_dictionary = corpora.Dictionary(processed_data)
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]

In [6]:
import pickle

pickle.dump(gensim_corpus, open('gensim_corpus_corpus.pkl', 'wb'))
gensim_dictionary.save('gensim_dictionary.gensim')

# Create the LDA Model with the Previously created "gensim_corpus" and "gensim_dictionary"

In [7]:
import gensim

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20)
lda_model.save('gensim_model.gensim')

In [8]:
topics = lda_model.print_topics(num_words=4)
for topic in topics:
    print(topic)
    print('')

(0, '0.006*"compression" + 0.005*"seller" + 0.005*"email" + 0.003*"help"')

(1, '0.017*"support" + 0.014*"strap" + 0.012*"back" + 0.011*"good"')

(2, '0.018*"support" + 0.011*"fit" + 0.010*"good" + 0.008*"day"')

(3, '0.025*"size" + 0.020*"support" + 0.015*"fit" + 0.012*"tight"')



# Analyze Top Terms 

In [9]:
lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=8, id2word=gensim_dictionary, passes=15)
lda_model.save('gensim_model.gensim')
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.025*"size" + 0.020*"support" + 0.018*"fit" + 0.015*"tight" + 0.011*"good"')
(1, '0.026*"strap" + 0.024*"support" + 0.018*"back" + 0.015*"velcro" + 0.015*"uncomfortable"')
(2, '0.017*"support" + 0.016*"size" + 0.013*"fit" + 0.008*"large" + 0.007*"much"')
(3, '0.005*"velcro" + 0.005*"bulky" + 0.005*"support" + 0.004*"muy" + 0.004*"pero"')
(4, '0.023*"size" + 0.019*"support" + 0.012*"good" + 0.010*"fit" + 0.009*"ordered"')
(5, '0.023*"size" + 0.021*"support" + 0.010*"compression" + 0.010*"good" + 0.009*"stay"')
(6, '0.015*"support" + 0.013*"size" + 0.012*"day" + 0.010*"fit" + 0.009*"compression"')
(7, '0.013*"fit" + 0.010*"support" + 0.009*"strap" + 0.008*"get" + 0.008*"compression"')


# Visualize the Topic Groups with pyLDAvis Library

In [10]:
gensim_dictionary = gensim.corpora.Dictionary.load('gensim_dictionary.gensim')
gensim_corpus = pickle.load(open('gensim_corpus_corpus.pkl', 'rb'))
lda_model = gensim.models.ldamodel.LdaModel.load('gensim_model.gensim')

import pyLDAvis.gensim

lda_visualization = pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False)
pyLDAvis.display(lda_visualization)

# Challenges

#### 1) Building a web scraper was my first priority. I first tried to scrape the desired amazon product page with "beautifulsoup requests" but my http requests were blocked by Amazon as they don't like people scraping their data in mass. I had to then make a much more sophisticated scraper using the "scrapy" library which could access Amazon by continuously cycling through different "user agents", the identifier that portrays who is accessing Amazon's website. A continuous cycle of different user agents makes it look like multiple people are requesting to see the review pages so Amazon doesn't raise its "alarm".
#### 2) Once I had the .csvs for to analyze I had to figure out which process to apply to them. It took some research to learn that Topic Modeling with LDA is the standard procedure for this type of task
#### 3) The scraper that I used collected all 10 reviews on each page into one row in the .csv. I initially thought this would misconstrue the LDA model results because I thought that each review wouldn't be accurately represented. I used nested 'for loops' and the pandas library to restructure the rows so that each contained only one review. Unfortunately some rows still contain multiple reviews because I used '.,' as a delimeter since most reviews end in a '.' and are separated by commas but some ended in a '!' so the for loops didn't apply to them. This ended up not affecting the model because it throughs all the 'tokens' (individual words) into an unorganized corpus (body of text) anyways so the original .csv structure doesn't matter. This makes the model less effective for short texts (reviews, tweets, etc.), but I'll discuss this in the last section.
#### 4) The last problem was combining the 5 different product review scrapes into one dataframe to analyze together. Pandas has a simple method for this: 'df = pd.concat(dataframes, ignore_index = True)'

# Advancements in Topic Modeling

#### While LDA is a powerful Topic Modeling tool, there is room for improvement regarding analyzing clusters of short texts like tweets and product reviews. 'Gibbs Sampling Dirichlet Mixture Model (GSDMM)' is a new Topic Modeling algorithm which alters LDA to assume that each document has 1 topic as opposed to bunchind all the words together and discovering general topics based on the frequency distribution of the words. I wasn't able to implement GSDMM because part of the code is protected intellectual property and I'm not skilled enough in programming to recreate it. 
#### A report on GSDMM can be found here: https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883

# Conclusion and LDA Uses

#### Topic Modeling is a very powerful text mining method to determine what people are talking about at scale. The ease of learning Python enables everyone who is willing to focus on the task the abiliity to extrapolate insight from large, unstructured data. LDA specifically can be used to identify what people do and don't like about products, people, and political movements and can be a valuable asset to marketing and consumer research firms. 