# Topic Modelling using Amazon reviews (Automotive Products)



![image.png](attachment:image.png)

### What is topic modelling?
One of the major applications in Natural Language Processing is to extract what are the topics that are being discussed from large volumes of text. Some examples of large text are feeds from social media, customer reviews, feedbacks, news stories etc. Thus, an algorithm that can read through text documents and automatically output the topics discussed is Latent Dirichlet Allocation (LDA). 
I will be using the Latent Dirichlet Allocation (LDA) from Gensim package. 

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

##### What is a topic?
A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is using a visualization in pyLDAvis.

In [1]:
#import packages

import nltk;
nltk.download('stopwords')

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# libraries for visualization
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

[nltk_data] Downloading package stopwords to C:\Users\Kalpita
[nltk_data]     Raut\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


  """
  """
  """


## Reading and loading the Data

In [2]:
#read data
df = pd.read_json('Automotive_5.json', lines=True)
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B00002243X,"[4, 4]",5,I needed a set of jumper cables for my new car...,"08 17, 2011",A3F73SC1LY51OO,Alan Montgomery,Work Well - Should Have Bought Longer Ones,1313539200
1,B00002243X,"[1, 1]",4,"These long cables work fine for my truck, but ...","09 4, 2011",A20S66SKYXULG2,alphonse,Okay long cables,1315094400
2,B00002243X,"[0, 0]",5,Can't comment much on these since they have no...,"07 25, 2013",A2I8LFSN2IS5EO,Chris,Looks and feels heavy Duty,1374710400
3,B00002243X,"[19, 19]",5,I absolutley love Amazon!!! For the price of ...,"12 21, 2010",A3GT2EWQSO45ZG,DeusEx,Excellent choice for Jumper Cables!!!,1292889600
4,B00002243X,"[0, 0]",5,I purchased the 12' feet long cable set and th...,"07 4, 2012",A3ESWJPAVRPWB4,E. Hernandez,"Excellent, High Quality Starter Cables",1341360000


As you can see, the data contains the following columns:
- asin – ID of the product
- helpful – helpfulness rating of the review, e.g. 2/3
- overall – rating of the product
- reviewText – text of the review
- reviewTime – time of the review (raw)
- reviewerID – ID of the reviewer
- reviewerName – name of the reviewer
- summary – summary of the review
- unixReviewTime – time of the review (unix time)

For the scope of our analysis and this article, we will be using only the reviews column, i.e., **reviewText**.

## Steps for Preparing Data in Topic Modelling using LDA:


### 1. Tokenizing
Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.

In [3]:
# remove unwanted characters, numbers and symbols
df['reviewText'] = df['reviewText'].str.replace("[^a-zA-Z#]", " ")

### 2. Stopwords
Certain parts of English speech, like conjunctions (“for”, “or”) or the word “the” are meaningless to a topic model. These terms are called stop words and need to be removed from our token list.


In [4]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [5]:
# function to remove stopwords
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

# remove short words (length < 3)
df['reviewText'] = df['reviewText'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

# remove stopwords from the text
reviews = [remove_stopwords(r.split()) for r in df['reviewText']]

# make entire text lowercase
reviews = [r.lower() for r in reviews]

### 3. Lemmatizing
Lemmatization is nothing but converting a word to its root word. For example: the lemma of the word ‘machines’ is ‘machine’. Likewise, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.

In [6]:
nlp = spacy.load('en', disable=['parser', 'ner'])

def lemmatization(texts, tags=['NOUN', 'ADJ']): # filter noun and adjective
       output = []
       for sent in texts:
             doc = nlp(" ".join(sent)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

#### Let’s tokenize the reviews and then lemmatize the tokenized reviews

In [7]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

['these', 'long', 'cables', 'work', 'fine', 'truck', 'quality', 'seems', 'little', 'shabby', 'side', 'for', 'money', 'expecting', 'dollar', 'snap', 'jumper', 'cables', 'seem', 'like', 'would', 'see', 'chinese', 'knock', 'shop', 'like', 'harbor', 'freight', 'bucks']


In [8]:
 # print lemmatized review
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1])

['long', 'cable', 'fine', 'truck', 'quality', 'little', 'shabby', 'side', 'money', 'dollar', 'jumper', 'cable', 'chinese', 'shop', 'harbor', 'freight', 'buck']


#### As you can see, we have not just lemmatized the words but also filtered only nouns and adjectives. Let’s de-tokenize the lemmatized reviews 

In [10]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

df['reviews'] = reviews_3



# Building a LDA model

First step in topic modelling is to convert your text into numeric representation using a document-term frequency matrix : **corpus**.

This matrix contains the word count for each word in each document and these words are called **vocabulary**.

Once we apply singlular value decomposition to the input(corpus) we will obtain 2 output matrices:
-	Document specific topic allocation
-	Topic specific word allocation 

In LDA, we apply statistical inference to obtain similar output but in the form of probability distribution 
-	Given a document, what is the probability distribution of each topic within the document
-	Given a specific topic, what is the word distribution from the vocabulary




In [11]:
#creating the term dictionary of our corpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(reviews_2)


In [12]:
#converting the list of reviews (reviews_2) into a Document Term Matrix using the dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

In [19]:
# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=10, random_state=100,
                chunksize=1000, passes=50)

**I have specified the number of topics as 10 for this model using the num_topics parameter.**

In [22]:
# Print the Keyword in the 10 topics
lda_model.print_topics()

[(0,
  '0.026*"good" + 0.021*"car" + 0.019*"oil" + 0.018*"product" + 0.015*"price" + 0.014*"filter" + 0.012*"time" + 0.011*"great" + 0.010*"quality" + 0.009*"year"'),
 (1,
  '0.093*"tire" + 0.076*"leather" + 0.053*"wheel" + 0.033*"gauge" + 0.029*"brush" + 0.024*"pressure" + 0.023*"seat" + 0.016*"air" + 0.016*"clean" + 0.014*"conditioner"'),
 (2,
  '0.034*"port" + 0.033*"brake" + 0.025*"board" + 0.025*"grease" + 0.019*"compressor" + 0.018*"seal" + 0.014*"film" + 0.014*"micro" + 0.012*"elbow" + 0.011*"project"'),
 (3,
  '0.067*"blade" + 0.056*"wiper" + 0.024*"windshield" + 0.018*"rain" + 0.015*"snow" + 0.013*"window" + 0.013*"bosch" + 0.013*"car" + 0.011*"old" + 0.011*"year"'),
 (4,
  '0.019*"easy" + 0.018*"good" + 0.015*"great" + 0.012*"use" + 0.011*"small" + 0.011*"work" + 0.011*"little" + 0.010*"nice" + 0.010*"thing" + 0.009*"fit"'),
 (5,
  '0.089*"light" + 0.038*"bulb" + 0.026*"bright" + 0.017*"white" + 0.014*"headlight" + 0.014*"color" + 0.013*"mat" + 0.012*"kit" + 0.009*"blue" + 0.

# INTERPRETATION:

Topic 0 is a represented as '0.026*"good" + 0.021*"car" + 0.019*"oil" + 0.018*"product" + 0.015*"price" + 0.014*"filter" + 0.012*"time" + 0.011*"great" + 0.010*"quality" + 0.009*"year"'),

It means the top 10 keywords that contribute to this topic are: ‘good’, ‘car’, ‘oil’.. and so on and the weight of ‘good’ on topic 0 is 0.026.

The weights reflect how important a keyword is to that topic.


- **Topic 1:** 
good,car,oil,product,price,filter,time,great,quality.year

- **Topic 2:**
tire,leather,wheel,gauge,brush,pressure,seat,air,clean,conditioner

- **Topic 3:**
port,brake,board,grease,compressor,seal,film,micro,elbow,project

- **Topic 4:**
blade,wiper,windshield,rain,snow,window,bosch,car,old,year

- **Topic 5:**
good,easy,great,use,small,work,little,nice,thing,fit

- **Topic 6:**
light,bulb,bright,white,headlight,color,mat,kit,blue,good

- **Topic 7:**
battery,car,power,device,unit,charger,light,use,phone,charger

- **Topic 8:**
car,product,towel,good,wax,clea,use,great,time,water

- **Topic 9:**
code,app,plug,tool,engine,reapir,sensor,,cap,scanner,computer

- **Topic 10:**
hose,water,tank,valve,pump,pressure,nozzle,sewer,black,air

# Visualization
To visualize our topics in a 2-dimensional space we will use the pyLDAvis library. This visualization is interactive in nature and displays topics along with the most relevant words.

In [21]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Interpreting the Visualization

- Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.
- The size of the bubbles represent the percentage of topics in the corpus. For example, Topic 7 contains 4.6% of tokens in the corpus. 
- A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. As we observe, out of the 10 topics,topic 1-6 overlap whereas topic 7-10 do not. Thus, we have built a good topic model.



## Results & Discussion

Looking at the topics, we can summarize that they represent information about cars and car parts.

The following can be the topic summary for each topic:
- **Topic 1: (Regarding car Quality)** 
good,car,oil,product,price,filter,time,great,quality,year

- **Topic 2:(Regarding Parts of the car)**
tire,leather,wheel,gauge,brush,pressure,seat,air,clean,conditioner

- **Topic 3:(Regarding car tools)**
port,brake,board,grease,compressor,seal,film,micro,elbow,project

- **Topic 4:(Regarding Parts of the car)**
blade,wiper,windshield,rain,snow,window,bosch,car,old,year

- **Topic 5:(Quality review)**
good,easy,great,use,small,work,little,nice,thing,fit

- **Topic 6:(Regarding colour and appeal)**
light,bulb,bright,white,headlight,color,mat,kit,blue,good

- **Topic 7:(Regarding car power supply)**
battery,car,power,device,unit,charger,light,use,phone,charger

- **Topic 8:(Regarding car maintainance)**
car,product,towel,good,wax,clea,use,great,time,water

- **Topic 9:(Regarding electronic car applications)**
code,app,plug,tool,engine,reapir,sensor,cap,scanner,computer

- **Topic 10:(Regarding fuel and internal parts of engine)**
hose,water,tank,valve,pump,pressure,nozzle,sewer,black,air


Thus, I agree with the topics that I have found. Most of the keywords in each topic are relevant to what the topic is about.  
 

 

## Do online communities always stay on topic?

Out of the thousands of reviews in our dataset, some keywords that are outside of the focus of the topic were observed. We observe using the visualization & the top 10 words mentioned above, that some keywords don't match what the topic represents. Some of these are: 
 - Topic 3: elbow,port
 - Topic 4: year
 - Topic 6: mat
 - Topic 8: towel, time, water, shoe
 - Topic 10: hose, sewer, air
 
However, these anamolies are quite negligible as compared to the other keyowrds that perfectly relate to what the topic represents. With respect to the observations in this study, we can conclude that most of the topics were relevant to what that particular topic represented. Thus, online communities mostly tend to stay on topic. 

## Citations:
1. For the visualization :  https://pyldavis.readthedocs.io/en/latest/readme.html#usage
2. For dataset: R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016 J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015
