# Sentiment Analysis using LDA

Businesses often want to know how customers think about the quality of their services to improve and make more profits. Restaurant goers may want to learn from others' experience using a variety of criteria such as food quality, service, ambience, discounts and worthiness. Yelp users may post their reviews and ratings on businesses and services or simply express their thoughts on other reviews. Bad (negative) reviews from one's perspective may influence potential customers in making decisions, e.g., a potential customer may cancel a service and persuade other do the same.

#### Topic Modelling

As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection.

In [6]:
import pandas as pd
import numpy as np
import gensim
from textblob import TextBlob
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as ps
from nltk.stem.wordnet import WordNetLemmatizer
import string
from gensim.parsing.preprocessing import STOPWORDS

#### Reading the json file

In [7]:
path = 'Filtered_review_10K.json'
df = pd.read_json(path, orient='records', lines=True)

#### Preparing Documents
Cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus.

In [8]:
import re

rest_review_dict = dict()
for temp in df.iterrows() :
    row = temp[1]
    business_id = row.business_id
    
    exclude = set(string.punctuation)
    
    
    review_text = row['text']
    stop_free = ' '.join([word for word in review_text.lower().split() if word not in STOPWORDS])
    stop_punc = ''.join(ch for ch in stop_free if ch not in exclude)
    text = ''.join([i for i in stop_punc if not i.isdigit()])
    
    review_stars = row['stars']
    
    if business_id in rest_review_dict :
        reviews_array = rest_review_dict[business_id]
        reviews_array.append({'review_text' : review_text, 'review_stars' : review_stars,
                              'polarity' : TextBlob(text).sentiment.polarity,
                             'stemmed_text' : text})
    else :
        reviews_array = list()
        reviews_array.append({'review_text' : review_text, 'review_stars' : review_stars,
                              'polarity' : TextBlob(text).sentiment.polarity,
                             'stemmed_text' : text})
        rest_review_dict[business_id] = reviews_array

#### Latent Dirichlet Allocation (LDA) for Topic Modeling

Latent Dirichlet Allocation is the most popular topic modeling technique. LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix.

#### Preparing Document-Term Matrix
All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

#### Running LDA Model
Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [9]:
import gensim
from gensim import corpora, models, similarities


business_corpus = dict()
for business_id, review_array in rest_review_dict.items():
    consolidated_text = [review['stemmed_text'] for review in review_array]
    texts = []
    for t in consolidated_text :
        for word in t.split(" ") :
            texts.append(word)
    texts = [texts]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    ## Creating the object for LDA model using gensim library
    lda = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2)
    topics = dict()
    for topic in lda.top_topics(corpus) :
        b = topic[0][0:15]
        for tup in b :
            if tup[1] not in topics :
                topics[tup[1]] = tup[0]
            else :
                if topics[tup[1]] < tup[0] :
                    topics[tup[1]] = tup[0]
    
    
    business_corpus[business_id] = topics

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


#### Result
We can check the top topics and the top numbers from our lda model

In [11]:
print(lda.print_topics(num_topics=10, num_words=3))

[(0, '0.017*"" + 0.016*"sweeping" + 0.016*"guys"'), (1, '0.016*"wanted" + 0.015*"working" + 0.015*"husband"')]


In [22]:
print(business_corpus['hW0Ne_HTHEAgGF1rAdmR-g'])

{'': 0.020021653, 'airport': 0.016872462, 'terminal': 0.009792045, 'security': 0.007875004, 'harbor': 0.006297942, 'sky': 0.0067009623, 'its': 0.006039902, 'time': 0.005759867, 'bus': 0.0050824843, 'long': 0.00472993, 'nice': 0.0045633, 'free': 0.0041222954, 'dont': 0.003928789, 'wifi': 0.0039093425, 'flight': 0.004128684, 'friendly': 0.004289146, 'line': 0.0038695524, 'parking': 0.0037288917}


In [28]:
all_reviews = rest_review_dict['hW0Ne_HTHEAgGF1rAdmR-g']

business_reviews = []
positive_topics = []
negative_topics = []
for review in all_reviews :
    if review['polarity'] < 0 :
        print ('negative')
    else :
        print ('positive')

positive
positive
positive
positive
positive
{'review_text': "Sky Harbor is one poor gateway to the city of Phoenix and the Valley of the Sun.  It is outdated, super crowded, it feels dirty and it is the home of US Airways, the biggest carrier in and out of this airport.\n\nTerminal 4 is configured rather poorly.  The terminal reminds me of a backwards E, like the second E in EMIN3M, except that there is another prong to the backwards E.  The concourses are not spacious at all, the seating limited, and consequently the crowds spill out into the walkway areas.\n.  \nTo go from one end of Terminal 4 to the other necessitates using really narrow moving walkways which are on the spine of the E.  It is really hard to maneuver if you want to pass the throngs given the limited space.  The distances are long, the bruises will be many, either to you but most likely given by you as you negotiate your way between flights.\n\nThe other thing that must be mentioned is that Sky Harbor is not a busin