# The Yelp Dataset Challenge Project tutorial

The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes. I really like the Yelp Dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected , and anyone can relate to it - it's largly about food, after all!!!

The current iteration of the Yelp dataset (As of this demo) consist of the following data:

1.2M users
192K businesses
6.6M user reviews

When focusing on restaurants alone and there are 59K restaurants with approximately 1.9M user reviews written about them.

The data is provided in a handful of files in json format. We'll be using the files for our demo:

    yelp_academic_dataset_business.json - the records for individual business
    yelp_academic_dataset_review.json - the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples

The business records consist of key, value pairs containing information about  the particular business. A few arrributes we'll be interested in for this demo include:
    
    business id - unique identifier for businesses
    categories - an array containing relevent category values of businesses
    
The categories attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the Restaurant tag in the categories array. In addition, the categories array may contain more detailed information about restaurants, such as the tyope of food they serve

The review records are stored in a similar manner - key, value parirs containing information about reviews

In [1]:
import os as os
import codecs


#Create a variable that will store the path to my data
data_directory = os.path.join('C:\\', 'yelp_dataset')

#Create variable that will contain a seperate filepath for my business data
business_filepath = os.path.join(data_directory,'business.json')

with codecs.open(business_filepath, encoding='utf-8') as f:
    first_business_record = f.readline()
    
print( first_business_record)

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}



A few attributes to note on the review records:

    busines_id - indicates which business the review is about
    text - the natural language text the user wrote
    
The text attribute will be our focus...


json is handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:

    1. Read in business record and convert it to a Python dict
    2. Filter out business records that aren't restaurants
    3. Create a frozenset of business ID's for restaurants, which we'll use in the next step
    
  

In [2]:
#Create a variable that will store the path to my data
data_directory = os.path.join('C:\\', 'yelp_dataset')


#Create another variable that will contain a seperate filepath for my reiews data
review_json_filepath = os.path.join(data_directory, 'review.json')


with codecs.open(review_json_filepath, encoding='utf-8') as f:
        first_review_record = f.readline()
        
print(first_review_record)

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}



Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [3]:
#Filepath in the intermediate directory(pre-existing)
intermediate_directory = os.path.join('C:\\', 'intermediate')

# Attach a filename to our intermediate directory
review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')

In [4]:
import json

restaurant_ids = set()

r_counter = 0

#open the Businesses file and encode our business json file with utf-8  
with codecs.open(business_filepath, encoding='utf-8') as f:
    
    #iterate through each line (json record) in the file
    for bus_line_json in f:
        #if i%10000==0:
            #print(i)
        
        #if i==20000:
            #break    
        
        #convert the json record to a Python dict
        business_dict = json.loads(bus_line_json)
        
        #convert category dictionary to a list & Split the list into seperate words to be easily iterated through
        cat_list = str(business_dict.get(u'categories',"")).split()
        
        #Iterate through every word in the list
        for category in cat_list:
            
            #if this business is not a restaurant, skip to the next one
            if category == u'Restaurants,' or category == u'Restaurants':

                #add the restaurant business id to our restaurant_ids set
                restaurant_ids.add(business_dict.get(u'business_id',""))

                #Increase the counter variable 
                r_counter+=1
    
#turn restaurant_ids into a frozenset, as we don't need to change it to anymore
restaurant_ids = frozenset(restaurant_ids)

#print the number of unique restaurant ids is in the dataset
print(r_counter, u' retaurants in the dataset,')

59382  retaurants in the dataset,


In [5]:
%%time

r_counter = 0

restaurant_reviews_id = set()

#create & open new file in write mode

with codecs.open(review_txt_filepath,'w', encoding='utf-8') as review_txt_file:

    #open the existing review json file
    with codecs.open(review_json_filepath, encoding='utf-8') as review_json_file:

        # loop through all reviews in the existing file and convert to Python dict
        for review_json in review_json_file:

            # Limit the amount of reviews to write
            if r_counter == 2000:
                break  

            #Convert review_json to a Python dict
            review_dict = json.loads(review_json)

            # if this review is not about a restuarant, skip to the next one
            for rest_id in restaurant_ids:

                # if(review_dict.[u'business_id'] == rest_id): "Verbose and inefficent"
                if(review_dict.get(u'business_id',"") == rest_id):

                    # Add review id's to restaurant_reviews_id set
                    restaurant_reviews_id.add(review_dict.get(u'business_id'))

                    # Write the review to the .txt file and append a '0' character as end marker
                    review_txt_file.write(review_dict.get(u'text').replace('\n', '\\n')+'ETX')

                    # Increase the counter variable
                    r_counter+=1

print(r_counter, u' retaurants in the dataset, \n')

#Print the last Restaurant review text wrten to the text file
print('Last review text writen to file:\n\n', review_dict.get(u'text',""), '\n')
print("-----------------------------------------------------\n")
print(f'Text from {r_counter} restaurant reviews written to the new_txt_file.', '\n')    

2000  retaurants in the dataset, 

Last review text writen to file:

 I ordered delivery from here today. Overall it was a pleasant experience.  I called to ask about their restaurant.com deal.  You can only use that for carry out but he was nice enough to give me a coupon code, $3.99 (coupon code 400) off a large one topping pizza for delivery.  I ordered half sausage half spinach.  The toppings were fresh, dough was nice and soft, and delivered quickly!  It was delicious.  Nice change from the other pizza delivery options around here! 

They have a restaurant.com deal going on right for $10 for $4 or $25 for $10 only for carry out! 

-----------------------------------------------------

Text from 2000 restaurant reviews written to the new_txt_file. 

Wall time: 30.8 s


spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many task commonly associated with building an end-to-end natural language processing pipeline:

       Tokenization
       Text normailization, such as lowercasing, stemming/lemmentazation
       Part-Of-Speech tagging
       Syntatic dependency parsing
       Sentence boundary detection
       Named entity recognition and annotation
       
In the "batteries included" Python tradition, spaCy contains built-in-data and models which you can use out-of-the-box for processing general-purpose English text:

        Large English vocabulary, including stopword lists
        Token "probabilities"
        Word Vectors
        
spaCy is written in optimized Cython, which means it's fast. According to a few independent sources, it's the fastest syntactic parser available in any language. Key places of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL(Global Interpreter Lock).

In [6]:
import spacy
import pandas as pd
import itertools as it

#It takes a few seconds to load because there is about 2GB to 3GB worth of models
nlp = spacy.load('en_core_web_sm')



Now, we've completed our first step by loading spaCy....let's grab some reviews  with a sample text

In [7]:
r_counter = 0

with codecs.open(review_txt_filepath, encoding='utf_8') as f:
        
        # Read the binary data from our text file up to the "^" marker
        # and store our review as a element in a list of reviews
        
        rest_reviews = str(f.read()).split('ETX')
        
        for text in rest_reviews:
            if r_counter == 2:
                    break  
            print(text.replace('\\n', '\n'), '\n')
            print("-----------------------------------------------------\n")
            r_counter+=1

Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.

Drink prices were pretty good.

The Server, Dawn, was friendly and accommodating. Very happy with her.

In summation, a great pub experience. Would go again! 

-----------------------------------------------------

I'll be the first to admit that I was not excited about going to La Tavolta. Being a food snob, when a group of friends suggested we go for dinner I looked online at the menu and to me there was nothing special and it seemed overpriced.  Im also not big on ordering pasta when I go out. Alas, I was outnumbered. Thank goodness! I ordered the sea bass special. It was to die for. Cooked perfectly, seasoned perfectly, perfect portion. I can not say enough good things about this dish. When the server asked how it was he seemed very pro

Great we were able to grab our text from our file!!! 

Also, we made a list of reviews so later we can iterate through the reviews pretty easily...

In [8]:
#Pass the a restaurant review to spaCy's nlp function which will return a parsed review
parsed_rev = nlp(rest_reviews[999].replace('\\n','\n'))

print(parsed_rev, '\n')

We visited last week.  Be Warned! My husband's sister ordered a cheeseburger.  The waiter asked how she wanted it cooked and explained that "Medium" would be pink through-out.  She ordered the burger medium.  It took a very long time for our food to arrive and when it did, her burger was cooked very well done.  The waiter offered to have another burger made.  We had waited quite a while, and everyone was nearly done with their meals, all of us sharing with my sister-in-law whom still had no burger.  The waiter came by and my sister-in-law said to just cancel the order because we had already waited too long for it.  He said that he would take care of it.  A few minutes later, someone who appeared to be a manager came to the table delivering the burger we had already asked to be removed from the order.  My sister in law, frustrated at this point stated "No, I don't want it now".  The manager walked away.  A few minutes later I noticed the manager with a waiter standing in the middle of t

Even though the text doesn't look like it changed at all we'll check behind the scene to see what happened

In [9]:
#Getting annoying "variable not defined" error because of scope issues in Jupyter Notebook
#So I repeated a redundant step by calling the nlp function and assigning to "parsed_rev" variable
parsed_rev = nlp(rest_reviews[999].replace('\\n','\n'))

# Use f string object update to provide a more efficent syntax and remove verbose code
for i,sentence in enumerate(parsed_rev.sents):
    print(f'Sentence {i+1}:')
    print(sentence)
    print('')

Sentence 1:
We visited last week.  

Sentence 2:
Be Warned!

Sentence 3:
My husband's sister ordered a cheeseburger.  

Sentence 4:
The waiter asked how she wanted it cooked and explained that "Medium" would be pink through-out.  

Sentence 5:
She ordered the burger medium.  

Sentence 6:
It took a very long time for our food to arrive and

Sentence 7:
when it did, her burger was cooked very well done.  

Sentence 8:
The waiter offered to have another burger made.  

Sentence 9:
We had waited quite a while, and everyone was nearly done with their meals, all of us sharing with my sister-in-law whom still had no burger.  

Sentence 10:
The waiter came by and my sister-in-law said to just cancel the order because we had already waited too long for it.  

Sentence 11:
He said that he would take care of it.  

Sentence 12:
A few minutes later, someone who appeared to be a manager came to the table delivering the burger we had already asked to be removed from the order.  

Sentence 13:
My si

When you call nlp on a text, spaCy will take the first step and tokenize it then call each component on the Doc, in order. Since the model data is loaded, the components can access it to assign annotations to the Doc object, and subsequently to the Token and Span which are only views of the Doc, and don’t own any data themselves.

Let's see what the next component in the pipeline which is the Enity classification followed by part of speech tagging

In [10]:
for i, entity in enumerate(parsed_rev.ents):
    print(f'Entity {i+1}:', entity, '-', entity.label_)
    print('')

Entity 1: last week - DATE

Entity 2: A few minutes later - TIME

Entity 3: A few minutes later - TIME



In [11]:
# Iterate through the parsed review list, call .orth_ which returns a tokenized string
# and create a element in your new "token_txt list"
token_txt = [token.orth_ for token in parsed_rev]

# Iterate through the parsed review list, call .pos which returns a tagged string
# and create a element in your new "token_pos list"
token_pos = [token.pos_ for token in parsed_rev]

pd.DataFrame(zip(token_txt, token_pos), columns = ['Token Text', 'Part Of Speech'])

Unnamed: 0,Token Text,Part Of Speech
0,We,PRON
1,visited,VERB
2,last,ADJ
3,week,NOUN
4,.,PUNCT
5,,SPACE
6,Be,VERB
7,Warned,VERB
8,!,PUNCT
9,My,DET


The lemma is a more normalized or root form of the word. For example, lets take the word "thinking". The lemma will remove the suffix and return the word to "think" which is the root form of the word "thinking" or "were" to "be". 

Below, we'll use the same code structure as in the cell above to display lemma, and shaping

In [12]:
# Iterate through the parsed review, call .lemma_(w\o "_" it will return a int instead)
# which returns a lemmanoized string create a "token_lemma list"
token_lemma = [token.lemma_ for token in parsed_rev]

# Iterate through the parsed review list, call .pos which returns a tagged string
# to create a "token_shape list"
token_shape = [token.shape_ for token in parsed_rev]

# The zip() function take iterables (can be zero or more), makes iterator that aggregates elements 
# based on the iterables passed, and returns a map. Pandas.Dataframe creates our database
# for a clean look
pd.DataFrame(zip(token_txt,token_lemma, token_shape), columns = ['Token Text','Token Lemma', 'Token Shape'])

Unnamed: 0,Token Text,Token Lemma,Token Shape
0,We,-PRON-,Xx
1,visited,visit,xxxx
2,last,last,xxxx
3,week,week,xxxx
4,.,.,.
5,,,
6,Be,Be,Xx
7,Warned,warn,Xxxxx
8,!,!,!
9,My,-PRON-,Xx


In [13]:
token_etype = [token.ent_type_ for token in parsed_rev]
token_eiob = [token.ent_iob_ for token in parsed_rev]

pd.DataFrame(zip(token_txt, token_etype, token_eiob),
             columns=['Token Text', 'Entity Type', 'Inside-Outsid-Begin'])

Unnamed: 0,Token Text,Entity Type,Inside-Outsid-Begin
0,We,,O
1,visited,,O
2,last,DATE,B
3,week,DATE,I
4,.,,O
5,,,O
6,Be,,O
7,Warned,,O
8,!,,O
9,My,,O


There are other attributes such as the realative frequency of tokens, and whether or not a token matches any of these categories?

        stopword
        punctuation
        whitespace
        represents a number

In [14]:
token_attbs = [(token.orth_, 
                token.is_stop, 
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
               for token in parsed_rev]


data_frame = pd.DataFrame(token_attbs,
                          columns=['Text',
                                   'Stop?',
                                   'Punctuation',
                                   'Whitespace?',
                                   'Number?',
                                   'Out Of Vocab.?'])

data_frame.loc[:, 'Stop?':'Out Of Vocab.?'] = (data_frame.loc[:, 'Stop?':'Out Of Vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
data_frame

Unnamed: 0,Text,Stop?,Punctuation,Whitespace?,Number?,Out Of Vocab.?
0,We,Yes,,,,Yes
1,visited,,,,,Yes
2,last,Yes,,,,Yes
3,week,,,,,Yes
4,.,,Yes,,,Yes
5,,,,Yes,,Yes
6,Be,Yes,,,,Yes
7,Warned,,,,,Yes
8,!,,Yes,,,Yes
9,My,Yes,,,,Yes


# Phrase Modeling

Phrase modeling is a alternative approach to learning groupings(combinations) of tokens that together embody meaningful multi-word concepts. For this project we are going to design phrase models by looping over the words in our reviews and looking for words that accompany together much more commonly than you would assume them by random chance. The formula our phrase models will use to determine whether two tokens “A” and “B” constitute a phrase is:

	(( Term_Freq(Token_A, Token_B) - min_range) / ( Term_Freq(Token_B) * Term_Freq(Token_A))) * count(corpus_vocab) > Threshold
    

Term_Freq(A) is the number of times token A appears in the corpus

Term_Freq(B) is the number of times token B appears in the corpus

Term_Freq(A, B) is the number of times the tokens A B appear in the corpus in 
order

count(corpus_vocab)is the total size of the corpus vocabulary

min_range is a user-defined parameter to ensure that accepted phrases occur a 
minimum number of times

Threshold is a user-defined parameter to control how strong of a relationship between two tokens as a prerequisite for the model

After our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensable gensim library to help us with phrase modeling — the Phrases class in particular

In [15]:
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.test.utils import datapath as dp





As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. We are going to prep our data in the following order:

     1. Segment text of complete reviews into sentences & normalize text
     2. Apply first-order phrase model to transform sentences
     3. Apply second-order phrase model to transform sentences
     4. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

    - Iterate over the 1.9M reviews in the review_txt_all.txt we created before
    - Segment the reviews into individual sentences
    - Remove punctuation and excess whitespace
    - Lemmatize the text
        
....by multithreading our text processing step we can efficiently iterate through large number of reviews. In order to accomplish this we must unlock the GIL(Global Interpreter Lock) for for-loop’s. spaCy’s pipe method will allow your threads to actually run on multiple cores which speed up production.

In [16]:
# Verifies if the token is a punctuation or a space
def punct_space(token):
    
    return token.is_punct or token.is_space

# generator function to read in reviews from the file
# and reverse the escaping of a string the original line breaks in the text
def rev_line(file_name):
    with codecs.open(file_name, encoding='utf=8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
# Generator function will use spaCy to parse reviews, lemmatize the textm and yield sentences
def lemma_sent_corpus(file_name):
    
    nlp.Defaults.stop_words = {'-pron-','think', 'bring', 'etx', 'tell', 'right', 'know', 'selection', 'ok'}
    
    # Iterate through our review text, set the batch_size, and increase threads to use multi-cores
    # with our for-loop(Python nltk module does not support unlocking of GIL)
    for parsed_rev in nlp.pipe(rev_line(file_name), batch_size=20000, n_threads=5):
        
        # Our list compreshension will iterate through our sentence and lemmatize each word
        # in the sentence then return it as a unicode lemmatized list
        for sentence in parsed_rev.sents:
            
            yield str([token.lemma_ for token in sentence if not punct_space(token)])


In [17]:
#create a new file to store our lemmatized sentences
unigram_sentences_filepath = os.path.join(intermediate_directory,'unigram_sentences_all.txt')

Let's use the lemmatized_sentence_corpus generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning(training) our phrase models.

In [18]:
%%time

#create & open new file in write mode
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
    # Write the lemmatized sentence to the file
    for lemmatized_sentence in lemma_sent_corpus(review_txt_filepath):
        f.write(lemmatized_sentence.lower())

Wall time: 36.3 s



If your data is organized like our unigram_sentences_all file now is — a large text file with one document/sentence per line — gensim's LineSentence class provides a convenient iterator for working with other gensim components. It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [19]:
unigram_sentences = LineSentence(dp(unigram_sentences_filepath))

Let's take a look at a few sample sentences in our new, transformed file.

In [20]:
for unigram_sentence in unigram_sentences:
    print(u' '.join(unigram_sentence).split('etx')[0])
    print(u'')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


['go', 'in', 'for', 'a', 'lunch']['steak', 'sandwich', 'be', 'delicious', 'and', 'the', 'caesar', 'salad', 'have', 'an', 'absolutely', 'delicious', 'dressing', 'with', 'a', 'perfect', 'amount', 'of', 'dressing', 'and', 'distribute', 'perfectly', 'across', 'each', 'leaf']['-pron-', 'know', '-pron-', 'be', 'go', 'on', 'about', 'the', 'salad']['but', '-pron-', 'be', 'perfect']['drink', 'price', 'be', 'pretty', 'good']['the', 'server', 'dawn', 'be', 'friendly', 'and', 'accommodate']['very', 'happy', 'with', '-pron-']['in', 'summation', 'a', 'great', 'pub', 'experience']['would', 'go']["again!

'plan', 'the', 'next', 'time', '-pron-', 'come', 'here']['and', '-pron-', 'server', 'be', 'a', 'sweet', 'sweet', 'girl']['soft', 'speak', 'but', 'just', 'the', 'perfect', 'server']['-pron-', 'would', 'take', 'a', 'picture', 'of', '-pron-', 'food']['but', '-pron-', 'dove', 'in', 'and', 'clean', '-pron-', 'plate']['lol']['oh', 'well']['there', 'be', '254', 'pic', 'for', '-pron-', 'to', 'choose', 'from'

'indoor']['so', 'get', 'on', 'with', '-pron-', 'service', '-pron-', 'gawd', 'amazing']['last', 'time', '-pron-', 'and', 't', 'friend', 'be', 'serve', 'by', 'jeff', '-pron-', 'love', 'jeff']['-pron-', 'be', 'a', 'super', 'awesome', 'waiter', 'no', 'longer', 'there', 'if', 'u', 'know', 'where', '-pron-', 'be', 'go']['please', 'message', '-pron-']['because', '-pron-', 'will', 'go', 'there']['this', 'time', '-pron-', 'have', 'eric', 'who', 'be', 'fantastic', 'and', 'soooooo', 'charming']['and', 'handsome']['what', 'an', 'excellent', 'waiter', '-pron-', 'do', 'awesome', 'consider', '-pron-', 'be', 'deal', 'with', 'a', 'staff', 'restaurant', 'and', 'stock', 'that', 'have', 'have', 'an', 'unexpected', 'surplus', 'of', 'clientele', 'the', 'previous', 'night', 'that', 'have', 'all', 'but', 'obliterate', 'the', 'quantity', 'available', 'of', 'wine', 'and', 'food', 'make', '-pron-', 'job', 'as', 'a', 'waiter', 'most', 'precarious']['-pron-', 'do', 'not', 'charge', '-pron-', 'for', 'most', 'of', '

Now let's apply the next level order of topic modeling...

In [21]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [22]:
%%time

bigram_model = Phrases(unigram_sentences)

bigram_model.save(bigram_model_filepath)

# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

'''
Now that we have a trained phrase model for word pairs, let's apply 
it to the review sentences data and explore the results.

'''

with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:

    for unigram_sentence in unigram_sentences:

        bigram_sentence = u' '.join(bigram_model[unigram_sentence])

        f.write(bigram_sentence + '\n')

"""
#Load sentences from bigram sentence file
bigram_sentences = LineSentence(dp(bigram_sentences_filepath))

for bigram_sentence in bigram_sentences:
    print( u' '.join(bigram_sentence))
    print( u'')
    
"""



Wall time: 1.26 s


Ok, we have applied two orders of phrase modeling to our review text and at this point we will aplly one more level. Keep in mind that you can apply as many level as you desire. However, if you know words like "new york city" exist in your document which will require at least a third order model. Applying a 3rd level to our model ensures words "new york city" or "happy hour" will be group together as a complete three or two word phrase combination into a single token. 

In [23]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [24]:
%%time

bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

bigram_sentences = LineSentence(dp(bigram_sentences_filepath))

trig_mod = Phrases(bigram_sentences)

trig_mod.save(trigram_model_filepath)

Wall time: 420 ms


In [25]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [26]:
%%time

with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:

    bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

    bigram_sentences = LineSentence(dp(bigram_sentences_filepath))
    
    trig_mod = Phrases.load(trigram_model_filepath)
    
    for bigram_sentence in bigram_sentences:

        trigram_sentence = u' '.join(trig_mod[bigram_sentence])

        f.write(trigram_sentence + '\n')

Wall time: 792 ms


In [27]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [28]:
with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
    
   
    for parsed_review in nlp.pipe(rev_line(review_txt_filepath),
                                  batch_size=5000, n_threads=5):

        # lemmatize the text, removing punctuation and whitespace
        unigram_review = [token.lemma_ for token in parsed_review
                          if not punct_space(token)]
        # Load trained bigram model file
        bigram_model = Phrases.load(bigram_model_filepath)
        
        # Load trained trigram model file
        trig_mod = Phrases.load(trigram_model_filepath)
        
        # apply the first-order phrase models
        bigram_review = bigram_model[unigram_review]
        
        
        # apply the second-order phrase models
        trigram_review = trig_mod[bigram_review]

        # remove any remaining stopwords
        trigram_review = [term for term in trigram_review
                          if term not in spacy.lang.en.stop_words.STOP_WORDS]

        # Convert segmented string elements to one huge string
        trigram_review = u' '.join(trigram_review)
        
        # write the transformed review as a line in the new file
        f.write(trigram_review + '\n')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [29]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

print( u'Original:' + u'\n')


print( rest_reviews[0])

print( u'----' + u'\n')
print(u'Transformed:' + u'\n')

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    
    iter_mod_rev = str(f.read()).split('etx')
    
    print(iter_mod_rev[0])

Original:

Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!
----

Transformed:

lunch Steak sandwich delicious Caesar salad absolutely delicious dressing perfect dressing distribute perfectly leaf -PRON- know -PRON- salad -PRON- perfect Drink price pretty good Server Dawn friendly accommodate happy -PRON- summation great pub experience Would again!


# Topic Modeling with Latent Dirichlet Allocation (LDA)

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. For this demo, we'll be using Latent Dirichlet Allocation or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented as a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

   - Document vectors tend to be large (one dimension for each token translate to lots of dimensions)
   - They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
   - The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its LdaMulticore class.

In [30]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

  """
  """


The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.

In [31]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,'trigram_dict_all.dict')

In [32]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if 1 == 1:

    trigram_reviews = LineSentence(dp(trigram_reviews_filepath))

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    #trigram_dictionary.filter_extremes(no_below=0.0002, no_above=0.007)
    
    #trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Wall time: 78.8 ms


Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The trigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as bow.

In [33]:
trigram_bow_filepath = os.path.join(intermediate_directory,'trigram_bow_corpus_all.mm')

In [34]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [35]:
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

In [36]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if 1 == 1:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(dp(trigram_bow_filepath), trigram_bow_generator(trigram_reviews_filepath))
    
    

Wall time: 97.7 ms


With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to LdaMulticore as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics.

In [37]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [38]:
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)


with warnings.catch_warnings():
    warnings.simplefilter('ignore')

    # workers => sets the parallelism, and should be
    # set to your number of physical cores minus one
    lda = LdaMulticore(trigram_bow_corpus, num_topics=40, id2word=trigram_dictionary, workers=4)

lda.save(lda_model_filepath)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [39]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print( u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=50):
        print (u'{:20} {:.4f}'.format(term, round(frequency, 4)))

In [40]:
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

In [41]:
explore_topic(topic_number=2)

term                 frequency

-PRON-               0.0476
good                 0.0060
place                0.0056
food                 0.0056
come                 0.0030
time                 0.0030
great                0.0026
order                0.0026
service              0.0020
like                 0.0020
restaurant           0.0020
table                0.0018
$                    0.0016
eat                  0.0016
try                  0.0015
wait                 0.0015
fry                  0.0014
price                0.0014
pretty               0.0014
chicken              0.0014
want                 0.0013
drink                0.0013
know                 0.0012
nice                 0.0012
love                 0.0012
menu                 0.0011
thing                0.0011
staff                0.0011
fresh                0.0011
sauce                0.0011
definitely           0.0011
look                 0.0010
taste                0.0010
friend               0.0010
delicious       

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic pyLDAvis library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [42]:
topic_names = {0: u'mexican',
               1: u'menu',
               2: u'thai',
               3: u'steak',
               4: u'donuts & appetizers',
               5: u'specials',
               6: u'soup',
               7: u'wings, sports bar',
               8: u'foreign language',
               9: u'las vegas',
               10: u'chicken',
               11: u'aria buffet',
               12: u'noodles',
               13: u'ambience & seating',
               14: u'sushi',
               15: u'arizona',
               16: u'family',
               17: u'price',
               18: u'sweet',
               19: u'waiting',
               20: u'general',
               21: u'tapas',
               22: u'dirty',
               23: u'customer service',
               24: u'restrooms',
               25: u'chinese',
               26: u'gluten free',
               27: u'pizza',
               28: u'seafood',
               29: u'amazing',
               30: u'eat, like, know, want',
               31: u'bars',
               32: u'breakfast',
               33: u'location & time',
               34: u'italian',
               35: u'barbecue',
               36: u'arizona',
               37: u'indian',
               38: u'latin & cajun',
               39: u'burger & fries',
               40: u'vegetarian',
               41: u'lunch buffet',
               42: u'customer service',
               43: u'taco, ice cream',
               44: u'high cuisine',
               45: u'healthy',
               46: u'salad & sandwich',
               47: u'greek',
               48: u'poor experience',
               49: u'wine & dine'}

In [43]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

In [44]:

LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,trigram_dictionary)

with codecs.open(LDAvis_data_filepath, 'wb') as f:
    pickle.dump(LDAvis_prepared, f)        

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [45]:
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)


pyLDAvis.display(LDAvis_prepared)

Wait, what am I looking at again?

There are a lot of moving parts in the visualization. Here's a brief summary:

On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
    
   - The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be    appear close together on the plot, while dissimilar topics should appear far apart.       
        
   - The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
        
   - An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
    
On the right, there is a bar chart showing top terms.
    
   - When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

   - When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.

      - Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.

      - Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.

      - Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.

   - Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.



# Analyzing our LDA model

The interactive visualization pyLDAvis produces is helpful for both:

Better understanding and interpreting individual topics, and
Better understanding the relationships between the topics.
For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

In our plot, there is a stark divide along the x-axis, with two topics far to the left and most of the remaining 48 far to the right. Inspecting the two outlier topics provides a plausible explanation: both topics contain many non-English words, while most of the rest of the topics are in English. So, one of the main attributes that distinguish the reviews in the dataset from one another is their language.

This finding isn't entirely a surprise. In addition to English-speaking cities, the Yelp dataset includes reviews of businesses in Montreal and Karlsruhe, Germany, often written in French and German, respectively. Multiple languages isn't a problem for our demo, but for a real NLP application, you might need to ensure that the text you're processing is written in English (or is at least tagged for language) before passing it along to some downstream processing. If that were the case, the divide along the x-axis in the topic plot would immediately alert you to a potential data quality issue.

The y-axis separates two large groups of topics — let's call them "super-topics" — one in the upper-right quadrant and the other in the lower-right quadrant. These super-topics correlate reasonably well with the pattern we'd noticed while naming the topics:

The super-topic in the lower-right tends to be about food. It groups together the burger & fries, breakfast, sushi, barbecue, and greek topics, among others.
The super-topic in the upper-right tends to be about other elements of the restaurant experience. It groups together the ambience & seating, location & time, family, and customer service topics, among others.
So, in addition to the 50 direct topics the model has learned, our analysis suggests a higher-level pattern in the data. Restaurant reviewers in the Yelp dataset talk about two main things in their reviews, in general: (1) the food, and (2) their overall restaurant experience. For this dataset, this is a very intuitive result, and we probably didn't need a sophisticated modeling technique to tell it to us. When working with datasets from other domains, though, such high-level patterns may be much less obvious from the outset — and that's where topic modeling can help.

# Describing text with LDA

Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% Topic A, 20% Topic B, 20% Topic C, and 10% Topic D.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:

Using spaCy to remove punctuation and lemmatize the text
Applying our first-order phrase model to join word pairs
Applying our second-order phrase model to join longer phrases
Removing stopwords
Creating a bag-of-words representation
Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The lda_description(...) function will perform all these steps for us, including printing the resulting topical description of the input text.

In [46]:
def lda_description(review_text, min_topic_freq=0.005):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trig_mod[bigram_review]
    
    
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.lang.en.stop_words.STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda freq: freq[1])
    
    for topic_number, freq in review_lda:
        
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print( '{:25} {}'.format(topic_number,
                                round(freq, 3)))

In [47]:
lda_description(rest_reviews[0])



                       14 0.9700000286102295
