# The Yelp Dataset Challenge Project tutorial

The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes. I really like the Yelp Dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected , and anyone can relate to it - it's largly about food, after all!!!

The current iteration of the Yelp dataset (As of this demo) consist of the following data:

552K users
77K businesses
2.2M user reviews

When focusing on restaurants alonem there are 22K restaurants with approximately 1M user reviews written about them.

The data is provided in a handful of files in json format. We'll be using the files for our demo:

    yelp_academic_dataset_business.json - the records for individual business
    yelp_academic_dataset_review.json - the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples

The business records consist of key, value pairs containing information about  the particular business. A few arrributes we'll be interested in for this demo include:
    
    business id - unique identifier for businesses
    categories - an array containing relevent category values of businesses
    
The categories attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the Restaurant tag in the categories array. In addition, the categories array may contain more detailed information about restaurants, such as the tyope of food they serve

The review records are stored in a similar manner - key, value parirs containing information about reviews

In [1]:
import os as os
import codecs


#Create a variable that will store the path to my data
data_directory = os.path.join('C:\\', 'yelp_dataset')

#Create variable that will contain a seperate filepath for my business data
business_filepath = os.path.join(data_directory,'business.json')

with codecs.open(business_filepath, encoding='utf-8') as f:
    first_business_record = f.readline()
    
print( first_business_record)

A few attributes to note on the review records:

    busines_id - indicates which business the review is about
    text - the natural language text the user wrote
    
The text attribute will be our focus...


json is handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:

    1. Read in business record and convert it to a Python dict
    2. Filter out business records that aren't restaurants
    3. Create a frozenset of business ID's for restaurants, which we'll use in the next step
    
  

In [2]:
#Create a variable that will store the path to my data
data_directory = os.path.join('C:\\', 'yelp_dataset')


#Create another variable that will contain a seperate filepath for my reiews data
review_json_filepath = os.path.join(data_directory, 'review.json')


with codecs.open(review_json_filepath, encoding='utf-8') as f:
        first_review_record = f.readline()
        
print(first_review_record)

Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [3]:
#Filepath in the intermediate directory(pre-existing)
intermediate_directory = os.path.join('C:\\', 'intermediate')

# Attach a filename to our intermediate directory
review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')

In [4]:
import json

restaurant_ids = set()

r_counter = 0

#open the Businesses file and encode our business json file with utf-8  
with codecs.open(business_filepath, encoding='utf-8') as f:
    
    #iterate through each line (json record) in the file
    for bus_line_json in f:
        #if i%10000==0:
            #print(i)
        
        #if i==20000:
            #break    
        
        #convert the json record to a Python dict
        business_dict = json.loads(bus_line_json)
        
        #convert category dictionary to a list & Split the list into seperate words to be easily iterated through
        cat_list = str(business_dict.get(u'categories',"")).split()
        
        #Iterate through every word in the list
        for category in cat_list:
            
            #if this business is not a restaurant, skip to the next one
            if category == u'Restaurants,' or category == u'Restaurants':

                #add the restaurant business id to our restaurant_ids set
                restaurant_ids.add(business_dict.get(u'business_id',""))

                #Increase the counter variable 
                r_counter+=1
    
#turn restaurant_ids into a frozenset, as we don't need to change it to anymore
restaurant_ids = frozenset(restaurant_ids)

#print the number of unique restaurant ids is in the dataset
print(r_counter, u' retaurants in the dataset,')

59382  retaurants in the dataset,


In [5]:
%%time
r_counter = 0

restaurant_reviews_id = set()

# This is a bit time consuming - make the if statement True
# If you want to execute data prep yourself assign "1" to the "if statement"
    
if 1==1:    

    #create & open new file in write mode
    with codecs.open(review_txt_filepath,'w', encoding='utf-8') as review_txt_file:
        
        #open the existing review json file
        with codecs.open(review_json_filepath, encoding='utf-8') as review_json_file:
        
            # loop through all reviews in the existing file and convert to Python dict
            for review_json in review_json_file:
                
                # Limit the amount of reviews to write
                if r_counter == 1000:
                    break  
                 
                #Convert review_json to a Python dict
                review_dict = json.loads(review_json)
    
                # if this review is not about a restuarant, skip to the next one
                for rest_id in restaurant_ids:
                    
                    # if(review_dict.[u'business_id'] == rest_id): "Verbose and inefficent"
                    if(review_dict.get(u'business_id',"") == rest_id):
                        
                        # Add review id's to restaurant_reviews_id set
                        restaurant_reviews_id.add(review_dict.get(u'business_id'))
                        
                        # Write the review to the .txt file and append a '0' character as end marker
                        review_txt_file.write(review_dict.get(u'text').replace('\n', '\\n')+'ETX')
                        
                        # Increase the counter variable
                        r_counter+=1
                        
    print(r_counter, u' retaurants in the dataset, \n')

    #Print the last Restaurant review text wrten to the text file
    print('Last review text writen to file:\n\n', review_dict.get(u'text',""), '\n')
    print("-----------------------------------------------------\n")
    print(f'Text from {r_counter} restaurant reviews written to the new_txt_file.', '\n')    

1000  retaurants in the dataset, 

Last review text writen to file:

 We visited last week.  Be Warned! My husband's sister ordered a cheeseburger.  The waiter asked how she wanted it cooked and explained that "Medium" would be pink through-out.  She ordered the burger medium.  It took a very long time for our food to arrive and when it did, her burger was cooked very well done.  The waiter offered to have another burger made.  We had waited quite a while, and everyone was nearly done with their meals, all of us sharing with my sister-in-law whom still had no burger.  The waiter came by and my sister-in-law said to just cancel the order because we had already waited too long for it.  He said that he would take care of it.  A few minutes later, someone who appeared to be a manager came to the table delivering the burger we had already asked to be removed from the order.  My sister in law, frustrated at this point stated "No, I don't want it now".  The manager walked away.  A few minutes

spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many task commonly associated with building an end-to-end natural language processing pipeline:

       Tokenization
       Text normailization, such as lowercasing, stemming/lemmentazation
       Part-Of-Speech tagging
       Syntatic dependency parsing
       Sentence boundary detection
       Named entity recognition and annotation
       
In the "batteries included" Python tradition, spaCy contains built-in-data and models which you can use out-of-the-box for processing general-purpose English text:

        Large English vocabulary, including stopword lists
        Token "probabilities"
        Word Vectors
        
spaCy is written in optimized Cython, which means it's fast. According to a few independent sources, it's the fastest syntactic parser available in any language. Key places of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL(Global Interpreter Lock).

In [6]:
import spacy
import pandas as pd
import itertools as it

#It takes a few seconds to load because there is about 2GB to 3GB worth of models
nlp = spacy.load('en_core_web_sm')



Now, we've completed our first step by loading spaCy....let's some reviews  with a sample text

In [7]:
r_counter = 0

with codecs.open(review_txt_filepath, encoding='utf_8') as f:
        
        # Read the binary data from our text file up to the "^" marker
        # and store our review as a element in a list of reviews
        
        rest_reviews = str(f.read()).split('ETX')
        
        for text in rest_reviews:
            if r_counter == 2:
                    break  
            print(text.replace('\\n', '\n'), '\n')
            print("-----------------------------------------------------\n")
            r_counter+=1

Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.

Drink prices were pretty good.

The Server, Dawn, was friendly and accommodating. Very happy with her.

In summation, a great pub experience. Would go again! 

-----------------------------------------------------

I'll be the first to admit that I was not excited about going to La Tavolta. Being a food snob, when a group of friends suggested we go for dinner I looked online at the menu and to me there was nothing special and it seemed overpriced.  Im also not big on ordering pasta when I go out. Alas, I was outnumbered. Thank goodness! I ordered the sea bass special. It was to die for. Cooked perfectly, seasoned perfectly, perfect portion. I can not say enough good things about this dish. When the server asked how it was he seemed very pro

Great we were able to grab our text from our file!!! 

Also, we made a list of reviews so later we can iterate through the reviews pretty easily...

In [34]:
#Pass the a restaurant review to spaCy's nlp function which will return a parsed review
parsed_rev = nlp(rest_reviews[999].replace('\\n','\n'))

print(parsed_rev, '\n')

We visited last week.  Be Warned! My husband's sister ordered a cheeseburger.  The waiter asked how she wanted it cooked and explained that "Medium" would be pink through-out.  She ordered the burger medium.  It took a very long time for our food to arrive and when it did, her burger was cooked very well done.  The waiter offered to have another burger made.  We had waited quite a while, and everyone was nearly done with their meals, all of us sharing with my sister-in-law whom still had no burger.  The waiter came by and my sister-in-law said to just cancel the order because we had already waited too long for it.  He said that he would take care of it.  A few minutes later, someone who appeared to be a manager came to the table delivering the burger we had already asked to be removed from the order.  My sister in law, frustrated at this point stated "No, I don't want it now".  The manager walked away.  A few minutes later I noticed the manager with a waiter standing in the middle of t

Even though the text doesn't look like it changed at all we'll check behind the scene to see what happened

In [35]:
#Getting annoying "variable not defined" error because of scope issues in Jupyter Notebook
#So I repeated a redundant step by calling the nlp function and assigning to "parsed_rev" variable
parsed_rev = nlp(rest_reviews[999].replace('\\n','\n'))

# Use f string object update to provide a more efficent syntax and remove verbose code
for i,sentence in enumerate(parsed_rev.sents):
    print(f'Sentence {i+1}:')
    print(sentence)
    print('')

Sentence 1:
We visited last week.  

Sentence 2:
Be Warned!

Sentence 3:
My husband's sister ordered a cheeseburger.  

Sentence 4:
The waiter asked how she wanted it cooked and explained that "Medium" would be pink through-out.  

Sentence 5:
She ordered the burger medium.  

Sentence 6:
It took a very long time for our food to arrive and

Sentence 7:
when it did, her burger was cooked very well done.  

Sentence 8:
The waiter offered to have another burger made.  

Sentence 9:
We had waited quite a while, and everyone was nearly done with their meals, all of us sharing with my sister-in-law whom still had no burger.  

Sentence 10:
The waiter came by and my sister-in-law said to just cancel the order because we had already waited too long for it.  

Sentence 11:
He said that he would take care of it.  

Sentence 12:
A few minutes later, someone who appeared to be a manager came to the table delivering the burger we had already asked to be removed from the order.  

Sentence 13:
My si

When you call nlp on a text, spaCy will take the first step and tokenize it then call each component on the Doc, in order. Since the model data is loaded, the components can access it to assign annotations to the Doc object, and subsequently to the Token and Span which are only views of the Doc, and don’t own any data themselves.

Let's see what the next component in the pipeline which is the "tagger"(part of speech tagging)

In [36]:
for i, entity in enumerate(parsed_rev.ents):
    print(f'Entity {i+1}:', entity, '-', entity.label_)
    print('')

Entity 1: last week - DATE

Entity 2: A few minutes later - TIME

Entity 3: A few minutes later - TIME



In [37]:
# Iterate through the parsed review list, call .orth_ which returns a tokenized string
# and create a element in your new "token_txt list"
token_txt = [token.orth_ for token in parsed_rev]

# Iterate through the parsed review list, call .pos which returns a tagged string
# and create a element in your new "token_pos list"
token_pos = [token.pos_ for token in parsed_rev]

pd.DataFrame(zip(token_txt, token_pos), columns = ['Token Text', 'Part Of Speech'])

Unnamed: 0,Token Text,Part Of Speech
0,We,PRON
1,visited,VERB
2,last,ADJ
3,week,NOUN
4,.,PUNCT
5,,SPACE
6,Be,VERB
7,Warned,VERB
8,!,PUNCT
9,My,DET


The lemma is a more normalized or root form of the word. For example, lets take the word "thinking". The lemma will remove the suffix and return the word to "think" which is the root form of the word "thinking" or "were" to "be". 

Below, we'll use the same code structure as in the cell above to display lemma, and shaping

In [38]:
# Iterate through the parsed review, call .lemma_(w\o "_" it will return a int instead)
# which returns a lemmanoized string create a "token_lemma list"
token_lemma = [token.lemma_ for token in parsed_rev]

# Iterate through the parsed review list, call .pos which returns a tagged string
# to create a "token_shape list"
token_shape = [token.shape_ for token in parsed_rev]

# The zip() function take iterables (can be zero or more), makes iterator that aggregates elements 
# based on the iterables passed, and returns a map. Pandas.Dataframe creates our database
# for a clean look
pd.DataFrame(zip(token_txt,token_lemma, token_shape), columns = ['Token Text','Token Lemma', 'Token Shape'])

Unnamed: 0,Token Text,Token Lemma,Token Shape
0,We,-PRON-,Xx
1,visited,visit,xxxx
2,last,last,xxxx
3,week,week,xxxx
4,.,.,.
5,,,
6,Be,Be,Xx
7,Warned,warn,Xxxxx
8,!,!,!
9,My,-PRON-,Xx


In [39]:
token_etype = [token.ent_type_ for token in parsed_rev]
token_eiob = [token.ent_iob_ for token in parsed_rev]

pd.DataFrame(zip(token_txt, token_etype, token_eiob),
             columns=['Token Text', 'Entity Type', 'Inside-Outsid-Begin'])

Unnamed: 0,Token Text,Entity Type,Inside-Outsid-Begin
0,We,,O
1,visited,,O
2,last,DATE,B
3,week,DATE,I
4,.,,O
5,,,O
6,Be,,O
7,Warned,,O
8,!,,O
9,My,,O


There are other attributes such as the realative frequency of tokens, and whether or not a token matches any of these categories?

        stopword
        punctuation
        whitespace
        represents a number

In [40]:
token_attbs = [(token.orth_, 
                token.is_stop, 
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
               for token in parsed_rev]


data_frame = pd.DataFrame(token_attbs,
                          columns=['Text',
                                   'Stop?',
                                   'Punctuation',
                                   'Whitespace?',
                                   'Number?',
                                   'Out Of Vocab.?'])

data_frame.loc[:, 'Stop?':'Out Of Vocab.?'] = (data_frame.loc[:, 'Stop?':'Out Of Vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
data_frame

Unnamed: 0,Text,Stop?,Punctuation,Whitespace?,Number?,Out Of Vocab.?
0,We,Yes,,,,Yes
1,visited,,,,,Yes
2,last,Yes,,,,Yes
3,week,,,,,Yes
4,.,,Yes,,,Yes
5,,,,Yes,,Yes
6,Be,Yes,,,,Yes
7,Warned,,,,,Yes
8,!,,Yes,,,Yes
9,My,Yes,,,,Yes


# Phrase Modeling

Phrase modeling is a alternative approach to learning groupings(combinations) of tokens that together embody meaningful multi-word concepts. For this project we are going to design phrase models by looping over the words in our reviews and looking for words that accompany together much more commonly than you would assume them by random chance. The formula our phrase models will use to determine whether two tokens “A” and “B” constitute a phrase is:

	(( Term_Freq(Token_A, Token_B) - min_range) / ( Term_Freq(Token_B) * Term_Freq(Token_A))) * count(corpus_vocab) > Threshold
    

Term_Freq(A) is the number of times token A appears in the corpus

Term_Freq(B) is the number of times token B appears in the corpus

Term_Freq(A, B) is the number of times the tokens A B appear in the corpus in 
order

count(corpus_vocab)is the total size of the corpus vocabulary

min_range is a user-defined parameter to ensure that accepted phrases occur a 
minimum number of times

Threshold is a user-defined parameter to control how strong of a relationship between two tokens as a prerequisite for the model

After our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensable gensim library to help us with phrase modeling — the Phrases class in particular

In [15]:
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.test.utils import datapath as dp





As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. We are going to prep our data in the following order:

     1. Segment text of complete reviews into sentences & normalize text
     2. Apply first-order phrase model to transform sentences
     3. Apply second-order phrase model to transform sentences
     4. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

    - Iterate over the 1M reviews in the review_txt_all.txt we created before
    - Segment the reviews into individual sentences
    - Remove punctuation and excess whitespace
    - Lemmatize the text
        
....by multithreading our text processing step we can efficiently iterate through large number of reviews. In order to accomplish this we must unlock the GIL(Global Interpreter Lock) for for-loop’s. spaCy’s pipe method will allow your threads to actually run on multiple cores which speed up production.

In [16]:
# Verifies if the token is a punctuation or a space
def punct_space(token):
    return token.is_punct or token.is_space

# generator function to read in reviews from the file
# and reverse the escaping of a string the original line breaks in the text
def rev_line(file_name):
    with codecs.open(file_name, encoding='utf=8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
# Generator function will use spaCy to parse reviews, lemmatize the textm and yield sentences
def lemma_sent_corpus(file_name):
    
    # Iterate through our review text, set the batch_size, and increase threads to use multi-cores
    # with our for-loop(Python nltk module does not support unlocking of GIL)
    for parsed_rev in nlp.pipe(rev_line(file_name), batch_size=10000, n_threads=5):
        
        # Our list compreshension will iterate through our sentence and lemmatize each word
        # in the sentence then return it as a unicode lemmatized list
        for sentence in parsed_rev.sents:
            
            yield str([token.lemma_ for token in sentence if not punct_space(token)])


In [17]:
#create a new file to store our lemmatized sentences
unigram_sentences_filepath = os.path.join(intermediate_directory,'unigram_sentences_all.txt')

Let's use the lemmatized_sentence_corpus generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [18]:
%%time

# #create & open new file in write mode
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
    # Write the lemmatized sentence to the file
    for lemmatized_sentence in lemma_sent_corpus(review_txt_filepath):
        f.write(lemmatized_sentence)

Wall time: 18.1 s


In [19]:
unigram_sentences = LineSentence(dp(unigram_sentences_filepath))

In [20]:
for unigram_sentence in unigram_sentences:
    print(u' '.join(unigram_sentence))
    print(u'')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


['go', 'in', 'for', 'a', 'lunch']['Steak', 'sandwich', 'be', 'delicious', 'and', 'the', 'Caesar', 'salad', 'have', 'an', 'absolutely', 'delicious', 'dressing', 'with', 'a', 'perfect', 'amount', 'of', 'dressing', 'and', 'distribute', 'perfectly', 'across', 'each', 'leaf']['-PRON-', 'know', '-PRON-', 'be', 'go', 'on', 'about', 'the', 'salad']['but', '-PRON-', 'be', 'perfect']['Drink', 'price', 'be', 'pretty', 'good']['the', 'Server', 'Dawn', 'be', 'friendly', 'and', 'accommodate']['very', 'happy', 'with', '-PRON-']['in', 'summation', 'a', 'great', 'pub', 'experience']['Would', 'go']["again!etxi'll", 'be', 'the', 'first', 'to', 'admit', 'that', '-PRON-', 'be', 'not', 'excited', 'about', 'go', 'to', 'La', 'Tavolta']['be', 'a', 'food', 'snob', 'when', 'a', 'group', 'of', 'friend', 'suggest', '-PRON-', 'go', 'for', 'dinner']['-PRON-', 'look', 'online', 'at', 'the', 'menu', 'and', 'to', '-PRON-', 'there', 'be', 'nothing', 'special']['and', '-PRON-', 'seem', 'overpriced']['-PRON-', 'be', 'also










'favorite']['-PRON-', 'only', 'other', 'gripe', 'be', 'smell', 'of', 'fajita']['when', '-PRON-', 'leave', '-PRON-', 'smell', 'like', '-PRON-', 'have', 'work', 'the', 'grill', '-PRON-', 'clothe', 'be', 'saturate', 'with', 'the', 'smell', 'and', '-PRON-', 'skin', 'feel', 'greasy']['-PRON-', 'have', 'notice', 'the', 'smell', 'and', 'cloud', 'of', 'smoke', 'in', 'the', 'restaurant', 'when', '-PRON-', 'sit', 'down']['-PRON-', 'will', 'just', 'be', 'sure', 'not', 'to', 'have', 'anything', 'special', 'plan', 'after', '-PRON-', 'eat', 'here']['price', 'be', 'on', 'par', 'with', 'other', 'mexican', 'restaurant', 'and', '-PRON-', 'get', 'more', 'than', 'enough', 'to', 'eat']['-PRON-', 'will', 'definitely', 'be', 'back', 'and', '-PRON-', 'will', 'be', 'pack', 'the', 'Febreeze!ETXService']['be', 'prompt', 'and', 'good']['-PRON-', 'order', 'the', 'roti', 'canai', 'which', 'be', 'excellent', 'and', 'also', 'fill']['the', 'chicken', 'curry', 'for', 'the', 'roti', 'be', 'not', 'thick', '-PRON-', 'thin




'-PRON-', 'got-', 'just', 'do', 'not', 'compute']['three', 'piece', 'of', 'fish', 'think', 'chicken', 'tender', 'size', 'and', 'a', 'small', 'portion', 'of', 'fry', 'along', 'with', 'a', 'small', 'side', 'of', 'overly', 'vinegare', 'cole', 'slaw']['again', 'the', 'food', 'be', 'good', 'but', '-PRON-', 'definitely', 'do', 'not', 'get', 'what', '-PRON-', 'pay', 'for']['-PRON-', 'understand', 'the', 'concept', 'of', 'small', 'portion', 'euro', 'style', 'etc']['but', 'for', 'a', 'bar', 'food', 'staple']['-PRON-', 'be', 'not', 'look', 'for', 'a', 'Costco', 'serve', 'of', 'anything', 'but', 'either', 'bump', 'up', 'portion', 'size', 'or', 'drop', 'the', 'price', 'from', '$', '16', 'to', '$', '11.50']['-PRON-', 'really', 'like', '-PRON-', 'server', 'but', 'for', 'some', 'reason', 'as', 'thing', 'go', 'on', '-PRON-', 'be', 'more', 'and', 'more', 'absent']['to', 'be', 'fair', '-PRON-', 'have', 'a', 'really', 'difficult', 'person', 'that', 'join', '-PRON-', 'some', 'of', 'the', 'stuff', 'this', 




'at', 'home']['Nobody', 'want', 'to', 'hear', '-PRON-']['etxthe', 'good', 'stuff', 'first', 'nice', 'décor', 'great', 'samosas', 'and', 'chutney']['great', 'chicken', 'tikka', 'masala']['sp']['the', 'bad', 'the', 'awful', 'shrimp', 'dish', 'spice', 'just', 'right']['but', 'the', 'shrimp', 'be', 'overcook', 'how', 'much', 'obviously', 'reheat', 'or', 'hold', 'too', 'long', 'like', 'bite', 'into', 'a', 'whole', 'water', 'chestnut']['-PRON-', 'should', 'have', 'send', 'the', 'dish', 'back']['do', 'mention', '-PRON-', 'too', 'the', 'waiter']['oh', '-PRON-', 'will', 'tell', 'the', 'chef']['no', 'response', 'at', 'all', 'from', 'management', 'be', 'busy', 'bs', 'e', 'at', 'the', 'bar']['Lesson', 'learn', 'by', '-PRON-', 'just', 'return', 'the', 'bloody', 'dish']['just', 'hate', 'do', 'that', 'though', 'would', '-PRON-', 'go', 'back']['yes', 'definitelyetxi', 'admit', 'that', '-PRON-', 'have', 'not', 'try', 'all', 'the', 'italian', 'restaurant', 'in', 'Toronto']['but', 'of', 'the', 'one', 'th




'all', 'these', 'year']['pizza', 'be', 'perfect', 'and', 'pasta', 'be', 'surprisingly', 'great', 'fresh', 'if', '-PRON-', 'can', 'resist', 'order', 'a', 'pizza']['Menu', 'be', 'often', 'change', 'and', 'new', 'addition', 'and', 'recommendation', 'be', 'usually', 'worth', 'take']['Have', 'not', 'yet', 'be', 'to', 'the', 'original', 'downtown', 'location', 'be', 'intimidate', 'by', 'no', 'reservation', 'and', 'what', '-PRON-', 'expect', 'to', 'be', 'long', 'line', 'for', 'a', 'reputable', 'place']['the', 'Town', 'Country', 'location', 'be', 'fashionable', 'shabby', 'chic', 'not', 'intimidate']['parking', 'be', 'kind', 'of', 'a', 'cluster']['look', 'forward', 'to', 'come', 'here', 'again', 'on', 'the', 'next', 'possible', 'date', 'night']['etxyet', 'another', 'example', 'of', 'a', 'business', 'cash', 'in', 'on', 'location', 'at', 'the', 'expense', 'of', 'quality']['cocktail', 'be', 'chain', 'restaurant', 'worthy', 'wine', 'list', 'be', 'weak', 'and', 'seafood', 'be', 'only', 'barely', 'co




'and', 'be', 'so', 'fun']['love', 'this']['Place!ETXPalermo', "'s", 'be', 'a', 'pretty', 'decent', 'pizza', 'for', 'the', 'price']['-PRON-', 'get', 'the', 'chicken', 'finger', 'and', 'pizza', 'special', 'for', '$', '12', 'which', 'be', 'a', 'great', 'deal']['the', 'delivery', 'be', 'a', 'bit', 'slow', 'but', 'free', 'so', 'no', 'complaint', 'there']['-PRON-', 'would', 'order', 'again']['ETXWhen.she', 'tell', '-PRON-', 'that', 'the', 'oven', 'be', '2,000', 'degree', 'so', '-PRON-', 'only', 'take', 'the', 'pizza', '90', 'second', 'to', 'cook', 'and', '-PRON-', 'order', 'would', 'be', 'ready', 'in', '5', 'minute', '-PRON-', 'should', 'have', 'hang', 'up', 'the', 'phone']['by', 'the', 'time', '-PRON-', 'get', '-PRON-', 'home']['-PRON-', 'be', 'rubbery']['be', 'that', 'a', 'word']['anyway', 'Yuck']['-PRON-', 'will', 'not', 'be', 'back']['ETXSo', 'this', 'be', 'actually', 'not', '-PRON-', 'first', 'time', 'here', 'come', 'here', 'a', 'few', 'time', 'already', 'and', 'food', 'be', 'pretty', '




'act', 'like', '-PRON-', 'love', '-PRON-', 'job', 'other', 'look', 'like', '-PRON-', 'hate', '-PRON-', 'life']['past', 'couple', 'of', 'time', '-PRON-', 'have', 'be', 'there']['the', 'food', 'come', 'out', 'lukewarm', 'at', 'good']['yesterday', '-PRON-', 'just', 'so', 'happen', 'that', '-PRON-', 'find', 'a', 'lemon', 'seed', 'in', '-PRON-', 'Brushetta', 'Chicken', 'Pasta', 'after', 'almost', 'crack', 'a', 'tooth', 'on', '-PRON-', 'ew']['although', '-PRON-', 'will', 'say', '-PRON-', 'just', 'move', 'that', 'bad', 'boy', 'to', 'the', 'side', 'and', 'act', 'like', '-PRON-', 'never', 'see', '-PRON-']['-PRON-', 'be', 'probably', 'well', 'that', 'way']['the', 'seat', 'arrangement', 'be', 'like', 'any', 'other', 'Friday', "'s", 'booth', 'table', 'a', 'bar']['some', 'of', '-PRON-', 'four', 'person', 'booth', 'would', 'probably', 'be', 'better', 'suited', 'for', '2', 'people', 'or', 'perhaps', 'small', 'child', '-PRON-', 'will', 'not', 'find', 'any', 'arm', 'room', 'or', 'a', 'place', 'to', 'se




'similar', 'to', 'most', 'indian', 'joint']['-PRON-', 'recommend', 'the', 'three', 'hot', 'relish', 'red', 'sauce', 'green', 'puree', 'and', 'pickle', 'pepper', 'provide', 'tableside', 'as', 'house', 'preparation', 'at', 'least', 'in', '-PRON-', 'experience', 'tend', 'to', 'be', 'mild']['Servers', 'be', 'generally', 'anglo', 'pleasant', 'and', 'competent', 'during', 'the', 'meal', 'but', 'just', 'about', 'every', 'time', '-PRON-', 'walk', 'in', 'for', 'dinner', 'at', '5', 'or', '5:30', 'the', 'entire', 'operation', 'seem', 'almost', 'startle', 'by', '-PRON-', 'presence']['Nobody', 'be', 'overtly', 'rude', 'but', 'instead', 'of', 'a', 'really', 'warm', 'welcome', '-PRON-', 'kind', 'of', 'get', 'four', 'or', 'five', 'employee', 'hang', 'out', 'at', 'the', 'bar', 'stare', 'at', '-PRON-', 'as', 'if', 'mentally', 'choose', 'straw', 'to', 'see', 'who', 'have', 'to', 'deal', 'with', 'the', 'first', 'dinner', 'customer', 'of', 'the', 'evening']['-PRON-', 'be', 'not', 'oppressive', 'but', '-PRO




'be', 'not', 'an', 'upcharge']['-PRON-', 'travel', 'around', 'and', 'glad', 'to', 'have', 'a', 'local', 'Thai', 'pace']['Pumpkin', 'curry', 'be', 'great']['too!etxoriginally', 'from', 'NY', '-PRON-', 'have', 'be', 'look', 'for', 'a', 'good', 'place', 'for', 'nearly', '20', 'year']['this', 'be', 'some', 'of', 'the', 'good', 'NY', 'pizza', '-PRON-', 'have', 'ever', 'have', 'in', 'AZ']['-PRON-', 'know', '-PRON-', 'good', 'when', '-PRON-', 'fold', '-PRON-', 'in', 'half', 'and', 'the', 'orange', 'oil', 'drip', 'off', 'the', 'tip']['yyuummmm']['pizza', 'heaven']['-PRON-', 'will', 'be', 'back!!ETXI']['can', 'not', 'do', 'italian', 'food', 'unless', '-PRON-', 'make', 'by', '-PRON-', 'mother', 'grandmother', 'or', 'grandfather']['all', 'who', 'descend', 'from', 'Sicily']['however', 'this', 'place', 'silence', 'even', 'the', 'most', 'difficult', 'critic']['each', 'dish', 'make', 'with', 'care', 'the', 'kitchen', 'obviously', 'have', 'a', 'deep', 'respect', 'for', 'this', 'cuisine']['and', 'the',




'-PRON-', 'will', 'get', 'a', 'well', 'feel', 'for', 'the', 'range', 'of', 'product', 'Cheese', 'Crackers', 'offer', 'on', 'a', 'weekly', 'basis', 'beyond', 'what', '-PRON-', 'just', 'see', 'in', 'store']['PROS', 'order', 'seafood']['-PRON-', 'have', 'a', 'mailing', 'list', 'to', 'order', 'fresh', 'seafood']['sign', 'up', 'on', '-PRON-', 'website']['-PRON-', 'bring', 'in', 'fresh', 'high', 'quality', 'seafood', 'from', 'Canada']['-PRON-', 'go', 'to', 'a', 'dinner', 'party', 'that', 'serve', 'fresh', 'salmon', 'from', 'here', 'be', 'delightful']['catering', '-PRON-', 'be', 'a', 'great', 'fit', 'for', 'catering', 'in', 'particular', 'artisanal', 'cheese', 'platter', 'and', 'tray']['baguette', 'Sandwiches']['to', 'go', '-PRON-', 'serve', 'sandwich', 'on', 'baguette', 'like', 'thinly', 'slice', 'roast', 'beef']['Cheese', 'Assortment', '-PRON-', 'have', 'an', 'extensive', 'list', 'of', 'feature', 'cheese']['CONS']['-PRON-', 'do', 'not', 'offer', 'in', 'store', 'or', 'to', 'go', 'cheese', 'p




Now let's apply the next level order of topic modeling...

In [21]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [22]:
%%time

bigram_model = Phrases(unigram_sentences)

bigram_model.save(bigram_model_filepath)

# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

'''
Now that we have a trained phrase model for word pairs, let's apply 
it to the review sentences data and explore the results.

'''
# this is a bit time consuming - make the if statement True

with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:

    for unigram_sentence in unigram_sentences:

        bigram_sentence = u' '.join(bigram_model[unigram_sentence])

        f.write(bigram_sentence + '\n')
        
bigram_sentences = LineSentence(dp(bigram_sentences_filepath))

for bigram_sentence in bigram_sentences:
    print( u' '.join(bigram_sentence))
    print( u'')



['go', 'in', 'for', 'a', 'lunch']['Steak', 'sandwich', 'be', 'delicious', 'and', 'the', 'Caesar', 'salad', 'have', 'an', 'absolutely', 'delicious', 'dressing', 'with', 'a', 'perfect', 'amount',_'of', 'dressing', 'and', 'distribute', 'perfectly', 'across', 'each', 'leaf']['-PRON-', 'know', '-PRON-', 'be', 'go', 'on', 'about', 'the', 'salad']['but', '-PRON-', 'be', 'perfect']['Drink', 'price', 'be', 'pretty', 'good']['the', 'Server', 'Dawn', 'be', 'friendly', 'and', 'accommodate']['very', 'happy', 'with', '-PRON-']['in', 'summation', 'a', 'great', 'pub', 'experience']['Would', 'go']["again!etxi'll", 'be', 'the', 'first', 'to', 'admit', 'that', '-PRON-', 'be', 'not', 'excited', 'about', 'go', 'to', 'La', 'Tavolta']['be', 'a', 'food', 'snob', 'when', 'a', 'group', 'of', 'friend', 'suggest', '-PRON-', 'go', 'for',_'dinner']['-PRON-', 'look', 'online', 'at', 'the', 'menu', 'and', 'to', '-PRON-', 'there', 'be', 'nothing', 'special']['and', '-PRON-', 'seem', 'overpriced']['-PRON-', 'be', 'also










'favorite']['-PRON-', 'only', 'other', 'gripe', 'be', 'smell', 'of', 'fajita']['when', '-PRON-', 'leave', '-PRON-', 'smell', 'like', '-PRON-', 'have', 'work', 'the', 'grill', '-PRON-', 'clothe', 'be', 'saturate', 'with', 'the', 'smell', 'and', '-PRON-', 'skin', 'feel', 'greasy']['-PRON-', 'have', 'notice', 'the', 'smell', 'and', 'cloud', 'of', 'smoke', 'in', 'the', 'restaurant', 'when', '-PRON-', 'sit', 'down']['-PRON-', 'will', 'just', 'be', 'sure', 'not', 'to', 'have', 'anything', 'special', 'plan', 'after', '-PRON-', 'eat', 'here']['price', 'be', 'on', 'par', 'with', 'other', 'mexican', 'restaurant', 'and', '-PRON-', 'get', 'more',_'than', 'enough',_'to', 'eat']['-PRON-', 'will',_'definitely', 'be', 'back', 'and', '-PRON-', 'will', 'be', 'pack', 'the', 'Febreeze!ETXService']['be', 'prompt', 'and', 'good']['-PRON-', 'order', 'the', 'roti', 'canai', 'which', 'be', 'excellent', 'and', 'also', 'fill']['the', 'chicken', 'curry', 'for', 'the', 'roti', 'be', 'not', 'thick', '-PRON-', 'thin




'-PRON-', 'got-', 'just', 'do',_'not', 'compute']['three', 'piece',_'of', 'fish', 'think', 'chicken', 'tender', 'size', 'and', 'a', 'small', 'portion', 'of', 'fry', 'along',_'with', 'a', 'small', 'side', 'of', 'overly', 'vinegare', 'cole', 'slaw']['again', 'the', 'food', 'be', 'good', 'but', '-PRON-', 'definitely', 'do',_'not', 'get', 'what', '-PRON-', 'pay', 'for']['-PRON-', 'understand', 'the', 'concept', 'of', 'small', 'portion', 'euro', 'style', 'etc']['but', 'for', 'a', 'bar', 'food', 'staple']['-PRON-', 'be', 'not', 'look',_'for', 'a', 'Costco', 'serve', 'of', 'anything', 'but', 'either', 'bump', 'up', 'portion',_'size', 'or', 'drop', 'the', 'price', 'from', '$', '16', 'to', '$', '11.50']['-PRON-', 'really', 'like', '-PRON-', 'server', 'but', 'for', 'some', 'reason', 'as', 'thing', 'go', 'on', '-PRON-', 'be', 'more', 'and', 'more', 'absent']['to', 'be', 'fair', '-PRON-', 'have', 'a', 'really', 'difficult', 'person', 'that', 'join', '-PRON-', 'some', 'of', 'the', 'stuff', 'this', 




'at', 'home']['Nobody', 'want',_'to', 'hear', '-PRON-']['etxthe', 'good', 'stuff', 'first', 'nice', 'décor', 'great', 'samosas', 'and', 'chutney']['great', 'chicken', 'tikka', 'masala']['sp']['the', 'bad', 'the', 'awful', 'shrimp', 'dish', 'spice', 'just', 'right']['but', 'the', 'shrimp', 'be', 'overcook', 'how', 'much', 'obviously', 'reheat', 'or', 'hold', 'too', 'long', 'like', 'bite', 'into', 'a', 'whole', 'water', 'chestnut']['-PRON-', 'should', 'have', 'send', 'the', 'dish', 'back']['do', 'mention', '-PRON-', 'too', 'the', 'waiter']['oh', '-PRON-', 'will', 'tell', 'the', 'chef']['no', 'response', 'at', 'all', 'from', 'management', 'be', 'busy', 'bs', 'e', 'at', 'the', 'bar']['Lesson', 'learn', 'by', '-PRON-', 'just', 'return', 'the', 'bloody', 'dish']['just', 'hate', 'do', 'that', 'though', 'would', '-PRON-', 'go', 'back']['yes', 'definitelyetxi', 'admit', 'that', '-PRON-', 'have', 'not', 'try', 'all', 'the', 'italian', 'restaurant', 'in', 'Toronto']['but', 'of', 'the', 'one', 'th




'all', 'these', 'year']['pizza', 'be', 'perfect', 'and', 'pasta', 'be', 'surprisingly', 'great', 'fresh', 'if', '-PRON-', 'can', 'resist', 'order', 'a', 'pizza']['Menu', 'be', 'often', 'change', 'and', 'new', 'addition', 'and', 'recommendation', 'be', 'usually', 'worth', 'take']['Have', 'not', 'yet', 'be', 'to', 'the', 'original', 'downtown', 'location', 'be', 'intimidate', 'by', 'no', 'reservation', 'and', 'what', '-PRON-', 'expect', 'to', 'be', 'long', 'line', 'for', 'a', 'reputable', 'place']['the', 'Town', 'Country', 'location', 'be', 'fashionable', 'shabby', 'chic', 'not', 'intimidate']['parking', 'be', 'kind',_'of', 'a', 'cluster']['look', 'forward',_'to', 'come',_'here', 'again', 'on', 'the', 'next', 'possible', 'date', 'night']['etxyet', 'another', 'example', 'of', 'a', 'business', 'cash', 'in', 'on', 'location', 'at', 'the', 'expense', 'of', 'quality']['cocktail', 'be', 'chain', 'restaurant', 'worthy', 'wine', 'list', 'be', 'weak', 'and', 'seafood', 'be', 'only', 'barely', 'co




'and', 'be', 'so', 'fun']['love', 'this']['Place!ETXPalermo', "'s", 'be', 'a', 'pretty', 'decent', 'pizza', 'for', 'the', 'price']['-PRON-', 'get', 'the', 'chicken', 'finger', 'and', 'pizza', 'special', 'for', '$', '12', 'which', 'be', 'a', 'great', 'deal']['the', 'delivery', 'be', 'a',_'bit', 'slow', 'but', 'free', 'so', 'no', 'complaint', 'there']['-PRON-', 'would', 'order', 'again']['ETXWhen.she', 'tell', '-PRON-', 'that', 'the', 'oven', 'be', '2,000', 'degree', 'so', '-PRON-', 'only', 'take', 'the', 'pizza', '90', 'second', 'to', 'cook', 'and', '-PRON-', 'order', 'would', 'be', 'ready', 'in', '5',_'minute', '-PRON-', 'should', 'have', 'hang', 'up', 'the', 'phone']['by', 'the', 'time', '-PRON-', 'get', '-PRON-', 'home']['-PRON-', 'be', 'rubbery']['be', 'that', 'a', 'word']['anyway', 'Yuck']['-PRON-', 'will', 'not', 'be', 'back']['ETXSo', 'this', 'be', 'actually', 'not', '-PRON-', 'first',_'time', 'here', 'come',_'here', 'a',_'few', 'time', 'already', 'and', 'food', 'be', 'pretty', '




'act', 'like', '-PRON-', 'love', '-PRON-', 'job', 'other', 'look',_'like', '-PRON-', 'hate', '-PRON-', 'life']['past', 'couple',_'of', 'time', '-PRON-', 'have', 'be', 'there']['the', 'food', 'come',_'out', 'lukewarm', 'at', 'good']['yesterday', '-PRON-', 'just', 'so', 'happen', 'that', '-PRON-', 'find', 'a', 'lemon', 'seed', 'in', '-PRON-', 'Brushetta', 'Chicken', 'Pasta', 'after', 'almost', 'crack', 'a', 'tooth', 'on', '-PRON-', 'ew']['although', '-PRON-', 'will', 'say', '-PRON-', 'just', 'move', 'that', 'bad', 'boy', 'to', 'the', 'side', 'and', 'act', 'like', '-PRON-', 'never', 'see', '-PRON-']['-PRON-', 'be', 'probably', 'well', 'that', 'way']['the', 'seat', 'arrangement', 'be', 'like', 'any', 'other', 'Friday', "'s", 'booth', 'table', 'a', 'bar']['some', 'of', '-PRON-', 'four', 'person', 'booth', 'would', 'probably', 'be', 'better', 'suited', 'for', '2', 'people', 'or', 'perhaps', 'small', 'child', '-PRON-', 'will', 'not', 'find', 'any', 'arm', 'room', 'or', 'a', 'place', 'to', 'se




'similar', 'to', 'most', 'indian', 'joint']['-PRON-', 'recommend', 'the', 'three', 'hot', 'relish', 'red', 'sauce', 'green', 'puree', 'and', 'pickle', 'pepper', 'provide', 'tableside', 'as', 'house', 'preparation', 'at',_'least', 'in', '-PRON-', 'experience', 'tend',_'to', 'be', 'mild']['Servers', 'be', 'generally', 'anglo', 'pleasant', 'and', 'competent', 'during', 'the', 'meal', 'but', 'just', 'about', 'every',_'time', '-PRON-', 'walk',_'in', 'for', 'dinner', 'at', '5', 'or', '5:30', 'the', 'entire', 'operation', 'seem', 'almost', 'startle', 'by', '-PRON-', 'presence']['Nobody', 'be', 'overtly', 'rude', 'but', 'instead',_'of', 'a', 'really', 'warm', 'welcome', '-PRON-', 'kind',_'of', 'get', 'four', 'or', 'five', 'employee', 'hang',_'out', 'at', 'the', 'bar', 'stare', 'at', '-PRON-', 'as', 'if', 'mentally', 'choose', 'straw', 'to', 'see', 'who', 'have', 'to', 'deal', 'with', 'the', 'first', 'dinner', 'customer', 'of', 'the', 'evening']['-PRON-', 'be', 'not', 'oppressive', 'but', '-PRO




'be', 'not', 'an', 'upcharge']['-PRON-', 'travel', 'around', 'and', 'glad', 'to', 'have', 'a', 'local', 'Thai', 'pace']['Pumpkin', 'curry', 'be', 'great']['too!etxoriginally', 'from', 'NY', '-PRON-', 'have', 'be', 'look',_'for', 'a', 'good', 'place', 'for', 'nearly', '20', 'year']['this', 'be', 'some', 'of', 'the', 'good', 'NY', 'pizza', '-PRON-', 'have',_'ever', 'have', 'in', 'AZ']['-PRON-', 'know', '-PRON-', 'good', 'when', '-PRON-', 'fold', '-PRON-', 'in', 'half', 'and', 'the', 'orange', 'oil', 'drip', 'off', 'the', 'tip']['yyuummmm']['pizza', 'heaven']['-PRON-', 'will', 'be', 'back!!ETXI']['can', 'not', 'do', 'italian', 'food', 'unless', '-PRON-', 'make', 'by', '-PRON-', 'mother', 'grandmother', 'or', 'grandfather']['all', 'who', 'descend', 'from', 'Sicily']['however', 'this',_'place', 'silence', 'even', 'the', 'most', 'difficult', 'critic']['each', 'dish', 'make', 'with', 'care', 'the', 'kitchen', 'obviously', 'have', 'a', 'deep', 'respect', 'for', 'this', 'cuisine']['and', 'the',




'-PRON-', 'will', 'get', 'a', 'well', 'feel', 'for', 'the', 'range', 'of', 'product', 'Cheese', 'Crackers', 'offer', 'on', 'a', 'weekly', 'basis', 'beyond', 'what', '-PRON-', 'just', 'see', 'in', 'store']['PROS', 'order', 'seafood']['-PRON-', 'have', 'a', 'mailing', 'list', 'to', 'order', 'fresh', 'seafood']['sign', 'up', 'on', '-PRON-', 'website']['-PRON-', 'bring', 'in', 'fresh', 'high', 'quality', 'seafood', 'from', 'Canada']['-PRON-', 'go', 'to', 'a', 'dinner', 'party', 'that', 'serve', 'fresh', 'salmon', 'from', 'here', 'be', 'delightful']['catering', '-PRON-', 'be', 'a', 'great', 'fit', 'for', 'catering', 'in', 'particular', 'artisanal', 'cheese', 'platter', 'and', 'tray']['baguette', 'Sandwiches']['to', 'go', '-PRON-', 'serve', 'sandwich', 'on', 'baguette', 'like', 'thinly', 'slice', 'roast', 'beef']['Cheese', 'Assortment', '-PRON-', 'have', 'an', 'extensive', 'list', 'of', 'feature', 'cheese']['CONS']['-PRON-', 'do',_'not', 'offer', 'in', 'store', 'or', 'to', 'go', 'cheese', 'p


Wall time: 846 ms


Ok, we have applied two orders of phrase modeling to our review text and at this point we will aplly one more level. Keep in mind that you can apply as many level as you desire. However, if you know words like "new york city" exist in your document which will require at least a third order model. Applying a 3 levels to our model ensures words "new york city" or "happy hour" will be group together as a phrase"new_york_city" or "happy_hour"

In [23]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [24]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.

bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

bigram_sentences = LineSentence(dp(bigram_sentences_filepath))

trig_mod = Phrases(bigram_sentences)

trig_mod.save(trigram_model_filepath)

# load the finished model from disk
trig_mod = Phrases.load(trigram_model_filepath)

Wall time: 305 ms


In [25]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [26]:
%%time

with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:

    bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')

    bigram_sentences = LineSentence(dp(bigram_sentences_filepath))
    
    trig_mod = Phrases.load(trigram_model_filepath)
    
    for bigram_sentence in bigram_sentences:

        trigram_sentence = u' '.join(trig_mod[bigram_sentence])

        f.write(trigram_sentence + '\n')

Wall time: 418 ms


In [29]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [None]:
with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:

    for parsed_review in nlp.pipe(rev_line(review_txt_filepath),
                                  batch_size=10000, n_threads=4):

        # lemmatize the text, removing punctuation and whitespace
        unigram_review = [token.lemma_ for token in parsed_review
                          if not punct_space(token)]

        bigram_model = Phrases.load(bigram_model_filepath)
        
        trig_mod = Phrases.load(trigram_model_filepath)
        
        # apply the first-order phrase models
        bigram_review = bigram_model[unigram_review]
        
        
        # apply the second-order phrase models
        trigram_review = trig_mod[bigram_review]

        # remove any remaining stopwords
        trigram_review = [term for term in trigram_review
                          if term not in spacy.lang.en.stop_words.STOP_WORDS]

        # write the transformed review as a line in the new file
        trigram_review = u' '.join(trigram_review)
        
        f.write(trigram_review + '\n')

Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.