## Modern NLP in python
### -Naveen
Reference: Patrick Harrison\`s Modern NLP in python

From this notebook you can get very good understanding of end-to-end data science & natural language processing pipeline, starting with raw data and running through preparing, modeling, visualizing, and analyzing the data. 

Notebook covers:
* A tour of the dataset
* Introduction to text processing with spaCy
* Automatic phrase modeling
* Topic modeling with LDA
* Visualizing topic models with pyLDAvis
* Word vector models with word2vec
* Visualizing word2vec with t-SNE

## Yelp Dataset
The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes. Dataset can be accessed <a href="https://www.yelp.com/dataset/challenge">here</a>.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:<br>
__1,100,000__ users<br>
__156,000__ businesses<br>
__4,700,000__ user reviews

When focusing on restaurants alone, there are 51613 restaurants with 2,927,731 user reviews written about them.

The data is given in handful of files in .json format. Files used in this notebook:<br>
__business.json__ — the records for individual businesses<br>
__review.json__ — the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [1]:
import os

data_path = os.path.join('.', 'yelpData')
businesses_path = os.path.join(data_path, 'business.json')

with open(businesses_path, encoding='utf_8') as f:
    first_business = f.readline()
print(first_business)

{"business_id": "YDf95gJZaq05wvo7hTQbbQ", "name": "Richmond Town Square", "neighborhood": "", "address": "691 Richmond Rd", "city": "Richmond Heights", "state": "OH", "postal_code": "44143", "latitude": 41.5417162, "longitude": -81.4931165, "stars": 2.0, "review_count": 17, "is_open": 1, "attributes": {"RestaurantsPriceRange2": 2, "BusinessParking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "BikeParking": true, "WheelchairAccessible": true}, "categories": ["Shopping", "Shopping Centers"], "hours": {"Monday": "10:00-21:00", "Tuesday": "10:00-21:00", "Friday": "10:00-21:00", "Wednesday": "10:00-21:00", "Thursday": "10:00-21:00", "Sunday": "11:00-18:00", "Saturday": "10:00-21:00"}}



Attributes we are interested in are:<br>
__business_id__<br>
__categories__, we need 'Restaurant' category businesses

In [2]:
reviews_path = os.path.join(data_path, 'reviews.json')

with open(reviews_path, encoding='utf_8') as f:
    first_review = f.readline()
print(first_review)

{"review_id":"VfBHSwC5Vz_pbFluy07i9Q","user_id":"cjpdDjZyprfyDG3RlkVG3w","business_id":"uYHaNptLzDLoV_JZ_MuzUA","stars":5,"date":"2016-07-12","text":"My girlfriend and I stayed here for 3 nights and loved it. The location of this hotel and very decent price makes this an amazing deal. When you walk out the front door Scott Monument and Princes street are right in front of you, Edinburgh Castle and the Royal Mile is a 2 minute walk via a close right around the corner, and there are so many hidden gems nearby including Calton Hill and the newly opened Arches that made this location incredible.\n\nThe hotel itself was also very nice with a reasonably priced bar, very considerate staff, and small but comfortable rooms with excellent bathrooms and showers. Only two minor complaints are no telephones in room for room service (not a huge deal for us) and no AC in the room, but they have huge windows which can be fully opened. The staff were incredible though, letting us borrow umbrellas for t

Attributes we are interested in are:<br>
__business_id__, restaurants only<br>
__text__, review user wrote

Now we create a set of our restaurant business ids

In [3]:
import json

restaurant_ids = set()

# open the businesses file
with open(businesses_path, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

51,613 restaurants in the dataset.


Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [4]:
restaurant_review_path = os.path.join(data_path, 'restaurant_reviews.txt')

In [5]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:
    
    review_count = 0

    # create & open a new file in write mode
    with open(restaurant_review_path, 'w', encoding='utf_8') as restaurant_review_file:

        # open the existing review json file
        with open(reviews_path, encoding='utf_8') as review_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                restaurant_review_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print(u'''Text from {:,} restaurant reviews written to the new txt file.'''.format(review_count))
    
else:
    
    with open(restaurant_review_path, encoding='utf_8') as restaurant_review_file:
        for review_count, line in enumerate(restaurant_review_file):
            pass
        
    print(u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 2,929,512 restaurant reviews in the txt file.
CPU times: user 4.27 s, sys: 640 ms, total: 4.91 s
Wall time: 4.95 s


## Spacy - Industrial strength NLP in python

spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.<br>
spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
* Tokenization
* Text normalization, such as lowercasing, stemming/lemmatization
* Part-of-speech tagging
* Syntactic dependency parsing
* Sentence boundary detection
* Named entity recognition and annotation

spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
* Large English vocabulary, including stopword lists
* Token "probabilities"
* Word vectors

spaCy is written in optimized Cython, which means it's fast. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL).

In [6]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

Let's grab a lengthy sample review to play with.

In [7]:
# Get the review of maximum length
# index is the largest review which is having a whole menu
# index1 is the second largest which is similar to index
# index2 is the third largest review that we need
"""
max_length = 0
index = 0
index1 = 0

with open(restaurant_review_path, encoding='utf_8') as f:
    for review_count, line in enumerate(f):
        length = len(line)
        if length > max_length:
            max_length = length
            index2 = index1
            index1 = index
            index = review_count
            
    print(index2)
"""

# if you run the above commented code you will get the index2 to be 982679
# Or Eyeball all the reviews!!!!

# I choose 150 randomly and it turned out to be having some words and phrases related to food
with open(restaurant_review_path, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 150, 151))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

On a stormy, humid night in Leslieville, my partner and I stumbled upon this welcoming looking restaurant with the rooster out front, and it turned out quite well.

Though the category on Yelp is listed as French, I would say the food is much closer to Italian if you had to pick a cuisine.

Right off the hop the service was questionable. Benefit of the doubt moment: it was after 8 pm on a Sunday -- which in Toronto means you're lucky if anything half-decent is open at all -- and the place seemed  understaffed. 

Still, having to stand awkwardly for 10-15 min in the non-foyer, while  waiters whisked by us was not too pleasant.

And once we had a table, service continued to be spotty. Our waiter was friendly and self-effacing, calling us his "little neglected table." This was not inaccurate as even getting our initially requested waters (it was sweaty out!) took nearly 10 minutes, and later he left us with dessert menus for 20 minutes before coming back to see what we wanted, our appetit

In [8]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 44 ms, sys: 0 ns, total: 44 ms
Wall time: 43 ms


In [9]:
print(parsed_review)

On a stormy, humid night in Leslieville, my partner and I stumbled upon this welcoming looking restaurant with the rooster out front, and it turned out quite well.

Though the category on Yelp is listed as French, I would say the food is much closer to Italian if you had to pick a cuisine.

Right off the hop the service was questionable. Benefit of the doubt moment: it was after 8 pm on a Sunday -- which in Toronto means you're lucky if anything half-decent is open at all -- and the place seemed  understaffed. 

Still, having to stand awkwardly for 10-15 min in the non-foyer, while  waiters whisked by us was not too pleasant.

And once we had a table, service continued to be spotty. Our waiter was friendly and self-effacing, calling us his "little neglected table." This was not inaccurate as even getting our initially requested waters (it was sweaty out!) took nearly 10 minutes, and later he left us with dessert menus for 20 minutes before coming back to see what we wanted, our appetit

Spacy handed over an object on which we can get sentences, segmentation, .......

In [10]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print()

Sentence 1:
On a stormy, humid night in Leslieville, my partner and I stumbled upon this welcoming looking restaurant with the rooster out front, and it turned out quite well.



Sentence 2:
Though the category on Yelp is listed as French, I would say the food is much closer to Italian if you had to pick a cuisine.



Sentence 3:
Right off the hop the service was questionable.

Sentence 4:
Benefit of the doubt moment: it was after 8 pm on a Sunday -- which in Toronto means you're lucky if anything half-decent is open at all -- and the place seemed  understaffed. 



Sentence 5:
Still, having to stand awkwardly for 10-15 min in the non-foyer, while  waiters whisked by us was not too pleasant.



Sentence 6:
And once we had a table, service continued to be spotty.

Sentence 7:
Our waiter was friendly and self-effacing, calling us his "little neglected table.

Sentence 8:
" This was not inaccurate as even getting our initially requested waters (it was sweaty out!) took nearly 10 minutes, 

Named entity detection

In [11]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print()

Entity 1: night - TIME

Entity 2: Leslieville - GPE

Entity 3: Yelp - PERSON

Entity 4: French - LANGUAGE

Entity 5: Italian - NORP

Entity 6: 8 pm - QUANTITY

Entity 7: Sunday - DATE

Entity 8: Toronto - GPE

Entity 9: half - CARDINAL

Entity 10: 10-15 min - DATE

Entity 11: nearly 10 minutes - TIME

Entity 12: 20 minutes - TIME

Entity 13: about half - CARDINAL

Entity 14: Rich - PERSON

Entity 15: two - CARDINAL

Entity 16: two - CARDINAL

Entity 17: One - CARDINAL

Entity 18: night - TIME

Entity 19: Wines - PERSON

Entity 20: Rollegrosse - GPE

Entity 21: $11 - MONEY

Entity 22: Anna - PERSON

Entity 23: one - CARDINAL

Entity 24: $13 - MONEY

Entity 25: two - CARDINAL

Entity 26: $77 + - MONEY

Entity 27: over $100 - MONEY

Entity 28: $30 - MONEY

Entity 29: more than three - CARDINAL

Entity 30: five - CARDINAL



Part of speech tagging

In [12]:
import pandas as pd
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame({'token_text': token_text, 'part_of_speech': token_pos})

Unnamed: 0,part_of_speech,token_text
0,ADP,On
1,DET,a
2,ADJ,stormy
3,PUNCT,","
4,ADJ,humid
5,NOUN,night
6,ADP,in
7,PROPN,Leslieville
8,PUNCT,","
9,ADJ,my


What about text normalization, like stemming/lemmatization and shape analysis?

In [13]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame({'token_text': token_text, 'token_lemma': token_lemma, 'token_shape': token_shape})

Unnamed: 0,token_lemma,token_shape,token_text
0,on,Xx,On
1,a,x,a
2,stormy,xxxx,stormy
3,",",",",","
4,humid,xxxx,humid
5,night,xxxx,night
6,in,xx,in
7,leslieville,Xxxxx,Leslieville
8,",",",",","
9,-PRON-,xx,my


What about token-level entity analysis?

In [14]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame({'token_text': token_text, 'token_entity_type': token_entity_type,
              'token_entity_iob': token_entity_iob})

Unnamed: 0,token_entity_iob,token_entity_type,token_text
0,O,,On
1,O,,a
2,O,,stormy
3,O,,","
4,O,,humid
5,B,TIME,night
6,O,,in
7,B,GPE,Leslieville
8,O,,","
9,O,,my


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?
* stopword
* punctuation
* whitespace
* represents a number
* whether or not the token is included in spaCy's default vocabulary?

In [15]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,On,-8.899799,Yes,,,,
1,a,-3.983075,Yes,,,,
2,stormy,-19.579313,,,,,
3,",",-3.391480,,Yes,,,
4,humid,-19.579313,,,,,
5,night,-8.964851,,,,,
6,in,-4.585874,Yes,,,,
7,Leslieville,-19.579313,,,,,Yes
8,",",-3.391480,,Yes,,,
9,my,-5.918125,Yes,,,,


## Phrase Modeling
Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:

$count(A)$ is the number of times token $A$ appears in the corpus<br>
$count(B)$ is the number of times token $B$ appears in the corpus<br>
$count(A B)$ is the number of times the tokens $A\ B$ appear in the corpus in order<br>
$N$ is the total size of the corpus vocabulary<br>
$count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times<br>
$threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modeling — the Phrases class in particular.

In [16]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

* Segment text of complete reviews into sentences & normalize text
* First-order phrase modeling $\rightarrow$ apply first-order phrase model to transform sentences
* Second-order phrase modeling $\rightarrow$ apply second-order phrase model to transform sentences
* Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:
* Iterate over the 1M reviews in the review_txt_all.txt we created before
* Segment the reviews into individual sentences
* Remove punctuation and excess whitespace
* Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's nlp.pipe() function.

In [17]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space


def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [18]:
unigram_sentences_filepath = os.path.join('yelpData',
                                          'unigram_sentences_all.txt')

Let's use the lemmatized_sentence_corpus generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [19]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(restaurant_review_path):
            f.write(sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.25 µs


If your data is organized like our unigram_sentences_all file now is — a large text file with one document/sentence per line — gensim's LineSentence class provides a convenient iterator for working with other gensim components. It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [20]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [21]:
for unigram_sentence in it.islice(unigram_sentences, 150, 160):
    print(u' '.join(unigram_sentence))
    print(u'')

the only downfall be that the noise be not conducive to conversation

-PRON- haven't be to an establishment this noisy since buca de beppo's which -PRON- love do n't get -PRON- wrong

but -PRON- will say on a positive note that the customer service be fabulous

something that can be hard to come by at time as for price $28 buck for the two of -PRON- include an appetizer two entree and two drink

coke and ice tea with free refill seem a bit pricy to -PRON- but honestly everything be expensive now day

overall try -PRON- out -PRON- offer take out too so -PRON- can enjoy great food in the comfort of -PRON- home without all the noise or -PRON- have a patio -PRON- can dine from

very clean and staff be always friendly

-PRON- usually order the honey sear chicken but decide to switch -PRON- up

-PRON- order the thai dynamite and -PRON- have a weird almost chemical taste

-PRON- will go back but will stick to -PRON- usual order



Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "ice cream", to be linked together to form a new, single token: "ice_cream".

In [22]:
bigram_model_filepath = os.path.join('yelpData', 'bigram_model_all')

In [23]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 0 == 1:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

CPU times: user 9.96 s, sys: 760 ms, total: 10.7 s
Wall time: 10.8 s


Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [24]:
bigram_sentences_filepath = os.path.join('yelpData',
                                         'bigram_sentences_all.txt')

In [25]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs


In [26]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [27]:
for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print(u' '.join(bigram_sentence))
    print(u'')

nothing fantastic about -PRON- but -PRON- taste fine

-PRON- come with a sweet mustard dip sauce which be pretty tasty

usual filling of pork veggie etc

deep_fried pot_stickers- -PRON- get 2 of these thing

-PRON- be good

small but good

straight pork filling in the middle of -PRON-

overall -PRON- be just alright

-PRON- will say that -PRON- give -PRON- a grip of food

the 2 entree -PRON- get could have easily_feed 1 2 more people



In [28]:
trigram_model_filepath = os.path.join('yelpData',
                                      'trigram_model_all')

In [29]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 0 == 1:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 12.3 s, sys: 644 ms, total: 13 s
Wall time: 13 s


We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [30]:
trigram_sentences_filepath = os.path.join('yelpData',
                                          'trigram_sentences_all.txt')

In [31]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs


In [32]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [33]:
for trigram_sentence in it.islice(trigram_sentences, 980, 990):
    print(u' '.join(trigram_sentence))
    print(u'')

book_through_opentable a few day before for 8_pm zip right in to -PRON- table

have the marinated_olive and diver_scallops to start wow

the scallop be fantastic some of the good -PRON- have have here in toronto

-PRON- be usually not a big béarnaise fan but find -PRON- scrap the plate

the guanciale be a neat change too different style of bacon

have the pork_tenderloin wrap in pancetta for dinner also fantastic

well cooked and present

love the square plate

-PRON- friend have the braise_short_rib but be mildly disappoint

find -PRON- to be very chewy



The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point. Stopwords are very common words, like a, the, and, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.

Finally, we'll write the transformed text out to a new file, with one review per line.

In [34]:
trigram_reviews_filepath = os.path.join('yelpData',
                                        'trigram_transformed_reviews_all.txt')

In [35]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(restaurant_review_path),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.en.English.Defaults.stop_words]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.44 µs


In [37]:
print(u'Original:' + u'\n')

for review in it.islice(line_review(restaurant_review_path), 150, 151):
    print(review)

print(u'----' + u'\n')
print(u'Transformed:' + u'\n')

with open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 150, 151):
        print(review)

Original:

On a stormy, humid night in Leslieville, my partner and I stumbled upon this welcoming looking restaurant with the rooster out front, and it turned out quite well.

Though the category on Yelp is listed as French, I would say the food is much closer to Italian if you had to pick a cuisine.

Right off the hop the service was questionable. Benefit of the doubt moment: it was after 8 pm on a Sunday -- which in Toronto means you're lucky if anything half-decent is open at all -- and the place seemed  understaffed. 

Still, having to stand awkwardly for 10-15 min in the non-foyer, while  waiters whisked by us was not too pleasant.

And once we had a table, service continued to be spotty. Our waiter was friendly and self-effacing, calling us his "little neglected table." This was not inaccurate as even getting our initially requested waters (it was sweaty out!) took nearly 10 minutes, and later he left us with dessert menus for 20 minutes before coming back to see what we wanted, 

You can see that most of the grammatical structure has been scrubbed from the text — capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic meaning is still present.  The review text is now ready for higher-level modeling.

## Topic Modeling with Latent Dirichlet Allocation (LDA)

Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics".

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its LdaMulticore class.

In [38]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import _pickle as pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.

In [39]:
trigram_dictionary_filepath = os.path.join('yelpData',
                                           'trigram_dict_all.dict')

In [40]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if 0 == 1:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 49 ms


Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The trigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as bow.

In [41]:
trigram_bow_filepath = os.path.join('yelpData',
                                    'trigram_bow_corpus_all.mm')

In [42]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [43]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if 0 == 1:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

CPU times: user 292 ms, sys: 8 ms, total: 300 ms
Wall time: 350 ms


With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to LdaMulticore as inputs, along with the number of topics the model should learn. 50 topics are chosen here.

In [44]:
lda_model_filepath = os.path.join('yelpData', 'lda_model_all')

In [45]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if 0 == 1:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=50,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

CPU times: user 264 ms, sys: 44 ms, total: 308 ms
Wall time: 470 ms


Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [46]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [47]:
explore_topic(topic_number=0)

term                 frequency

table                0.033
come                 0.032
order                0.029
wait                 0.029
ask                  0.021
server               0.020
time                 0.020
service              0.018
seat                 0.017
drink                0.013
waitress             0.013
sit                  0.012
restaurant           0.011
tell                 0.010
waiter               0.010
bring                0.010
arrive               0.010
check                0.009
leave                0.009
walk                 0.008
want                 0.008
water                0.007
hostess              0.007
busy                 0.007
friend               0.006


The first topic has strong associations with words like table, time, service and waitress, as well as a handful of more general words. You might call this the service topic!

It's possible to go through and inspect each topic in the same way, and try to assign a human-interpretable label that captures the essence of each one. I've given it a shot for all 50 topics below.

In [48]:
topic_names = {0: u'service',
               1: u'good service',
               2: u'mexican',
               3: u'arizona',
               4: u'italian',
               5: u'facilities',
               6: u'quality',
               7: u'cheese and grill',
               8: u'delivery and pick up',
               9: u'food truck',
               10: u'look and feel',
               11: u'cafe',
               12: u'bbq',
               13: u'celebrations',
               14: u'spanish',
               15: u'burger',
               16: u'donut',
               17: u'chinese',
               18: u'american',
               19: u'los angeles',
               20: u'taste',
               21: u'thai',
               22: u'price',
               23: u'new orleans',
               24: u'good review',
               25: u'ambiance',
               26: u'bad review',
               27: u'chicken and wings',
               28: u'pizza',
               29: u'menu',
               30: u'breakfast',
               31: u'las vegas',
               32: u'latin & cajun',
               33: u'bagel and cream',
               34: u'mediterranean',
               35: u'vegan',
               36: u'japanese',
               37: u'sandwich',
               38: u'indian',
               39: u'seafood',
               40: u'dessert',
               41: u'spanish',
               42: u'bar',
               43: u'entertainment',
               44: u'general',
               45: u'appetizers and wine',
               46: u'time',
               47: u'salad',
               48: u'greek',
               49: u'worth'}

In [49]:
topic_names_filepath = os.path.join('yelpData', 'topic_names.pkl')

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

You can see that, along with mexican, there are a variety of topics related to different styles of food, such as thai, steak, sushi, pizza, and so on. In addition, there are topics that are more related to the overall restaurant experience, like ambience & seating, good service, waiting, and price.

Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to.

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic pyLDAvis library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [50]:
LDAvis_data_filepath = os.path.join('yelpData', 'ldavis_prepared')

In [51]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                              trigram_dictionary)

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 52.5 ms


In [52]:
pyLDAvis.display(LDAvis_prepared)

### Wait, what am I looking at again?

There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)

The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.

The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

* On the right, there is a bar chart showing top terms.

When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.

Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.

Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.

Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.

Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

### Analyzing our LDA model
The interactive visualization pyLDAvis produces is helpful for both:

Better understanding and interpreting individual topics, and
Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% Topic A, 20% Topic B, 20% Topic C, and 10% Topic D.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
* Using spaCy to remove punctuation and lemmatize the text
* Applying our first-order phrase model to join word pairs
* Applying our second-order phrase model to join longer phrases
* Removing stopwords
* Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The lda_description(...) function will perform all these steps for us, including printing the resulting topical description of the input text.

In [53]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    return list(it.islice(line_review(restaurant_review_path),
                          review_number, review_number+1))[0]

In [54]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.en.English.Defaults.stop_words]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda x: -x[1])
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_names[topic_number],
                                round(freq, 3)))

In [55]:
sample_review = get_sample_review(50)
print(sample_review)

Pretty good food a chain. You do get a lot of food for the price. However, every time I have ordered online there has been a mistake. Sometimes they tell me they never even received my order when I received the confirmation email. Also, if you in during a busy time and have a teenage girl as a cashier, you're chances of having a pleasant experience with her are slim to none. This has happened to me multiple times with multiple different cashiers. I have spoken to the manager and he is always very apologetic, but I guess you just can't get stressed teenage girls to be nice to customers. I will continue to go here as the food is good for the price and they are relatively quick with preparation. Nice options if you're veggies as well.



In [56]:
lda_description(sample_review)

delivery and pick up      0.429
quality                   0.185
worth                     0.123
italian                   0.072
good review               0.066




In [57]:
sample_review = get_sample_review(1)
print(sample_review)

For being fairly "fast" food.. Pei Wei (pronounced pay way I confirmed haha) is pretty darn good. we got a few things to share. I had the Asian chicken salad and was impressed! There was a decent amount of chicken. Some more veggies would be nice, but overall pretty good. The steak teriyaki was great as well as the fried rice. Over all good was good! Nice, clean, and reasonable.



In [58]:
lda_description(sample_review)

worth                     0.231
salad                     0.178
ambiance                  0.156
chicken and wings         0.139
thai                      0.103
latin & cajun             0.079




## Word Vector Embedding with Word2Vec
The goal of word vector embedding models, or word vector models for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the meaning or concept the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised — they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is word2vec, originally proposed in 2013. The general idea of word2vec is, for a given focus word, to use the context of the word — i.e., the other words immediately before and after it — to provide hints about what the focus word might mean. To do this, word2vec uses a sliding window technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training epoch. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are close to each other in vector space.

For a deeper dive into word2vec's machine learning process, <a href="https://arxiv.org/pdf/1411.2738v4.pdf">see here</a>.

Word2vec has a number of user-defined hyperparameters, including:
* The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
* The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
* The number of training epochs.

For using word2vec in Python, gensim comes to the rescue again! It offers a highly-optimized, parallelized implementation of the word2vec algorithm with its Word2Vec class.

In [59]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join('yelpData', 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [60]:
%%time

import sys

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if 0 == 1:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    print('First epoch completed')
    food2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
    for i in range(1,12):
        sys.stderr.write('\rOn {}'.format(i))
        food2vec.train(trigram_sentences, total_examples=food2vec.corpus_count, epochs=food2vec.iter)
        food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(u'{} training epochs so far.'.format(food2vec.train_count))

12 training epochs so far.
CPU times: user 3.44 s, sys: 72 ms, total: 3.52 s
Wall time: 3.75 s


On my four-core machine, each epoch over all the text in the ~3 million Yelp reviews takes about 70 minutes.

In [61]:
print(u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))

91,036 terms in the food2vec vocabulary.


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [62]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,-0.085470,0.014861,-0.123570,0.086499,-0.000353,-0.007540,-0.133669,-0.031146,0.070704,0.042372,...,-0.014009,-0.125808,-0.025142,-0.036401,1.135166e-01,-0.072344,0.086524,0.024117,-0.164374,0.011765
be,0.053216,-0.127242,-0.076624,-0.002680,0.020520,0.046150,-0.236219,0.129629,-0.094829,0.087052,...,-0.061288,-0.083323,-0.045794,0.004355,3.566204e-02,0.000128,-0.046617,-0.008546,-0.057985,0.083923
the,0.037929,0.004000,-0.225210,0.059887,0.038845,-0.119762,-0.124331,0.145222,0.079240,-0.012406,...,0.076552,-0.047843,-0.110813,-0.002587,1.084575e-01,0.024091,-0.121003,-0.119721,-0.090697,-0.066081
and,0.012674,0.001887,-0.055653,0.062659,0.103213,-0.044975,-0.103242,0.045381,0.108051,-0.069832,...,-0.035530,-0.004807,-0.001784,0.096487,1.159265e-01,0.006008,0.040725,-0.138051,-0.043151,-0.044955
a,0.091581,-0.032514,-0.074431,0.156406,0.017965,-0.017718,-0.060090,0.094293,0.015793,0.107813,...,-0.064855,0.001498,-0.057765,-0.108280,1.099965e-01,0.043991,0.075105,-0.089240,-0.045729,0.055314
to,0.056062,0.000651,-0.099027,0.014328,0.054837,-0.147337,-0.143888,0.031364,-0.009085,0.075165,...,-0.061655,-0.131655,-0.080241,-0.033305,1.708396e-01,0.013830,-0.090710,0.004262,0.051951,-0.055905
have,0.003593,-0.056869,-0.069458,0.072651,0.132269,0.052339,-0.073179,0.034690,-0.107481,-0.057593,...,-0.070799,-0.064124,0.007917,-0.034900,1.318552e-01,0.082462,0.080481,-0.012780,-0.081089,0.228976
of,0.089374,-0.052012,-0.172409,-0.033182,0.094349,-0.078493,-0.029926,-0.021875,-0.033056,0.032901,...,0.016428,0.073649,-0.219952,-0.048696,-1.436407e-02,-0.017604,0.003971,-0.076952,-0.072511,0.049584
not,-0.017735,-0.014053,-0.079273,0.039611,-0.028743,-0.043794,-0.146965,-0.062202,-0.044409,0.061669,...,-0.060804,-0.144764,-0.023305,-0.055358,1.175644e-01,-0.165212,-0.108636,0.091019,-0.096056,-0.102158
for,0.099926,0.064141,-0.037362,0.150152,0.054814,-0.035868,-0.009534,0.025909,-0.015882,0.263066,...,-0.039587,-0.092900,-0.140028,0.041006,1.667080e-01,-0.005095,0.005415,0.025702,-0.088043,0.018006


Holy wall of numbers! This DataFrame has 91,036 rows — one for each term in the vocabulary — and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [63]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):

        print(u'{:20} {}'.format(word, round(similarity, 3)))

### What is happy hour?

In [64]:
get_related_terms(u'happy_hour')

hh                   0.913
reverse_happy_hour   0.88
happy_hr             0.855
happy_hour-          0.831
4_7pm                0.789
4_6pm                0.788
happy_hour_3_6pm     0.782
3pm-6pm              0.776
5_7pm                0.774
happy_hour_m_f       0.754


The model has noticed several alternate spellings for happy hour, such as hh and happy hr, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look — the model has turned up phrases like 3-6pm, 4-7pm, and mon-fri, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

### What things are like iHop?

In [65]:
get_related_terms(u'ihop')

denny_'s             0.928
dennys               0.91
bob_evans            0.876
village_inn          0.855
perkins              0.848
cracker_barrel       0.826
egg_works            0.785
blueberry_hill       0.785
oph                  0.774
mimi_'s              0.762


The model has learned to identify similar kinds of restaurants

### What are similar to McDonalds?

In [66]:
get_related_terms(u'mcdonalds')

mcdonald_'s          0.974
mcd_'s               0.94
mcdonald             0.939
mcd                  0.927
wendy_'s             0.898
mcds                 0.885
bk                   0.881
mc_donald_'s         0.879
wendys               0.876
carl_'s_jr.          0.872


The model has learned that fast food restaurants are similar to each other! In addition, the model has found that alternate spellings for the same entities are probably related, such as mcdonalds, mcdonald's and mcd's.

### Let's make pasta tonight. Which style do you want?

In [67]:
get_related_terms(u'pasta', topn=20)

rigatoni             0.858
spaghetti            0.848
fettuccine           0.845
penne                0.844
bolognese            0.837
lasagna              0.827
manicotti            0.819
penne_pasta          0.818
fettucini            0.817
fettuccini           0.816
ziti                 0.808
penne_vodka          0.805
tortellini           0.805
linguine             0.804
spaghettini          0.802
linguini             0.802
angel_hair_pasta     0.801
alfredo              0.8
ravioli              0.795
gnocchi              0.794


## Word algebra!
No self-respecting word2vec demo would be complete without a healthy dose of word algebra, also known as analogy completion.

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
2. Look up the vectors that represent those terms in the word vector model.
3. Add and subtract those vectors to produce a new, combined vector.
4. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
5. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the meaning or concepts of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [68]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

### breakfast + lunch = ?

In [69]:
word_algebra(add=[u'breakfast', u'lunch'])

brunch


OK, so the model knows that brunch is a combination of breakfast and lunch. What else?
### lunch - day + night = ?

In [70]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

dinner


Now we're getting a bit more nuanced. The model has discovered that:
* Both lunch and dinner are meals
* The main difference between them is time of day
* Day and night are times of day
* Lunch is associated with day, and dinner is associated with night

What else?

### taco - mexican + chinese = ?

In [71]:
word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican'])

dumpling


Here's an entirely new and different type of relationship that the model has learned.
* It knows that tacos are a characteristic example of Mexican food
* It knows that Mexican and Chinese are both styles of food
* If you subtract Mexican from taco, you're left with something like the concept of a "characteristic type of food", which is represented as a new vector
* If you add that new "characteristic type of food" vector to Chinese, you get dumpling.

What else?

### bun - american + mexican = ?

In [72]:
word_algebra(add=[u'bun', u'mexican'], subtract=[u'american'])

corn_tortilla


The model knows that both buns and tortillas are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.

What else?
### filet mignon - beef + seafood = ?

In [73]:
word_algebra(add=[u'filet_mignon', u'seafood'], subtract=[u'beef'])

alaskan_king_crab_leg


The model has learned a concept of delicacy. If you take filet mignon and subtract beef from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to seafood, you get alaskan king crab leg.

What else?
### coffee - drink + snack = ?

In [74]:
word_algebra(add=[u'coffee', u'snack'], subtract=[u'drink'])

afternoon_snack


The model knows that if you're on your coffee break, but instead of drinking something, you're eating something... that thing is most likely a afternoon snack.

What else?
### Burger King + fine dining = ?

In [75]:
word_algebra(add=[u'wendys', u'fine_dining'])

denny_'s


It makes sense, though. The model has learned that both Burger King and Denny's are large chains, and that both serve fast, casual, American-style food. But Denny's has some elements that are slightly more upscale, such as printed menus and table service. Fine dining, indeed.

What if we keep going?

### Denny's + fine dining = ?

In [76]:
word_algebra(add=[u"denny_'s", u'fine_dining'])

dennys


Dennys and denny_\`s are repeated

### Applebee's + italian = ?

In [77]:
word_algebra(add=[u"applebee_'s", u'italian'])

olive_garden


### Applebee's + pancakes = ?

In [78]:
word_algebra(add=[u"applebee_'s", u'pancakes'])

ihop


### Applebee's + pizza = ?

In [79]:
word_algebra(add=[u"applebee_'s", u'pizza'])

pizza_hut


## Word Vector Visualization with t-SNE
t-Distributed Stochastic Neighbor Embedding, or t-SNE for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its TSNE class.

In [80]:
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
1. Drop stopwords — it's probably not too interesting to visualize the, of, or, and so on
2. Take only the 5,000 most frequent terms in the vocabulary — no need to visualize all ~90,000 terms right now.

In [81]:
tsne_input = word_vectors.drop(spacy.en.English.Defaults.stop_words, errors=u'ignore')
tsne_input = tsne_input.head(5000)

In [82]:
tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,-0.08547,0.014861,-0.12357,0.086499,-0.000353,-0.00754,-0.133669,-0.031146,0.070704,0.042372,...,-0.014009,-0.125808,-0.025142,-0.036401,0.1135166,-0.072344,0.086524,0.024117,-0.164374,0.011765
good,0.184775,0.027287,-0.087182,-0.03339,-0.132884,-0.082955,-0.067438,0.053086,-0.003937,0.011418,...,-0.039735,-0.095171,0.092665,0.040655,0.1938257,0.024601,-0.03048,-0.112038,-0.195792,0.121953
food,-0.045964,-0.021983,-0.108085,0.194386,-0.095299,-0.121416,-0.162542,0.09624,-0.101459,-0.081172,...,-0.051936,-0.199135,-0.144332,0.049387,-0.07141093,0.032712,0.117387,0.14532,-0.044421,-0.044294
place,0.012342,-0.04615,0.029041,-0.042761,-0.041367,0.002951,-0.086748,0.024527,-0.034502,0.14937,...,-0.21323,-0.10217,-0.200018,0.04291,7.016725e-07,-0.032343,0.147936,0.125483,-0.052058,0.198497
order,-0.044611,0.129437,-0.178152,0.087306,0.055869,-0.071726,-0.063208,-0.042146,0.056832,-0.014008,...,0.093484,-0.069846,0.077853,-0.075931,0.1208487,0.162153,-0.035212,0.082544,-0.180627,-0.01623


In [83]:
tsne_filepath = os.path.join('yelpData',
                             u'tsne_model')

tsne_vectors_filepath = os.path.join('yelpData',
                                     u'tsne_vectors.npy')

In [84]:
%%time

if 0 == 1:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 44.4 ms


Now we have a two-dimensional representation of our data! Let's take a look.

In [85]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
-PRON-,-45.807915,-35.178066
good,-12.017088,29.034958
food,-7.119519,8.582788
place,32.16494,-29.702133
order,-12.040922,57.257828


In [86]:
tsne_vectors[u'word'] = tsne_vectors.index

### Plotting with Bokeh

In [87]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [88]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, resize, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

  warn(message)


## Conclusion
Let's round up the major components that we've seen:
1. Text processing with spaCy
2. Automated phrase modeling
3. Topic modeling with LDA $\ \longrightarrow\ $ visualization with pyLDAvis
4. Word vector modeling with word2vec $\ \longrightarrow\ $ visualization with t-SNE

Why use these models?<br>
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:
* Text classification
* Search
* Recommendations
* Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.