# A Basic Delve into Python NLP
### *~ According to the Majestic Whims of Jay Kaiser ~*

**Note**: A vast majority of the code and methodologies found here has been taken directly from [Modern NLP in Python](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb). This is merely a way for me to practice these topics myself in my own environment, using the most recent dataset as provided by Yelp. I do not claim original creation of anything found here, and all credit belongs with the authors of that Jupyter notebook. I am merely striving to learn these concepts myself in a manner that is most conductive to my own research...

With that in mind, let's get started.

## Introduction

Hi. My name is Jay Kaiser, and I am in my final semester of grad school at Indiana University Bloomington, where I am receiving my M.S. in Computational Linguistics. Naturally, this has consisted mostly of classes of linguistics and of computer science separately, but occasionally a class that cleverly combines the two appears. Classes like these have slowly migrated me into the world of syntax and syntactic parsing. Moreover, such classes have taken me as well to the wonderful world of data science and data analytics, and my interest for such was only expanded upon following an internship with Kingfisher Systems, Inc. as a data scientist in Washington D.C. over the summer of 2017.

However, I must continue in my studies if I ever hope to succeed. In order to rapidly acclimate myself to data science and its many aspects, I have invested in a range of resources to assist me with my learning. Nothing, though, will teach me better than merely getting my hands dirty and doing it myself, so here it goes...

### Project: Yelp Dataset

I have downloaded on my computer the entirety of the JSON portion of the [Yelp Dataset](https://www.yelp.com/dataset/challenge), which unzipped into six separate JSON files, each of which is outlined below.

| Dataset File | Description |
|-------------:|:------------|
| business.json | information on each business reviewed on Yelp |
| checkin.json | each business' hours |
| photos.json | photos found on Yelp with their captions |
| review.json | text reviews of each business |
| tip.json | small quotes of advice for future visitors of the business |
| user.json | Yelp reviewers and information describing them |

However, only *business.json* and *review.json* will be used here.

### Step 1: A tour of the dataset

First, the files need to loaded into python.

In [2]:
import os
import codecs

data_directory = os.path.join(r'C:\Users\jayka\Documents', 'Datasets', 'yelp')
business_filepath = os.path.join(data_directory, 'business.json')
review_filepath = os.path.join(data_directory, 'review.json')

with codecs.open(business_filepath, encoding='utf_8') as f:
    first_business_record = f.readline()
    
print(first_business_record)

{"business_id": "YDf95gJZaq05wvo7hTQbbQ", "name": "Richmond Town Square", "neighborhood": "", "address": "691 Richmond Rd", "city": "Richmond Heights", "state": "OH", "postal_code": "44143", "latitude": 41.5417162, "longitude": -81.4931165, "stars": 2.0, "review_count": 17, "is_open": 1, "attributes": {"RestaurantsPriceRange2": 2, "BusinessParking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "BikeParking": true, "WheelchairAccessible": true}, "categories": ["Shopping", "Shopping Centers"], "hours": {"Monday": "10:00-21:00", "Tuesday": "10:00-21:00", "Friday": "10:00-21:00", "Wednesday": "10:00-21:00", "Thursday": "10:00-21:00", "Sunday": "11:00-18:00", "Saturday": "10:00-21:00"}}



Of these, **text** is the field I will use. However, any one of these fields could be used as well for other types of analysis.

This JSON data is not easily navigable, so it'll have to be converted into a more-usable format.

In [3]:
import json

restaurant_ids = set()

with codecs.open(business_filepath, encoding='utf_8') as f:
    for business_json in f:
        business = json.loads(business_json)
        
        if u'Restaurants' not in business[u'categories']:
            continue
        
        restaurant_ids.add(business[u'business_id'])
    
    restaurant_ids = frozenset(restaurant_ids)
    
    print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

51,613 restaurants in the dataset.


There are 51,613 registered restaurants in the dataset, over twice as many as there had been in 2016.

Separately, I'll need to find the reviews for each of the restaurants, based on their ID. This will be placed in a separate file that's easier to parse.

In [4]:
intermediate_directory = os.path.join(r'C:\Users\jayka\Documents\Datasets', 'intermediate')
review_text_filepath = os.path.join(intermediate_directory, 'yelp_review_text.txt')

In [5]:
%%time

review_count = 0

if False: # Once this is run once it does not need to be run again.
    
    with codecs.open(review_text_filepath, 'w', encoding='utf_8') as review_text_file:
        with codecs.open(review_filepath, encoding='utf_8') as review_json_file:
            for review_json in review_json_file:
                review = json.loads(review_json)

                if review[u'business_id'] not in restaurant_ids:
                    continue

                review_text_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1
            
    print(u'Text from {:,} restaurant reviews written to the new text file.'.format(review_count))

else:
    with codecs.open(review_text_filepath, encoding='utf_8') as review_text_file:
        for review_count, line in enumerate(review_text_file):
            pass

    print(u'Text from {:,} restaurant reviews written to the new txt file.'.format(review_count + 1))

Text from 2,929,512 restaurant reviews written to the new txt file.
Wall time: 45.6 s


This process takes me about 2.5 minutes on my dual-core i5 laptop. Moreover, ther are a total of 2,927,731 reviews found, over three times as many as in 2016. I now have the IDs and review texts in separate files, and real parsing of the text can now begin.

### Step 2: Introduction to text processing with spaCy

spaCy is a means to perform NLP tasks on these reviews just collected. It is both free-to-use and incredibly efficient and fast, much moreso than NLTK. It has been designed not for research, but for commerical and industrial usage. Below, a sample review has been printed. With it, I can get a better idea of what kind of data I'm working with.

In [6]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

with codecs.open(review_text_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')

print(sample_review)

The staff here is great and they're nice,  wonderful and quick. People were ranting in raving about pei wei, I had to try it.  Even good yelp reviews.  I'm highly dissatisfied with the flavor of the food. This  should be labeled Asian inspired and not Asian. I've tried a variety of Chinese restaurants, this doesn't taste close to anything I've had at other Asian restaurants. Their Mongolian beef  was 5 pieces of beef and large mushrooms cut into thirds in a thick sauce. You eat the rice to wash off the nasty flavor. My shrimp was thickly coated in an overpowering  sauce as well.  I only ate some of the veggies that take center stage on a meat dish.  The center of my pork egg roll was cold. The hot N sour soup was a much thicker consistency almost like that of a chili instead of being brothy. Worst of all was the price.  This was not worth it to us. Neither me or my husband enjoyed either of  our dishes.  We didn't even eat half of our plates.  We even refused to take it home with us.  

In [7]:
%%time
parsed_review = nlp(sample_review)

Wall time: 24 ms


Parsing this entire thing with state-of-the-art NLP technology only took 23.2 milliseconds on my computer. Super fast!

For listing all of the sentences in the sample review:

In [8]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
The staff here is great and they're nice,  wonderful and quick.

Sentence 2:
People were ranting in raving about pei wei, I had to try it.  

Sentence 3:
Even good yelp reviews.  

Sentence 4:
I'm highly dissatisfied with the flavor of the food.

Sentence 5:
This  should be labeled Asian inspired and not Asian.

Sentence 6:
I've tried a variety of Chinese restaurants, this doesn't taste close to anything I've had at other Asian restaurants.

Sentence 7:
Their Mongolian beef  was 5 pieces of beef and large mushrooms cut into thirds in a thick sauce.

Sentence 8:
You eat the rice to wash off the nasty flavor.

Sentence 9:
My shrimp was thickly coated in an overpowering  sauce as well.  

Sentence 10:
I only ate some of the veggies that take center stage on a meat dish.  

Sentence 11:
The center of my pork egg roll was cold.

Sentence 12:
The hot N sour soup was a much thicker consistency almost like that of a chili instead of being brothy.

Sentence 13:
Worst of all was the 

For listing all of the named entities:

In [9]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: Asian - NORP

Entity 2: Asian - NORP

Entity 3: Chinese - NORP

Entity 4: Asian - NORP

Entity 5: Mongolian - NORP

Entity 6: 5 - CARDINAL

Entity 7: half - CARDINAL

Entity 8: Asian - NORP



For listing each word's POS tag:

(We'll be using *pandas* for this for its easy displaying and further parsing.)

In [10]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,The,DET
1,staff,NOUN
2,here,ADV
3,is,VERB
4,great,ADJ
5,and,CCONJ
6,they,PRON
7,'re,VERB
8,nice,ADJ
9,",",PUNCT


For lemmas and shape analysis:

In [11]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
            columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,The,the,Xxx
1,staff,staff,xxxx
2,here,here,xxxx
3,is,be,xx
4,great,great,xxxx
5,and,and,xxx
6,they,-PRON-,xxxx
7,'re,be,'xx
8,nice,nice,xxxx
9,",",",",","


As you can see, each lemma is its token, minus inflections or tenses. Also note that all pronouns have been converted to *-PRON-*. (I can only guess this is because all pronouns can be see as identical information-wise, so there's no reason to save them as separate lemmas.)

For listing tokens and their analytics (like what kind of named entity they are and where a token lies within a named entity phrase):

In [12]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
            columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,The,,O
1,staff,,O
2,here,,O
3,is,,O
4,great,,O
5,and,,O
6,they,,O
7,'re,,O
8,nice,,O
9,",",,O


For listing stopwords, punctuation, whitespace, numbers, unstandard vocab words, probability of appearance in the corpus, etc:

In [13]:
token_attributes = [(token.orth_,
                    token.prob,
                    token.is_stop,
                    token.is_punct,
                    token.is_space,
                    token.like_num,
                    token.is_oov)
                   for token in parsed_review]

df = pd.DataFrame(token_attributes,
                 columns=['text',
                         'log_probability',
                         'stop?',
                         'punctuation?',
                         'whitespace?',
                         'number?',
                         'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                      .applymap(lambda x: u'Yes' if x else u''))

df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,The,-5.774222,Yes,,,,
1,staff,-10.720455,,,,,
2,here,-7.175437,Yes,,,,
3,is,-4.329765,Yes,,,,
4,great,-7.822114,,,,,
5,and,-4.195279,Yes,,,,
6,they,-5.429816,Yes,,,,
7,'re,-6.377125,,,,,
8,nice,-8.462502,,,,,
9,",",-3.391480,,Yes,,,


### Step 3: Automatic Phrase Modeling

The problem with these tokens is that independently, they do not properly reflect the relationships between one another. For example, above, *do* and *n't* are considered separate tokens, but they would be better reflected as a single collocation since *n't* will never appear apart from *do* (or alternative modal verbs).

From the *gensim* library, I will be able to complete phrase modeling over the current list of tokens to find such relationships.

In [14]:
import warnings
# Because of Windows Anaconda, a warning appears without this filter.
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

There are four steps that must be done in data preparation here:
- a) Segment text of complete reviews into sentences and normalize text
- b) First-order phrase modeling
- c) Second-order phrase modeling
- d) Apply text normalization and second-order phrase model to text of complete reviews

#### a) Segment text of complete reviews into sentences & normalize text

Three helper functions will be used for this step. They will, respectively, find punctuation and whitespace, line breaks, and word lemmas.

In [15]:
def punct_space(token):
    """ eliminates purely punctuation and whitespace tokens"""
    return token.is_punct or token.is_space

def line_review(filename):
    """reads in each review and unescapes the line-breaks"""
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """uses spaCy to parse reviews, lemmatize, and tokenize sentences"""
    for parsed_review in nlp.pipe(line_review(filename),
                                 batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                            if not punct_space(token)])

*lemmatized_sentence_corpus* loops over the original review text and outputs it to *unigram_sentences_all* one normalized sentence at a time, with one per line.

In [16]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'unigram_sentences_all.txt')

In [17]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_text_filepath):
            f.write(sentence + '\n')

Wall time: 0 ns


This process took me 2 hours and 48 minutes to complete on my modest computer.

The Gensim library provides a LineSentence class that streams documents of this format from disk, allowing massive scaling for huge amounts of data that now does not need to be saved in whole into memory. 

In [18]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

What do these sentences look like now?

In [19]:
for unigram_sentence in it.islice(unigram_sentences, 450, 460):
    print(u' '.join(unigram_sentence))
    print(u'')

this place be clean the staff be wonderful and go out of -PRON- way to help -PRON-

-PRON- stay a step ahead of -PRON- and get what what -PRON- ne the food be perfect

-PRON- have eat here many many time and -PRON- have always be fresh hot and delicious -PRON- highly recommend this place 100%

service be a 4

very sloppy also the hot and sour soup be not that bad if u add enough chile paste and salt and soy sauce the gentleman name pop -PRON- be the manager great service from -PRON- and this other lady do not get -PRON- name unfortunetly -PRON- have change -PRON- rating 2 time that be how impressed -PRON- be

love the food but -PRON- be great at all location

the customer service be horrible at this location

only two table be occupy at the time -PRON- walk in for a carry out

-PRON- walk in and -PRON- look at -PRON- like really

go away



Notice that these new sentences are far simpler grammatically than they would have been before. This presents a glimpse at what the final outcome of this stage will look like in the end.

#### b) First-order phrase modeling

From these normalized sentences, one can now use *gensim.models.Phrases* to find two-word collocations and link them together into single tokens. 

For example, in one of the sentences above are the words "soy sauce". Though they are technically separate words, when together as a bigram they act as a distinct entity. Hence, by training a model on frequent bigrams, one can disambiguate cases like these.

In [20]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [21]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)

# load the finished model back in after it's been built
bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 21.4 s


This process took a little more than 14 minutes to complete on my computer. By saving the model to its filepath, it can be run on unrelated text. This process allows different models formed from different corpora to be cross-compared.

I can now edit the original sentences to account for these bigrams.

In [22]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                        'bigram_sentences_all.txt')

In [23]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

Wall time: 0 ns


This took a little more than 26 minutes on my computer. Let's take a look at what I'm left with.

In [24]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print(u' '.join(bigram_sentence))
    print(u'')

nothing fantastic about -PRON- but -PRON- taste fine

-PRON- come with a sweet mustard dip sauce which be pretty tasty

usual filling of pork veggie etc

deep_fried pot_stickers- -PRON- get 2 of these thing

-PRON- be good

small but good

straight pork filling in the middle of -PRON-

overall -PRON- be just alright

-PRON- will say that -PRON- give -PRON- a grip of food

the 2 entree -PRON- get could have easily_feed 1 2 more people



So what has been achieved? Looking at the outputted sentences above, now two-word phrases like "deep fried" and "pot stickers" are now written out as "deep\_fried" and "pot\_stickers". This gives a better representation of what words co-occur together at a rate far greater than chance.

#### c) Second-order phrase modeling

We now have concatenated bigrams, but having trigrams accounted for as well might be a good idea. This involves nearly the same exact code found in part (b), but this time instead of running the new trigram model over the unigram sentences, I'll run it over the just-made bigram ones.

In [25]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                     'trigram_model_all')

In [26]:
%%time
if False:  # Once this is run once it does not need to be run again.
    
    trigram_model = Phrases(bigram_sentences)
    trigram_model.save(trigram_model_filepath)

# load the finished model back in after it's been built
trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 21.8 s


This process took almost 18 minutes to complete on my computer. Again, I'm saving this trigram model in its own file so that it could be used again in the future on other corpora. Now, this second-order phrase model can be applied to the first-order transformed sentences, and the results of such will be outputted to a third sentences file.

In [27]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'trigram_sentences_all.txt')

In [28]:
%%time
if False:  # Once this is run once it does not need to be run again.
    
    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

Wall time: 0 ns


This took almost 27 minutes on my computer. Now there is a file consisting of every sentence marked with trigrams. 

In [29]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [30]:
for trigram_sentence in it.islice(trigram_sentences, 270, 280):
    print(u' '.join(trigram_sentence))
    print(u'')

definitely a second good to p.f._chang_'s

service be fast and on time

love the online_ordering very convenient food be always hot and delicious

however

no forks for take out

that be poor management

either -PRON- do not order in time or there be a problem with the order either way hustle -PRON- booty to walmart and buy a couple box unacceptable

the food be fantastic -PRON- highly_recommend the lettuce_wraps

then -PRON- go to dinner at pei_wei

not bad for an asian_fusion place fill with senior_citizen



Why did I do all this? Because now, instead of the phrase "p.f. chang 's" being represented as three separate words, it is now concatenated into "p.f.\_chang\_'s". This is now correctly marked as a single entity instead of as three.

Technically, I could repeat this process indefinitely with higher and higher ngrams, but very little is gained information-wise after trigrams, so here I stop.

#### d) Apply text normalization and second-order phrase model to text of complete reviews

Now what? I now have these useful models and lots of sentences, but I was originally looking at full reviews. Therefore, I can run the original texts through these models (and even remove imformation-poor stop-words to boot) and be left with a semantically-rich equivalent to our original text, ignoring most of grammar and focusing on the words themselves.

In [31]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                       'trigram_tranformed_reviews_all.txt')

In [32]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        for parsed_review in nlp.pipe(line_review(review_text_filepath),
                                    batch_size=10000, n_threads=4):
            
            # lemmatize, and remove punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
            
            # apply the first- and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove stopwords, using SpaCy's nifty premade list
            trigram_review = [term for term in trigram_review if term not in spacy.en.STOP_WORDS]
            
            # finally, output this to the new file, review by review as line by line
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')

Wall time: 0 ns


This took a little more than 3 hours and 47 minutes to complete on my computer.

Thus, to compare the text from before and after NLP transformations:

In [33]:
print(u'Original:\n')
for review in it.islice(line_review(review_text_filepath), 11, 12):
    print(review)
    
print(u'----\n')
print(u'Transformed:\n')

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print(review)

Original:

I love this place i'd recommend it to anyone ! We always order it togo and it never disappoints! The food always taste fresh and is always ready on time! Definitely our favorite lunch spot !

----

Transformed:

-PRON- love place -PRON- recommend -PRON- -PRON- order -PRON- togo -PRON- disappoint food taste fresh ready time definitely -PRON- favorite lunch spot



It's obvious through a comparison of both of these that while a lot of content has been removed from the original text, the meat-and-potatoes remains. This information is far more useful for the semantic topic modelling that'll be applied next.

### Step 4: Visualizing topic models with pyLDAvis

*Topic Modeling* is a set of techniques where given a large corpus, inferences can be made between words such that many clusters of words appearing in similar contexts, or 'topics', can be determined and extracted. These topics can then be compared to the contexts found in specific documents, such that each document can be classified according to these contexts. In this sense, the computer is able to understand semantic connections between words without actually understanding the words themselves.

Here, [Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), or [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), is to be used. This approach was designed in 2003 and proves highly successful and easy to implement in Python (though there are of course other topic modeling techniques that too could be used as well).

In LDA, documents are first analyzed not as massive strings, but merely as bags-of-words. A lexicon consisting of all the vocabulary found across all documents is created, then each document is described as a sparse matrix with the counts of each token found in each document. A massive and sparse *document-term matrix* results, and it is from here that analysis can be done.

The *Latent* in LDA can be seen as synonymous to *hidden*. There are a number of hidden topics that the documents can be described by, and the total number of topics is determined by human inference. In the examples below, 50 topics are found, though more or less would yield denser but less accurate topics or sparser but more comprehensive topics, respectively.

The [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) in LDA is a type of distribution discovered by Gustav Lejeune Dirichlet; this distribution is assumed to be the probability distribution that the documents, topics, and tokens jointly follow. Again, there are alternative options that could be utilized, but LDAs work quite well for instances like these.

Finally, it is important to note that this is an instance of unsupervised learning. This means that the algorithm will discover these 50 topics on its own, with no previous knowledge of what each topic is. Because of this, I will have to label each topic myself in the end, as their true definitions will remain hidden to the computer and to me.

Luckily, gensim again has modules that can be used for data processing and for LDA analysis.

In [34]:
import warnings
# Because of Windows Anaconda, a warning appears without this filter.
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import _pickle as pickle

Gensim's *Dictionary* class allows the full lexicon of the Yelp review corpus to be modeled. The trigram dictionary found in the previous section will be used in forming the dictionary.

In [35]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                          'trigram_dict_all.dict')

In [36]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    trigram_reviews = LineSentence(trigram_reviews_filepath)
    
    # iterate the reviews and build a dictionary of trigrams
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are too rare or common (so little information gain) from the dict
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    
    # reassign integer ids removed from the filtering to conserve space
    trigram_dictionary.compactify()
    
    trigram_dictionary.save(trigram_dictionary_filepath)

# load the finished dictionary back in after it's been built
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Wall time: 67.6 ms


This process only about 5 minutes on my computer. The LDA utilizes a bag-of-words model to represent all the words and their counts, and this is now saved as a sparse matrix with the dimensions of the number of reviews by the number of total vocab words.

In [37]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                   'trigram_bow_corpus_all.mm')

In [38]:
def trigram_bow_generator(filepath):
    """
    generator function that reads in a file and yields a bow representation from it
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [39]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    #generate the bow representation for the reviews and save as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                      trigram_bow_generator(trigram_reviews_filepath))

# load the finished corpus back in after it's been built
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

Wall time: 579 ms


This took about 9.5 minutes on my computer. I pass into the LDA model the sparse matrix made above, the number of topics to be found (here 50), and the total lexicon.

In [40]:
lda_model_filepath = os.path.join(intermediate_directory,
                                 'lda_model_all')

In [41]:
%%time
if False: # Once this is run once it does not need to be run again.
    
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        lda = LdaMulticore(trigram_bow_corpus,
                          num_topics=50,
                          id2word=trigram_dictionary,
                          workers=1)
        
    lda.save(lda_model_filepath)

# load the finished model back in after it's been built
lda = LdaMulticore.load(lda_model_filepath)

Wall time: 700 ms


This took a little over 2 hours to complete. With this now-finished LDA model, each topic's most influential words can be seen, and from this one can infer roughly what each topic represents.

In [42]:
def explore_topic(topic_number, topn=25):
    """
    prints out a formatted list of the top terms of a given topic number
    """
    
    print(u'{:20} {}\n'.format(u'term', u'frequency'))
    
    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [43]:
explore_topic(topic_number=15)

term                 frequency

chicken              0.102
sauce                0.037
meat                 0.034
bbq                  0.025
wing                 0.025
fry                  0.024
order                0.022
fried                0.018
like                 0.016
dry                  0.014
flavor               0.013
rib                  0.013
taste                0.011
tender               0.011
smoke                0.010
brisket              0.010
try                  0.009
cheese               0.009
hot                  0.009
come                 0.008
mac                  0.007
crispy               0.007
pork                 0.006
piece                0.006
little               0.006


Topic 15 contains words like *bbq*, *sauce*, *smoke*, and *brisket*, all of which relate to **BBQ**. A similar process of inference has been repeated for each of the topics, and these have been listed below.

In [44]:
topic_names = {0: u'interior',
              1: u'oriental',
              2: u'breakfast',
              3: u'service',
              4: u'arrival',
              5: u'cost',
              6: u'nightlife',
              7: u'alternative',
              8: u'hip',
              9: u'bar',
              10: u'japanese',
              11: u'pizza',
              12: u'bistro',
              13: u'steak',
              14: u'overnight',
              15: u'bbq',
              16: u'family',
              17: u'awesomesauce',
              18: u'dessert',
              19: u'selections',
              20: u'mexican',
              21: u'indian',
              22: u'discount',
              23: u'french',
              24: u'lunch',
              25: u'thai',
              26: u'location',
              27: u'parking/cajun/utensils',
              28: u'terribad',
              29: u'positive',
              30: u'buffet',
              31: u'tex-mex',
              32: u'cajun',
              33: u'burgers',
              34: u'ramen',
              35: u'fancy',
              36: u'alcohol/south',
              37: u'atmosphere',
              38: u'crowdedness',
              39: u'ice cream/delivery',
              40: u'french',
              41: u'management',
              42: u'experience',
              43: u'quality',
              44: u'letters/numbers',
              45: u'seafood',
              46: u'upper-class',
              47: u'celebration',
              48: u'breakfast diner',
              49: u'awesomesauce'}

In [45]:
topic_names_filepath = os.path.join(intermediate_directory,
                                   'topic_names.pkl')

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

I've appended all of these topic names to the original file. Although not all topics are easy to infer, each can be roughly associated with some distinct aspect of restaurants and cuisine.

However, being able to visually inspect how each topic relates to one another in a graphical format would help to better understand just how each topic relates to one another. Luckily, pyLDAvis allows one to easily do such.

In [46]:
LDAvis_data_filepath = os.path.join(intermediate_directory,
                                   'ldavis_prepared')

In [47]:
%%time
if False: # Once this is run once it does not need to be run again
    
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda,
                                              trigram_bow_corpus,
                                              trigram_dictionary)
    
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the finished pyLDAvis data back in after it's been built       
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

Wall time: 12 ms


With pyLDAvis.display(pyLDAvisModel), a visualization can be displayed directly in Jupyter.

In [48]:
pyLDAvis.display(LDAvis_prepared)

...

In [49]:
def get_sample_review(review_number):
    """
    retrieve a particular review index from the reviews file and return it
    """
    
    return list(it.islice(line_review(review_text_filepath),
                         review_number,
                         review_number + 1))[0]

In [50]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    take the original text of a review and
    (1) parse it using SpaCy
    (2) pre-process the text
    (3) create a bow representation
    (4) create an LDA representation
    (5) print a sorted list of the top topics from the LDA representation
    """
    
    # (1) parse the text using SpaCy
    parsed_review = nlp(review_text)
    
    # (2) pre-process the text
    #  - lemmatize and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                     if not punct_space(token)]
    
    #  - apply the first- and second-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    #  - remove stopwords
    trigram_review = [term for term in trigram_review
                     if not term in spacy.en.STOP_WORDS]
    
    # (3) create a bow representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # (4) create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly rated topics first
    review_lda = sorted(review_lda, key=(lambda topic_number_freq: -topic_number_freq[1]))
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
        
        # (5) print a sorted list of the top topics from the LDA representation
        print('{:25} {}'.format(topic_names[topic_number],
                               round(freq, 3)))

In [51]:
sample_review = get_sample_review(50)
print(sample_review)

Pretty good food a chain. You do get a lot of food for the price. However, every time I have ordered online there has been a mistake. Sometimes they tell me they never even received my order when I received the confirmation email. Also, if you in during a busy time and have a teenage girl as a cashier, you're chances of having a pleasant experience with her are slim to none. This has happened to me multiple times with multiple different cashiers. I have spoken to the manager and he is always very apologetic, but I guess you just can't get stressed teenage girls to be nice to customers. I will continue to go here as the food is good for the price and they are relatively quick with preparation. Nice options if you're veggies as well.



In [52]:
lda_description(sample_review)

crowdedness               0.265
management                0.238
service                   0.131
quality                   0.098
dessert                   0.091
selections                0.066


...

In [53]:
sample_review = get_sample_review(100)
print(sample_review)

If you blink. You will miss this little eatery, squished into the corner of a small strip mall in Mentor, Ohio.

I went with a friend of mine from the area, who was treated like a rock star when he arrived (well, he IS a rock star), and that's indicative of the casual and familial vibe at this restaurant.  I didn't take notes - too busy enjoying the company - but I'm pretty sure I got a burrito and it was delicious.  Margaritas were excellent (this much I do remember) and service was quick and pleasant.  Now that I know that this place exists, it gives me even more reason to travel to the northern tip of Ohio for good company and eats!



In [54]:
lda_description(sample_review)

experience                0.191
positive                  0.161
interior                  0.131
awesomesauce              0.106
crowdedness               0.097
cost                      0.076
breakfast                 0.064


...

### Step 5: Word vector models with word2vec

In [55]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

...

In [58]:
%%time
if False: # Once this is run once it does not need to be run again
    
    # initiate the model and perform training's first epoch
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                       min_count=20, sg=1, workers=2)
    
    food2vec.save(word2vec_filepath)
    
    # repeat another 11 epochs
    for i in range(1,12):
        food2vec.train(trigram_sentences,
                      total_examples=food2vec.corpus_count,
                      epochs=food2vec.iter)
        food2vec.save(word2vec_filepath)
        
# load the finished model back in after it's been built
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(u'{} training epochs so far.'.format(food2vec.train_count))

12 training epochs so far.
Wall time: 14h 16min 2s


This took over 15 hours total on my computer.

In [60]:
print(u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))

91,036 terms in the food2vec vocabulary.


...

In [65]:
# build a list of the terms, integer indices, and term counts from food2vec's vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                for term, voc in food2vec.wv.vocab.items()]

# sort by term counts to order most frequent terms first
ordered_vocab = sorted(ordered_vocab, key=(lambda term_index_count: -term_index_count[2]))

# unzip the three into separate list
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a Pandas dataframe with the vectors as data and the terms as rows
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                           index=ordered_terms)

...

In [66]:
def get_related_terms(token, topn=10):
    """
    Look up the topn most similar terms to the token and print them as  a list.
    """
    
    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):
        print(u'{:20} {}'.format(word, round(similarity, 3)))

...

In [70]:
get_related_terms(u"mcdonald_'s")

mcdonalds            0.974
mcd_'s               0.95
mcdonald             0.925
wendy_'s             0.916
mcd                  0.905
mcds                 0.9
carl_'s_jr.          0.886
wendys               0.877
bk                   0.868
mickey_d_'s          0.865


...

In [68]:
get_related_terms(u'happy_hour')

hh                   0.905
reverse_happy_hour   0.89
happy_hr             0.863
happy_hour-          0.831
4_7pm                0.795
happy_hour_3_6pm     0.785
5_7pm                0.783
4_6pm                0.779
3pm-6pm              0.762
happy_hour_m_f       0.761


...

In [69]:
get_related_terms(u'pasta', topn=20)

spaghetti            0.856
rigatoni             0.854
fettuccine           0.852
penne                0.846
lasagna              0.828
ziti                 0.827
bolognese            0.825
manicotti            0.824
fettucini            0.821
linguine             0.813
alfredo              0.81
tortellini           0.809
angel_hair_pasta     0.807
fettuccini           0.806
linguini             0.802
penne_alla_vodka     0.802
penne_pasta          0.801
angel_hair           0.8
gnocchi              0.798
lasagne              0.792


...

In [71]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors of words in add and subtract,
    and print the topn most similar terms to the combined vector
    """
    
    answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

...

In [72]:
word_algebra(add=[u'breakfast', u'lunch'])

brunch


...

In [73]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

dinner


...

In [74]:
word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican'])

dumpling


...

In [75]:
word_algebra(add=[u'bun', u'mexican'], subtract=[u'american'])

steamed_bun


...

In [76]:
word_algebra(add=[u'filet_mignon', u'seafood'], subtract=[u'beef'])

alaskan_king_crab_leg


...

In [77]:
word_algebra(add=[u'coffee', u'snack'], subtract=[u'drink'])

pastry


...

In [79]:
word_algebra(add=[u"mcdonald_'s", u'fine_dining'])

denny_'s


...

In [80]:
word_algebra(add=[u"denny_'s", u'fine_dining'])

dennys


...

In [81]:
word_algebra(add=[u"applebee_'s", u'italian'])

olive_garden


...

In [82]:
word_algebra(add=[u"applebee_'s", u'pancakes'])

ihop


...

In [83]:
word_algebra(add=[u"applebee_'s", u'pizza'])

barro_'s


...

In [84]:
word_algebra(add=[u'wine', u'barley'], subtract=[u'grapes'])

merlot


...

### Step 6: Visualizing word2vec with t-SNE

In [85]:
from sklearn.manifold import TSNE

...

In [86]:
tsne_input = word_vectors.drop(spacy.en.STOP_WORDS, errors=u'ignore')
tsne_input = tsne_input.head(5000)

tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,0.072001,-0.085673,-0.115002,-0.131268,0.130546,-0.091399,0.09209,0.061851,-0.044852,0.075698,...,0.036346,-0.104727,-0.034779,-0.069648,0.144019,0.050682,-0.030202,0.046225,0.019045,0.002935
good,0.043744,-0.061336,-0.088378,-0.085642,0.021823,-0.018142,-0.075616,0.081995,-0.050594,0.026308,...,0.151693,0.151151,-0.026826,0.033992,0.052284,-0.055157,-0.054231,0.177916,0.128972,0.047372
food,0.117888,-0.181887,0.060473,0.126244,0.209582,0.047414,-0.047423,0.075752,-0.018864,0.043803,...,-0.025871,0.000534,0.07055,-0.028516,0.201392,-0.098547,-0.061869,0.025773,0.082732,0.172148
place,0.084988,-0.324791,-0.055835,-0.018883,0.182912,-0.029256,-0.020491,0.03062,-0.027119,0.139906,...,0.01101,-0.025304,0.099694,-0.090623,0.082347,-0.067192,-0.173651,0.010739,0.026422,-0.00966
order,0.001718,-0.081069,0.001485,-0.2501,0.149529,0.015925,-0.012842,-0.014634,-0.007161,0.057533,...,-0.021325,-0.049446,0.020087,-0.207049,0.071794,-0.163573,0.08354,0.036266,-0.108861,-0.009785


...

In [87]:
tsne_filepath = os.path.join(intermediate_directory,
                            u'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                    u'tsne_vectors.npy')

In [91]:
%%time
if False: # Once this is run once it does not need to be run again
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)
        
    pd.np.save(tsne_vectors_filepath, tsne_vectors)

    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

Wall time: 20 ms


...

In [92]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
-PRON-,-1.36721,5.265493
good,9.012697,1.281403
food,6.129308,2.01285
place,-2.732799,7.652569
order,1.920465,-2.917492


In [93]:
tsne_vectors[u'word'] = tsne_vectors.index

...

In [94]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

  NP_EPOCH = np.datetime64('1970-01-01T00:00:00Z')


In [96]:
# add the dataframe as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create a plot and configure the external variables
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools = (u'pan, wheel_zoom, box_zoom',
                            u'box_select, resize, reset'),
                   active_scroll = u'wheel_zoom')

# add a hovel tool to display words on mouse-over
tsne_plot.add_tools(HoverTool(tooltips = u'@word'))

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                size=10, hover_line_color=u'black')

# configure the plot's visual elements
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# the plot is complete! Let's look at it!
show(tsne_plot)