## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it &mdash; it's largely about food, after all!

In [3]:
import os

# start_data_dir = '.'
start_data_dir = '/Users/aliosha/Development/nlp/'

data_directory = os.path.join(start_data_dir, 'data', 'yelp_dataset')

businesses_filepath = os.path.join(data_directory, 'business.json')

with open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print(first_business_record)

{"business_id": "FYWN1wneV18bWNgQjJ2GNg", "name": "Dental by Design", "neighborhood": "", "address": "4855 E Warner Rd, Ste B9", "city": "Ahwatukee", "state": "AZ", "postal_code": "85044", "latitude": 33.3306902, "longitude": -111.9785992, "stars": 4.0, "review_count": 22, "is_open": 1, "attributes": {"AcceptsInsurance": true, "ByAppointmentOnly": true, "BusinessAcceptsCreditCards": true}, "categories": ["Dentists", "General Dentistry", "Health & Medical", "Oral Surgeons", "Cosmetic Dentists", "Orthodontists"], "hours": {"Friday": "7:30-17:00", "Tuesday": "7:30-17:00", "Thursday": "7:30-17:00", "Wednesday": "7:30-17:00", "Monday": "7:30-17:00"}}



The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _an array containing relevant category values of businesses_

The _categories_ attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the _Restaurant_ tag in the _categories_ array. In addition, the _categories_ array may contain more detailed information about restaurants, such as the type of food they serve.

In [4]:
review_json_filepath = os.path.join(data_directory, 'review.json')

with open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"v0i_UHJMo_hPBq9bxWvW4w","user_id":"bv2nCi5Qv5vroFiqKGopiw","business_id":"0W4lkclzZThpx3V65bVgig","stars":5,"date":"2016-05-28","text":"Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.","useful":0,"funny":0,"cool":0}



A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

The _text_ attribute will be our focus today!

In [5]:
import json

restaurant_ids = set()

# open the businesses file
with open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

54,618 restaurants in the dataset.


In [6]:
intermediate_directory = os.path.join(data_directory, 'intermediate')

review_txt_filepath = os.path.join(intermediate_directory,
                                   'review_text_all.txt')

In [7]:
if not os.path.exists(intermediate_directory):
    os.makedirs(intermediate_directory)

In [8]:
with open(review_txt_filepath, 'w+', encoding='utf_8') as review_txt_file:
    pass

In [9]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
max_rev = 10000
process = True
process = False

review_count = 0


if process:
    # create & open a new file in write mode
    with open(review_txt_filepath, 'w+', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1
                if review_count >= max_rev:
                    break

    print(u'''Text from {:,} restaurant reviews
              written to the new txt file.'''.format(review_count))
    
else:
    with open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print(u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 1 restaurant reviews in the txt file.
CPU times: user 159 µs, sys: 81 µs, total: 240 µs
Wall time: 211 µs


In [9]:
with open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 340, 341))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

I awake, wild eyed in a cold sweat, visions of caramel creme macaroons soliciting me like high class tricks in a whorehouse, Croque Monsieurs their swanky pimps, demanding payment for service. Essence strikes again. The only way to vanquish these dreams is to lunch here on a regular basis. 

I've been here twice in the past week alone and I think I'm addicted. The atmosphere, product, service, and baked goods have seduced me with their simple elegance. The only gripe I possibly have about this place is the fact that the parking lot is the size of a matchbox, and with Sacks, Cafe Lalibella and Starbucks all occupying the same plaza, finding a place to park during the busy lunch rush is like simultaneously playing Tetris and Frogger.

The atmosphere at Essence is understated in its sophistication. Windows line the majority of the dining room, creating an inviting, open air feeling and the attached patio is charming. Strong earth toned colors and cute little origami flower bouquets grace 

In [10]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 208 ms, sys: 40 ms, total: 248 ms
Wall time: 231 ms


In [11]:
print(parsed_review)

I awake, wild eyed in a cold sweat, visions of caramel creme macaroons soliciting me like high class tricks in a whorehouse, Croque Monsieurs their swanky pimps, demanding payment for service. Essence strikes again. The only way to vanquish these dreams is to lunch here on a regular basis. 

I've been here twice in the past week alone and I think I'm addicted. The atmosphere, product, service, and baked goods have seduced me with their simple elegance. The only gripe I possibly have about this place is the fact that the parking lot is the size of a matchbox, and with Sacks, Cafe Lalibella and Starbucks all occupying the same plaza, finding a place to park during the busy lunch rush is like simultaneously playing Tetris and Frogger.

The atmosphere at Essence is understated in its sophistication. Windows line the majority of the dining room, creating an inviting, open air feeling and the attached patio is charming. Strong earth toned colors and cute little origami flower bouquets grace 

In [12]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('###')

Sentence 1:
I awake, wild eyed in a cold sweat, visions of caramel creme macaroons soliciting me like high class tricks in a whorehouse, Croque Monsieurs their swanky pimps, demanding payment for service.
###
Sentence 2:
Essence strikes again.
###
Sentence 3:
The only way to vanquish these dreams is to lunch here on a regular basis. 


###
Sentence 4:
I've been here twice in the past week alone
###
Sentence 5:
and I think I'm addicted.
###
Sentence 6:
The atmosphere, product, service, and baked goods have seduced me with their simple elegance.
###
Sentence 7:
The only gripe I possibly have about this place is the fact that the parking lot is the size of a matchbox, and with Sacks, Cafe Lalibella and Starbucks
###
Sentence 8:
all occupying the same plaza, finding a place to park during the busy lunch rush is like simultaneously playing Tetris and Frogger.


###
Sentence 9:
The atmosphere at Essence is understated in its sophistication.
###
Sentence 10:
Windows line the majority of the d

In [13]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: Croque Monsieurs - EVENT

Entity 2: the past week - DATE

Entity 3: Sacks - PRODUCT

Entity 4: Cafe Lalibella - GPE

Entity 5: Starbucks - ORG

Entity 6: Tetris - PRODUCT

Entity 7: Frogger - PRODUCT

Entity 8: Essence - ORG

Entity 9: Friday - DATE

Entity 10: Essence - ORG

Entity 11: Mill - PERSON

Entity 12: French - NORP

Entity 13: Greek - NORP

Entity 14: Essence - ORG

Entity 15: Little Debbie-esque - PERSON

Entity 16: One - CARDINAL

Entity 17: about three - CARDINAL

Entity 18: Essence - ORG

Entity 19: daily - DATE

Entity 20: 
 - GPE



In [14]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,I,PRON
1,awake,VERB
2,",",PUNCT
3,wild,ADJ
4,eyed,VERB
5,in,ADP
6,a,DET
7,cold,ADJ
8,sweat,NOUN
9,",",PUNCT


In [15]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,I,-PRON-,X
1,awake,awake,xxxx
2,",",",",","
3,wild,wild,xxxx
4,eyed,eye,xxxx
5,in,in,xx
6,a,a,x
7,cold,cold,xxxx
8,sweat,sweat,xxxx
9,",",",",","


In [16]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,I,,O
1,awake,,O
2,",",,O
3,wild,,O
4,eyed,,O
5,in,,O
6,a,,O
7,cold,,O
8,sweat,,O
9,",",,O


In [17]:
token_attributes = [(token.orth_,
#                      token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
#                            'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,I,,,,,Yes
1,awake,,,,,Yes
2,",",,Yes,,,Yes
3,wild,,,,,Yes
4,eyed,,,,,Yes
5,in,Yes,,,,Yes
6,a,Yes,,,,Yes
7,cold,,,,,Yes
8,sweat,,,,,Yes
9,",",,Yes,,,Yes


In [18]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

In [19]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename, max=None):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    

    with open(filename, encoding='utf_8') as f:
        for i, review in enumerate(f):
            if not max or i<= max:
                yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [20]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'unigram_sentences_all.txt')

In [21]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if process:

    with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

CPU times: user 7min 47s, sys: 1min 57s, total: 9min 44s
Wall time: 7min 30s


In [22]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [23]:
for unigram_sentence in it.islice(unigram_sentences, 340, 350):
    print(u' '.join(unigram_sentence))
    print(u'')

decide to give this place a try base on the review

service be really slow on a thursday afternoon work day

-PRON- take almost 30 min before -PRON- get -PRON- food

definitely not the place to go during the work week

order a pork bone soup which be just ok

nothing special or memorable about -PRON-

not sure if -PRON- would come back again

this place be really good

the good restaurant in the area

the food there be really tasty especially the ahi tuna club sandwich lobster ravioli and the tandoori chicken flat bread



In [24]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [25]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if process:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

CPU times: user 5.44 s, sys: 168 ms, total: 5.61 s
Wall time: 6.16 s


In [26]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

In [27]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if process:

    with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')



CPU times: user 11.3 s, sys: 178 ms, total: 11.5 s
Wall time: 12.4 s


In [28]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [29]:
for bigram_sentence in it.islice(bigram_sentences, 340, 350):
    print(u' '.join(bigram_sentence))
    print(u'')

decide to give this_place a try base_on the review

service be really slow on a thursday afternoon work day

-PRON- take almost 30_min before -PRON- get -PRON- food

definitely not the place to go during the work week

order a pork bone soup which be just ok

nothing_special or memorable about -PRON-

not sure if -PRON- would come_back again

this_place be really good

the good restaurant in the area

the food there be really tasty especially the ahi_tuna club sandwich lobster_ravioli and the tandoori_chicken flat_bread



In [30]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [31]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if process:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 5.27 s, sys: 164 ms, total: 5.44 s
Wall time: 5.9 s


In [32]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [33]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if process:

    with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')



CPU times: user 11.1 s, sys: 195 ms, total: 11.3 s
Wall time: 12.2 s


In [None]:
for bigram_sentence in it.islice(tri, 340, 350):
    print(u' '.join(bigram_sentence))
    print(u'')

In [34]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [35]:
for trigram_sentence in it.islice(trigram_sentences, 340, 350):
    print(u' '.join(trigram_sentence))
    print(u'')

decide to give this_place a try base_on the review

service be really slow on a thursday afternoon work day

-PRON- take almost 30_min before -PRON- get -PRON- food

definitely not the place to go during the work week

order a pork bone soup which be just ok

nothing_special or memorable about -PRON-

not sure if -PRON- would come_back_again

this_place be really good

the good restaurant in the area

the food there be really tasty especially the ahi_tuna club sandwich lobster_ravioli and the tandoori_chicken flat_bread



In [36]:
import re
from collections import Counter

p = re.compile('.*_.*_.*')

selected = []
trigrams = Counter()

for trigram_sentence_ext in trigram_sentences:
    for word in trigram_sentence_ext:
        if p.search(word):
            selected.append(trigram_sentence_ext)
            trigrams[word] += 1
            
        
print(len(selected))
trigrams

5682


Counter({'10_15_minute': 7,
         '10_minute_before': 11,
         '10_minute_later': 6,
         '10_year_ago': 9,
         '15_20_minute': 9,
         '15_minute_before': 7,
         '20_minute_later': 7,
         '2_star_because': 10,
         '2_year_old': 9,
         '3_1/2_star': 6,
         '45_minute_wait': 9,
         '4_year_old': 7,
         '5_10_minute': 8,
         '5_year_old': 7,
         '8_year_ago': 5,
         'a___erto': 1,
         'about_$_10': 20,
         'about_10_minute': 23,
         'about_15_minute': 22,
         'about_20_minute': 27,
         'about_five_minute': 9,
         'about_ten_minute': 10,
         'always_leave_satisfied': 6,
         'an_add_bonus': 7,
         'an_early_dinner': 12,
         'an_empty_stomach': 7,
         'an_enjoyable_experience': 6,
         'an_extra_star': 8,
         'an_hour_later': 6,
         'an_old_house': 7,
         'another_10_minute': 12,
         'another_15_minute': 11,
         'answer_any_question': 7,
 

In [37]:
trigrams.most_common(50)

[('do_not_know', 259),
 ('as_well_as', 174),
 ('look_forward_to', 164),
 ('do_not_care', 126),
 ('do_not_mind', 87),
 ('as_soon_as', 84),
 ('do_not_disappoint', 82),
 ('do_not_know_what', 79),
 ('big_fan_of', 76),
 ('right_amount_of', 75),
 ('in_term_of', 73),
 ('not_go_wrong', 72),
 ('sweet_potato_fry', 69),
 ('run_out_of', 68),
 ('mac_n_cheese', 65),
 ('recommend_this_place', 62),
 ('take_care_of', 62),
 ('for_those_who', 61),
 ('huge_fan_of', 56),
 ('for_some_reason', 54),
 ('write_home_about', 54),
 ('start_off_with', 53),
 ('do_not_know_how', 53),
 ('definitely_come_back', 51),
 ('not_too_sweet', 50),
 ('would_definitely_recommend', 47),
 ('some_sort_of', 46),
 ('have_no_idea', 45),
 ('come_back_again', 44),
 ('will_definitely_return', 44),
 ('will_definitely_come_back', 42),
 ('as_an_appetizer', 41),
 ('go_back_again', 40),
 ('during_happy_hour', 39),
 ('at_least_once', 39),
 ('do_not_bother', 35),
 ('more_than_enough', 33),
 ('seat_right_away', 31),
 ('not_sure_why', 30),
 ('do_

In [38]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [39]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if process:

    with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath, 0),
                                      batch_size=10000, n_threads=4):
            
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.lang.en.STOP_WORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')
            



CPU times: user 8min 3s, sys: 2min, total: 10min 3s
Wall time: 7min 46s


In [40]:
print(u'Original:' + u'\n')

for review in it.islice(line_review(review_txt_filepath), 11, 12):
    print(review)

print(u'----' + u'\n')
print(u'Transformed:' + u'\n')

with open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print(review)

Original:

This place is awesome! Definitely authentic!!!

My two favourite dishes are the rice flour rolls and the chicken pho. The rice flour rolls are always fresh whenever I'm there! And the chicken pho is always flavourful!! mmmm....just thinking of it makes me want some!

My boyfriend is Vietnamese and he agrees that this place is authentic and one of the best Vietnamese restaurants he has ever eaten at.

Prices are very reasonable too!

----

Transformed:

this_place awesome definitely authentic -PRON- favourite dish rice flour roll chicken pho rice flour roll fresh -PRON- chicken pho flavourful mmmm think -PRON- -PRON- want -PRON- boyfriend vietnamese -PRON- agree this_place authentic good vietnamese_restaurant -PRON- eat price very_reasonable



## Topic Modeling with Latent Dirichlet Allocation (_LDA_)

*Topic modeling* is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model &mdash; documents and tokens &mdash; and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:
* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other &mdash; there's no sense of connection between related tokens, such as _knife_ and _fork_.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

In [10]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

In [42]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                           'trigram_dict_all.dict')

In [43]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if process:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 1.26 s, sys: 19 ms, total: 1.28 s
Wall time: 1.34 s


In [44]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                    'trigram_bow_corpus_all.mm')

In [45]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [46]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if process:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

CPU times: user 1.9 s, sys: 57 ms, total: 1.96 s
Wall time: 2.09 s


In [47]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [102]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.

topic_num = 10

if process:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=topic_num,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Process ForkPoolWorker-10:
Process ForkPoolWorker-11:
Process ForkPoolWorker-9:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run


KeyboardInterrupt: 

In [89]:
lda.get_topics().shape

(10, 82)

In [90]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [91]:
explore_topic(topic_number=2)

term                 frequency

sandwich             0.099
time                 0.055
restaurant           0.051
tasty                0.050
place                0.047
this_place           0.047
chicken              0.047
area                 0.047
amazing              0.047
serve                0.047
's                   0.047
soup                 0.033
overall              0.014
taste                0.013
great                0.012
tell                 0.012
friend               0.011
salad                0.010
sort_of              0.008
ask                  0.008
dish                 0.008
bread                0.008
sure                 0.008
atmosphere           0.008
review               0.008


In [92]:
for topic in range(topic_num):
    print("topic ", topic)
    print(explore_topic(topic_number=topic, topn=10))
    print('###')
    print()

topic  0
term                 frequency

this_place           0.075
place                0.068
ask                  0.036
nice                 0.034
want                 0.033
think                0.028
great                0.027
salad                0.026
do_not               0.026
husband              0.024
price                0.023
sandwich             0.023
soup                 0.023
service              0.022
's                   0.021
try                  0.018
thing                0.018
eat                  0.017
decide               0.016
like                 0.016
drink                0.016
bad                  0.016
dish                 0.015
atmosphere           0.014
chicken              0.014
None
###

topic  1
term                 frequency

great                0.042
come                 0.035
service              0.030
place                0.029
look                 0.028
think                0.028
delicious            0.026
small                0.024
drink            

In [12]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

In [95]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if process:

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                              trigram_dictionary, mds='mmds')

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

CPU times: user 2.24 ms, sys: 1.04 ms, total: 3.29 ms
Wall time: 20.8 ms


In [13]:
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

In [14]:
pyLDAvis.display(LDAvis_prepared)

In [97]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    return list(it.islice(line_review(review_txt_filepath),
                          review_number, review_number+1))[0]

In [98]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.lang.en.STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda X : -X[1])
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_number,
                                round(freq, 3)))

In [99]:
sample_review = get_sample_review(50)
print(sample_review)

Small little Japanese restaurant in the Don Mills neighbourhood. Lots of different rolls to pick from. 

Great lunch special selection.

Service is a little slow though. So if you only have 1 hour for lunch, may not be the best place. Take out is pretty quick though.



In [100]:
lda_description(sample_review)



IndexError: index 138 is out of bounds for axis 1 with size 82

In [101]:
sample_review2 = get_sample_review(100)
print(sample_review2)
lda_description(sample_review2)

This place is a bit of an adventure...at least for me. I've never had charcuterie before and I'm not generally a big meat eater, so this was a little out of the box. That said, I like trying to be adventurous and trying new things, so when are visiting friends suggested we give it a go, we went for it. Dinner service is open at 6pm, so we agreed we'd arrive at 6pm. We got there at about 6:15....there was already a HUGE wait for a table, but we managed to wrangle seats at the bar. 

The four of us ordered a bunch of dishes to split between ourselves. We tried the charcuterie (sort of a platter of assorted sliced meats and pâté), horse tar tar, roasted bone marrow, pork carnitas tacos, tongue on brioche and the foie gras and nutella for dessert. 

The dishes came out staggered, so we got to sit around chatting and enjoying ourselves and our cocktails. The charcuterie came out first. It was assorted sliced meats arranged on a wood plank and served with bread. It was actually really nice. 



IndexError: index 82 is out of bounds for axis 1 with size 82

In [50]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

In [51]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if process:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
    for i in range(1,12):
        food2vec.train(trigram_sentences, total_examples=food2vec.corpus_count, epochs=food2vec.epochs)
        food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(u'{} training epochs so far.'.format(food2vec.train_count))

  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  as

  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)


12 training epochs so far.
CPU times: user 6min 20s, sys: 4.2 s, total: 6min 25s
Wall time: 2min 34s


  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  setattr(self, attrib, None)


In [52]:
print(u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))

3,629 terms in the food2vec vocabulary.


In [53]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.wv.vocab.items()]

def sort_func(tup):
    term, indec, count = tup
    return -count

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=sort_func)

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.vectors_norm[term_indices, :],
                            index=ordered_terms)

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,-0.192540,-0.137421,0.087687,-0.041500,-0.046415,0.062379,0.016884,-0.103175,0.140363,0.088608,...,0.015603,-0.126255,-0.051399,0.037474,0.139312,0.042000,-0.059149,0.078348,-0.050734,-0.068732
be,-0.096318,-0.144332,-0.212839,-0.052480,0.036218,0.219431,0.000228,-0.034399,0.202455,-0.034905,...,0.006305,0.000102,0.030234,0.043527,-0.017617,0.009483,-0.145317,0.228525,-0.069701,-0.114176
the,0.000089,-0.028026,-0.041021,0.111821,-0.085819,0.043335,-0.056748,-0.131266,0.042512,0.022696,...,-0.188833,-0.068218,-0.112790,0.042619,-0.024181,0.014776,-0.080565,0.028031,-0.114328,0.064669
and,-0.189819,-0.073866,0.065609,0.040589,-0.076059,0.059223,0.021162,-0.162560,0.112582,-0.030728,...,-0.106540,0.071831,-0.049181,0.066906,-0.054017,-0.077172,-0.196286,0.115795,-0.091203,0.083549
a,0.030527,0.019898,-0.173009,0.110138,0.061045,0.147410,0.076303,-0.185216,-0.074776,-0.000041,...,-0.015731,0.120600,-0.101515,0.006418,-0.044200,-0.026002,-0.085152,0.075556,0.042961,-0.011100
to,-0.184892,0.053593,0.148398,-0.002422,-0.278245,0.127144,-0.046509,-0.088389,0.231082,0.091001,...,0.131239,0.017966,-0.044981,0.002161,0.141389,-0.029075,-0.197160,-0.052363,-0.029980,-0.062437
have,-0.263948,-0.139392,-0.094716,-0.105031,0.029591,0.186383,0.089885,-0.073477,0.127601,-0.038056,...,-0.042434,0.003974,-0.080431,-0.017601,0.019746,0.111616,-0.128615,0.125177,-0.054404,0.176189
of,-0.064054,-0.021256,-0.037077,-0.136387,0.130824,0.011794,-0.070115,-0.086504,0.034637,0.240778,...,-0.142515,0.128248,-0.085780,0.114970,-0.082709,0.076574,-0.083384,0.117512,-0.182946,-0.000966
for,-0.010958,-0.093656,-0.013771,0.135409,0.151720,0.223458,0.132373,0.050078,0.031994,0.018673,...,-0.091269,0.116898,0.030665,-0.040892,0.132972,-0.132993,-0.290694,0.164789,-0.119832,-0.133160
in,-0.127789,-0.014818,-0.066206,-0.095683,0.045466,0.096631,0.072912,0.010753,0.061598,0.327738,...,-0.108266,0.009744,-0.000948,0.109708,0.077226,0.040942,-0.131442,-0.055605,-0.031663,-0.078908


In [54]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.wv.most_similar(positive=[token], topn=topn):

        print(u'{:20} {}'.format(word, round(similarity, 3)))

In [55]:
get_related_terms(u'burger')

slider               0.601
sandwich             0.547
wing                 0.526
bacon                0.526
turkey_burger        0.525
fry                  0.508
filet                0.499
cheeseburger         0.495
lettuce_tomato       0.486
tater_tot            0.472


In [56]:
get_related_terms(u'pizza')

pepperoni            0.602
pie                  0.575
salami               0.519
bbq_chicken          0.515
thin_crust           0.507
mozzarella           0.495
wing                 0.49
lasagna              0.475
cheeseburger         0.47
burger               0.467


In [57]:
get_related_terms(u'happy_hour')

tuesday              0.572
during_happy_hour    0.551
10_p.m.              0.549
hh                   0.534
happy_hour_special   0.524
every_day            0.517
wednesday            0.504
friday               0.497
thursday             0.494
7_p.m.               0.494


In [58]:
get_related_terms(u'pasta', topn=20)

marinara             0.587
penne                0.563
bolognese            0.561
mushroom             0.544
spinach              0.539
cream_sauce          0.523
arugula              0.519
lasagna              0.516
risotto              0.506
sausage              0.497
parmesan             0.496
mozzarella           0.489
carbonara            0.483
tomato               0.483
soup                 0.482
spaghetti            0.482
veal                 0.479
pesto                0.468
lamb                 0.467
pizza                0.463


In [59]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print( term)

In [60]:
word_algebra(add=[u'breakfast', u'lunch'])

brunch


In [61]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

dinner


In [62]:
word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican'])

carnita


In [63]:
word_algebra(add=[u'bun', u'mexican'], subtract=[u'american'])

tortilla


In [64]:
word_algebra(add=[u'filet_mignon', u'seafood'], subtract=[u'beef'])

desert


In [65]:
word_algebra(add=[u'coffee', u'snack'], subtract=[u'drink'])

hazelnut


In [66]:
word_algebra(add=[u'burger', u'fine_dining'])

fast_food_joint


In [67]:
from sklearn.manifold import TSNE

In [68]:
tsne_input = word_vectors.drop(spacy.lang.en.STOP_WORDS, errors=u'ignore')
tsne_input = tsne_input.head(5000)

In [69]:
tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,-0.19254,-0.137421,0.087687,-0.0415,-0.046415,0.062379,0.016884,-0.103175,0.140363,0.088608,...,0.015603,-0.126255,-0.051399,0.037474,0.139312,0.042,-0.059149,0.078348,-0.050734,-0.068732
good,-0.18819,-0.06703,-0.128134,-0.042151,0.12797,0.109065,0.065108,-0.068333,0.096048,-0.027358,...,-0.113426,-0.200908,-0.002371,-0.08659,-0.028552,0.098042,0.027771,-0.002126,0.019668,-0.142265
food,-0.094605,-0.057565,0.01909,0.052923,-0.10534,-0.125611,0.02678,0.035995,0.088988,0.21887,...,0.052335,-0.060964,0.073406,-0.170049,0.042704,0.177785,-0.025385,-0.051239,-0.15938,-0.175729
order,-0.095767,-0.23151,0.04144,0.074546,-0.036252,0.052206,0.04943,-0.073733,-0.031316,-0.016103,...,-0.05173,0.077677,0.050358,0.009253,-0.0024,0.193037,-0.060212,0.041352,0.068094,0.041941
great,-0.117353,-0.018317,-0.026486,0.144878,0.142012,0.094409,0.01471,-0.11492,0.112542,0.031354,...,-0.214965,0.07244,0.065399,-0.138333,-0.23828,0.055685,0.005077,0.012492,-0.005923,-0.070824


In [5]:
tsne_filepath = os.path.join(intermediate_directory,
                             u'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                     u'tsne_vectors.npy')

NameError: name 'os' is not defined

In [71]:
%%time

if process:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

CPU times: user 1min 55s, sys: 9.61 s, total: 2min 5s
Wall time: 2min 8s


In [72]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
-PRON-,-4.02369,-14.034254
good,-0.155021,-27.410044
food,18.706568,35.679836
order,4.488968,-15.051637
great,1.070059,-28.578657


In [73]:
tsne_vectors[u'word'] = tsne_vectors.index

In [74]:
%load_ext autoreload
%autoreload 1
%aimport bokeh

In [75]:
%reload_ext autoreload

In [76]:
%autoreload 2

In [77]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [78]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

In [79]:
from textblob      import TextBlob


In [80]:
sentiments = {'positive': [],
             'neutral': [],
              'negative': []}
# sentiments = Counter()
with open(review_txt_filepath) as rf:
    positive = 0
    neutral  = 0
    negative = 0
    for i, sent in enumerate(rf):
        analysis  = TextBlob(sent)
        sentiment = analysis.sentiment.polarity

        if sentiment > 0:
            positive += 1
            sentiments['positive'].append(sent)
        elif sentiment == 0:
            neutral += 1
            sentiments['neutral'].append(sent)

        else:
            negative += 1
            sentiments['negative'].append(sent)

        if i > 500:
            break

print(positive, neutral, negative)

456 4 42


In [81]:
sentiments['neutral']

['burgers are very big portions here.\\n\\ndefinitely order the onion ring tower to share...\\n\\nMilkshakes are tasty! My personal favourite - the vanilla one.\n',
 "Ate here with my girlfriend this evening. She had the Mongolian entree and I had the Pad Cha Cha Cha entree. Both were excellent.\\n\\nWe've also eaten here once before and had no complaints that time either.\\n\\nService was prompt and friendly.\\n\\nWe'll definitely be back again.\n",
 'I have only three words for this place. Yum. Yum. and Yum.\n',
 'I have since been back twice, once for breakfast and another time for lunch. Stan, thank you for sacking whoever was dicking off the thing last time I came in to visit.\\n\\n"That\'ll do, Pig. That\'ll do." Seal of Approval.\n']

In [82]:
sentiments['negative'][:10]

['Server was a little rude.\\n\\nOrdered the calamari, duck confit poutine and the trout fish with miso soba - all very tasty. Definitely not your typical diner.\n',
 'Wanted to check out this place due to all the hype I had heard. My friend wanted to come here for her birthday. We had a group of seven and based on the way the seating works (as it is communal), we wind up sharing with another group of five. \\n\\nFood was ok - not sure what the hype was. Almost 75% of the dishes had some sort of cheese in it.\\n\\nThis place is very loud so not ideal for catching up/talking. Not a great place to go if you have any gluten or vegan restrictions.\\n\\nInteresting concept. Would definitely recommend to try this place once - but definitely not worth the 30- 45 mins wait.\n',
 "really excited to hear of this restaurant coming to Toronto. When it finally opened, my friend and I were really excited to try this place.\\n\\nService here is not great, it felt like they had forgotten about us and 

In [3]:
from spacy import displacy
import spacy

In [4]:
nlp = spacy.load('en')

In [5]:
s = sentiments['negative'][9].replace("\\n"," . ")
s
doc = nlp(s)
doc

NameError: name 'sentiments' is not defined

In [2]:
s = "'Wanted to check out this place due to all the hype I had heard. My friend wanted to come here for her birthday. We had a group of seven and based on the way the seating works (as it is communal), we wind up sharing with another group of five. \\n\\nFood was ok - not sure what the hype was. Almost 75% of the dishes had some sort of cheese in it.\\n\\nThis place is very loud so not ideal for catching up/talking. Not a great place to go if you have any gluten or vegan restrictions.\\n\\nInteresting concept. Would definitely recommend to try this place once - but definitely not worth the 30- 45 mins wait.\n',"

In [13]:
print(nlp(s))

'Wanted to check out this place due to all the hype I had heard. My friend wanted to come here for her birthday. We had a group of seven and based on the way the seating works (as it is communal), we wind up sharing with another group of five. \n\nFood was ok - not sure what the hype was. Almost 75% of the dishes had some sort of cheese in it.\n\nThis place is very loud so not ideal for catching up/talking. Not a great place to go if you have any gluten or vegan restrictions.\n\nInteresting concept. Would definitely recommend to try this place once - but definitely not worth the 30- 45 mins wait.
',


In [14]:
print(displacy.render(nlp(s), style='dep'))
# displacy.render(nlp(s), style='dep',jupyter=True, options={'distance' : 100})


<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="0" class="displacy" width="20350" height="1012.0" style="max-width: none; height: 1012.0px; color: #000000; background: #ffffff; font-family: Arial">
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="922.0">
    <tspan class="displacy-word" fill="currentColor" x="50">'</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PUNCT</tspan>
</text>

<text class="displacy-token" fill="currentColor" text-anchor="middle" y="922.0">
    <tspan class="displacy-word" fill="currentColor" x="225">Wanted</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">VERB</tspan>
</text>

<text class="displacy-token" fill="currentColor" text-anchor="middle" y="922.0">
    <tspan class="displacy-word" fill="currentColor" x="400">to</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="400">PART</tspan>
</text>

<text class="displacy-token" 

In [16]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [17]:
sentiments_vader = {'positive': [],
             'neutral': [],
              'negative': []}
# sentiments = Counter()
analyzer = SentimentIntensityAnalyzer()

with open(review_txt_filepath) as rf:
    positive = 0
    neutral  = 0
    negative = 0
    for i, sent in enumerate(rf):
        sentiment_scores = analyzer.polarity_scores(sent)
        sentiment = get_sentiment(sentiment_scores)

        if sentiment == 'pos':
            positive += 1
            sentiments_vader['positive'].append(sent)
        elif sentiment == 'neu':
            neutral += 1
            sentiments_vader['neutral'].append(sent)
        else:
            negative += 1
            sentiments_vader['negative'].append(sent)

        if i > 500:
            break

print(positive, neutral, negative)

NameError: name 'review_txt_filepath' is not defined

In [107]:
def get_sentiment(sentiment_values):
    res = sorted(sentiment_values.items(), key=lambda x: x[1])
    if res[-1][0] != 'compound':
        return res[-1][0]
    else:
        return res[-2][0]

test_score = {'compound': 0.9551, 'neg': 0.0, 'neu': 0.645, 'pos': 0.355}
get_sentiment(test_score)


'neu'

In [109]:
# positive sentiment: compound score >= 0.5
# neutral sentiment: (compound score > -0.5) and (compound score < 0.5)
# negative sentiment: compound score <= -0.5

def get_sentiment2(sentiment_values):
    if sentiment_values['compound'] > 0.5:
        return 'pos'
    elif sentiment_values['compound'] < -0.5:
        return 'neg'
    else:
        return 'neu'
    

test_score = {'compound': 0.9551, 'neg': 0.0, 'neu': 0.645, 'pos': 0.355}
get_sentiment(test_score)


'neu'

In [110]:
sentiments_vader = {'positive': [],
             'neutral': [],
              'negative': []}
# sentiments = Counter()
analyzer = SentimentIntensityAnalyzer()

with open(review_txt_filepath) as rf:
    positive = 0
    neutral  = 0
    negative = 0
    for i, sent in enumerate(rf):
        sent = sent.replace('\\n', ' . ')
        sentiment_scores = analyzer.polarity_scores(sent)
        sentiment = get_sentiment2(sentiment_scores)

        if sentiment == 'pos':
            positive += 1
            sentiments_vader['positive'].append(sent)
        elif sentiment == 'neu':
            neutral += 1
            sentiments_vader['neutral'].append(sent)
        else:
            negative += 1
            sentiments_vader['negative'].append(sent)

        if i > 500:
            break

print(positive, neutral, negative)

431 46 25


In [None]:
el = "worse customer service ever. \\nManager on duty was rude. She didn't care that I had negative feelings about this place when I said that I would never come back again!\\nRestaurant has gone downhill since they renovated!!\n"

In [None]:
el.split('\\n')

In [None]:
sentiments_vader['negative'][:10]

In [None]:
sentiments_vader['neutral'][:10]

In [111]:
experiment = ["I kill a good human", "I kill a bad human", "I save a good killer"]
for sent in experiment:
    print(analyzer.polarity_scores(sent))


{'neg': 0.547, 'neu': 0.116, 'pos': 0.337, 'compound': -0.4215}
{'neg': 0.891, 'neu': 0.109, 'pos': 0.0, 'compound': -0.8481}
{'neg': 0.413, 'neu': 0.0, 'pos': 0.587, 'compound': 0.2023}
