# Yelp Academic dataset

Yelp released an [academic dataset](https://www.yelp.com/dataset/challenge) to engage ML researchers in solving their problems, particularly those related to image processing and natural language processing (NLP).

We'll do an intro to NLP methods using some functions I wrote that simplifies some of the processes involved. I use mostly Spacy and Gensim here, but here are some libraries that I've found helpful in other applications:

1. [Spacy](https://spacy.io/) - tokenizing, tagging, part of speech, entity recognition, etc.
2. [Gensim](https://radimrehurek.com/gensim/) - topic modeling, phrase modeling, etc.
3. [NLTK](https://www.nltk.org/) - tokenizing, tagging, part of speech, entity recognition, etc.
4. [Pattern](https://www.clips.uantwerpen.be/pattern) - tokenizing, tagging, part of speech, entity recognition, etc.
5. [PyLdaVis](https://github.com/bmabey/pyLDAvis) - visualize LDA topics (port from R)
6. [wordcloud](https://github.com/amueller/word_cloud) - creating word clouds
7. [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - fuzzy text matching

# Read in some data

In [4]:
import sys, pandas as pd, numpy as np, os, random, ujson, codecs
scriptpath = "../Code/"
sys.path.append(scriptpath)
import spacy, re, gensim, numpy as np, pandas as pd, json, codecs, pyLDAvis, warnings
from gensim.models import Phrases, Word2Vec
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.word2vec import LineSentence
from collections import Counter
from itertools import chain
from matplotlib import pyplot as plt
from sklearn.manifold import TSNE
import basic_text_processing_functions as tx
floc = 'C:/Users/yangy/Documents/Data/Yelp/yelp_dataset_challenge_round9/yelp_dataset_challenge_round9/' # Download the dataset locally and change the floc to the correct place

In [5]:
fnames = os.listdir(floc) # these are the files that the dataset comes with (there's another folder for photos)
fnames

['._Dataset_Challenge_Dataset_Agreement.pdf',
 '._Yelp_Dataset_Challenge_Terms_round_9.pdf',
 'Dataset_Challenge_Dataset_Agreement.pdf',
 'PaxHeader',
 'texts_only.txt',
 'yelp_academic_dataset_business.json',
 'yelp_academic_dataset_checkin.json',
 'yelp_academic_dataset_review.json',
 'yelp_academic_dataset_tip.json',
 'yelp_academic_dataset_user.json',
 'Yelp_Dataset_Challenge_Terms_round_9.pdf']

In [158]:
fname = 'yelp_academic_dataset_review.json' # reviews data
# data = pd.read_json(floc+fname,  lines=True, encoding = 'utf-8') # load json data using ujson

In [159]:
import codecs
with codecs.open(floc+fname,'r' ,encoding = 'utf-8') as f:
    data = f.readlines()

In [160]:
import ujson
d = ujson.loads(data[0]) # each line is one review
d

{u'business_id': u'2aFiy99vNLklCx3T_tGS9A',
 u'cool': 0,
 u'date': u'2011-10-10',
 u'funny': 0,
 u'review_id': u'NxL8SIC5yqOdnlXCg18IBg',
 u'stars': 5,
 u'text': u"If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and enjoyable. I hope not to need him again (though this is highly unlikely) but knowing he is there if I do is very nice. By the way, I'm not from El Centro, CA. but Scottsdale, AZ.",
 u'type': u'review',
 u'useful': 0,
 u'user_id': u'KpkOkG6RIf4Ra25Lhhxf1A'}

In [161]:
d.keys() # These are the variables included

[u'funny',
 u'user_id',
 u'review_id',
 u'text',
 u'business_id',
 u'stars',
 u'date',
 u'useful',
 u'type',
 u'cool']

For our purposes, we really only care about the text field. Let's write this off into a separet file.

In [10]:
import random
ujson.loads(random.choice(data))['text'] # this is how we would get a text

u"The kind of breakfast you should expect from a cheap hotel buffet. Powdered eggs and crepes with a side of tasty taters. Plus the coffee tastes like... Well it's tasteless.\nService reminds me of Canadian Tire."

In [14]:
if 1==0:
    with codecs.open(floc+'texts_only.txt','w' ,encoding = 'utf-8') as f:
        for d in data:
            dat = ujson.loads(d)['text'].strip().replace(u'\n', u' ')
            if len(dat)>1:
                f.write(dat+u'\n')

In [15]:
fname = 'texts_only.txt'
with codecs.open(floc+fname,'r' ,encoding = 'utf-8') as f:
    data = f.readlines()
print(len(data))

4156042


In [16]:
random.choice(data).strip() # look at a random line

u'Stopped by today to pick up a pint of vegan ice cream for the house and their special was Vegan Chocolate Ice Cream w/ bits of choco chip cookie & peanut butter - DIVINE!'

# First, normalize these texts

In written language, we have grammar rules, normal sentence structures, punctuations, etc... but even when the writng is no good like this bad sentence be still meaning understand.

This is because the important words remain. Much of the text analysis tools use this idea of "bag of words" to extract meaning from texts.

Before we do this, we can convert words to normalized forms, i.e. infinitive forms of verbs: Am,is,were,are,was$\rightarrow$be and singular forms of nouns: drinks$\rightarrow$drink.

There are several packages to deal with this: NLTK, Pattern, Textblob (uses NLTK/Pattern), Spacy.

I wrote my code to work with Spacy 1.9, after upgrading to 2.x, it seems to be a lot slower. There's a current open issue for this on Spacy's [GitHub page](https://github.com/explosion/spaCy/issues/1508).

In [23]:
# Save my current configuration
# write config file for the text processing functions
path = 'text_analysis/'
pathloc = floc + path
if 1==0:
    with codecs.open(floc+path+'default.cfg', 'w', encoding = 'utf-8') as f:
        f.write(json.dumps({'batch_size':1000, 'n_threads':8,
                            'fpathroot':floc+path, 'fpathappend':u'', 'entity_sub':True}))
else:
    tx.batch_size, tx.n_threads, tx.fpathroot, tx.fpathappend, tx.entity_sub, tx.numtopics=tx._config_text_analysis_(floc+path+'default.cfg')
#     tx.batch_size, tx.n_threads, tx.fpathroot, tx.fpathappend, tx.entity_sub, tx.numtopics = batch_size, n_threads, fpathroot, fpathappend, entity_sub, numtopics

In [29]:
# this step takes a while, needs af fix on spacy's end, going back to v1.9 or Pattern/NLTK
if 1 == 0:
    tx._write_unigram_(floc+fname,
                       unigram_sentences_filepath=tx.fpathroot+tx.fpathappend+'_sent_gram_0.txt',
                       entity_sub=True)

# Next, create phrase models (colocation detector)

In [152]:
passes = 2
if 1==0:
    ngrams = tx._phrase_detection_(fpath=tx.fpathroot+tx.fpathappend, passes = passes, returnmodels = True, threshold = 10.)
else:
    ngrams = list()
    gramfiles = os.listdir(tx.fpathroot)
    phrasemodels = [tx.fpathroot+g for g in gramfiles if 'phrase_model' in g]
    for m in phrasemodels:
        ngrams.append(Phrases.load(m))

In [110]:
with codecs.open(tx.fpathroot+'_sent_gram_0.txt','r' ,encoding = 'utf-8') as f:
    raw = f.readlines()
raw = [r for r in raw if len(r.strip())>0]
with codecs.open(tx.fpathroot+'sent_gram_{}.txt'.format(passes),'r' ,encoding = 'utf-8') as f:
    gram = f.readlines()

In [163]:
n = np.random.choice(range(len(raw)))
print raw[n]
print gram[n]

course PRON leave vegan salt caramel cantaloupe gelato

course PRON leave vegan salt_caramel cantaloupe gelato



# Typically, we want to apply parser/phrase models to doc level

In [33]:
regram = 0 # change to 1 to organize bags of words by documents (vs. sentences)
# passes = 2
if regram == 1:
    if 1==1:
        grammed_reviews = tx._phrase_prediction_(tx.fpathroot+fname, ngrams, outfpath = tx.fpathroot+tx.fpathappend+'grammed_texts.txt', entity_sub = True)
    else:
        grammed_reviews = tx.fpathroot+'grammed_texts.txt'
else:
    grammed_reviews = tx.fpathroot+'sent_gram_{}.txt'.format(passes)

# Create dictionary by filtering out common/uncommon tokens

This step is *SUPER* critical and a lot of the (re)iterations come with changes to this step. Garbage in gabage out still applies, but the problem is it's often hard to tell if you have garbage before you do the topic analysis.

Rule of thumb: You want to get rid of your most and least frequent 20% of words. You can get rid of less of the most frequent words if your least frequent words are plentiful... this is typically the case for working with user generated content. People can't read good and or do other stuff good too.

![zoolander](https://www.dropbox.com/s/a70gdqzc2nurnrq/zoolander.png?dl=1)

In [153]:
vocab, gensim_dictionary, cts = tx._make_dict_(grammed_reviews,
                                          floc = tx.fpathroot+tx.fpathappend+'dict_gram.dict',
                                          topfilter = 90,bottomfilter = 20,
                                          no_filters=False, keep_ent=False, 
                                         discard_list={},
                                         keep_list = {})

Success


In [164]:
len(vocab)

75539

# Create serialized representation of corpus (token id's instead of tokens)

This step in th eprocess also filters out vocab not in dictionary

In [154]:
if 1==1: 
    grammed_corpus = tx._serialize_corpus_(grammed_reviews, gensim_dictionary,  outfpath = tx.fpathroot+tx.fpathappend+'_serialized.mm')
else:
    grammed_corpus_loc= tx.fpathroot+tx.fpathappend+'_serialized.mm'
    grammed_corpus = MmCorpus(grammed_corpus_loc)

# Do some kind of topic modeling (LDA here)

Some common types of topic modeling for texts:

1. TF-IDF, term frequency * inverse document frequency (TF-IDF). This is a weighting scheme to represent the importance of each word to identifying the document.
    1. Fast
    2. Easy to implement
    3. Great for things like matching company names in Hilton San Diego, Hilton is more relevant than San Diego, especially if San Diego is lumped together with other location entities ("GPE" in Spacy), it gets more weight.
2. Latent semantic indexing (LSI). Matrix decomposition method (at its core a SVD based deconstruction) to find the "principle components" of sparse matrix spaces, think of it as finding the span of the vectorized token space. The token vectors used as inputs are commonly TF-IDF representations.
    1. Relatively fast, does not require maximizing likelihood functions, etc.
    2. Reduces dimensions (vs. TF-IDF)
    3. Has both positive and negative weights to define "topics." Negative weights mean the topic is less likely if its associated token is in a document.
    4. Great alternative to LDA.
3. [Latent Dirichlet Allocation](http://ai.stanford.edu/~ang/papers/jair03-lda.pdf) (LDA). A generative hierarchical model of word distributions. Specifies a nested distributional structures where documents are distributions of topics which are in turn distributions over the (entire) vocabulary.
    1. Slower, but Gensim has a multicore implementation that reduces runtime to hours (vs days/weeks)
    2. Richer representation of topics
    3. Probably the standard that one would compare newer methods (typically variants of LDA) to.

A nice graphical illustration of LDA:
![LDA overview](../Figures/LDA.png)

The multicore version is a Godsend!
![running cores](https://www.dropbox.com/s/ly2jzk303e613rn/Screenshot%202018-02-22%2013.50.18.png?dl=1)

In [37]:
ntopics = 10
ldafile = 'lda_'+str(ntopics)
if 1 ==0:
    lda = tx._lda_(gensim_dictionary, corpus_path = grammed_corpus, numtopics= ntopics,iterations=25) # defaults to 10 topics
    lda.save(tx.fpathroot+ldafile)
else: 
    lda = LdaMulticore.load(tx.fpathroot+ldafile)

In [165]:
lda.show_topics(num_topics = ntopics)

[(0,
  u'0.047*"price_reasonable" + 0.042*"right_away" + 0.023*"big_deal" + 0.018*"fry_chicken" + 0.012*"bottle_wine" + 0.012*"totally_worth" + 0.011*"grill_cheese" + 0.011*"large_party" + 0.009*"ahead_time" + 0.009*"fine_dining"'),
 (1,
  u'0.027*"hard_find" + 0.026*"walk_away" + 0.019*"hit_miss" + 0.019*"sound_like" + 0.013*"price_fair" + 0.011*"answer_question" + 0.010*"large_selection" + 0.009*"far_away" + 0.006*"lobster_roll" + 0.005*"plenty_parking"'),
 (2,
  u'0.023*"high_end" + 0.016*"bit_pricey" + 0.011*"sweet_potato" + 0.010*"group_friend" + 0.009*"outdoor_seating" + 0.009*"family_member" + 0.008*"old_fashioned" + 0.008*"bit_high" + 0.007*"regular_basis" + 0.007*"convenient_location"'),
 (3,
  u'0.064*"definitely_recommend" + 0.043*"little_bit" + 0.020*"friendly_helpful" + 0.019*"walk_door" + 0.015*"and_or" + 0.014*"sport_bar" + 0.013*"special_occasion" + 0.013*"stay_hotel" + 0.009*"save_money" + 0.009*"worth_drive"'),
 (4,
  u'0.054*"great_job" + 0.032*"read_review" + 0.018*

# Visualize LDA

In [39]:
import pyLDAvis.gensim
ldaviz = pyLDAvis.gensim.prepare(lda, grammed_corpus, gensim_dictionary)
pyLDAvis.save_html(ldaviz, '../Figures/viz_'+ldafile+'.html')

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


# Some things I've done with text analysis
1. A lot of company name matching (hotels, unstandard company names with standard ones from databases)
    1. tokenizing$\rightarrow$TF-IDF$\rightarrow$Fuzzy Matching
    2. SpaCy$\rightarrow$Gensim$\rightarrow$FuzzyWuzzy
    
2. Analyzing conversations online, such as online reviews and [manager responses to online reviews](http://journals.ama.org/doi/abs/10.1509/jmr.15.0511?journalCode=jmkr).
    1. tokenizing$\rightarrow$topic analysis$\rightarrow$econometric modeling with topics as features
    2. SpaCy$\rightarrow$Gensim$\rightarrow$R
    
3. Analyzing [performance reviews](http://www.cangrade.com) at large corporations.
    1. tokenizing$\rightarrow$topic analysis$\rightarrow$machine learning
    2. SpaCy$\rightarrow$Gensim$\rightarrow$TensorFlow