https://www.kaggle.com/yelp-dataset/yelp-dataset/notebooks?datasetId=10100&sortBy=dateRun
https://www.kaggle.com/poonaml/bidirectional-lstm-spacy-on-yelp-reviews
https://www.kaggle.com/saraclay/hcde-511-yelp-dataset


# Yelp Reviews

In [2]:
import numpy as np
import tensorflow as tf
import pandas as pd
import os

%matplotlib inline
import matplotlib

Checking to make sure I have GPU support for TensorFlow

In [3]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [4]:
tf.version.VERSION
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Importing the Data
Here, I'm going to import all of the paths to all of the files into a list. The files are all in the parent directory of the current working directory. So to get the base_dir, I need to join the cwd with the pardir. Then, I dynamically retrieve all of the files in just that directory, join the root with the file name, and append that to the list all_files.

In [5]:
all_files = []

base_dir = os.path.join(os.getcwd(), os.pardir)
for count, (root, dirs, files) in enumerate(os.walk(base_dir)):
    if (count == 0):
        for file in files:
            all_files.append(os.path.join(root, file))
all_files

['C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\Dataset_Agreement.pdf',
 'C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\yelp_academic_dataset_business.json',
 'C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\yelp_academic_dataset_checkin.json',
 'C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\yelp_academic_dataset_review.json',
 'C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\yelp_academic_dataset_tip.json',
 'C:\\Users\\lhm\\GitHub\\yelp_reviews\\yelp_reviews\\..\\yelp_academic_dataset_user.json']

In order to read the data, from version 0.19.0 of Pandas and later, it is necessary to use the lines parameter. This tells pandas to read it line by line. There might be some faster ways of loading this into memory, since this does take a few minutes to fully load.

This could possibly be done by passing the chunksize argument into pd.read_json. Or I could possibly try reading the JSON file directly as a JSON object (into a Python dictionary) then use the json module. More details on these can be found here:https://datascience.stackexchange.com/questions/60268/load-large-jsons-file-into-pandas-dataframe

In [7]:
import json
#data = json.load(open(all_files[3],"r"))
#review = pd.DataFrame.from_dict(data, orient="index")
#business = pd.read_json(all_files[1], lines=True)
#checkin = pd.read_json(all_files[2], lines=True)
review = pd.read_json(all_files[3], lines=True) #, chunksize = 10)
#tip = pd.read_json(all_files[4], lines=True)
#user = pd.read_json(all_files[5], lines=True)

In [8]:
review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16
1,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,lbrU8StCq3yDfr-QMnGrmQ,1,1,1,0,I am actually horrified this place is still in...,2013-12-07 03:16:52
2,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,1,0,0,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11
3,i6g_oA9Yf9Y31qt0wibXpw,ofKDkJKXSKZXu5xJNGiiBQ,5JxlZaqCnk1MnbgRirs40Q,1,0,0,0,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",2011-05-27 05:30:52
4,6TdNDKywdbjoTkizeMce8A,UgMW8bLE0QMJDCkQ1Ax5Mg,IS4cv902ykd8wj1TR0N3-A,4,0,0,0,"Oh happy day, finally have a Canes near my cas...",2017-01-14 21:56:57


Here is an example review:

In [9]:
review['text'][0]

'As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It\'s what real estate agents would call "cozy" or "charming" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it\'s not kid friendly at all. Seriously, don\'t bring them.\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should view the ar

For this dataset, I want to simplify it down so that it just contains the text and labels. Specifically, I want to use NLP in order to identify whether a review is positive or negative. In order to do this, I'm going to engineer a new label called "target" which is 1 when the review received 4 or 5 stars and is 0 when the review received 1 or 2 stars. I'm going to ignore 3 star reviews since those would be neutral

In [10]:
review = review[['review_id','stars','text']]
review = review.copy(deep=True)

In [11]:
review['label'] = review['stars'].map({5:'1',4:'1',2:'0',1:'0'})

In [12]:
review = review[review['label'].notnull()]

In [42]:
review2 = review

In [43]:
review.reset_index(drop=True, inplace = True)

Now I have the data separated into the review and the label of whether it is positive or negative.

In [44]:
train_size = int(review.shape[0] * 0.7)
cv_size = int(review.shape[0] * 0.2)
#test_size = int(review.shape[0] * 0.1)

In [45]:
train = review['text'][0:train_size]
cv = review['text'][train_size:train_size+cv_size]
test = review['text'][train_size + cv_size :]

train_y = review['label'][0:train_size]
cv_y = review['label'][train_size:train_size+cv_size]
test_y = review['label'][train_size + cv_size :]

In [47]:
train.shape

(5025183,)

In [48]:
cv.shape

(1435766,)

In [49]:
test.shape

(717884,)

In [21]:
X = review['text'].values
y = review['label'].values

Before doing anything else. I'm going to split the data into a training and test set. I'm doing this before even running the tokenizer in order to avoid data leakage. Data leakage is when information from outside the training dataset is used to create the model. If I even ran the tokenizer before separating out the training and test set, then the AI would have knowledge of the words that were used in the test set and would be able to assign them numbers in the tokenizer.

In [22]:
X

array(['As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It\'s what real estate agents would call "cozy" or "charming" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it\'s not kid friendly at all. Seriously, don\'t bring them.\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should view

In [23]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token="<OOV>", 
                      num_words = 5000, 
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True)
tokenizer.fit_on_texts(X)

In [39]:
## Named list mapping words to number of times they appeared on during fit
#tokenizer.word_counts

## Named list mapping words to the number of documents/texts they appeared on during fit
#tokenizer.word_docs

## Named list mapping words to their rank/index (int)
#tokenizer.word_index

## Number of documents the tokenizer was trained on
#tokenizer.document_count

OrderedDict([('as', 3155425),
             ('someone', 290057),
             ('who', 861504),
             ('has', 1201770),
             ('worked', 129104),
             ('with', 6361220),
             ('many', 573330),
             ('museums', 2237),
             ('i', 20624732),
             ('was', 13364467),
             ('eager', 12424),
             ('to', 19240219),
             ('visit', 425252),
             ('this', 6436950),
             ('gallery', 6632),
             ('on', 5323292),
             ('my', 7823017),
             ('most', 580819),
             ('recent', 44136),
             ('trip', 169330),
             ('las', 194679),
             ('vegas', 617202),
             ('when', 2172519),
             ('saw', 208393),
             ('they', 6362421),
             ('would', 2104363),
             ('be', 3354519),
             ('showing', 29908),
             ('infamous', 3186),
             ('eggs', 140386),
             ('of', 10717134),
             ('the', 37520

Although I set the tokenizer num_words hyperparemeter to 5000, the tokenizer still keeps track of the counter on all words. It only uses the num_words most common words when any transformative method is called though. The reason it behaves like this is so that a user can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.

In [40]:
word_index = tokenizer.word_index
print(len(word_index))

840909


In [41]:
sequences = tokenizer.texts_to_sequences(X)