# Final Capstone Notebook 1 out of 4

Due to size constraints, we cannot load the original dataset into a google collab format notebook. As a result, we will need to load the dataset in a jupyter notebook, prep the data, and export it to a CSV. The original dataset can be found [here](https://www.kaggle.com/yelp-dataset/yelp-dataset/version/6?select=yelp_business.csv). Although there is a lot of information to work with, we will mostly be using the business table, the reviews table, and the tips table.

In [1]:
import pandas as pd
import spacy
import re
import warnings

warnings.filterwarnings('ignore', category=FutureWarning)

# Load Dataset

Due to the size of the dataset, we would have to sample only one city for our analysis (Cleveland, Ohio). The business table will be used to filter the reviews and tips to a single city. Then, we will text clean the reviews and tips. Because the tips are unlabeled, we will have to create a sentiment classifier to give them labels. Only then can we combine the tips with the reviews for topic extraction. The classifier and topic extraction will be explored in the other two notebooks.

In [2]:
# Load the datset
path = "C:\\Users\\James\\Desktop\\Data_Folder\\yelp\\"

biz=pd.read_csv(path + "yelp_business.csv")
reviews=pd.read_csv(path + "yelp_review.csv")
tips=pd.read_csv(path + "yelp_tip.csv")

In [3]:
# Remove unnecessary columns; note that stars is total stars for biz
col = ['neighborhood', 'address', 'latitude', 'longitude', 'stars']
biz.drop(columns=col, inplace=True)       

# Update categories so that it's a list and not a string
biz.categories = biz.categories.apply(lambda x: x.split(";"))

# We will filter open restaurants in Cleveland, OH only
biz = biz[
    (biz.city == 'Cleveland') & 
    (biz.state == 'OH') & 
    (biz.is_open == 1) & 
    (biz.categories.apply(
        lambda x: True if 'Restaurants' in x else False
    ))]

In [4]:
df = pd.merge(reviews, biz, how='inner', on='business_id')

df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,name,city,state,postal_code,review_count,is_open,categories
0,OPZsR2jCG72uoDNjU71DQQ,qYbWTWH5leltA0bzWAOnmA,meXjqyhTNLFmknY39y2sMg,5,2014-09-11,Solid beers -- Christmas Ale defines my holida...,1,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv..."
1,fxGwEiSYDtAen8BNuVGGxg,8Az_JgEpXqAii_5EDkw2tw,meXjqyhTNLFmknY39y2sMg,3,2013-10-13,Meh. It was OK. A bartender the night before...,0,1,0,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv..."
2,Gweb4pADeQ26WnaiKEZ7GQ,T9tEic49JZjN4nCUcDvrRQ,meXjqyhTNLFmknY39y2sMg,4,2014-01-15,"Oh Christmas Ale, oh Christmas Ale, how lovely...",1,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv..."
3,P1vhwPI56SeZEz10ywaS7w,W1p8_CFW5FISSihmQo5Qzw,meXjqyhTNLFmknY39y2sMg,3,2012-02-09,What is the big deal about this place? The foo...,2,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv..."
4,1kQvQlBX0V5_rGddBh9-rQ,Y_PP05RRdzbKRYfDCCfh8w,meXjqyhTNLFmknY39y2sMg,5,2017-04-30,Great Lakes Brewing Company is one of my favor...,0,0,0,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv..."


In [5]:
# still have a lot of rows and no null values

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56910 entries, 0 to 56909
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_id     56910 non-null  object
 1   user_id       56910 non-null  object
 2   business_id   56910 non-null  object
 3   stars         56910 non-null  int64 
 4   date          56910 non-null  object
 5   text          56910 non-null  object
 6   useful        56910 non-null  int64 
 7   funny         56910 non-null  int64 
 8   cool          56910 non-null  int64 
 9   name          56910 non-null  object
 10  city          56910 non-null  object
 11  state         56910 non-null  object
 12  postal_code   56880 non-null  object
 13  review_count  56910 non-null  int64 
 14  is_open       56910 non-null  int64 
 15  categories    56910 non-null  object
dtypes: int64(6), object(10)
memory usage: 7.4+ MB


# Text Cleaning

If you look at the filtered dataframe below, you can see that some of the text has Chinese, Korean, & Japanese characters. We need to make a decision about how we will handle them. One option is to ignore them and allow them to stay in our data. The problem is that there is so few entries with CJK characters that our sentiment classifier and our topic extractor will only pick them up as noise. 

The second option is to remove the sentences completely that have these characters. However, some sentences have clear sentiments but only a word or two in CJK with the rest in English. We don't want to lose this info.

The third option is to simply remove the characters from the sentences but keep the sentences intact. This method has some weaknesses since we have entire sentences in CJK characters. However, a majority of the sentences with CJK characters only have a few number of words per sentence. As a result, we will take this third option despite it's shortcomings.

In [6]:
# This function will be used to detect CJK characters
def cjk_detect(texts):
    # korean
    if re.search("[\uac00-\ud7a3]", texts):
        return True
    # japanese
    if re.search("[\u3040-\u30ff]", texts):
        return True
    # chinese
    if re.search("[\u4e00-\u9FFF]", texts):
        return True
    return False

In [7]:
# Let's filter the df
df[df.text.apply(cjk_detect)]

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,name,city,state,postal_code,review_count,is_open,categories
1596,S2QzbvWSzg55eotbFUnWlg,ud2S2pG_8wtQiiZEonvYhw,WHlbUb5SNdXDKYfkNDo89A,2,2017-02-26,店の雰囲気、立地は良いと思った。\nただし、とにかくオーダーしてから食事が来るのが遅い。飲み...,0,0,0,"""Pickwick & Frolic Restaurant & Club""",Cleveland,OH,44115,128,1,"[Food, Lounges, Ice Cream & Frozen Yogurt, Ame..."
9533,D7_28LOkVPoYE0d3xdpOMA,ww6SJqoqDDlJRj-f6B56Ew,1veVZUawy7IhIc5oDpRRQA,5,2015-10-30,早餐只供應到11:00，可惜沒吃到，但他的sandwich 還是很精彩; Corned Be...,0,0,0,"""Slyman's Restaurant""",Cleveland,OH,44114,385,1,"[Sandwiches, Restaurants, American (Traditional)]"
10270,HVyd1pdbzR9_Y3oWfyupgw,XhyK4GaKHh3dzSeHfsfsvQ,tulUhFYMvBkYHsjmn30A9w,2,2017-07-04,Finally decided to see what all the hype was a...,0,0,0,"""Hot Sauce Williams Barbecue Restaurant""",Cleveland,OH,44103,162,1,"[Soul Food, Restaurants, Barbeque]"
12694,qFLoCT9-2d_MVJchyQZQcg,bLNS2BFfa9PTxCbT5jJztA,wftmt-n8OUA4Ng3bWWH5dw,5,2016-09-18,Wonton is a great place to go with a large gro...,8,6,6,"""Wonton Gourmet & BBQ""",Cleveland,OH,44114,220,1,"[Cantonese, Restaurants, Chinese, Barbeque]"
16230,IRi2mraRE6le1ZRUwnWOMA,7Sq9MNNDseNy9AobzAms3A,ijY4C4ut4M9xg3QvK2R2pg,3,2015-10-29,To be fair.... It was decent. It's the only Di...,3,1,0,"""Li Wah""",Cleveland,OH,44114,246,1,"[Cantonese, Restaurants, Dim Sum, Chinese]"
16381,aUSN66vxjUpfIV0hTk6lBA,4QazAwPkW0LRVFEL7Uf1Fg,ijY4C4ut4M9xg3QvK2R2pg,3,2015-06-21,"There's always a long line around lunch time, ...",14,10,12,"""Li Wah""",Cleveland,OH,44114,246,1,"[Cantonese, Restaurants, Dim Sum, Chinese]"
17866,D3vq0phHPaCzO6MjrLYXZg,YmXDorvj5dg-dJbGjj8dQw,GDpd9KZdUgnjcoCWTTphqw,5,2016-01-01,Very good food. Waiter checked gluten for us. ...,0,0,0,"""Szechuan Gourmet""",Cleveland,OH,44114,165,1,"[Restaurants, Chinese, Szechuan]"
17882,XxCO3kf4Iz_tO-cqy35yLg,5ZrFDFPJe8Qq3Zgue315oA,GDpd9KZdUgnjcoCWTTphqw,5,2015-08-05,I was ready for this seedy-looking restaurant ...,9,5,7,"""Szechuan Gourmet""",Cleveland,OH,44114,165,1,"[Restaurants, Chinese, Szechuan]"
17885,D6QKEWLmA9BbW5mxa5VmTQ,DOFUf5lLMHZU7fRki_M6uQ,GDpd9KZdUgnjcoCWTTphqw,5,2017-05-12,The food is served so fast!! And is absolutely...,0,0,0,"""Szechuan Gourmet""",Cleveland,OH,44114,165,1,"[Restaurants, Chinese, Szechuan]"
17969,TO1TZaitro8BtqCROtUBgA,Z27fERSsub99cZNbcJHBDg,GDpd9KZdUgnjcoCWTTphqw,4,2016-05-15,My mother can't handle Sichuan spice very well...,2,0,0,"""Szechuan Gourmet""",Cleveland,OH,44114,165,1,"[Restaurants, Chinese, Szechuan]"


Because we are doing sentiment analysis, certain stopwords such as negatives (no, not, etc.) could completely reverse the sentiment if omitted. As a result, we should keep a few of them in. "Elsewhere" and "else" are popular words to use to express a negative review (i.e. "Go elsewhere for your food"). Also, the word "well" can be used to express positive sentiment (i.e. "Well done!" or "The food was prepared well.").

In [8]:
# Removes punctation, stops words, CJK, and lemmatizes
def lemma(text):
    lem_text = []
    
    for token in text:
        if (not token.is_punct and 
            not token.like_email and
            not token.like_url and
            not cjk_detect(token.text) and
            token not in all_stopwords and
            token.lemma_ not in all_stopwords): 
            lem_text.append(token.lemma_.lower())
    
    return " ".join(lem_text)

In [9]:
# create stopwords list
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("-PRON-")

exceptions = {'do', 'no', 'not', 'never', 
              'nothing', 'none', 'go', 
              'else', 'elsewhere', 'well', 
             }

all_stopwords = nlp.Defaults.stop_words
all_stopwords.difference_update(exceptions)
all_stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 '-PRON-',
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'furt

In [10]:
%%time
# let's create lemmatized sentences for the reviews
df['proc'] = list(nlp.pipe(iter(df.text), 
                           n_process=12, 
                           batch_size=1000))

df.loc[:,'lem_join'] = df.proc.apply(lambda x: lemma(x))

df.head()

Wall time: 4min 9s


Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,name,city,state,postal_code,review_count,is_open,categories,proc,lem_join
0,OPZsR2jCG72uoDNjU71DQQ,qYbWTWH5leltA0bzWAOnmA,meXjqyhTNLFmknY39y2sMg,5,2014-09-11,Solid beers -- Christmas Ale defines my holida...,1,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv...","(Solid, beers, --, Christmas, Ale, defines, my...",solid beer christmas ale define holiday season...
1,fxGwEiSYDtAen8BNuVGGxg,8Az_JgEpXqAii_5EDkw2tw,meXjqyhTNLFmknY39y2sMg,3,2013-10-13,Meh. It was OK. A bartender the night before...,0,1,0,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv...","(Meh, ., , It, was, OK, ., , A, bartender, t...",meh ok bartender night tell cleveland eat ...
2,Gweb4pADeQ26WnaiKEZ7GQ,T9tEic49JZjN4nCUcDvrRQ,meXjqyhTNLFmknY39y2sMg,4,2014-01-15,"Oh Christmas Ale, oh Christmas Ale, how lovely...",1,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv...","(Oh, Christmas, Ale, ,, oh, Christmas, Ale, ,,...",oh christmas ale oh christmas ale lovely do ta...
3,P1vhwPI56SeZEz10ywaS7w,W1p8_CFW5FISSihmQo5Qzw,meXjqyhTNLFmknY39y2sMg,3,2012-02-09,What is the big deal about this place? The foo...,2,1,1,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv...","(What, is, the, big, deal, about, this, place,...",big deal place food overprice beer do not
4,1kQvQlBX0V5_rGddBh9-rQ,Y_PP05RRdzbKRYfDCCfh8w,meXjqyhTNLFmknY39y2sMg,5,2017-04-30,Great Lakes Brewing Company is one of my favor...,0,0,0,"""Great Lakes Brewing Company""",Cleveland,OH,44113,751,1,"[Breweries, Restaurants, Event Planning & Serv...","(Great, Lakes, Brewing, Company, is, one, of, ...",great lakes brewing company favorite place wor...


In [11]:
%%time
# let's filter only tips for the businesses we want
tips = tips[tips['business_id'].isin(df.business_id.unique())]

# let's create lemmatized sentences for the tips
tips['proc'] = list(nlp.pipe(
    iter(tips.text), 
    n_process=12, 
    batch_size=1000
))

tips.loc[:,'lem_join'] = tips.proc.apply(lambda x: lemma(x))

# let's add names to tips
biz_dict = pd.Series(biz.name.values, 
                     index=biz.business_id).to_dict()

tips['name'] = tips.business_id.map(biz_dict)

tips.head()

Wall time: 34.6 s


Unnamed: 0,text,date,likes,business_id,user_id,proc,lem_join,name
124,The Cleveland Pickle is the best sandwich deli...,2012-10-20,0,MTsIckdo3_uKuqk3B4zuKA,MolXvMRbUNY6Yr_s0zEQ0A,"(The, Cleveland, Pickle, is, the, best, sandwi...",cleveland pickle good sandwich deli hand uniqu...,"""Cleveland Pickle"""
382,A bit different. Out at a restaurant for Thank...,2010-11-25,0,0youcKV6-eE3F2MQj1l6Fw,blrWvPePSv87aU9hV1Zd8Q,"(A, bit, different, ., Out, at, a, restaurant,...",bit different restaurant thanksgiving buffet l...,"""100th Bomb Group"""
596,Tab Benoit,2012-08-10,0,CDqPVVvQtVncNQGydnZy7A,JE2qFjL4BaUbiI-cT5MSBw,"(Tab, Benoit)",tab benoit,"""Beachland Ballroom and Tavern"""
1140,One of the. Eat buffet I have been to in an ve...,2016-10-28,0,pGjtxXBq4tZcdKdgTU-Tww,a4pc6NRtbGkO7koP8qrVsg,"(One, of, the, ., Eat, buffet, I, have, been, ...",eat buffet long time,"""Little Hong Kong"""
3557,Shipwreck! That's an awesome breakfast!,2010-09-11,0,_5PJ4GHIXNdUdXtohylKGQ,ca___2Qaf5FFyPCf6T2eZA,"(Shipwreck, !, That, 's, an, awesome, breakfas...",shipwreck awesome breakfast,"""Lucky's Café"""


# Upload to CSV

We will move these CSV files to a google drive to be uploaded to the google collab notebook. We will redundant columns before we do.

In [12]:
# let's remove unncecessary columns
col = ['city', 'state', 'postal_code', 
       'is_open', 'user_id', 'review_id', 
       'proc', 'categories']
df.drop(columns=col, inplace=True)  

# let's create a csv file
df.to_csv('yelp_reviews_lem.csv', index=False)

In [13]:
# let's remove unncecessary columns
col = ['user_id', 'proc']
tips.drop(columns=col, inplace=True)

# let's create a csv file
tips.to_csv('yelp_tips_lem.csv', index=False)