# What's in a Yelp Review?
**Michael Feeley**  
**Metis Bootcamp - Project 4**

**===================================================================================================================**

### Preparing Data

**Below are all the imported packages for this project.**

In [None]:
# Obligatory
import numpy as np
import pandas as pd
import pickle

# Pre-processing
import langdetect
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import re
from sklearn.feature_extraction import text
import string

# Modeling
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

**The data is stored in the 'data' subdirectory as json files.  Here, I use pandas' 'read_json' function to read in the data.**

**The original review file was WAY too large to import on my Macbook.  I created a subset using terminal of the first 100K rows.**

In [30]:
# Read in the reviews and business data
review_df = pd.read_json(r'data/review_sub.json', lines = True)
business_df = pd.read_json(r'data/business.json', lines = True)

**I then chcked what features are in the review dataframe.  This will help determine what is actually useful for the project.**

In [31]:
# Column names - review
review_df.columns

Index(['business_id', 'cool', 'date', 'funny', 'review_id', 'stars', 'text',
       'useful', 'user_id'],
      dtype='object')

**I don't need most of these features for the purpose of this project, so I only kept:**
* business_id (for zoning in on a categorical or views)
* stars for the review (for analyzing positive and negative reviews separately)
* text (for insights)

In [32]:
# Retreieve the desired features
review_df = review_df.loc[:,['business_id','stars','text']]

In [33]:
# Display
review_df.head()

Unnamed: 0,business_id,stars,text
0,ujmEBvifdJM6h6RLv4wQIg,1,Total bill for this horrible service? Over $8G...
1,NZnhc2sEQy3RmzKTZnqtwQ,5,I *adore* Travis at the Hard Rock's new Kelly ...
2,WTqjgwHlXbSFevF32_DJVw,5,I have to say that this office really has it t...
3,ikCg8xy5JIg_NGPx-MSIDA,5,Went in for a lunch. Steak sandwich was delici...
4,b1b1eb3uo-w561D0ZfCEiQ,1,Today was my second out of three sessions I ha...


**I did the same thing for the business dataframe, as I only needed the categories.  I also kept the cities for future work.**

In [34]:
# Column names - business
business_df.columns

Index(['address', 'attributes', 'business_id', 'categories', 'city', 'hours',
       'is_open', 'latitude', 'longitude', 'name', 'postal_code',
       'review_count', 'stars', 'state'],
      dtype='object')

In [35]:
# Retrieve the desired features
business_df = business_df.loc[:,['business_id','categories','city']]

In [36]:
# Display
business_df.head()

Unnamed: 0,business_id,categories,city
0,1SWheh84yJXfytovILXOAQ,"Golf, Active Life",Phoenix
1,QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga
2,gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte
3,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear
4,HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte


**Here's where the two dataframes tie together.  I merged both dataframes on the business_id, then filtered the review dataframe to include only 'Food' or 'Restaurant' categories.**

In [37]:
# Merge the dataframes
review_df = review_df.merge(business_df, on = 'business_id')

In [38]:
# All categories to be extracted
food = [ \
     "Afghan", "African", "Senegalese", "South African", "American (New)", 
     "American (Traditional)", "Arabian", "Argentine", "Armenian", "Asian Fusion", 
     "Australian", "Austrian", "Bangladeshi", "Barbeque", "Basque", "Belgian", 
     "Brasseries", "Brazilian", "Breakfast & Brunch", "British", "Buffets", 
     "Burgers", "Burmese", "Cafes", "Themed Cafes", "Cafeteria", "Cajun/Creole", 
     "Cambodian", "Caribbean", "Dominican", "Haitian", "Puerto Rican", "Trinidadian", 
     "Catalan", "Cheesesteaks", "Chicken Shop", "Chicken Wings", "Chinese", "Cantonese", 
     "Dim Sum", "Hainan", "Shanghainese", "Szechuan", "Comfort Food", "Creperies", 
     "Cuban", "Czech", "Delis", "Diners", "Dinner Theater", "Ethiopian", "Fast Food", 
     "Filipino", "Fish & Chips", "Fondue", "Food Court", "Food Stands", "French", 
     "Mauritius", "Reunion", "Game Meat", "Gastropubs", "German", "Gluten-Free", "Greek", 
     "Guamanian", "Halal", "Hawaiian", "Himalayan/Nepalese", "Honduran", 
     "Hong Kong Style Cafe", "Hot Dogs", "Hot Pot", "Hungarian", "Iberian", "Indian", 
     "Indonesian", "Irish", "Italian", "Calabrian", "Sardinian", "Sicilian", "Tuscan", 
     "Japanese", "Conveyor Belt Sushi", "Izakaya", "Japanese Curry", "Ramen", 
     "Teppanyaki", "Kebab", "Korean", "Kosher", "Laotian", "Latin American", "Colombian", 
     "Salvadoran", "Venezuelan", "Live/Raw Food", "Malaysian", "Mediterranean", "Falafel", 
     "Mexican", "Tacos", "Middle Eastern", "Egyptian", "Lebanese", "Modern European", 
     "Mongolian", "Moroccan", "New Mexican Cuisine", "Nicaraguan", "Noodles", "Pakistani",
     "Pan Asia", "Persian/Iranian", "Peruvian", "Pizza", "Polish", "Polynesian", 
     "Pop-Up Restaurants", "Portuguese", "Poutineries", "Russian", "Salad", "Sandwiches", 
     "Scandinavian", "Scottish", "Seafood", "Singaporean", "Slovakian", "Soul Food", "Soup", 
     "Southern", "Spanish", "Sri Lankan", "Steakhouses", "Supper Clubs", "Sushi Bars", 
     "Syrian", "Taiwanese", "Tapas Bars", "Tapas/Small Plates", "Tex-Mex", "Thai", 
     "Turkish", "Ukrainian", "Uzbek", "Vegan", "Vegetarian", "Vietnamese", "Waffles", "Wraps",
     "Food", "Bakeries", "Acai Bowls", "Bagels", "Bakeries", "Beer, Wine & Spirits", "Beverage Store", 
     "Breweries", "Brewpubs", "Bubble Tea", "Butcher", "CSA", "Chimney Cakes", "Cideries", 
     "Coffee & Tea", "Coffee Roasteries", "Convenience Stores", "Cupcakes", "Custom Cakes", 
     "Desserts", "Distilleries", "Do-It-Yourself Food", "Donuts", "Empanadas", "Farmers Market", 
     "Food Delivery Services", "Food Trucks", "Gelato", "Grocery", "Honey", "Ice Cream & Frozen Yogurt", 
     "Imported Food", "International Grocery", "Internet Cafes", "Juice Bars & Smoothies", "Kombucha", 
     "Organic Stores", "Patisserie/Cake Shop", "Piadina", "Poke", "Pretzels", "Shaved Ice", "Shaved Snow", 
     "Smokehouse", "Specialty Food", "Candy Stores", "Cheese Shops", "Chocolatiers & Shops", 
     "Fruits & Veggies", "Health Markets", "Herbs & Spices", "Macarons", "Meat Shops", "Olive Oil", "Pasta Shops", 
     "Popcorn Shops", "Seafood Markets", "Street Vendors", "Tea Rooms", "Water Stores", "Wineries", "Wine Tasting Room"]

In [39]:
# Split the categories feature in a list
review_df['categories'] = review_df['categories'].apply(lambda x: str(x).split(','))

# Strip white space
review_df['categories'] = review_df['categories'].apply(lambda x: [cat.strip() for cat in x])

review_df.head()

Unnamed: 0,business_id,stars,text,categories,city
0,ujmEBvifdJM6h6RLv4wQIg,1,Total bill for this horrible service? Over $8G...,"[Fitness & Instruction, Doctors, Health & Medi...",Las Vegas
1,ujmEBvifdJM6h6RLv4wQIg,4,My family has used this ER four times in the p...,"[Fitness & Instruction, Doctors, Health & Medi...",Las Vegas
2,ujmEBvifdJM6h6RLv4wQIg,1,I have never been more disappointed by the car...,"[Fitness & Instruction, Doctors, Health & Medi...",Las Vegas
3,ujmEBvifdJM6h6RLv4wQIg,1,"Went in for a broken finger, was asked if I wa...","[Fitness & Instruction, Doctors, Health & Medi...",Las Vegas
4,ujmEBvifdJM6h6RLv4wQIg,5,My mother was at Mountain View for nearly two ...,"[Fitness & Instruction, Doctors, Health & Medi...",Las Vegas


In [40]:
# Flag the document for whether it's in relevant categories
review_df['food'] = review_df['categories'].apply(lambda x: [set(x) & set(food)])
review_df['food'] = review_df['food'].apply(lambda x: len(x[0]) > 0)

In [41]:
# Retrieve desires observations (only food reviews)
review_df = review_df[review_df['food'] == True]

In [42]:
review_df.head()

Unnamed: 0,business_id,stars,text,categories,city,food
60,ikCg8xy5JIg_NGPx-MSIDA,5,Went in for a lunch. Steak sandwich was delici...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True
61,ikCg8xy5JIg_NGPx-MSIDA,1,"Really one of dirtiest places to eat,not sure ...","[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True
62,ikCg8xy5JIg_NGPx-MSIDA,1,"Terrible place to eat and or drink, waitresses...","[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True
75,eU_713ec6fTGNO4BegRaww,4,I'll be the first to admit that I was not exci...,"[Restaurants, Italian, Pizza]",Pittsburgh,True
76,eU_713ec6fTGNO4BegRaww,5,One of the best Italian restaurants in a city ...,"[Restaurants, Italian, Pizza]",Pittsburgh,True


**When exploring the data, I noticed some foreign languages.  Since I'm only working with English, I used the 'langdetect' package to detect which reviews are in English and filtered down the dataset.**

In [43]:
def language(x):
    '''
    Detect the language of a string.
    '''
    try:
        return langdetect.detect(x)
    except:
        return np.nan

In [44]:
# Language flag
review_df['language'] = review_df['text'].apply(lambda x: language(x))

**The dataset now has a language feature, which I will use to take an English-only subset of the data.**

In [45]:
review_df.head()

Unnamed: 0,business_id,stars,text,categories,city,food,language
60,ikCg8xy5JIg_NGPx-MSIDA,5,Went in for a lunch. Steak sandwich was delici...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
61,ikCg8xy5JIg_NGPx-MSIDA,1,"Really one of dirtiest places to eat,not sure ...","[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
62,ikCg8xy5JIg_NGPx-MSIDA,1,"Terrible place to eat and or drink, waitresses...","[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
75,eU_713ec6fTGNO4BegRaww,4,I'll be the first to admit that I was not exci...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en
76,eU_713ec6fTGNO4BegRaww,5,One of the best Italian restaurants in a city ...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en


In [46]:
# Language distribution
review_df['language'].value_counts()

en       67030
fr         336
es          23
de          17
zh-cn        8
it           7
ja           4
nl           4
pt           3
sk           3
da           3
zh-tw        2
no           2
cy           2
ca           1
af           1
tl           1
hr           1
ko           1
ro           1
fi           1
sv           1
sl           1
Name: language, dtype: int64

In [47]:
# Only include English
review_df = review_df[review_df['language'] == 'en'].reset_index(drop = True)

**This is where the text pre-processing begins.  I removed all punctuation marks and numbers, and made all letters lowercase.  I also removed the new line (\n) character from all reviews.**

In [48]:
# Sample raw text
review_df['text'][1][:200]

'Really one of dirtiest places to eat,not sure how they get past the Health inspections,very rude staff and management. Spend your money elsewhere.'

In [49]:
# Remove the apostrophes (separation issues in word_tokenize)
remove_apostrophes = lambda x: x.replace('\'', '')

# Keep only letters
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)

# Make them lowercase
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

# Remove new line characters
no_new_line = lambda x: x.replace('\n','')

review_df['text'] = review_df['text'].map(remove_apostrophes).map(alphanumeric).map(punc_lower).map(no_new_line)
review_df.head(5)

Unnamed: 0,business_id,stars,text,categories,city,food,language
0,ikCg8xy5JIg_NGPx-MSIDA,5,went in for a lunch steak sandwich was delici...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
1,ikCg8xy5JIg_NGPx-MSIDA,1,really one of dirtiest places to eat not sure ...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
2,ikCg8xy5JIg_NGPx-MSIDA,1,terrible place to eat and or drink waitresses...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
3,eU_713ec6fTGNO4BegRaww,4,ill be the first to admit that i was not excit...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en
4,eU_713ec6fTGNO4BegRaww,5,one of the best italian restaurants in a city ...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en


**At this point, I pickled the dataframe to save the work done so far and load it back up quickly.**

In [53]:
with open('pickles/review_df','wb') as file:
      pickle.dump(review_df, file)

**As is typical with NLP projects, I applied lemmatization to the text.  First, I wrote a function to get the part of speech based on the 'nltk.pos_tag' function.  I passed this value into the lemmatizer function to get the accurately stemmed version of the word (this wasn't 100% accurate, but more accurate than without the part-of-speech tag.**

In [57]:
def get_wordnet_pos(word):
    '''
    Map POS tag to first character lemmatize() accepts.
    '''
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [58]:
def lemmatizer(text):
    '''
    Lemmatize a given string.
    '''
    
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(text)]
    lemmatized_text = ' '.join(tokens)
    return lemmatized_text

In [71]:
# Apply lemmatization to the text
review_df['text'] = review_df['text'].apply(lambda x: lemmatizer(x))

**===================================================================================================================**

### Stop Words

**This is my favorite part of text pre-processing!  'Stop words' are words that don't add much value to the data.  For example, 'the', 'her', 'town', 'then' are all useless words that add unneccessary noise to the data.  As you add more stop words, you'll notice the topics that we model after vectorization become more specific.  If you add too many, you may end up 'overfitting' the topic.  If you don't have enough, you may not have distinguishable topics.**

**This is the main list of stop words taken from the nltk and sklearn stop word lists.  I removed apostrophes so it'll match the punctuation mark-less text from my dataset.**

In [89]:
# Add general English stopwords without apostrophes
nltk_stopwords = []

for word in list(stopwords.words('english')):
    nltk_stopwords.append(word.replace('\'',''))

# Add these words to our main list
stop_words = text.ENGLISH_STOP_WORDS.union(nltk_stopwords)

**I then created an ongoing list of stop words to add as I honed in on specific topics.**

In [92]:
# Add more stop words here
add_stop_words = ['las','la','tony','ive','say','really','eat','friend','im','cup',
                  'good','food']

# Join the stop words above to the original list
stop_words = stop_words.union(add_stop_words)

# Display the first five alphabetically
list(sorted(stop_words))[:5]

['a', 'about', 'above', 'across', 'after']

**===================================================================================================================**

### Vectorization

**Now that the text has been cleaned, it's time to vectorize it.  Vectorization involves collecting every word that occurs in the entire dataset, and counting how many times it appear in each document.  I used the TfidfVectorizer from sklearn for this task.**

#### TF-IDF Vectorizer

In [95]:
# Create the vectorizer object
vectorizer = TfidfVectorizer(stop_words = stop_words, min_df = .005)

# Create the doc_word sprase matrix
doc_word = vectorizer.fit_transform(review_df['text'])

# Create a dataframe for easy labeled viewing
doc_word_df = pd.DataFrame(doc_word.toarray(), columns = vectorizer.get_feature_names())

**Here you can see the result of vectorization.  Because many words only appear in a small number of reviews, the matrix contains a large number of zeros.  This is known as a 'sparse matrix'.  In the above code, the 'doc_word.tooaray()' function call takes the doc_word sparse matrix object and converts it into the content of the dataframe below (with the zeros included).**

In [117]:
doc_word_df.head()

Unnamed: 0,able,absolute,absolutely,accommodate,act,actual,actually,add,addition,additional,...,yeah,year,yelp,yes,yesterday,yogurt,york,young,yum,yummy
0,0.0,0.0,0.184802,0.223004,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**===================================================================================================================**

### NMF Topic Modeling

**We now have a matrix of review vectors that each contain a value for each word, AKA document_word matrix or document_term matrix.  With this, I can create an NMF topic model object (Non-negative Matrix Factorization) and fit it to my dataset.**

**After experimenting with a variety of values for 'n_components' (the number of topics to cluster), I found 10 to be an optimal number for accurate and helpful topics.**

In [103]:
# Initialize NMF model with 10 topics
nmf = NMF(n_components = 10)

# Fit the doc_word sparse matrix
doc_topic = nmf.fit_transform(doc_word)

# Create a dataframe for easy labeled viewin
doc_topic_df = pd.DataFrame(doc_topic.round(5),
                            index = review_df.text.apply(lambda x: x[:100]),
                            columns = range(10))

**The modeling process outputs a set of a matrix of review vectors, each containing a weight for each topic.  Because the computer cannot label the topics itself (it's clustering purely based on the review vectors, not actually separating topics) it's my job as a human to label the topics.**

In [123]:
doc_topic_df.head(1)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
go in for a lunch steak sandwich be delicious and the caesar salad have an absolutely delicious dres,0.00043,0.02337,0.00141,0.01143,0.0,0.02099,0.0,0.0,0.0,0.02662


**The function below displays the top n words for each 'topic' found by the NMF model.  If the words make sense to be together and fall under a specific topic, the model did a good job of extracting a topic from the collection of reviews.  This ties back in to the goal of the project, which is to figure out what topics are associated with positive reviews and negative reviews.**

In [119]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic: ", ix)
        else:
            print("\n", ix+1, "-", topic_names[ix], "\n")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [120]:
display_topics(nmf, vectorizer.get_feature_names(), 15)


Topic:  0
order, time, wait, come, minute, ask, table, service, drink, server, bad, customer, hour, waitress, told

Topic:  1
great, service, atmosphere, price, beer, selection, drink, awesome, bar, wine, spot, happy, nice, lunch, fun

Topic:  2
pizza, crust, cheese, topping, slice, sauce, order, wing, delivery, salad, pie, best, garlic, italian, dough

Topic:  3
chicken, fry, rice, order, sauce, dish, soup, salad, noodle, spicy, beef, thai, curry, pork, chinese

Topic:  4
burger, fry, cheese, beer, onion, bun, patty, ring, bacon, mac, shake, guy, joint, topping, dog

Topic:  5
like, try, taste, little, make, sandwich, nice, menu, coffee, look, cream, flavor, pretty, breakfast, ice

Topic:  6
sushi, roll, buffet, price, fresh, quality, fish, sashimi, lunch, crab, tuna, salmon, rice, las_vegas, best

Topic:  7
place, love, favorite, try, awesome, time, people, clean, fun, new, family, nice, recommend, look, year

Topic:  8
taco, salsa, mexican, fish, burrito, chip, asada, bean, tortill

**You can see in the above that the topics are fairly accurately clustered.  Topic 2 seems to be about pizzerias, topic 6 is about sushi or Japanese restaurants, and Topic 8 is about Mexican restaurants.**

**===================================================================================================================**

### Positive vs Negative

**I split the reviews into positive and negative subset to explore what words are associated with each sentiment.  I ran the same vectorization and modeling process on both subsets.**

**Start with positive reviews.**

In [125]:
# Define the mask for positive reviews (only 5 and 4 stars)
pos_mask = ((review_df['stars'] == 5.0) |
            (review_df['stars'] == 4.0))

pos_review_df = review_df[pos_mask].reset_index(drop = True)

**Vectorization**

In [130]:
# Create the vectorizer object
pos_vectorizer = TfidfVectorizer(stop_words = stop_words, min_df = .005)

# Create the doc_word sprase matrix
pos_doc_word = pos_vectorizer.fit_transform(pos_review_df['text'])

# Create a dataframe for easy labeled viewing
pos_doc_word_df = pd.DataFrame(pos_doc_word.toarray(), columns = pos_vectorizer.get_feature_names())

**Topic Modeling**

In [142]:
# Initialize NMF model with 10 topics
pos_nmf = NMF(n_components = 10)

# Fit the doc_word sparse matrix
pos_doc_topic = pos_nmf.fit_transform(pos_doc_word)

# Create a dataframe for easy labeled viewin
pos_doc_topic_df = pd.DataFrame(pos_doc_topic.round(5),
                                index = pos_review_df.text.apply(lambda x: x[:100]),
                                columns = range(10))

**Then do the same for negative reviews.**

In [151]:
# Define the mask for positive reviews (only 5 and 4 stars)
neg_mask = ((review_df['stars'] == 2.0) |
            (review_df['stars'] == 1.0))

neg_review_df = review_df[neg_mask].reset_index(drop = True)

In [152]:
# Create the vectorizer object
neg_vectorizer = TfidfVectorizer(stop_words = stop_words, min_df = .005)

# Create the doc_word sprase matrix
neg_doc_word = neg_vectorizer.fit_transform(neg_review_df['text'])

# Create a dataframe for easy labeled viewing
neg_doc_word_df = pd.DataFrame(neg_doc_word.toarray(), columns = neg_vectorizer.get_feature_names())

In [153]:
# Initialize NMF model with 10 topics
neg_nmf = NMF(n_components = 10)

# Fit the doc_word sparse matrix
neg_doc_topic = neg_nmf.fit_transform(neg_doc_word)

# Create a dataframe for easy labeled viewin
neg_doc_topic_df = pd.DataFrame(neg_doc_topic.round(5),
                                index = neg_review_df.text.apply(lambda x: x[:100]),
                                columns = range(10))

**Then compare the topics for positive and nevative reviews.**

In [156]:
pos_topics = ['General','Experience','Pizzeria','Chinese Food','Positivity',
              'Burger Joint','Sushi','Breakfast','Mexican Food','Experience']

In [157]:
neg_topics = ['General','Chinese Food','Pizzeria','Wait Time','Staff',
              'Burger Joint','Service','Seafood','Sushi','Mexican Food']

**When taking a closer look at top words extracted per topic, a number of insights are apparent.**
* Pizzeria's typically get negative reviews for burnt or soggy pizza, and slow delivery.
* Offering fresh sauce, and having plenty of garlic is key.
* Wait time and staff attitude are big drivers of negative sentiment.
* Burger joints needs to make sure they don't overcook their burgers.  Keep it fresh and juicy.
* Also, serve the burgers in a timely manner, because 'cold' is a frequent term used in negative reviews.

In [160]:
display_topics(pos_nmf, pos_vectorizer.get_feature_names(), 15, topic_names = pos_topics)


 1 - General 

time, come, like, order, make, menu, wait, restaurant, drink, table, nice, bar, night, little, look

 2 - Experience 

great, service, atmosphere, price, awesome, beer, selection, drink, customer, excellent, lunch, spot, fantastic, wine, fun

 3 - Pizzeria 

pizza, crust, cheese, topping, slice, wing, sauce, order, salad, delivery, italian, best, pie, fresh, garlic

 4 - Chinese Food 

chicken, fry, rice, order, sauce, soup, dish, salad, delicious, spicy, noodle, beef, thai, try, pork

 5 - Positivity 

place, love, favorite, try, awesome, like, new, family, recommend, people, clean, fun, look, kid, time

 6 - Burger Joint 

burger, fry, cheese, beer, onion, ring, mac, bacon, bun, patty, dog, juicy, shake, guy, joint

 7 - Sushi 

amaze, best, sushi, service, recommend, highly, restaurant, las_vegas, roll, excellent, definitely, delicious, town, absolutely, customer

 8 - Breakfast 

breakfast, coffee, sandwich, cream, ice, egg, chocolate, cake, flavor, delicious, tea, 

In [161]:
display_topics(neg_nmf, neg_vectorizer.get_feature_names(), 15, topic_names* = neg_topics)


 1 - General 

table, come, server, ask, drink, waitress, seat, restaurant, bar, waiter, water, meal, menu, sat, manager

 2 - Chinese Food 

chicken, sauce, order, taste, rice, fry, salad, like, dry, flavor, soup, sandwich, dish, meat, noodle

 3 - Pizzeria 

pizza, crust, cheese, order, delivery, sauce, topping, slice, like, deliver, pie, dough, burnt, soggy, wing

 4 - Wait Time 

wait, order, minute, hour, long, time, min, finally, line, told, later, busy, people, arrive, ready

 5 - Staff 

time, location, customer, make, work, employee, store, manager, rude, want, ask, order, drive, know, sandwich

 6 - Burger Joint 

burger, fry, cheese, bun, patty, onion, bacon, dry, order, beer, cooked, cold, medium, guy, ring

 7 - Service 

service, bad, horrible, slow, customer, terrible, poor, rude, great, staff, experience, awful, mediocre, quality, restaurant

 8 - Seafood 

buffet, price, sushi, crab, las_vegas, selection, quality, bellagio, dessert, leg, worth, dinner, lunch, money, s

**I decided to change things up and zoom in on specific quality metrics like wait time and service, rather than keeping the types of restaurant separate.  This method may gave more insight.**

**To do this, I added more stopwords to eliminate any indictation of what type of restaurant it is.  That way, the NMF model will no longer cluster around those types of words.**

In [155]:
neg_review_df.head()

Unnamed: 0,business_id,stars,text,categories,city,food,language
0,ikCg8xy5JIg_NGPx-MSIDA,1,really one of dirtiest place to eat not sure h...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
1,ikCg8xy5JIg_NGPx-MSIDA,1,terrible place to eat and or drink waitress be...,"[Bars, Pubs, Nightlife, Tapas Bars, Restaurants]",Calgary,True,en
2,eU_713ec6fTGNO4BegRaww,1,oh this place i want to like it i really do i ...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en
3,eU_713ec6fTGNO4BegRaww,2,this place be byob which i love and i remember...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en
4,eU_713ec6fTGNO4BegRaww,2,we use to love come to this restaurant but i m...,"[Restaurants, Italian, Pizza]",Pittsburgh,True,en


In [162]:
# Add more stop words here
add_stop_words = ['las','la','tony','ive','say','really','eat','friend','im','cup',
                  'good','food','definitely','love','place','las_vegas','recommend',
                  'great','service','dish', 'cooky', 'guy', 'want', 'town', 'best',
                  'probably','like','morning','spot', 'wine', 'bar', 'chip', 'fish', 
                  'burrito', 'tortilla', 'favorite', 'absolutely', 'amaze', 'come', 
                  'try', 'menu', 'look', 'sure', 'lunch', 'chicken', 'pizza', 'crust', 
                  'cheese', 'topping', 'wing', 'sauce', 'salad', 'slice', 'bread', 
                  'garlic', 'fry', 'burger', 'coffee', 'egg', 'taco', 'salsa', 'mexican', 
                  'beer', 'night', 'breakfast', 'sandwich', 'sausage', 'potato', 'toast', 
                  'bacon', 'pretty', 'husband', 'enjoy', 'meal', 'highly', 'beautiful', 
                  'cocktail', 'happy', 'hour', 'tea', 'cool', 'special', 'sweet', 'shrimp', 
                  'rice', 'family', 'visit', 'home', 'know', 'brisket', 'margarita', 'cool', 
                  'thing', 'sushi', 'meat', 'tasty', 'chocolate', 'strawberry', 'cake', 
                  'waffle', 'butter', 'latte', 'weve', 'star', 'month', 'wife','way', 'bit',
                  'ice', 'cream', 'shop','week', 'couple', 'ill', 'make', 'drink', 'perfect',
                  'impressed', 'steak', 'dinner', 'lot', 'table', 'excellent', 'thank', 'day',
                  'usually', 'super', 'point', 'year', 'lettuce','bartender','bbq', 'hostess',
                  'sat','today']

# Join's the stop words above to the standard English list
stop_words = stop_words.union(add_stop_words)

# Display the first five alpabetically
list(sorted(stop_words))[:5]

['a', 'about', 'above', 'absolutely', 'across']

**I added n_grams of 1,3 instead of just 1 here.  This means the vectorizer will count all 1, 2, and 3-word combinations.  This vastly increases the feature space and takes longer to run, but it gives more insight in this particular case because some words do belong together as a result of their meaning in context.**

**I'll now apply the vectorization and topic modeling to the subsets with the updated stop words and adjusted hyperparameters.  I found setting n_components to 5 in this case worked well.**

In [170]:
# Positive subset vectorization
pos_vectorizer = TfidfVectorizer(ngram_range = (1,3), stop_words = stop_words, min_df = .01)
pos_doc_word = pos_vectorizer.fit_transform(pos_review_df['text'])
pos_doc_word_df = pd.DataFrame(pos_doc_word.toarray(), columns = pos_vectorizer.get_feature_names())

In [172]:
# NMF model with 5 topics
pos_nmf = NMF(n_components = 5)
pos_doc_topic = pos_nmf.fit_transform(pos_doc_word)
pos_doc_topic_df = pd.DataFrame(pos_doc_topic.round(5),
                                index = pos_review_df.text.apply(lambda x: x[:100]),
                                columns = range(5))

In [176]:
# Negative subset vectorization
neg_vectorizer = TfidfVectorizer(ngram_range = (1,3), stop_words = stop_words, min_df = .01)
neg_doc_word = neg_vectorizer.fit_transform(neg_review_df['text'])
neg_doc_word_df = pd.DataFrame(neg_doc_word.toarray(), columns = neg_vectorizer.get_feature_names())

In [178]:
# NMF model with 5 topics
neg_nmf = NMF(n_components = 5)
neg_doc_topic = neg_nmf.fit_transform(neg_doc_word)
neg_doc_topic_df = pd.DataFrame(neg_doc_topic.round(5),
                                index = neg_review_df.text.apply(lambda x: x[:100]),
                                columns = range(5))

In [174]:
display_topics(pos_nmf, pos_vectorizer.get_feature_names(), 15)


Topic:  0
restaurant, nice, price, little, taste, flavor, fresh, people, small, area, experience, portion, right, worth, serve

Topic:  1
staff, friendly, staff friendly, friendly staff, atmosphere, clean, helpful, awesome, nice, fast, location, attentive, quick, wait staff, owner

Topic:  2
time, long, wait, awesome, server, long time, experience, second, fun, busy, twice, stop, disappointed, customer, people

Topic:  3
order, wait, minute, delivery, hot, spicy, roll, fast, line, thai, beef, ask, arrive, decide, long

Topic:  4
delicious, fresh, soup, dessert, yummy, wonderful, roll, spicy, fantastic, perfectly, flavorful, authentic, flavor, homemade, hot


In [179]:
display_topics(neg_nmf, neg_vectorizer.get_feature_names(), 15)


Topic:  0
ask, customer, bad, manager, restaurant, server, rude, experience, told, walk, waitress, people, work, staff, need

Topic:  1
taste, price, buffet, quality, flavor, restaurant, bland, dry, bad, small, fresh, ok, little, disappointed, nice

Topic:  2
order, wrong, delivery, order order, ask, order wrong, deliver, receive, time order, online, pick, told, min, mess, piece

Topic:  3
wait, minute, wait minute, seat, long, line, finally, people, minute wait, min, walk, busy, minute order, arrive, server

Topic:  4
time, location, waste, drive, time time, long, waste time, second, second time, close, slow, long time, time order, open, money


**You can see here that clusters have clear topics.  I'll know label the topics by creating a list of topic names and adding it as an argument to the display_topics() function.**

In [189]:
pos_topic_names = ['Experience','Staff','Wait Time','Time','Atmosphere']

In [190]:
neg_topic_names = ['Staff','Food Quality','Order Accuracy','Wait Time','Order Time']

In [193]:
display_topics(pos_nmf, pos_vectorizer.get_feature_names(), 15, topic_names = pos_topic_names)


 1 - Experience 

restaurant, nice, price, little, taste, flavor, fresh, people, small, area, experience, portion, right, worth, serve

 2 - Staff 

staff, friendly, staff friendly, friendly staff, atmosphere, clean, helpful, awesome, nice, fast, location, attentive, quick, wait staff, owner

 3 - Wait Time 

time, long, wait, awesome, server, long time, experience, second, fun, busy, twice, stop, disappointed, customer, people

 4 - Time 

order, wait, minute, delivery, hot, spicy, roll, fast, line, thai, beef, ask, arrive, decide, long

 5 - Atmosphere 

delicious, fresh, soup, dessert, yummy, wonderful, roll, spicy, fantastic, perfectly, flavorful, authentic, flavor, homemade, hot


In [192]:
display_topics(neg_nmf, neg_vectorizer.get_feature_names(), 15, topic_names = neg_topic_names)


 1 - Staff 

ask, customer, bad, manager, restaurant, server, rude, experience, told, walk, waitress, people, work, staff, need

 2 - Food Quality 

taste, price, buffet, quality, flavor, restaurant, bland, dry, bad, small, fresh, ok, little, disappointed, nice

 3 - Order Accuracy 

order, wrong, delivery, order order, ask, order wrong, deliver, receive, time order, online, pick, told, min, mess, piece

 4 - Wait Time 

wait, minute, wait minute, seat, long, line, finally, people, minute wait, min, walk, busy, minute order, arrive, server

 5 - Order Time 

time, location, waste, drive, time time, long, waste time, second, second time, close, slow, long time, time order, open, money


**From the model results above, it becomes apparent the staff attitude, food quality, and wait time are big drivers of negative sentiment.  Price, food quality, and friendly, competent staff seems to drive the ratings up.**

**Since this is such a large dataset, there is a lot more to explore.  For example, this subset mostly covers Las Vegas.  It would be very interesting to see how the metrics differ from city to city.**