## Motivation for this Notebook

If I was a business owner, I would want to know how my customers are generally feeling. After reading a couple of reviews, you can start to pick up on some trends but who has the time to go through all of the comments to get a full picture of what people are saying about the company? Well luckily we have the power of NLP and Machine Learning algorithms that can do this compiling and grouping for us. Here I try to get a better look into 'average' reviews for a particular business and what's being said in them by implementing kMeans clustering.

In [1]:
!pip install nltk
!pip install textblob
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from sklearn.cluster import KMeans
import textblob as tb

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
review_df = pd.read_json('yelp_academic_dataset_review.json', lines = True)
business_df = pd.read_json('yelp_academic_dataset_business.json', lines = True)

In [3]:
name_df = business_df[['business_id', 'name']]

In [4]:
review_df = pd.merge(review_df, name_df, how = 'left', left_on = 'business_id', right_on = 'business_id')
review_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,name
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16,Bellagio Gallery of Fine Art
1,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,lbrU8StCq3yDfr-QMnGrmQ,1,1,1,0,I am actually horrified this place is still in...,2013-12-07 03:16:52,Rio Hair Salon
2,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,1,0,0,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11,Deagan's Kitchen & Bar
3,i6g_oA9Yf9Y31qt0wibXpw,ofKDkJKXSKZXu5xJNGiiBQ,5JxlZaqCnk1MnbgRirs40Q,1,0,0,0,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",2011-05-27 05:30:52,Cabo Mexican Restaurant
4,6TdNDKywdbjoTkizeMce8A,UgMW8bLE0QMJDCkQ1Ax5Mg,IS4cv902ykd8wj1TR0N3-A,4,0,0,0,"Oh happy day, finally have a Canes near my cas...",2017-01-14 21:56:57,Raising Cane's Chicken Fingers


I want to stem the words so we're not getting various forms of words that basically have the same meaning. I also tokenize the tokens so we're only getting words, including those with apostrophes. Below is a function to pass through as an argument in the TfidfVectorizer to override the tokenizing and to add the stemming. 

In [10]:
snowball = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')

def tokenize(text):
    return [snowball.stem(word) for word in tokenizer.tokenize(text.lower())]

def vectorize_reviews2(reviews):
    vectorizer = TfidfVectorizer(stop_words = 'english', tokenizer = tokenize, \
                        min_df = 0.0025, max_df = 0.05, max_features = 1000, ngram_range = (1, 3))
    X = vectorizer.fit_transform(reviews)
    words = vectorizer.get_feature_names()
    return X, words

def print_clusters2(company_id, K = 8, num_words = 10):
    company_df = review_df[review_df['business_id'] == company_id]
    company_name = company_df['name'].unique()[0]
    reviews = company_df['text'].values
    X, words = vectorize_reviews2(reviews)
    
    kmeans = KMeans(n_clusters = K)
    kmeans.fit(X)
    
    common_words = kmeans.cluster_centers_.argsort()[:,-1:-num_words-1:-1]
    print('Groups of ' + str(num_words) + ' words typically used together in reviews for ' + \
          company_name)
    for num, centroid in enumerate(common_words):
        print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

def calc_polarity(text):
    blob = tb.TextBlob(text)
    return blob.sentiment.polarity

def calc_subjectivity(text):
    blob = tb.TextBlob(text)
    return blob.sentiment.subjectivity

def get_pol_sub(company_id):
    company_df = review_df[review_df['business_id'] == company_id]
    company_name = company_df['name'].unique()[0]
    company_df['polarity'] = company_df['text'].apply(calc_polarity)
    company_df['subjectivity'] = company_df['text'].apply(calc_subjectivity)
    
    print('Company:' + company_name + '\nMean Polarity: ' + str(company_df['polarity'].mean())\
          + '\nMean Subjectivity: ' + str(company_df['subjectivity'].mean()))

In [3]:
for bus_id in business_df['business_id']:
    print_clusters2(bus_id, K = 3, num_words = 20)
    get_pol_sub(bus_id)

NameError: name 'business_df' is not defined