## Motivation for this Notebook

If I was a business owner, I would want to know how my customers are generally feeling. After reading a couple of reviews, you can start to pick up on some trends but who has the time to go through all of the comments to get a full picture of what people are saying about the company? Well luckily we have the power of NLP and Machine Learning algorithms that can do this compiling and grouping for us. Here I try to get a better look into 'average' reviews for a particular business and what's being said in them by implementing kMeans clustering.

In [1]:
!pip install nltk
!pip install textblob
import cProfile
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from sklearn.cluster import KMeans
import textblob as tb

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
def readDataset():
    return pd.read_json('yelp_academic_dataset_review.json', lines = True)

#cProfile.run('readDataset()')
review_df = readDataset()

In [3]:
snowball = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')

def tokenize(text):
    return [snowball.stem(word) for word in tokenizer.tokenize(text.lower())]

def vectorize_reviews(reviews):
    vectorizer = TfidfVectorizer(stop_words = 'english', tokenizer = tokenize, \
                        min_df = 0.0025, max_df = 0.05, max_features = 1000, ngram_range = (1, 3))
    X = vectorizer.fit_transform(reviews)
    words = vectorizer.get_feature_names()
    return X, words

def print_clusters():
    num_words = 20
    X, words = vectorize_reviews(review_df['text'])
    
    kmeans = KMeans(n_clusters = 3)
    kmeans.fit(X)
    
    common_words = kmeans.cluster_centers_.argsort()[:,-1:-num_words-1:-1]
    for num, centroid in enumerate(common_words):
        print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

def calc_polarity(text):
    blob = tb.TextBlob(text)
    return blob.sentiment.polarity

def calc_subjectivity(text):
    blob = tb.TextBlob(text)
    return blob.sentiment.subjectivity

def get_pol_sub():
    review_df['polarity'] = review_df['text'].apply(calc_polarity)
    review_df['subjectivity'] = review_df['text'].apply(calc_subjectivity)
    
    print('\nMean Polarity: ' + str(review_df['polarity'].mean())\
          + '\nMean Subjectivity: ' + str(review_df['subjectivity'].mean()))

In [None]:
#print_clusters()
cProfile.run('print_clusters()')
#cProfile.run('get_pol_sub()')
get_pol_sub()

  'stop_words.' % sorted(inconsistent))
