“Yelp Inc. (NYSE: YELP) connects people with great local businesses. Yelp was founded in San Francisco in July 2004. Since then, Yelp communities have taken root in major metros across 32 countries. By the end of Q4 2018, Yelpers had written approximately 177 million rich, local reviews, making Yelp the leading local guide for real word-of-mouth on everything from boutiques and mechanics to restaurants and dentists. Approximately 33 million unique devices* accessed Yelp via the Yelp app, approximately 69 million unique visitors visited Yelp via mobile web** and approximately 62 million unique visitors visited Yelp via desktop*** on a monthly average basis during Q4 2018.”

Yelp is a website that offers users an opportunity to leave recommendations for various types of businesses. Yelp also acts a social platform, users are able to have ‘friends’ and rate this ‘friends’ reviews. 
Each year, Yelp provides a data set that that includes information from local business in 10 metropolitan areas across two countries with the aim of having students research or analyze on this data and share their discoveries.

Yelp reviewers leave "stars" for the businesses that they are reviewing and each business has an aggragate number of "stars" indicating whether they are a good business or an unfavorable business.

Utilizing the Yelp dataset, the objective of our project is two fold. First we will look at the the social influence that Yelp Elite users have within the Yelp network by focusing on the links in the bipartite network. Do businesses were Elite users have left recommendations often have more recommendations than other restaurants?  

Second, we will build a sentiment analysis of user reviews from the Yelp dataset. We will attempt to find out whether Elite Users tend to leave more positive or negative reviews and based on those reviews, if the business then receives more positive or negative stars. 

Overall, this analysis will focus on the influence that Elite Yelp users have on business ratings.


In [1]:
#Packages needed to run program
import requests
import re
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np
from collections import Counter 

# graph viz
import plotly
import plotly.offline as pyo
from plotly.graph_objs import *
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns 
import matplotlib.gridspec as gridspec 
from wordcloud import *

#graph section
import networkx as nx

#natural language section
import string
import nltk
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from textblob import TextBlob


#machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.svm import LinearSVC
import ast


%matplotlib inline

First, we load each of the datasets from their JSON format to a pandas dataframe. It's important to look at each dataframe and figure out what values are in each column.

In [2]:
tips = []
for line in open('/Users/ntlrsmllghn/Dropbox/Data/Data 620/Final/yelp_dataset/tip.json', 'r'):
    tips.append(json.loads(line))

tips_df = pd.DataFrame(tips)
tips_df.head()

Unnamed: 0,business_id,compliment_count,date,text,user_id
0,VaKXUpmWTTWDKbpJ3aQdMw,0,2014-03-27 03:51:24,"Great for watching games, ufc, and whatever el...",UPw5DWs_b-e2JRBS-t37Ag
1,OPiPeoJiv92rENwbq76orA,0,2013-05-25 06:00:56,Happy Hour 2-4 daily with 1/2 price drinks and...,Ocha4kZBHb4JK0lOWvE0sg
2,5KheTjYPu1HcQzQFtm4_vw,0,2011-12-26 01:46:17,Good chips and salsa. Loud at times. Good serv...,jRyO2V1pA4CdVVqCIOPc1Q
3,TkoyGi8J7YFjA6SbaRzrxg,0,2014-03-23 21:32:49,The setting and decoration here is amazing. Co...,FuTJWFYm4UKqewaosss1KA
4,AkL6Ous6A1atZejfZXn1Bg,0,2012-10-06 00:19:27,Molly is definately taking a picture with Sant...,LUlKtaM3nXd-E4N4uOk_fQ


In [3]:
tips_df.shape

(1223094, 5)

In [4]:
business = []
for line in open('/Users/ntlrsmllghn/Dropbox/Data/Data 620/Final/yelp_dataset/business.json', 'r'):
    business.append(json.loads(line))
business_df = pd.DataFrame(business)
business_df.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars,state
0,2818 E Camino Acequia Drive,{'GoodForKids': 'False'},1SWheh84yJXfytovILXOAQ,"Golf, Active Life",Phoenix,,0,33.522143,-112.018481,Arizona Biltmore Golf Club,85016,5,3.0,AZ
1,30 Eglinton Avenue W,"{'RestaurantsReservations': 'True', 'GoodForMe...",QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,"{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",1,43.605499,-79.652289,Emerald Chinese Restaurant,L5R 3E7,128,2.5,ON
2,"10110 Johnston Rd, Ste 15","{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...",gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,"{'Monday': '17:30-21:30', 'Wednesday': '17:30-...",1,35.092564,-80.859132,Musashi Japanese Restaurant,28210,170,4.0,NC
3,"15655 W Roosevelt St, Ste 237",,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,33.455613,-112.395596,Farmers Insurance - Paul Lorenz,85338,3,5.0,AZ
4,"4209 Stuart Andrew Blvd, Ste F","{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...",HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte,"{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...",1,35.190012,-80.887223,Queen City Plumbing,28217,4,4.0,NC


In [None]:
business_df.shape

(192609, 14)

In [None]:
review = []
for line in open('/Users/ntlrsmllghn/Dropbox/Data/Data 620/Final/yelp_dataset/review.json', 'r'):
    review.append(json.loads(line))
review_df = pd.DataFrame(review)
review_df.head()

In [None]:
review_df.shape

In [None]:
user = []
for line in open('/Users/ntlrsmllghn/Dropbox/Data/Data 620/Final/yelp_dataset/user.json', 'r'):
    user.append(json.loads(line))
user_df = pd.DataFrame(user)
user_df.head()

In [None]:
user_df.shape

In order to see the review for each business, we must combine the business dataframe and review dataframe. The common column that they both share is the `business_id` column. 

In [None]:
business_review_df = pd.merge(business_df, review_df, how='inner', on='business_id')

In [None]:
business_review_df.head()

The number of rows for this dataframe is equal to the review column lenght. Our merge was successful.

In [None]:
business_review_df.shape

In [None]:
business_review_df.columns

Since some of the columns shared the same names, we had to rename columns.

In [None]:
business_review_df.rename(columns={'name_x': 'business_name', 'stars_x':'average_stars','stars_y': 'reviewer_star'}, inplace=True)

In [None]:
business_review_df.head()

In order to find out more about the elite user rewiews, we have to merge a third dataframe to the business and review dataframes. The common column that these dataframe share is `user_id`.

In [None]:
master_df = pd.merge(business_review_df, user_df, how='inner', on='user_id')

In [None]:
master_df.columns

There are a number of columns within this dataframe that are not useful for our analysis, so we will drop them.

In [None]:
master_df = master_df.drop(columns=['compliment_cool', 'compliment_funny', 'compliment_hot', 'compliment_list', 'compliment_more',
                        'compliment_note', 'compliment_photos', 'compliment_cute', 'compliment_plain', 'compliment_profile',
                       'compliment_writer'])

As before, there are columns with the same name, so we will rename these columns.

In [None]:
master_df.rename(columns={'review_count_x': 'business_review_cnt', 'average_stars_x':'business_avg_stars', 'cool_x':'cool_business_review',
                          'funny_x': 'funny_business_review', 'useful_x':'useful_business_review', 'name_y':'user_name', 
                          'average_stars_y': 'user_avg_stars', 'cool_y':'cool_user_reviews', 'funny_y':'funny_user_reviews',
                          'review_count_y':'user_review_cnt', 'useful_y':'useful_user_reviews'}, inplace=True)

checking the shape of the new master dataframe, we find that this dataframe is the same lenght as the `business_df` dataframe.

In [None]:
master_df.shape

In [None]:
master_df.columns

Now that we have loaded all the data, let's do an exploratory analysis of our data.

First, we should find out what type of businesses are reviewed by Yelp?

In [None]:
category_list = set()
for business in business_df['categories'][business_df['categories'].notnull()].str.split(','):
    business = [x.strip(' ') for x in business]
    category_list = set().union(business, category_list)
category_list = list(category_list)

In [None]:
category_count = []
for cat in category_list:
    category_count.append([cat,business_df['categories'].str.contains(cat).sum()])

names = ['category_name','category_count']
category_df = pd.DataFrame(data=category_count, columns=names)
category_df.sort_values("category_count", inplace=True, ascending=False)
category_df.head(10)

In [None]:
plt.subplots(figsize=(8, 8))
labels=category_df['category_name'][category_df['category_count']>10000]
category_df['category_count'][category_df['category_count']>10000].plot.bar( align='center', alpha=0.5)
y_pos = np.arange(len(labels))
#plt.yticks(y_pos, labels)
plt.xticks(y_pos, labels)
plt.xlabel('Business Categories')
plt.ylabel('Categories Count')

plt.show()

Restaurants, Food, and Shopping are the top three categories reviewed on Yelp. 

Which cities are featured in this data?

In [None]:
x=business_df['city'].value_counts()
x=x.sort_values(ascending=False)
x=x.iloc[0:20]
plt.figure(figsize=(16,4))
ax = sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Which city has the most reviews?")
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('City', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
top_cities = business_df.city.value_counts()
top_cities.head(20)

Within this dataset, Las Vegas, Toronto, and Phoenix have the most reviews. 

It would be interesting to see what the average number of stars these cities have.

In [None]:
city_business_reviews = business_df[['city', 'review_count', 'stars']].groupby(['city']).\
agg({'review_count': 'sum', 'stars': 'mean'}).sort_values(by='review_count', ascending=False)
city_business_reviews.head(10)

Which business have the most reviews? Let's take a look at the top 25 most reviewed businesses.

In [None]:
business_df[['name', 'review_count', 'city', 'stars']].sort_values(ascending=False, by="review_count")[0:25]

Each of these businesses are located in Las Vegas. Likely, this is because there are more reviews in the dataset for businesses in Las Vegas.

Let's see how stars are distributed within this dataset.

In [None]:
x=business_df['stars'].value_counts()
x=x.sort_index()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Star Rating Distribution")
plt.ylabel('# of businesses', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
business_df['stars'].value_counts()

3.5 to 5 stars make up most of the reviews. Let's see if that means that there will be more positive sentiment than a negative one.

Is there any correlation between the number of stars and the length of the review? Or how useful/cool/or funny the reviews are?

In [None]:
review_df['text length'] = review_df['text'].apply(len)
g = sns.FacetGrid(data=review_df, col='stars')
g.map(plt.hist, 'text length', bins=50)

In [None]:
stars = review_df.groupby('stars').mean()
stars

In [None]:
stars.corr()

From the matrix, looks like the 1-star and 2-star ratings have much longer text - maybe text length won’t be such a useful feature to consider after all.

Looking at the matrix, funny is strongly correlated with useful, and useful seems strongly correlated with text length. We can also see a negative correlation between cool and the other three features. Maybe funny reviews are longer than useful and cool reviews.

Which businesses have the top rated reviews in the data?

In [None]:
review_df['name'] = review_df['business_id'].map(business_df.set_index('business_id')['name'])
top_rated = review_df.name.value_counts().index[:20].tolist()
top_rated

In [None]:
df_review_top = review_df.loc[review_df['name'].isin(top_rated)]
df_review_top.groupby(df_review_top.name)['stars'].mean().sort_values(ascending=True).plot(kind='barh',figsize=(12, 10))
plt.yticks(fontsize=12)
plt.title('Top rated businesses on Yelp',fontsize=12)
plt.ylabel('Business names', fontsize=12)
plt.xlabel('Ratings', fontsize=12) 
plt.show()

Now that we have a sense of what the business, reviews, and distribution of stars look like, we want to start looking at the social network of Yelp Reviews and Businesses.


<b>Social Network Analysis of Yelp users</b>

We used the user dataframe, more specifically user_id and friends to define the nodes and the edges of the network

In [None]:
user_df.describe()

In [None]:
print(user_df.dtypes)

In [None]:
#convert string to timestampe
user_df['joined']= pd.to_datetime(user_df['yelping_since'])
#group by year and count occurrences
yearGrouping =user_df.groupby(user_df['joined'].map(lambda x : x.year))['yelping_since'].count()
user_df['number of Friends'] = pd.to_numeric(user_df['friends'], errors='coerce').fillna(0)
user_df['number of Friends']=user_df['number of Friends'].astype(np.int64)
user_df['elite']=pd.to_numeric(user_df['elite'],errors='coerce').fillna(0)
user_df['elite']=user_df['elite'].astype(np.int64)                                                                   
#user_df['target'] = user_df['elite']!='[]'
print(user_df.dtypes)

In [None]:
plt.figure()
plt.plot(yearGrouping)
plt.figure()
plt.scatter(user_df['average_stars'],user_df['review_count'])

<b>Looking at the entire user data</b>

1. From the above figure we can see that users joining inclnation has started from around 2006 and peaked somewhere between 2014 and 2016 and it's currently declining
2. The second scatter plot depicts the review count and the number of stars. We can see that somewhere between 3.5 and 4 stars people leave more reviews.


<b>Creating a subset of users</b>

Since graphing the entire data would require a lot of memory and it would look super messy

In [None]:
#subset users who have atleast 1 friend
subset_users=user_df[user_df['friends']!='None']
#user has given atleast 10 reviews
subset_users=subset_users[subset_users['review_count']>=10]
#subset_users=subset_users.sort_values('review_count',ascending=False)

subset_users['list_friends']=subset_users["friends"].apply(lambda x: str(x).split(','))

subset_users=subset_users[['user_id','list_friends']]
#stopping at 6k due to space constraints
subset_users=subset_users.iloc[0:6000]
res = subset_users.set_index(['user_id'])['list_friends'].apply(pd.Series).stack()

In [None]:
network_data=res.reset_index()
#checking the dataframe
network_data.head()

In [None]:
#changing the column name to suit nx import
network_data.columns=['source','level_1','target']

# Considering each (user_id,friend) pair as an edge of a graph, constructing the graph
graph=nx.from_pandas_edgelist(network_data)

In [None]:
print(nx.info(graph))
#check density
print("The density of the graph is ",nx.density(graph))

<b>Network Density</b>

"Network density is a measure of the proportion of possible ties which are actualized among the members of a network. Dense social networks, especially coupled with strong boundaries segregating the group from others, can enforce communal norms so that social pressures for conformity can inhibit creativity, which necessarily contains an element of deviance. Small dense networks may develop ‘groupthink’ where conformity of ideas is highly valued and normatively enforced. This inhibits creativity within the group."--Katherine Giuffre, Cultural Productons in Networks


We can see that the density of the graph is not very high but again we are not portraying the entire network due to memory contraints. "From an academic perspective graph density would be defined as the ratio of the number of edges and the number of possible edges." For the purposes of our project it would be intereting to see who are the most influetial Yelp users and draw a graph that depicts all the connections.

In [None]:
#lets take a single town's population and make a graph out of those users
# since we dont have people and location together
# Mapping businesses of a location to reviews and then to users and then finding their friends network

subset=business_df[business_df.city=='Cleveland']
subset=pd.merge(subset,review_df,how='inner',on='business_id')
subset_users=subset.user_id.unique()

subset_users=pd.DataFrame(subset_users,columns=['user_id'])
subset_users=pd.merge(subset_users,user_df,how='inner',on='user_id')

# create friend list
subset_users['list_friends']=subset_users["friends"].apply(lambda x: str(x).split(','))
subset_users['count_friends']=subset_users["list_friends"].apply(lambda x: len(x))

#check
subset_users.shape

The dataset subset_users reflects the data from only Cleveland and we will attempt to graph that subset

In [None]:
subset_users_list=subset_users[['user_id','list_friends']]
network_data = subset_users_list.set_index(['user_id'])['list_friends'].apply(pd.Series).stack()
network_data=network_data.reset_index()
#changing the column name to suit nx import
network_data.columns=['source','level_1','target']

In [None]:
# Considering each (user_id,friend) pair as an edge of a graph, constructing the graph
graph=nx.from_pandas_edgelist(network_data)
print(nx.info(graph))

<b>The top/ most influential Yelp users</b>

We wanted to look into the most influential people and used the heapq library.



In [None]:
#use degree-centrality to find out influencers in the selected region
import heapq  # for getting top n number of things from list,dict
x=nx.degree_centrality(graph)
#Creating a subset again as we cant handle 70k nodes, unfortunately.

#Using heapq to find the 200 most connected nodes (ie) people with the most connections
elite=heapq.nlargest(200, x, key=x.get)

heapq.nlargest--Return a list with the n largest elements from the dataset defined by iterable. key



In [None]:
elite_sub_graph=graph.subgraph(elite)

# Check for isolates ( nodes with no edges (ie) users without friends in the sub-graph)
# graph=graph.remove_nodes_from(nx.isolates(graph))
list_of_nodes_to_be_removed=[x for x in nx.isolates(elite_sub_graph)]

# remove the selected isolates from the main graph
graph.remove_nodes_from(list_of_nodes_to_be_removed)

In [None]:
print(nx.info(elite_sub_graph))
#check density
print("The density of the graph is ",nx.density(elite_sub_graph))

In [None]:
# create the layout
pos = nx.spring_layout(elite_sub_graph)

In [None]:
plt.figure(figsize=(15,15))
plt.title("Cleveland's elite graph")
nx.draw(elite_sub_graph, pos=pos, node_size=0.05, width=1)

We can see from the graph that there are 2 users that are the "central" nodes in this network. Kamada Kawai algorythm-- "The system tries to find a balance between the “springs”vertices whose edges have small desired distance tend to move in groups away from the more dissimilar vertices.

In [None]:

#creating a larger Kamada Kawai layout 
plt.figure(figsize=(16,16))
plt.title("Plot of Cleveland's community : Kamada Kawai layout")
pos2=nx.kamada_kawai_layout(elite_sub_graph)
nx.draw(elite_sub_graph, pos=pos, node_size=0.7, width=0.1)

Detecting communities within the Cleveland's subgraph elite users

In [None]:
communities_generator = community.girvan_newman(elite_sub_graph)
top_level_communities = next(communities_generator)
next_level_communities = next(communities_generator)
print(len(top_level_communities))
#sorted(map(sorted, top_level_communities))

Three Communities have been identified.

In [None]:
print(type(top_level_communities))

In [None]:
plt.figure(figsize=(16,16))
plt.axes=False
plt.title("Cleveland's graph of the 200 most influential users" , fontsize=20)
nx.draw_networkx(elite_sub_graph, pos = pos2,cmap = plt.get_cmap("jet"), node_size = 0.9, with_labels = False,scale=2)

Checking if the graph is connected

In [None]:
nx.is_connected(elite_sub_graph)

Now we can use the function betweenness_centrality() to compute the centrality of each node. This function returns a list of tuples, one for each node, and each tuple contains the label of the node and the centrality value. We can use this information in order to trim the original network and keep only the most important nodes

In [None]:
G=elite_sub_graph
def most_important(G):
 """ returns a copy of G with
     the most important nodes
     according to the pagerank """ 
 ranking = nx.betweenness_centrality(G).items()
 print(ranking)
 r = [x[1] for x in ranking]
 m = sum(r)/len(r) # mean centrality
 t = m*3 # threshold, we keep only the nodes with 3 times the mean
 Gt = G.copy()
 for k, v in ranking:
  if v < t:
   Gt.remove_node(k)
 return Gt

Gt = most_important(G) # trimming

And we can use the original network and the trimmed one to visualize the network as follows

In [None]:
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.9,node_size=8)
nx.draw_networkx_edges(G,pos,alpha=0.1)

# draw the most important nodes with a different style
nx.draw_networkx_nodes(Gt,pos,node_color='r',alpha=0.4,node_size=254)
# also the labels this time
nx.draw_networkx_labels(Gt,pos,font_size=12,font_color='b')
show()

<b>Conclusion</b>

We can clearly see the 2 most influential users within this network. One of the main qualities of an elite user on Yelp is connectivity and we were interested in looking to see to what extend this quality is really preveant in the network. Our results show that claim to be true-- the most inflential users are highly connected to the rest of the users.

<b>Sentiment Analysis</b>

Now ...

Let's see if elite users are more likely to leave positive or negative reviews on businesses and if this seems to impact the number of stars that a business has.

First, we subset elite users from the master dataframe.

In [None]:
elite_user_df = master_df[master_df['elite'].str.contains("201", na = False)]

In [None]:
elite_user_df.head()

In [None]:
elite_user_df.columns

In [None]:
elite_user_df.shape

Now, we will subset the `elite_users_df` by average business stars. If a business has less than 3 stars, we will assume that the business will have negative reviews and if the business has 3 stars or above, we will classify the business as having positive reviews.

In [None]:
x=elite_user_df['business_avg_stars'].value_counts()
x=x.sort_index()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Star Rating Distribution for Businesses Reviewed by Elite Users")
plt.ylabel('# of reviews', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
negative_reviws = elite_user_df[elite_user_df["business_avg_stars"]<=2]
positive_reviews = elite_user_df[elite_user_df["business_avg_stars"]>=3]

Since the dataset has 1559043 reviews, I'm going to reduce the dataset to 50000. This is for the sake of my computer's processing abilities

In [None]:
elite_user_df = elite_user_df[:10000]

An initial step in text and sentiment classification is pre-processing. A significant amount of techniques is applied to data in order to reduce the noise of text, reduce dimensionality, and assist in the improvement of classification effectiveness. The most popular techniques include:

Remove numbers,
Stemming,
Part of speech tagging,
Remove punctuation,
Lowercase,
Remove stopwords

In [None]:
def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  
    x = [w for w in x.split() if w not in set(stopwords)]  
    return ' '.join(x)

elite_user_df['processed_text'] = elite_user_df['text'].apply(preprocess)

Now, let's find out the sentiment of most user reviews

In [None]:
def sentiment(x):
    sentiment = TextBlob(x)
    return sentiment.sentiment.polarity

elite_user_df['text_sentiment'] = elite_user_df['processed_text'].apply(sentiment)

Finally, Let's plot the user sentiment polarity to understand what type of reviews we're going to be training and see if these positive or negative reviews have an impact on predicting number of stars a business has. 

In [None]:
elite_user_df['sentiment'] = ''
elite_user_df['sentiment'][elite_user_df['text_sentiment'] > 0] = 'positive'
elite_user_df['sentiment'][elite_user_df['text_sentiment'] < 0] = 'negative'
elite_user_df['sentiment'][elite_user_df['text_sentiment'] == 0] = 'neutral'

plt.figure(figsize=(6,6))
ax = sns.countplot(elite_user_df['sentiment'])
plt.title('Review Sentiments')

To build a classification algorithm will need some sort of feature vector in order to perform the classification task. The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word in a text will be represented by one number.

Let’s create a function that will split a review into individual words, return a list, and remove stop words (such as “the”, “a”, “an”, etc.). To do this, we can take advantage of the NLTK library. The function below removes punctuation, stopwords, and returns a list of the remaining words, or tokens.

First, let's get the average business stars

In [None]:
review_class = elite_user_df[(elite_user_df['business_avg_stars'] == 1) | (elite_user_df['business_avg_stars'] == 2)| (elite_user_df['business_avg_stars'] == 3)|(elite_user_df['business_avg_stars'] == 4)| (elite_user_df['business_avg_stars'] == 5)]
review_class.shape

Now, we will add assign `X` and `Y` to the two variables we want to use in our sentiment analysis. The text of a review by elite user and the average business stars.

In [None]:
X = review_class['text']
Y = review_class['business_avg_stars']

Creating a function for cleaning the data.

In [None]:
def clean_review(review):
    clean_data = [char for char in review if char not in string.punctuation]
    clean_data = ''.join(clean_data)
    
    return [word for word in clean_data.split() if word.lower() not in stopwords.words('english')]

Bag of Words to convert the text documents into a matrix of token counts

In [None]:
bag_of_word_transformer = CountVectorizer(analyzer=clean_review).fit(X)

We need to find out the shape of the sparse matrix, the amount of non-zero occurances, and the denistory of the non-zero values

In [None]:
X = bag_of_word_transformer.transform(X)
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print("Density: {}".format((density)))

Now we need a function to classify if a review is positive or negative

In [None]:
def classification(sent):
    if sent <= 2:
        print("negative")
    else:
        print("positive")

Next, let's transform a count matrix to a normalized tf or tf-idf representation

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()  
X = tfidfconverter.fit_transform(X).toarray() 

Now, we are ready to split our dataset into train and test samples, 70% of the dataset 30% of the dataset is the test data.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=101)

We will try a few different approaches to predicting if an elite user leaves a positive or negative review, what the average stars of the restaurant will be. 

The first approach will be to use the naive bayes model.

In [None]:
from sklearn.naive_bayes import MultinomialNB
naiveBayes = MultinomialNB()
naiveBayes.fit(X_train, Y_train)

In [None]:
prediction = naiveBayes.predict(X_test)
print ("Prediction using Naive Bayes Model")
print(prediction)
ax = sns.countplot(prediction)
plt.title('Naive Bayes Model Review Sentiments')

We will create a list that keeps track of our f1 scores for each of the models that we run.

In [None]:
from sklearn.metrics import f1_score
f1_scr  = []
f1_scr.append(f1_score(Y_test,prediction,average='micro')*100)

Our next model will be a Linear Model

In [None]:
svc = LinearSVC()
svc.fit(X_train, Y_train)

In [None]:
prediction = svc.predict(X_test)
print ("Prediction using Linear SVC model\n")
print(prediction)
ax = sns.countplot(prediction)
plt.title('Linear SVC model Review Sentiments')

Append our f1 score list

In [None]:
f1_scr.append(f1_score(Y_test,prediction,average='micro')*100)

Our third model will be a Logistical Regression Model.

In [None]:
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression()
LogReg.fit(X_train, Y_train)

In [None]:
prediction = LogReg.predict(X_test)

print ("Prediction using Logistic Regression model\n")
print(prediction)
ax = sns.countplot(prediction)
plt.title('Logistic Regression Prediction of Review Sentiments')

In [None]:
f1_scr.append(f1_score(Y_test,prediction,average='micro')*100)

Finally, let's try a Random Forest Classifier model to predict if an elite user leaves a review, will it have an impact of the overall stars a business has

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=18)
rf.fit(X_train, Y_train)

In [None]:
prediction = rf.predict(X_test)
print ("Prediction using Random Forest Classifier model\n")
print(prediction)
ax = sns.countplot(prediction)
plt.title('Random Forest Classifier model Review Sentiments')

In [None]:
f1_scr.append(f1_score(Y_test,prediction,average='weighted')*100)

In [None]:
line1 = plt.plot (
          ['Naive Bayes','Linear SVC','LogisticRegression','RandomForest Classifier'],f1_scr ,'--o',alpha=0.7)
plt.title("Model Evaluation")
plt.ylabel('F1 score')
plt.xlabel('Models')
plt.show()

In [None]:
print(f1_scr)

Looking at the F1 score, we see that elite users sentiment 