# Yelp Dataset Sentiment Analysis

#### Dataset - https://www.yelp.com/dataset/download

#### Source code - https://github.com/iwiszhou/ML1010-final-project

#### Group 10 - Haofeng Zhou - zhf85@my.yorku.ca

This is a Yelp data-set. I would use this data-set to do a sentiment analysis.
I would build a model to predict the review either positive or negative.
This is a big data-set. Firstly, I would try to extra the review data and create a simple data-set,
which only contain Review & Rating. After that, I would create a new column which is Class.
Class column is either Positive or Negative. If Rating is grater than 3, I would mark Class to Positive. Otherwise,
Negative. If I have more time at the end, I would introduce one more value to Class column which is Neutral ( when
Rating is equal to 3 )

In [1]:
# Import libraries
import pandas as pd
import json
import numpy as np
import re
import nltk
import sqlite3
import matplotlib.pyplot as plt
from pathlib import Path
import os
import spacy

In [2]:
# Download stopwords if not existing
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/iwiszhou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Set col to max width
pd.set_option('display.max_colwidth', -1)

In [4]:
# Database name & tables' name
db_name = "yelp.db"
table_names = {
    "reviews": "reviews",
    "clean_reviews": "clean_reviews"
}

In [18]:
# Helper functions

def get_absolute_path(file_name):
    return os.path.abspath('') + "/" + file_name


# Save data to database
def save_to_db(dataFrame, tableName):
    con = sqlite3.connect(db_name)
    dataFrame.to_sql(tableName, con)
    con.close()


# Get data (dataframe format) from database by table name
def get_table_by_name(tableName):
    con = sqlite3.connect(db_name)
    df = pd.read_sql_query("SELECT * FROM " + tableName + ";", con)
    con.close()
    return df


# Read data from file
# NOTE - the data-set is too big. I have already to several time, my computer crash. So that, I would start with first
# 10000 rows. I would increase the data-set size when training the model.
def load_json():
    filename = get_absolute_path('./yelp_dataset/review.json')
    row_count = 0
    row_limit = 10000
    df = []
    with open(filename, encoding="utf8") as f:
        for line in f:
            df.append(json.loads(line))
            row_count = row_count + 1
            if row_count > row_limit:
                break
    df = pd.DataFrame(df)
    return df


### STEP 1 - Gather data

In [20]:
# Get data from database or json file

file_path = get_absolute_path("data.csv")

if os.path.isfile(file_path):
    # Import csv
    df = pd.read_csv(file_path, encoding='utf-8')
else:
    df = load_json()
    # Export to csv
    df.to_csv(file_path, encoding='utf-8', index=False)


# Top 5 records
print(df.head().values)

# Shape of dataframe
print(df.shape)

# View data information
print(df.info())

# Check na values
print(df.isnull().values.sum())

[['Q1sbwvVQXV2734tPgoKj4Q' 'hG7b0MtEbXx5QzbzE6C_VA'
  'ujmEBvifdJM6h6RLv4wQIg' 1.0 6 1 0
  'Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.'
  '2013-05-07 04:34:36']
 ['GJXCdrto3ASJOqKeVWPi6Q' 'yXQM5uF2jS6es16SJzNHfg'
  'NZnhc2sEQy3RmzKTZnqtwQ' 5.0 0 0 0
  "I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a 

There are not NA value in this data-set. Next, let's create a new column to store our Class/Label value,
which depennds on our 'stars' column, if 'stars' is great than 3, Class/Label is 1 - 'Positive'. Otherwise,
it is 0 - 'Negative'

In [21]:
# Create a class(label) column
def get_class_label_value(row):
    if row["stars"] >= 3:
        return 1
    return 0


review_file_path = get_absolute_path("review.csv")

if not os.path.isfile(review_file_path):
    df["class"] = df.apply(get_class_label_value, axis=1)

    # Create new data frame
    filter_df = df[['class', 'text']]
    print(filter_df.head(1).values)
    print(filter_df.shape[0])
    print(filter_df.columns)

    # Export to csv
    filter_df.to_csv(review_file_path, encoding='utf-8', index=False)
else:
    # Import csv
    filter_df = pd.read_csv(review_file_path, encoding='utf-8')

### STEP 2 - Clean data / Text pre-processing

#### First of all, let's balance the data

In [22]:
balance_review_file_path = get_absolute_path("balance_review.csv")

if not os.path.isfile(balance_review_file_path):
    # num of Positive record
    print(filter_df.loc[filter_df["class"] == 1].count())

    # num of Negative record
    print(filter_df.loc[filter_df["class"] == 0].count())

    # balance the data
    balance_data_count = 10
    n_df = filter_df.loc[filter_df["class"] == 0][:balance_data_count]
    # number of negative rows
    print("Number of negative should be 100. Actual is ", len(n_df.loc[n_df["class"] == 0]))
    print("Number of positive should be 0. Actual is ", len(n_df.loc[n_df["class"] == 1]))

    p_df = filter_df.loc[filter_df["class"] == 1][:balance_data_count]
    # number of positive rows
    print("Number of positive should be 100. Actual is ", len(p_df.loc[p_df["class"] == 1]))
    print("Number of negative should be 0. Actual is ", len(p_df.loc[p_df["class"] == 0]))

    # merge positive and negative together to become a balance data
    filter_df = n_df.append(p_df)

    filter_df.to_csv(balance_review_file_path, encoding='utf-8', index=False)
else:
    # Import csv
    filter_df = pd.read_csv(balance_review_file_path, encoding='utf-8')


#### Secondly, we would use NLTK method to normalize our corpus.

In [23]:
# Text Normalization - using NLTK
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')


def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc


normalize_corpus = np.vectorize(normalize_document)

# filter_df["norm_text"] = normalize_corpus(filter_df["text"])
#
# # Check the result
# print(filter_df["norm_text"].describe())
# print(filter_df.head(1))

As the result, the norm text is still have some words not fully converted to what we want.
Such as, 'checked', 'costs', we expected those should stem correctly.
Next, let's try library Spacy, which provide all lots of helper method for us to normalize our corpus.

#### Next, let's use Spacy to normolize text

In [24]:
nlp = spacy.load("en_core_web_sm")
white_list_pos = ["VERB", "PART", "NOUN", "ADJ", "ADV"]


def spacy_norm_text(text):
    # tokenizing
    doc = nlp(str(text))

    ret_set = set()

    # handle stop words, VERB, PART, ADJ, ADV and NOUN
    for token in doc:
        if not token.is_stop and token.text:  # remove stop words & empty string
            if token.pos_ in white_list_pos:  # if token is in white list, taking lemma_ instead
                ret_set.add(token.lemma_.lower().strip())

    # handle PROPN
    for token in doc.ents:
        ret_set.add(token.text)

    # convert to list
    unique_list = list(ret_set)

    return " ".join(unique_list)


norm_review_file_path = get_absolute_path("norm_review.csv")

if not os.path.isfile(norm_review_file_path):
    filter_df["norm_text"] = filter_df.apply(lambda row: spacy_norm_text(row["text"]), 1);

    # Export norm text to file
    filter_df.to_csv(norm_review_file_path, encoding='utf-8', index=False)
else:
    # Import norm text data frame
    filter_df = pd.read_csv(norm_review_file_path, encoding='utf-8')

# Check the result
print(filter_df["norm_text"].describe())
print(filter_df.head(1))

count     20                                                                                                                                                                                                        
unique    20                                                                                                                                                                                                        
top       love shabu perspective fresh home limited bland water taste price miserable try good be clean skip sauce favor well judge quality place pot small hot selection expensive soup star appetite base quantity
freq      1                                                                                                                                                                                                         
Name: norm_text, dtype: object
   class  \
0  0       

                                                                                            

### STEP 3 - Feature extraction from text

#### Using TF-IDF to convert text to vector

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2)
tfidf = vectorizer.fit_transform(filter_df["norm_text"].values)

# convert to array
tfidf = tfidf.toarray()
print(tfidf.shape)  # 200 is our rows, 1186 is how many words

words = vectorizer.get_feature_names()

# plt.figure(figsize=[20,4])
# _ = plt.show(tfidf)

pd.DataFrame(tfidf, columns=words)

(20, 222)


Unnamed: 0,10,100,15,25,30,45,80,about,actually,add,...,want,watch,water,way,work,worth,write,wrong,year,years
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.315025,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.174245,0.0,0.095106,0.095106,0.087123,0.105399,0.0,0.0,0.0,0.0,...,0.075084,0.095106,0.0,0.095106,0.0,0.0,0.0,0.105399,0.066093,0.075084
2,0.0,0.0,0.230908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.230908,0.255898,0.0,0.160466,0.182297
3,0.0,0.171843,0.155061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.218662,...,0.0,0.0,0.0,0.218662,0.0,0.0,0.0,0.0,0.151956,0.172629
5,0.2724,0.0,0.0,0.0,0.0,0.0,0.0,0.164772,0.0,0.0,...,0.117381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103324,0.117381
6,0.123956,0.0,0.0,0.0,0.123956,0.0,0.0,0.0,0.149959,0.0,...,0.106828,0.135314,0.0,0.0,0.123956,0.0,0.149959,0.0,0.094035,0.106828
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.262448,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.119747,0.0,0.0,0.197965,0.119747,0.0,0.0,0.0,0.0,...,0.085305,0.108053,0.119747,0.0,0.098982,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.146683,0.0,0.0,0.177453,0.0,0.160124,...,0.126414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### STEP 4 - Build Models

In [26]:
from sklearn.model_selection import train_test_split

X = tfidf  # the features we want to analyze
y = filter_df['class'].values  # the labels, or answers, we want to test against

# split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Logistic regression
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X_train, y_train)

predict_ret = model.predict_proba(X_test)

# convert to Positive and Negative
y_predict = np.array([int(p[1] > 0.5) for p in predict_ret])

# accuracy
print(y_predict)
print(y_test)
print(np.sum(y_test == y_predict) / len(y_test))

[0 1 1 1 0 0]
[0 0 1 1 0 1]
0.6666666666666666


