T


Programming Assignment 

Project Overview: 

For this programing assignment, I decided to develop a sentiment analysis program using Google Local Reviews (2021) (citation here). The trained model in this program will analyze the sentiment in the reviews of businesses on the Google Places around America. The Tensorflow Keras NLP library is used to train a LSTM model to predict sentiment of reviews, whether they are negative or positive. 

Problem Space: 

Customer experience is extremely important to the continued success of any business, and one way to improve on existing services is by referring to customer feedback. In today’s age, there are plentiful amounts of data, and customers might leave hundreds or thousands of reviews. As a business owner, it is critical to identify useful reviews to draw important insights to improve operations. Most reviews on the internet actually yield minimal information, such as “great service.” More importantly, not every platform require users to give a numerical rating when giving an review. Thus, business owner need to sift through many useless reviews to find the insightful reviews to analyze. 
Solution: Sentiment Analysis Program 

A robust sentiment analysis program will help quickly identify meaningful reviews. In some ways, extremely good or bad reviews generate the most insights. While it is unideal for businesses to see bad reviews, those reviews generally yield significantly more  insights.

Some examples of these reviews are as follows: 

“It is posted that they open at 9:30 am Monday through Friday. I'm here at 10:05 in the morning with packages that need to be sent out and there's nobody here!” 

" Their value has gone down based on their massage quality declining. There is definitely less effort verses when I first starting going here. I usually tip $5 for the $25 massage. But these guys actually frown upon you and ask you to make it $10. And then charge $1 for using a credit card.” 

“What does it say about a company when they can't even show you pricing. Have to trick you into a membership. stay far away!” 

While these reviews are bad, it does tell the business owners exactly what caused these poor experiences. As a business owner, reviews that point to actionable areas for operational improvements inherently are more valuable. 

Methodology : 

The program will read the data file and group reviews by positive and negative sentiment. Through personal experience, any review under 4 is generally considered sub-par. For this project, reviews with ratings under 4 will be treated as negative, and reviews with rating over 3 is considered to be positive. In order to have balanced data, 3000 of positive and negative reviews are used. More data will yield to better results, but there seems to be significantly less negative reviews in the dataset. Thus, in order to keep runtime reasonable, only 3000 reviews each are used. 

The reviews will then be pre processed. Stopwords in reviews are be removed, and words are lemmatized. Reviews will then be split into training and testing sets, and texts are tokenized with a dictionary size of 8000 and padded using Keras library functions and fed into the model. 

The model has one LSTM layer and uses a dropout rate of 0.2. After testing with various batch sizes, a batch size of 40 seems to balance accuracy with runtime.
 
Results: 

The result of this particular model trained with these parameters yielded a testing accuracy of 90%. 

Conclusion:

This project explored the effectiveness of LSTM in conducting sentiment analysis. A high performing sentiment analysis is very valuable to any business owner. The performance of the LSTM are be improved with more parameter tweaking. In addition, due to issues of excessive runtime, only 6000 total reviews were used to train the model. With more data, the model accuracy should improve. This project only implemented a simple LSTM through the Tensorflow Keras Library, and more complex models could yield to higher accuracy. 



In [None]:
import gzip
import json
import re
import string

import pandas as pd
!pip install langdetect
from langdetect import detect


import nltk
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import tensorflow 

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


#inpsired from https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python
#removes emoji from reviews
def remove_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#check if text is in english(some reviews are in foreign langauge)
def is_english(text): 
    try: 
      language = detect(text)
      if language == 'en':
          return True 
      else:
          return False
    except:
      return False

#get sentiment, ratings > 3 is positive, ratings =< 3 is negative
def get_senitment(rating):
    if rating > 3: 
        return 1
    else: 
        return 0

# Given from dataset website : https://jiachengli1995.github.io/google/index.html
#load json objects
def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l) 

#from https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

def filter_tokens(tokens): 
    stemmed_tokens = []
    for token in tokens: 
        if token not in stop_words:
            stemmed = lemmatizer.lemmatize(token)
            stemmed_tokens.append(stemmed)
    return stemmed_tokens


#from https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
def text_preprocess(text): 
    filtered_text = remove_emoji(text)
    filtered_text = filtered_text.lower()
     
    #remove punctuation from https://www.geeksforgeeks.org/python-remove-punctuation-from-string/
    filtered_text = filtered_text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(filtered_text)
    filtered_tokens = filter_tokens(tokens)
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text


#******************************** data import and processing ************************

data = []
count_positive = 0
count_negative = 0
entries_need = 3000
for parsed_obj in parse("review-California_10.json.gz"):
    print("count_positive, %s" % count_positive)
    print("count_negative, %s" % count_negative)

    if count_positive and count_negative == entries_need: 
        break 

    elif parsed_obj['text'] == None: 
        continue
    elif not is_english(parsed_obj['text']): 
        continue

    #balance data
    senitment = get_senitment(parsed_obj['rating'])
    if senitment == 0 and count_negative < entries_need or senitment == 1 and count_positive < entries_need:
        data.append({'review': parsed_obj['text'], 'filtered_review': text_preprocess(parsed_obj['text']), 'rating': parsed_obj['rating'], "sentiment": senitment})
        if senitment == 1: 
            count_positive += 1
        else: 
          count_negative += 1 
    else: continue


data_list = []
for d in data:
    data_list.append(pd.DataFrame(d, index=[0]))
df = pd.concat(data_list, ignore_index=True)

#******************************** Tokenization and Model Training ************************
!pip install keras
import tensorflow
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split

#split
X_train, X_test, Y_train, Y_test = train_test_split(df['filtered_review'], df['sentiment'], test_size=0.2, random_state=30)


#proprocess step 
tokenizer = Tokenizer(num_words=8000)
tokenizer.fit_on_texts(X_train)

#Tokenzier and pad to length of 100
#https://www.educba.com/keras-pad_sequences/
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=100)
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=100)


Y_train = Y_train.values
Y_test = Y_test.values

#add model layers
#https://victorzhou.com/blog/keras-rnn-tutorial/ 
#https://keras.io/api/layers/recurrent_layers/lstm/
#
model = Sequential()
model.add(Embedding(input_dim=8000, output_dim=75, input_length=100))
model.add(LSTM(units=75, dropout=0.2))
model.add(Dense(units=1, activation='sigmoid'))

#model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#training
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=8, batch_size=40)

# Test the model accuracy
#https://www.tutorialspoint.com/keras/keras_model_evaluation_and_prediction.htm
result = model.evaluate(X_test, Y_test)
print('accuracy of this model is: ', result[1])

    



        
        

    






Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negative, 1524
count_positive, 3000
count_negat