# <b>Introduction<b>

In this project, I classify Yelp round-10 review datasets. The reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment. For simplicity, I classify the review comments into two class: either as positive or negative. Reviews that have star higher than three are regarded as positive while the reviews with star less than or equal to 3 are negative. Therefore, the problem is a supervised learning. To build and train the model, I first tokenize the text and convert them to sequences. Each review comment is limited to 50 words. As a result, short texts less than 50 words are padded with zeros, and long ones are truncated. After processing the review comments, I trained three model in three different ways:

<li> Model-1: In this model, a neural network with LSTM and a single embedding layer were used.
<li> Model-2: In Model-1, an extra 1D convolutional layer has been added on top of LSTM layer to reduce the training time.
<li> Model-3:  In this model, I use the same network architecture as Model-2, but use the pre-trained glove 100 dimension word embeddings as initial input.

Since there are about 1.6 million input comments, it takes a while to train the models. To reduce the training time step, I limit the training epoch to three. After three epochs, it is evident that Model-2 is better regarding both training time and validation accuracy.

## <b>Project Outline <b>

In this project I will cover the follwouings :

<li> Download data from yelp and process them
<li> Build neural network with LSTM
<li> Build neural network with LSTM and CNN
<li> Use pre-trained GloVe word embeddings
<li> Word Embeddings from Word2Vec

In [2]:
!pip install keras
!pip install tensorflow



In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## <b>Import libraries<b>

In [2]:
# Keras
# from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
# from keras.models import Sequential
# from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
# from keras.layers.embeddings import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from tensorflow.keras.layers import Embedding
## Plot
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib as plt

# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Other
import re
import string
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE

## <b> Data Processing<b>

The orginal dataset expects only 2 column. To save time. I'll write the two columns we need into a temporary dataframe.

'''
/content# cp /content/drive/MyDrive/ColabNotebooks/ml_1002/yelp_academic_dataset_review.csv /content/sample_data/'''

use the above. It might provide better peformance than a mounted drive.

In [3]:
df = pd.read_csv('/content/sample_data/yelp_academic_dataset_review.csv')


Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.



There are 6M records lets take say 100 000 and do a train test split on that.

In [4]:
df=df.sample(n=100000)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,stars
3793111,3793109,I love this place! The falafel is always good ...,5.0
2904416,2904416,Overall terrible experience. To think we had a...,1.0
4543595,4543593,This is a must if your visiting Tampa! The his...,5.0
6430858,6430856,What a delight! Usually I get dragged to a veg...,5.0
2060823,2060823,I am very sad that I had to give Faith a 3 out...,3.0


the new dataset uses floats in starts. The rest is to filter out empty strings

In [6]:
df= df.dropna()
#df = df[df.stars.apply(lambda x: x.isnumeric())]
df = df[df.stars.apply(lambda x: x !="")]
df = df[df.text.apply(lambda x: x !="")]

In [7]:
df.describe()

Unnamed: 0,stars
count,100000.0
mean,3.74691
std,1.478586
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,5.0


In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,stars
3793111,3793109,I love this place! The falafel is always good ...,5.0
2904416,2904416,Overall terrible experience. To think we had a...,1.0
4543595,4543593,This is a must if your visiting Tampa! The his...,5.0
6430858,6430856,What a delight! Usually I get dragged to a veg...,5.0
2060823,2060823,I am very sad that I had to give Faith a 3 out...,3.0


### Convert five classes into two classes (positive = 1 and negative = 0)

Since the main purpose is to identify positive or negative comments, I convert five class star category into two classes:

<li> (1) Positive: comments with stars > 3 and
<li> (2) Negative: comments with stars <= 3

In [9]:
labels = df['stars'].map(lambda x : 1 if int(x) > 3 else 0)

In [10]:
labels.head()

Unnamed: 0,stars
3793111,1
2904416,0
4543595,1
6430858,1
2060823,0


### Tokenize text data

Because of the computational expenses, I use the top 20000 unique words. First, tokenize the comments then convert those into sequences. I keep 50 words to limit the number of words in each comment.

In [11]:
def clean_text(text):

    ## Remove puncuation
    text = text.translate(string.punctuation)

    ## Convert words to lower case and split them
    text = text.lower().split()

    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]

    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)

    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)

    return text

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
df['text'] = df['text'].map(lambda x: clean_text(x))

In [None]:
df.head(10)

Unnamed: 0,stars,text
0,5,minut realiz conflict block away visit way lea...
2,5,love conflict kitchen food fantast downsid rot...
3,4,holi moli addict first heard conflict kitchen ...
4,4,great persian food though cheap fill street fo...
5,4,yummi food good price encourag tri new thing w...
6,5,passion peopl passion food ask question relat ...
7,1,first food never mix polit second hummus serv ...
8,5,past review discuss concept need countri exper...
9,5,premi food countri conflict delici creativ qui...
10,5,sixth star joke place everyth right give flip ...


In [15]:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['text'])

sequences = tokenizer.texts_to_sequences(df['text'])
data = pad_sequences(sequences, maxlen=50)

In [16]:
print(data.shape)

(100000, 50)


###  <b>Build neural network with LSTM<b>

### Network Architechture

The network starts with an embedding layer. The layer lets the system expand each token to a more massive vector, allowing the network to represent a word in a meaningful way. The layer takes 20000 as the first argument, which is the size of our vocabulary, and 100 as the second input parameter, which is the dimension of the embeddings. The third parameter is the input_length of 50, which is the length of each comment sequence.

In [17]:
model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=50))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Argument `input_length` is deprecated. Just remove it.



### Train the network

There are about 1.6 million comments, and it takes a while to train the model in a MacBook Pro. To save time I have used only three epochs. GPU machines can be used to accelerate the training with more time steps. I split the whole datasets as 60% for training and 40% for validation.

In [18]:
model_lstm.fit(data, np.array(labels), validation_split=0.4, epochs=3)

Epoch 1/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m300s[0m 158ms/step - accuracy: 0.8324 - loss: 0.3750 - val_accuracy: 0.8876 - val_loss: 0.2863
Epoch 2/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m318s[0m 156ms/step - accuracy: 0.9089 - loss: 0.2300 - val_accuracy: 0.8867 - val_loss: 0.2788
Epoch 3/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m358s[0m 175ms/step - accuracy: 0.9261 - loss: 0.1900 - val_accuracy: 0.8835 - val_loss: 0.2874


<keras.src.callbacks.history.History at 0x7f41677070a0>

##  <b>Build neural network with LSTM and CNN <b>
The LSTM model worked well. However, it takes forever to train three epochs. One way to speed up the training time is to improve the network adding “Convolutional” layer. Convolutional Neural Networks (CNN) come from image processing. They pass a “filter” over the data and calculate a higher-level representation. They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs.

In [19]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=50))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(LSTM(100))
    model_conv.add(Dense(1, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model_conv

In [20]:
model_conv = create_conv_model()
model_conv.fit(data, np.array(labels), validation_split=0.4, epochs = 3)

Epoch 1/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m163s[0m 84ms/step - accuracy: 0.8384 - loss: 0.3648 - val_accuracy: 0.8822 - val_loss: 0.2947
Epoch 2/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m157s[0m 84ms/step - accuracy: 0.9110 - loss: 0.2245 - val_accuracy: 0.8835 - val_loss: 0.2842
Epoch 3/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m200s[0m 82ms/step - accuracy: 0.9422 - loss: 0.1508 - val_accuracy: 0.8697 - val_loss: 0.3832


<keras.src.callbacks.history.History at 0x7f4134b52140>

### Save processed Data

In [21]:
df_save = pd.DataFrame(data)
df_label = pd.DataFrame(np.array(labels))

In [22]:
result = pd.concat([df_save, df_label], axis = 1)

In [23]:
result.to_csv('/content/sample_data/train_dense_word_vectors.csv', index=False)

## <b>Use pre-trained Glove word embeddings<b>

In this subsection, I want to use word embeddings from pre-trained Glove. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. The glove has embedding vector sizes, including 50, 100, 200 and 300 dimensions. I chose the 100-dimensional version. I also want to see the model behavior in case the learned word weights do not get updated. I, therefore, set the trainable attribute for the model to be False.

### Get embeddings from Glove

In [25]:
embeddings_index = dict()
f = open('/content/drive/MyDrive/ColabNotebooks/ml_1002/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [26]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

### Develop model

I use the same model architecture with a convolutional layer on top of the LSTM layer.

In [27]:
model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.2))
model_glove.add(Conv1D(64, 5, activation='relu'))
model_glove.add(MaxPooling1D(pool_size=4))
model_glove.add(LSTM(100))
model_glove.add(Dense(1, activation='sigmoid'))
model_glove.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [28]:
model_glove.fit(data, np.array(labels), validation_split=0.4, epochs = 3)

Epoch 1/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m116s[0m 61ms/step - accuracy: 0.7560 - loss: 0.5012 - val_accuracy: 0.8213 - val_loss: 0.3976
Epoch 2/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m131s[0m 55ms/step - accuracy: 0.8207 - loss: 0.3899 - val_accuracy: 0.8283 - val_loss: 0.3897
Epoch 3/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 56ms/step - accuracy: 0.8404 - loss: 0.3590 - val_accuracy: 0.8356 - val_loss: 0.3691


<keras.src.callbacks.history.History at 0x7f415c38c7c0>

## <b>Word embedding visialization<b>

In this subsection, I want to visualize word embedding weights obtained from trained models. Word embeddings with 100 dimensions are first reduced to 2 dimensions using t-SNE. Tensorflow has an excellent tool to visualize the embeddings in a great way, but here I just want to visualize the word relationship.

### Get embedding weights from glove

In [29]:
lstm_embds = model_lstm.layers[0].get_weights()[0]

In [30]:
conv_embds = model_conv.layers[0].get_weights()[0]

In [31]:
glove_emds = model_glove.layers[0].get_weights()[0]

### Get word list

In [32]:
word_list = []
for word, i in tokenizer.word_index.items():
    word_list.append(word)

### Scatter plot of first two components of TSNE

In [39]:
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

In [40]:
def plot_words(data, start, stop, step):
    trace = go.Scatter(
        x = data[start:stop:step,0],
        y = data[start:stop:step, 1],
        mode = 'markers',
        text= word_list[start:stop:step]
    )
    layout = dict(title= 't-SNE 1 vs t-SNE 2',
                  yaxis = dict(title='t-SNE 2'),
                  xaxis = dict(title='t-SNE 1'),
                  hovermode= 'closest')
    fig = dict(data = [trace], layout= layout)
    py.iplot(fig)
    fig.show(renderer="notebook")


#### 1. LSTM

In [34]:
number_of_words = 2000
lstm_tsne_embds = TSNE(n_components=2).fit_transform(lstm_embds)

In [41]:
plot_words(lstm_tsne_embds, 0, number_of_words, 1)

#### 2. CNN + LSTM

In [36]:
conv_tsne_embds = TSNE(n_components=2).fit_transform(conv_embds)

In [37]:
plot_words(conv_tsne_embds, 0, number_of_words, 1)

#### 3. Glove

In [42]:
glove_tsne_embds = TSNE(n_components=2).fit_transform(glove_emds)

In [43]:
plot_words(glove_tsne_embds, 0, number_of_words, 1)

## <b>Word Embeddings from Word2Vec<b>

In this subsection, I use word2vec to create word embeddings from the review comments. Word2vec is one algorithm for learning a word embedding from a text corpus.

In [48]:
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

###  Tokenize the reviews coments.

In [49]:
df['tokenized'] = df.apply(lambda row : nltk.word_tokenize(row['text']), axis=1)

In [50]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,stars,tokenized
3793111,3793109,love place ! falafel alway good occas would ge...,5.0,"[love, place, !, falafel, alway, good, occas, ..."
2904416,2904416,overal terribl experi think group peopl privat...,1.0,"[overal, terribl, experi, think, group, peopl,..."
4543595,4543593,must visit tampa ! histori restaur start tapa ...,5.0,"[must, visit, tampa, !, histori, restaur, star..."
6430858,6430856,delight ! usual get drag vegan place want afte...,5.0,"[delight, !, usual, get, drag, vegan, place, w..."
2060823,2060823,sad give faith 5 + first balayag went great se...,3.0,"[sad, give, faith, 5, +, first, balayag, went,..."


### Train word2vec model

In [52]:
model_w2v = Word2Vec(df['tokenized'], vector_size=100)

In [58]:
X = model_w2v.wv[model_w2v.wv.key_to_index]



### Plot Word Vectors Using PCA

In [59]:
from sklearn.decomposition import TruncatedSVD

In [60]:
tsvd = TruncatedSVD(n_components=5, n_iter=10)
result = tsvd.fit_transform(X)

In [61]:
result.shape

(17202, 5)

In [64]:
tsvd_word_list = []
words = list(model_w2v.wv.key_to_index)
for i, word in enumerate(words):
    tsvd_word_list.append(word)

trace = go.Scatter(
    x = result[0:number_of_words, 0],
    y = result[0:number_of_words, 1],
    mode = 'markers',
    text= tsvd_word_list[0:number_of_words]
)

layout = dict(title= 'SVD 1 vs SVD 2',
              yaxis = dict(title='SVD 2'),
              xaxis = dict(title='SVD 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)
py.iplot(fig)