# Sentiment Analysis on Yelp Open Dataset for Review Classification

Sentiment Analysis represents one of the most popular types of Natural Language Processing together with Speech Recognition. What we intend to do within the project is to study various learning models to find an efficient ensemble that can define whether a Yelp review is positive or negative. The project will consist of studying an open-source dataset made available by Yelp for academic purposes, defining its properties, manipulating the data to prepare for training, generating models according to particular structures, saving them and analyzing the results.

### Import Libraries

The libraries have been divided into categories, based on how they are used within the notebook. Functions based on classic ML libraries such as tensorflow and sklearn, for handling text manipulation such as nltk and gensim, and data analysis methods provided by seaborn, matplotlib and worldcloud have been used.

In [None]:
# data collections
import pandas as pd
import numpy as np

# data analysis
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline

# text manipulation
import gensim
from gensim.parsing.preprocessing import remove_stopwords
import nltk as nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# dataset manipulation
from collections import Counter, defaultdict
from datetime import datetime
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# data modelling
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_addons as tfa


# save models
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
import pickle

## 1. Data Loading

One of the first obstacles in the project is the data loading. Dataset file size is at most 5GB, which means that loading the entire data file directly could have resulted in inappropriate use of memory. To solve this problem, we have divided the file into blocks or chunks of data with a predetermined size, in order to lighten the load of data in memory and facilitate the CPU working. In order to obtain a performing procedure, the process took into considering the types of the Json data features to be able to know before the amount of bytes to be analyzed and loaded. At the end of the procedure, the various blocks were unified into a dataframe, a data structure made available by the Python pandas library.

In [None]:
# define features types on the file to reduce the loading time

rtypes = {  "review_id": str,
            "user_id":str,
            "business_id":str,
            "stars": np.float16, 
            "useful": np.int32, 
            "funny": np.int32,
            "cool": np.int32,
            "text" : str,
           }

# json dataset path
path = './data/yelp_academic_dataset_review.json'

# chunk size used in the read_json method
chunkSize = 10000

In [None]:
%%time
# making a JsonReader object
review = pd.read_json(path, lines=True,
                      orient="records",
                      dtype=rtypes,
                      chunksize=chunkSize)
chunkList = []

# using the chunk segmentation to manipulate only chunk data for each block
for chunkReview in review:
    # removing relational attributes
    chunkReview = chunkReview.drop(['review_id', 'user_id','business_id'], axis=1)
    chunkList.append(chunkReview)
    
# chunks concatenation to make the pandas dataframe 
df = pd.concat(chunkList, ignore_index=True, axis=0)

Keeping the idea of the chunks divisions, data loading has been reduced until reaching 1 minute and 42 seconds.

In [None]:
# dataframe head showing
df.head()

The final dataframe is full of hypothetical unused information for the goals of the project. However the choice of analyzing the total properties to find possible correlations with the main fields and, if there are any, let us to be able to exploit them for the models making.

## 2. Data Analysis

The data analysis depicts one of the procedures that has the purpose of having to detect hidden properties on the data in order to fully understand their quality and how to make the most of it to train the models. Next, we will scan the data types of the dataframe going to analyze field by field and find possible useful information to define: data distribution, data types, correlations or dependencies between features.

In [None]:
# informazioni sulle colonne del dataframe e su quante entries o righe si hanno
df.info()

### 2.1 Stars Analysis

The "star" feature represents one of the most important data types in the entire dataset. Represents a numerical evaluation associated with the text of a reviews which, according to the official Yelp documentation, has an integer domain from 1 to 5. Since our purpose is to define a binary classification, we will aggregate the classes in order to distinguish if the reviews evaluations are positive or negative. However, we need to analyze before its structure and understand how the values are distributed in the dataset.

In [None]:
# define figure size
plt.figure(figsize=(8,8))

# counting stars by values and put the result in the pie plot
df['stars'].value_counts().plot.pie(startangle=60)

# define the plot title
plt.title('Distribuzione dei valori per l\'attributo stars')

As can be seen from the previous graph, the classes are not balanced. The maximum rating of "stars" currently makes it impossible to divide the dataset into two opposite categories, for this reason, we are going to rebalance the values by taking a subset of reviews and dividing them between training, validation and testing set.

In [None]:
# distribution of polarized or binary class stars 
binstars = pd.DataFrame()
binstars['stars'] = [0 if star <= 3.0 else 1 for star in df['stars']]


# define the size of the figure
plt.figure(figsize=(8,8))


# couting stars by value and put results in the pie plot
binstars['stars'].value_counts().plot.pie(startangle=60)

# define a title for the plot
plt.title('Distribuzione dei valori positivi e negativi')

Following a binary division in which the evaluations with a value greater than 3 are classified as "positive" while the remaining ones as "negative", does not make the dataset balanced. As you can see from the graph, 1 (positive) reviews have a higher quantity than reviews with 0 (negative) rating.

### 2.2 Cool, Fun and Useful Analysis

In [None]:
# insert a new feature to define the text lenght 
df['textLength']  = df['text'].str.len()

In [None]:
# define correlations 
corr = df.corr()

# generate heatmap with correlation results
sns.heatmap(corr)

We tried to find possible correlations between the stars and text properties and other dataset features. Following the results provided by the heatmap, we have no such correlations; therefore the secondary features were not considered useful in achieving the objectives of the project.

### 2.3 Text Analysis

The analysis of the texts was fundamental to understand what are the terms used in the reviews vocabulary and how they could be uniquely classified as one of the two categories. The aim of this section is define which are the most used terms, if we need to remove some misleading terms and which are the most used words.

In [None]:
%%time

# define a subset of the input dataset 
subset = df[:100000]
# concatenation of texts in a single string and put every character in lowercase
inputText = ' '.join(subset['text']).lower()

# creation of worldcloud ignoring the stopwords
wordCloud = WordCloud(background_color='white', stopwords=gensim.parsing.preprocessing.STOPWORDS).generate(inputText)
# views setting using a bilinear interpolation
plt.imshow(wordCloud, interpolation='bilinear')

# set axis invisible
plt.axis('off')
# showing of the most frequent terms in a worldcloud plot
plt.show()

Given the dataset's imbalance towards positive reviews, it was natural to expect that positive terms would have a higher frequency than negative ones. Specifically, from the previous worldlcloud, it is possible to see terms such as "good", "delicious" are used more frequently to describe characteristics of the service to be reviewed. From words like "food", "service", "place" or "restaurant", we can deduce how most of the reviews are related to restaurant or culinary activities. From this, therefore, we can already guess what the vocabulary may be used but, for greater precision, we have decided to be able to display the data graphically, so as to be able to have a more precise view.

In [None]:
# calculae the frequent terms 
wordTokens = word_tokenize(inputText)
tokens = list()
for word in wordTokens:
    if word.isalpha() and word not in gensim.parsing.preprocessing.STOPWORDS:
        tokens.append(word)
tokenDist = FreqDist(tokens)
# select the 20 most frequent terms from the token dictionary
dist = pd.DataFrame(tokenDist.most_common(20),columns=['term', 'freq'])

In [None]:
# showing results in a bar plot
fig = plt.figure(figsize=(14,8))
ax = fig.add_axes([0,0,1,1])
x = dist['term']
y = dist['freq']
ax.bar(x,y)
plt.title('Frequenza dei termini più utilizzati')
plt.show()

The analysis of the terms used in the texts of the reviews, points out how stopwords without a conceptual meaning are not present among the most frequent words. However, we should eliminate them anyway to reduce the variance of terms used within the dataset, this is because if the amount of words used is less, it is possible to obtain a trained and high performed classifier faster based the training phase only on most relevant conceptual wolds.

In [None]:
df.head()

In [None]:
# difference of the length between reviews from the different stars values
graph = sns.FacetGrid(data=df,col='stars')
graph.map(plt.hist,'textLength',bins=50,color='blue')

Normalizing text length values, it can be seen that the distribution of this texts property is  similar between reviews indipendently from the stars value.

## 3. Data Pre-processing

Data pre-processing considers some analyzed aspects discovered in data analysis and prepare the dataset for the training phase. In according to well known information, we are able to remove useless features like cool, funny, useful and textLength because they aren't correlations with stars and text field. In addition, we can modify texts as a lemmatized sequences of worlds in lowercase and without stopwords (except for negative ones which are important to determinate the result class of a review).

### 3.1 Remove of unused and null data

In [None]:
# removing of useless features
df = df.drop(['cool', 'funny', 'useful', 'textLength'], axis=1)

In [None]:
df.head()

In [None]:
# removing of rows with null values 
df['text'].dropna(inplace=True)

### 3.2 Lowercase reduction

This kind of change is necessary to avoid different classification on a same word in lower and upper variation like "Hello" and "hello".

In [None]:
# reduction of dataset text in lowecase
df['text'] = [review_text.lower() for review_text in df['text']]

In [None]:
df['text'].head()

### 3.3 Stars polarization and dataset balance

To lighten the workload associated with the transformation of the dataset, we have anticipated the stars polarization. If we do it before, we should be able to manage data faster on a balanced subset that needs to be made by an equal number of positive and negative reviews in according to the polarized stars values.

In [None]:
# polarization procedure 

texts =  df['text']

# setting stars as 1 if more than 3, otherwise set it to 0
stars = [0 if star <= 3.0 else 1 for star in df['stars']]

balancedTexts = [] # it's the balanced subset of text features
balancedLabels = [] # it's the balanced subset of stars refered to the balanced text 

# limit is 200.000, so we select 400.000 reviews, 50% positive and 50% negative 
limit = 200000  

# counting for negative and positive reviews added to the balanced subset
negPosCounts = [0, 0] 

for i in range(0,len(texts)):
    polarity = stars[i]
    if negPosCounts[polarity] < limit: # if limited is not reached
        balancedTexts.append(texts[i])
        balancedLabels.append(stars[i])
        negPosCounts[polarity] += 1

In [None]:
df_balanced = pd.DataFrame()
df_balanced['text'] = balancedTexts
df_balanced['labels'] = balancedLabels
df_balanced.head()

In [None]:
# verify the balance between the two classes
counter = Counter(df_balanced['labels'])
print(f'Ci sono {counter[1]} recensioni positive e {counter[0]} recensioni negative')

As a result of the balance, we have a dataset of 400,000 reviews in which 50% are positive and the rest negative.

### 3.3 Lemmatization

The lemmatization involves an onerous procedure on the balanced dataset. Particularly, we will associate a semantic tag to each word in order to differentiate which type of term is being analyzed and, according to the semantic class to which it belongs, carry out an ad-hoc lemmatization in order to obtain the correct normalized form.

In [None]:
%%time
# lemmatizer making
lemmatizer = WordNetLemmatizer()

# word_tagger is the function to return the semantic tag associated to the given input
def word_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None


#
texts = df_balanced['text']

# lemmatize_reviews return the lemmatized text collection from the given input one.
def lemmatize_reviews(texts):
    df_texts = []
    for text in texts:
        # association between a token and the relative tag
        word_tagged = nltk.pos_tag(nltk.word_tokenize(text))
        # mapping word : tag
        map_word_tag = list(map(lambda x: (x[0], word_tagger(x[1])), word_tagged))
        # building the lemmatized text for each text
        lemmatized_text = []
        for word, tag in map_word_tag:
            if tag is None:
                # element with no tag
                lemmatized_text.append(word)
            else:
                # element with tag
                lemmatized_text.append(lemmatizer.lemmatize(word, tag))
        # adding the lemmatized text to the texts collection
        lemmatized_text = " ".join(lemmatized_text)
        df_texts.append(lemmatized_text)
    return df_texts

df_texts = lemmatize_reviews(texts)
print(texts[0] + "\n\n")
print(df_texts[0])

In [None]:
df_balanced['text'] = df_texts

### 3.4 Removing not alfanumeric charactes and stopwords terms 

As a set of stopwords, it was not possible to use the set offered by nltk as there were negative terms which, for the objectives of the project, are very important in order to distinguish the belonging of a review class. To solve this problem, we used an alternative set provided by the gensim library.

In [None]:
# stopwords to remove
print(gensim.parsing.preprocessing.STOPWORDS)

In [None]:
# removing stopwords
df_texts = []
for text in df_balanced['text']:
    df_texts.append(remove_stopwords(text))

df_balanced['text'] = df_texts

# removing not alfanumeric characters
df_texts = []
for text in df_balanced['text']:
    df_texts.append(''.join(ch for ch in text if ch.isalnum() or ch == ' '))

df_balanced['text'] = df_texts

In [None]:
print(df_balanced['text'])

### 3.5 Text Tokenization

The tokenization of texts simply involves the transformation of a text into a list of words in order to have an index of each term using matrix structure references.

In [None]:
%%time
# tokenization using nltk world_tokenize
df_balanced['text'] = [nltk.word_tokenize(text) for text in df_balanced['text']]

In [None]:
df_balanced['text']

### 3.6 Word Reviews Representation

Since predictive models and classifiers work on numerical values, we need to transform our literal tokens into numerical values. These values will represent an identification in the real numbers field for the terms. Words that were initially placed in various forms, thanks to lemmatization, will be classified under a single numerical identifier, thus not only will reduce the amount of integer values present within each review, but will also have greater precision on semantic text perception and, consequently, improve the performance of the models.

In [None]:
# verification of the number of term inside the lemmatized and balanced dataset vocabulary
map_terms = dict()
for text in df_balanced['text']:
    for word in text:
        if word not in map_terms:
            map_terms[word] = 1

print(f'There are {len(map_terms)} different words') # number of words

In [None]:
%%time
# defining a tokenizer with the 15.000 most frequented word there is a truncation, for the short one a filling using
# sequence of zeros
tokenizer = Tokenizer(num_words=15000)
tokenizer.fit_on_texts(df_balanced['text'])
# trasformation of text in sequences
sequences = tokenizer.texts_to_sequences(df_balanced['text'])
# sequences are reducted by 300 words for review. For long one, there is. 
text_sequence = pad_sequences(sequences, maxlen=300)
labels = np.array(df_balanced['labels'])

In [None]:
# partial checking 
word_index = tokenizer.word_index
# print the first 50 terms
check = {key: value for key, value in word_index.items() if value <= 50}
print(check)

The vector of numerical values has a domain equal to 15,000 different word types among the 132,062 total. We will therefore select more than 1/10 of the words present in the reviews which, however, have a greater relevance than 9/10 in according to their occurrences. Furthermore, the ordered sequence created will follow the order of occurrence of the terms within the texts of 300 words (maximum size).

## 4. Modelling

At this stage it is possible to find alternative models used today for Sentiment Analysis and generally in the Natural Language Processing. 

In [None]:
# checking sulle compile flags di tensorflow
print(tf.sysconfig.get_compile_flags())
print(tf.__version__)

One of the elements that we will take as a parameter is F / F1-Score which measures the accuracy of a model on a particular set of data (testing and validation set in our case) based on the precision and recall parameters. To calculate it, simply divide the double product of the precision for recall by their sum.

In [None]:
# callbacks management
# class to calculate the f1 score
class f1_score_callback(tf.keras.callbacks.Callback):

    def __init__(self, train, validation=None):
        super(f1_score_callback, self).__init__()
        self.validation = validation
        self.train = train

    # load the f1 score at the end of the epoch inherits from the callback class
    def on_epoch_end(self, epoch, logs={}):

        logs['f1_score_train'] = float('-inf')
        X_train, y_train = self.train[0], self.train[1]
        y_pred = (self.model.predict(X_train).ravel()>0.5)+0
        score = f1_score(y_train, y_pred)  

        if (self.validation):
            logs['f1_score_val'] = float('-inf')
            X_valid, y_valid = self.validation[0], self.validation[1]
            y_val_pred = (self.model.predict(X_valid).ravel()>0.5) + 0
            val_score = f1_score(y_valid, y_val_pred)
            logs['f1_score_train'] = np.round(score, 5)
            logs['f1_score_val'] = np.round(val_score, 5)
        else:
            logs['f1_score_train'] = np.round(score, 5)

In [None]:
# early stopping callbacks
es_acc_callback = keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3, verbose=0)
es_loss_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, verbose=0)

In [None]:
# train and test splitting
x_train, x_test, y_train, y_test = train_test_split(text_sequence , labels ,random_state=520, test_size=0.33, shuffle=True)

# train and validation splitting
x_train, x_val, y_train, y_val = train_test_split(
    x_train, y_train, test_size=0.5, random_state=1)

### 4.1 LSTM

The proposed model add to an input embedding layer an LSTM ones, flatten the output and gives in input to two Dense layers before the output one which use sigmoid approximation to determinate the final class of a review.

In [None]:
# making to the sequential network
lstm = keras.Sequential([
    layers.Embedding(15000,128,input_length=300), # word embedding
    layers.LSTM(128,return_sequences = True,  dropout=0.2), # long-short-term-memory layer with return sequences for sequential correlation
    layers.Flatten(), # flattering the output vector
    layers.Dense(256, activation="relu"), # hidden states
    layers.Dense(128, activation="relu"), # hidden states
    layers.Dense(1, activation="sigmoid") # output layer
]); 

lstm.compile(
    loss='binary_crossentropy', # loss function
    optimizer='adam', # adam optimizer
    metrics=['accuracy']) # accuracy metrics

lstm.summary()

In [None]:
results_lstm = lstm.fit(x_train, y_train, epochs=3, 
                              validation_data=(x_val, y_val), 
                              callbacks=[f1_score_callback(train=(x_train,y_train),validation=(x_val,y_val)), 
                                         es_acc_callback, es_loss_callback])

In [None]:
lstm.evaluate(x_test, y_test)

modelLSTM.evaluate(xTest, yTest)

### 4.2 Convulational Neural Network

As the image recognition approach, we can consider words as pixels. Using an embedding input layer which determinate the input vector of the convulational neural network, using convulational and pooling layers, we can manage the logic feature extraction, flatten the output, gives in input to Dense layers and, finally, determinate the class using the single neuron with sigmoid activation approximation.

In [None]:
cnn = keras.Sequential([
    layers.Embedding(15000,128,input_length=300), # word embedding
    layers.Conv1D(128, 8, activation="relu"), # convulational layer for features pattens matching
    layers.MaxPooling1D(pool_size=4), # max pooling to unify pattern matching
    layers.Flatten(), # flattening the dimension of the output
    layers.Dropout(0.2), # dropout layer
    layers.Dense(256, activation="relu"), # hidden states
    layers.Dropout(0.2), # dropout layer
    layers.Dense(128, activation="relu"), # hidden states
    layers.Dense(1, activation="sigmoid") # output layers
])


cnn.compile(
    loss='binary_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])



cnn.summary()

In [None]:
results_cnn = cnn.fit(x_train, y_train, epochs=3, 
                               validation_data=(x_val, y_val), 
                              callbacks=[f1_score_callback(train=(x_train,y_train),validation=(x_val,y_val)), 
                                         es_acc_callback, es_loss_callback], 
                              batch_size=67)

In [None]:
cnn.evaluate(x_test, y_test)

### 4.1.1 Model based on the combination between a Convulational Neural Network and LSTM

This model combines the convulational neural network features extraction with the classification of LSTM and, in addiction, usig Dense layers to determinate a well accured results. Finally, the computation results goes to the our typical output layer.

In [52]:
# combination models between convulational neural network and LSTM previous models
clstm = keras.Sequential([
    layers.Embedding(15000,128,input_length=300), # word embedding
    layers.Conv1D(128, 8, activation="relu"), # convulational layer for features pattens matching
    layers.MaxPooling1D(pool_size=4),  # max pooling to unify pattern matching
    layers.LSTM(128,return_sequences = True,  dropout=0.2), # long-short-term-memory layer per l'acquisizione di informazioni
    layers.Flatten(), # flattening output
    layers.Dense(128, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(1, activation="sigmoid") # output layer che restituisce il tipo di review
])


clstm.compile(
    loss='binary_crossentropy', 
    optimizer='adam',
    metrics=['accuracy'])

clstm.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 300, 128)          1920000   
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 293, 128)          131200    
_________________________________________________________________
max_pooling1d_12 (MaxPooling (None, 73, 128)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 73, 128)           131584    
_________________________________________________________________
flatten_4 (Flatten)          (None, 9344)              0         
_________________________________________________________________
dense_24 (Dense)             (None, 128)               1196160   
_________________________________________________________________
dense_25 (Dense)             (None, 64)              

In [None]:
results_clstm = clstm.fit(x_train, y_train, epochs=3, 
                              validation_data=(x_val, y_val), 
                              callbacks=[f1_score_callback(train=(x_train,y_train),validation=(x_val,y_val)), 
                                         es_acc_callback, es_loss_callback], 
                              batch_size=67)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
clstm.evaluate(x_test, y_test)

### 4.3 Model based on the combination between a Convulational Neural Network and biLSTM

These model is similar to the previous one. The difference is in the classification layers which use a biLSTM or bidirectional LSTM which finds correlations between features in according to their orders (not as LSTM which determinate the correlation in a unidirectional analysis that consider only correlation with a current feature and previous ones).

In [None]:
bilstm= keras.Sequential([
    layers.Embedding(15000,128,input_length=300), # word embedding
    layers.Conv1D(128, 2, activation="relu"), # convulational layer for features patterns 
    layers.MaxPooling1D(pool_size=8), # max pooling to unify patterns
    layers.Bidirectional(LSTM(128,return_sequences = True,  dropout=0.2)), # long-short-term-memory layer per l'acquisizione di informazioni
    layers.Flatten(), # flattening output
    layers.Dense(128, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(1, activation="sigmoid") # output layer che restituisce il tipo di review
])

bilstm.compile(
    loss='binary_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

bilstm.summary()

In [None]:
results_bidlstm = bilstm.fit(x_train, y_train, epochs=3, 
                              validation_data=(x_val, y_val), 
                              callbacks=[f1_score_callback(train=(x_train,y_train),validation=(x_val,y_val)), 
                                         es_acc_callback, es_loss_callback], 
                              batch_size=67)

In [None]:
bilstm.evaluate(x_test,y_test)

## 5. Save Models

Given the excessive time spent training the models, we decided to save them in special files using the Python pickle library, so that we can always keep them available for future tests. As you can see in the examples, the classifiers are not optimal but they provide a good support tool to be able to define whether a review is positive or negative.

In [None]:
import pickle

# saving of the tokenizer
with open("dump/tokenizer/keras_tokenizer.pickle", "wb") as f:
    pickle.dump(tokenizer, f)

lstm.save("dump/model/yelp_lstm.hdf5")
cnn.save("dump/model/yelp_cnn.hdf5")
clstm.save("dump/model/yelp_clstm.hdf5")
bilstm.save("dump/model/yelp_bilstm.hdf5")


In [None]:
# class for results class definition
def predict_conv(predictions):
    res = []
    for pred in predictions:
        if pred >= 0.5:
            res.append(1)
        else:
            res.append(0)
            
    return res   
predict_conv([0.8,0.4,0.5])

In [None]:
#  loading of the tokenizer
with open("dump/tokenizer/keras_tokenizer.pickle", "rb") as f:
    tokenizer = pickle.load(f)

# loading models
lstm = load_model("dump/model/yelp_lstm.hdf5")
cnn = load_model("dump/model/yelp_cnn.hdf5")
clstm = load_model("dump/model/yelp_clstm.hdf5")
bilstm = load_model("dump/model/yelp_bilstm.hdf5")

# defines testing samples
sample = df[:100].sample()
text_sample = np.array(sample.text) # text array
lem_sample = lemmatize_reviews(text_sample) # text lemmatization
# using sequence of integers to give to the model
sequences = tokenizer.texts_to_sequences(lem_sample) # texts to sequences of integer tokens
data_examples = pad_sequences(sequences, maxlen=300) # sequence padding

# doing predictions and save the results
predictions_lstm = lstm.predict(data_examples)
predictions_cnn = cnn.predict(data_examples)
predictions_clstm = clstm.predict(data_examples)
predictions_bid = bilstm.predict(data_examples)

# print the random review taken in input and print it
print(f"Random review\n: {sample}")

# print predictions
print(f"LSTM results:\n {predict(predictions_lstm)}\n\n")
print(f"Convulational Neural Network predictions results:\n {predict(predictions_cnn)}\n\n")
print(f"Convulational Neural Network concatenated with LSTM predictions results:\n {predict(predictions_clstm)}\n\n")
print(f"Convulational Neural Network concatenated with a biLSTM predictions results:\n {predict(predictions_bid)}\n\n")

## 6. Data Results

In this section, we are going to propose a graphical representation of the results by representing the performance evaluation parameters taken into consideration: accuracy, loss, and f-score. The results to be differentiated by model and by training periods.

In [None]:
# plot to compare training accuracy values
plt.plot(results_lstm.history['accuracy'])
plt.plot(results_cnn.history['accuracy'])
plt.plot(results_clstm.history['accuracy'])
plt.plot(results_bid.history['accuracy'])
plt.title('Models training accuracy')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()


# plot to compare validation accuracy values
plt.plot(results_lstm.history['val_accuracy'])
plt.plot(results_cnn.history['val_accuracy'])
plt.plot(results_clstm.history['val_accuracy'])
plt.plot(results_bid.history['val_accuracy'])
plt.title('Models validations accuracy')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()



In [None]:
# plot to compare training loss values
plt.plot(results_lstm.history['loss'])
plt.plot(results_cnn.history['loss'])
plt.plot(results_clstm.history['loss'])
plt.plot(results_bid.history['loss'])
plt.title('Models training loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()



# plot to compare vaidation loss values 
plt.plot(results_lstm.history['val_loss'])
plt.plot(results_cnn.history['val_loss'])
plt.plot(results_clstm.history['val_loss'])
plt.plot(results_bid.history['val_loss'])
plt.title('Models validations loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()

In [None]:
# plots for f1 scores

plt.plot(results_lstm.history['f1_score_train'])
plt.plot(results_cnn.history['f1_score_train'])
plt.plot(results_clstm.history['f1_score_train'])
plt.plot(results_bid.history['f1_score_train'])
plt.title('Models training F1 scores values')
plt.ylabel('F-Score')
plt.xlabel('Epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()

plt.plot(results_lstm.history['f1_score_val'])
plt.plot(results_cnn.history['f1_score_val'])
plt.plot(results_clstm.history['f1_score_val'])
plt.plot(results_bid.history['f1_score_val'])
plt.title('Analisi validation f-score dei modelli')
plt.ylabel('F-Score')
plt.xlabel('Epoch')
plt.legend(['LSTM', 'CNN', 'CNN and LSTM', 'CNN and biLSTM'], loc='best')
plt.show()