# Model Training

This notebook presents an exploration into predicting the star rating of a restaurant review using different types of machine learning models - Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), and Bidirectional Encoder Representations from Transformers (BERT).

All the models aim to predict the star rating of a restaurant review based on the text of the review. The models are trained on the Yelp Dataset from Kaggle: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_review.json

## Import Required Libraries

In this first cell, all necessary libraries for this notebook are imported. 

In [4]:
# Load libraries
import pandas as pd
import numpy as np
import json
import nltk
from nltk.stem import SnowballStemmer
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords

## Preprocessing

Preprocessing involves cleaning the text data, removing unwanted words, converting it into a form that is predictive. This process involves several steps:

1. Downloading the NLTK stopwords: Stopwords are words that you want to ignore, so you download a list of these words from NLTK, which is a toolkit for natural language processing.
2. The function preprocess_text() is defined to convert the text into lower case, remove punctuation, remove stopwords, and stem the words.
3. The data is loaded and processed in chunks to reduce RAM usage. I had to this here because we're dealing with large amounts of data that did not fit into memory.

The cleaned and preprocessed data is then saved for further analysis.

In [5]:
# Download the NLTK stopwords
nltk.download('punkt')
nltk.download('stopwords')

stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

# Define a function to preprocess the text
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove stopwords and stem the words
    text = ' '.join(stemmer.stem(word) for word in text.split() if word not in stop_words)
    return text

# Define the chunk size
chunksize = 1000 # for 5k reviews
# chunksize = 10000 # for 100k reviews
# chunksize = 50000 # for 1M reviews
# chunksize = 100000 # for the entire dataset

# Initialize an empty DataFrame to store the preprocessed data
df_preprocessed = pd.DataFrame()

# Load and preprocess the data in chunks
for i, chunk in enumerate(pd.read_json('data/yelp_academic_dataset_review_reduced_5k.json', lines=True, chunksize=chunksize)): # for 5k reviews 
# for i, chunk in enumerate(pd.read_json('data/yelp_academic_dataset_review_reduced_100k.json', lines=True, chunksize=chunksize)): # for 100k reviews
# for i, chunk in enumerate(pd.read_json('data/yelp_academic_dataset_review_reduced_1M.json', lines=True, chunksize=chunksize)): # for 1M reviews
# for i, chunk in enumerate(pd.read_json('data/yelp_academic_dataset_review.json', lines=True, chunksize=chunksize)): # for the entire dataset
    # Preprocess the text in the chunk
    chunk['text'] = chunk['text'].apply(preprocess_text)
    # Append the preprocessed chunk to the DataFrame
    df_preprocessed = pd.concat([df_preprocessed, chunk])
    # Save the preprocessed chunk to a separate file
    chunk.to_json(f'preprocessing/preprocessed_reviews_chunk_{i}.json')
    # Print the progress 
    print(f'preprocessed chunk {i}')

# Save the entire preprocessed DataFrame
df_preprocessed.to_json('preprocessing/preprocessed_reviews.json')

# Display the first few rows of the preprocessed DataFrame
df_preprocessed.head()

# Print the number of rows in the DataFrame
print('Number of rows in the DataFrame:', len(df_preprocessed))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dinopelesevic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dinopelesevic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


preprocessed chunk 0
preprocessed chunk 1
preprocessed chunk 2
preprocessed chunk 3
preprocessed chunk 4
Number of rows in the DataFrame: 5000


## Loading Preprocessed Data in Chunks

I implemented this step as my Kernel kept crashing because my RAM wasn't big enough. With this approach I was able to continue with the preprocessing even if the kernel crashed. 

The preprocessed data is loaded in chunks for further usage. This reduces the RAM usage as the whole data isn't loaded into memory at once. 
However, if your RAM is big enough you can load the whole preprocessed json-file at once.

The data is then divided into three subsets to be used for different models - CNN, LSTM and BERT.

In [6]:
# Initialize an empty DataFrame to store the preprocessed data
df_preprocessed = pd.DataFrame()

# Load the preprocessed data chunks into the DataFrame
# The number of chunks may vary depending on the chunk size in the previous step

for i in range(5): # for 5k reviews
# for i in range(10): # for 100k reviews
# for i in range(20): # for 1M reviews
# for i in range(70): # for the entire dataset
    print(f'Loading chunk {i}')
    # Load the chunk
    chunk = pd.read_json(f'preprocessing/preprocessed_reviews_chunk_{i}.json')

    # Append the chunk to the DataFrame
    df_bert = pd.concat([df_preprocessed, chunk])
    df_cnn = df_bert.copy()
    df_lstm = df_bert.copy()




Loading chunk 0
Loading chunk 1
Loading chunk 2
Loading chunk 3
Loading chunk 4


## CNN

This section is dedicated to building a Convolutional Neural Network (CNN) for the text classification task.

- First, the necessary libraries for building a CNN model are imported from TensorFlow.
- Tokenization: The Tokenizer utility class is used to vectorize a text corpus into a list of integers. Each integer maps to a value in a dictionary that encodes the entire corpus, with the keys in the dictionary being the vocabulary of the corpus.
- The model is defined with an Embedding layer, Conv1D layer and Dense layers. It's then compiled with categorical crossentropy loss function and Adam optimizer.
- The model is trained for a specified number of epochs.
- The trained model is saved for future use.

In [7]:
# Load Libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Embedding
from tensorflow.keras.callbacks import EarlyStopping

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 10000
# Max number of words in each review.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100

# Extract the strings from the dictionaries in the 'text' column
df_cnn['text'] = df_cnn['text'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Extract the ratings from the dictionaries in the 'stars' column
df_cnn['stars'] = df_cnn['stars'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Create a tokenizer
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@\[\]^_`{|}~', lower=True)

# Fit the tokenizer on the texts
tokenizer.fit_on_texts(df_cnn['text'].values)

# Vocabulary size
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Transform text to sequence of integers
X = tokenizer.texts_to_sequences(df_cnn['text'].values)

# Pad sequences
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

# One-hot encode labels
Y = pd.get_dummies(df_cnn['stars']).values
print('Shape of label tensor:', Y.shape)

# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

# Define the model
model_cnn = Sequential()
model_cnn.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model_cnn.add(Conv1D(128, 5, activation='relu'))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(64, activation='relu'))
model_cnn.add(Dense(5, activation='softmax'))

# Compile the model
model_cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summarize the model
model_cnn.summary()

# Define the number of epochs and the batch size
epochs = 1 # I did 5 epochs for the whole dataset
batch_size = 64 # and 128 batch size

# Train the model
history_cnn = model_cnn.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

# Save the model
model_cnn.save('models/sentiment_analysis_model_cnn_5k.h5')
# model_cnn.save('models/sentiment_analysis_model_cnn_100k.h5')
# model_cnn.save('models/sentiment_analysis_model_cnn_1M.h5')
# model_cnn.save('models/sentiment_analysis_model_cnn.h5')


Found 6468 unique tokens.
Shape of data tensor: (1000, 250)
Shape of label tensor: (1000, 5)
(900, 250) (900, 5)
(100, 250) (100, 5)
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 250, 100)          1000000   
                                                                 
 conv1d (Conv1D)             (None, 246, 128)          64128     
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                                 
 dense_1 (Dense)             (None, 5)                 325       
                                                       

![Result after Training the CNN Model in 5 Epochs with the full Dataset](images/CNN_FullDataset_5Epochs_128Batches.png)

## LSTM

This section is dedicated to building a Long Short Term Memory (LSTM) model for the text classification task.

- Similar steps as in the CNN section are followed to build and train an LSTM model, except that the architecture of the model is different. LSTM layers are used instead of Conv1D layers.
- The trained model is saved for future use.

In [8]:
# Load Libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.callbacks import EarlyStopping

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 10000
# Max number of words in each review.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100

# Extract the strings from the dictionaries in the 'text' column
df_lstm['text'] = df_lstm['text'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Extract the ratings from the dictionaries in the 'stars' column
df_lstm['stars'] = df_lstm['stars'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Create a tokenizer
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@\[\]^_`{|}~', lower=True)

# Fit the tokenizer on the texts
tokenizer.fit_on_texts(df_lstm['text'].values)

# Vocabulary size
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Transform text to sequence of integers
X = tokenizer.texts_to_sequences(df_lstm['text'].values)

# Pad sequences
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

# One-hot encode labels
Y = pd.get_dummies(df_lstm['stars']).values
print('Shape of label tensor:', Y.shape)

# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

# Define the model
model_lstm = Sequential()
model_lstm.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model_lstm.add(SpatialDropout1D(0.2))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0))
model_lstm.add(Dense(5, activation='softmax'))

# Compile the model
model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summarize the model
model_lstm.summary()

# Define the number of epochs and the batch size
epochs = 1
batch_size = 64

# Train the model
history = model_lstm.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

# Save the model
model_lstm.save('models/sentiment_analysis_model_lstm_5k.h5')
# model_lstm.save('models/sentiment_analysis_model_lstm_100k.h5')
# model_lstm.save('models/sentiment_analysis_model_lstm_1M.h5')
# model_lstm.save('models/sentiment_analysis_model_lstm.h5')


Found 6468 unique tokens.
Shape of data tensor: (1000, 250)
Shape of label tensor: (1000, 5)
(900, 250) (900, 5)
(100, 250) (100, 5)
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 250, 100)          1000000   
                                                                 
 spatial_dropout1d (SpatialD  (None, 250, 100)         0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dense_2 (Dense)             (None, 5)                 505       
                                                                 
Total params: 1,080,905
Trainable params: 1,080,905
Non-trainable params: 0
___________________________________________

2023-06-11 14:55:05.612304: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-11 14:55:05.613770: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-06-11 14:55:05.615621: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2023-06-11 14:55:10.901892: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-11 14:55:10.903497: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-06-11 14:55:10.905247: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



![Result after Training the LSTM Model in 5 Epochs with the full Dataset](images/LSTM_FullDataset_5Epochs_128Batches.png)

## BERT

This section covers the usage of BERT (Bidirectional Encoder Representations from Transformers) model. BERT has been pre-trained on a large corpus of text and can generate high-quality embeddings for text classification tasks.

- First, the necessary libraries are imported, including the AutoTokenizer and TFDistilBertForSequenceClassification classes from the transformers library.

- The BERT tokenizer is loaded using the "distilbert-base-uncased" model, which is a version of BERT that is smaller, faster, cheaper, and lighter.

- The 'text' and 'stars' columns of the DataFrame are processed to extract the actual strings from the dictionaries.

- The ratings are converted to integers and then encoded to integer labels using LabelEncoder.

- The data is split into a training set and a test set.

- The reviews are tokenized using the BERT tokenizer, padding and truncating them to a maximum length of 512 tokens.

- The tokenized data and labels are converted into a TensorFlow dataset and batched.

- The BERT model is initialized using the "distilbert-base-uncased" model and compiled with the Adam optimizer, the SparseCategoricalCrossentropy loss function, and the SparseCategoricalAccuracy metric.

- The model is trained for the chosen amount of epochs and then saved.

**Warning for Demo Purposes:**
 The runtime for the Bert model takes a lot longer than LSTM and CNN. Even with the 5k Dataset, it takes about 3min.

In [10]:
# Load Libraries
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from sklearn import preprocessing
import numpy as np

# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Extract the strings from the dictionaries in the 'text' column
df_bert['text'] = df_bert['text'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Extract the ratings from the dictionaries in the 'stars' column
df_bert['stars'] = df_bert['stars'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Convert labels to integers
df_bert['stars'] = df_bert['stars'].astype(int)

# Initialize label encoder
le = preprocessing.LabelEncoder()
le.fit(df_bert['stars']) # Fit label encoder with the all data

# Transform the labels
df_bert['stars'] = le.transform(df_bert['stars'])

# Split the data
df_train, df_test = train_test_split(df_bert, test_size=0.2, random_state=42)

# Tokenize the datasets
tokenized_train_dataset = tokenizer(df_train['text'].tolist(), padding=True, truncation=True, max_length=512)
tokenized_test_dataset = tokenizer(df_test['text'].tolist(), padding=True, truncation=True, max_length=512)

# Convert the tokenized data and labels into a tensorflow dataset
train_dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": tokenized_train_dataset['input_ids'], "attention_mask": tokenized_train_dataset['attention_mask']}, df_train['stars'].values)).batch(16)
test_dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": tokenized_test_dataset['input_ids'], "attention_mask": tokenized_test_dataset['attention_mask']}, df_test['stars'].values)).batch(16)

# Initialize the model
model_bert = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(np.unique(df_bert['stars'])))

# Compile the model
model_bert.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

# Train the model
model_bert.fit(train_dataset, epochs=1, validation_data=test_dataset)

# Save the model
model_bert.save_pretrained('models/sentiment_analysis_model_bert_5k')
# model_bert.save_pretrained('models/sentiment_analysis_model_bert_100k')
# model_bert.save_pretrained('models/sentiment_analysis_model_bert_1M')
# model_bert.save_pretrained('models/sentiment_analysis_model_bert')


  from .autonotebook import tqdm as notebook_tqdm


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading model.safetensors: 100%|██████████| 268M/268M [00:06<00:00, 38.8MB/s] 
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.



2023-06-11 15:08:14.210567: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_2' with dtype int64 and shape [200]
	 [[{{node Placeholder/_2}}]]




![Result after Training the LSTM Model in 5 Epochs with the full Dataset](images/Bert_100k_5Epochs.png)

## Loading the Models

After saving the models, they are loaded for use. The model files are located in the 'models' directory and loaded using the load_model function from TensorFlow (for ltsm and cnn) and the from_pretrained method from transformers (for bert).

In [11]:
from tensorflow.keras.models import load_model

# Load the model
model_cnn = load_model('models/sentiment_analysis_model_cnn_5k.h5')
model_lstm = load_model('models/sentiment_analysis_model_lstm_100k.h5')
model_bert = TFDistilBertForSequenceClassification.from_pretrained('models/sentiment_analysis_model_bert_100k')

2023-06-11 15:09:48.011085: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-11 15:09:48.013698: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-06-11 15:09:48.015274: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

## Testing

The models are then tested on new data. In order to do this, the necessary libraries are imported and the preprocessing steps carried out earlier are repeated on the new data.


If you are starting from the Loading part, you have to run this cell as well.

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
from tensorflow.keras.preprocessing.text import Tokenizer
import pandas as pd
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# If the preprocessing has already been done and you're ram is big enough, you can load the preprocessed data all at once
# But if your ram is not big enough, you can load the data in chunks and preprocess them in chunks under "Loading Preprocessed Data in Chunks"
df_preprocessed= pd.read_json('preprocessing/preprocessed_reviews.json')

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 10000
# Max number of words in each review.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100

# Extract the strings from the dictionaries in the 'text' column
df_preprocessed['text'] = df_preprocessed['text'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Extract the ratings from the dictionaries in the 'stars' column
df_preprocessed['stars'] = df_preprocessed['stars'].apply(lambda x: list(x.values())[0] if isinstance(x, dict) else x)

# Create a tokenizer
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@\[\]^_`{|}~', lower=True)

# Fit the tokenizer on the texts
tokenizer.fit_on_texts(df_preprocessed['text'].values)

# Vocabulary size
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Transform text to sequence of integers
X = tokenizer.texts_to_sequences(df_preprocessed['text'].values)

Found 6468 unique tokens.


A set of test reviews is defined, and these reviews are preprocessed and tokenized using the same steps as the training data.

The models are then used to predict the ratings for these reviews, and the predicted ratings are printed.

In [13]:
# Download the stopwords from NLTK
nltk.download('stopwords')
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

def preprocess_reviews(reviews):
    processed_reviews = []
    for review in reviews:
        # Convert to lowercase
        review = review.lower()
        # Remove punctuation
        review = re.sub(r'[^\w\s]', '', review)
        # Remove stopwords and stem the words
        review = ' '.join(stemmer.stem(word) for word in review.split() if word not in stop_words)
        processed_reviews.append(review)
    return processed_reviews

# Select a few reviews to test the model
test_reviews = [
    'The food was absolutely wonderful, from preparation to presentation, very pleasing.',
    'The staff did not give us good service.',
    'The restaurant was not clean. Our food was terrible.',
    'The food was delicious and the service was great!',
    'The food was ok. But we liked the service.',
    'We ate not fine, but the food was not great at all!'
]

# Preprocess the test reviews
test_reviews = preprocess_reviews(test_reviews)

# Convert the test reviews into sequences
test_sequences = tokenizer.texts_to_sequences(test_reviews)
test_sequences = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)


# Make predictions on the test reviews
predictions_cnn = model_cnn.predict(test_sequences)
predictions_lstm = model_lstm.predict(test_sequences)


# Print the predictions
for i, review in enumerate(test_reviews):
    print(f'Review: {review}')
    print(f'Predicted rating (CNN): {np.argmax(predictions_cnn[i]) + 1}')
    print(f'Predicted rating (LSTM): {np.argmax(predictions_lstm[i]) + 1}')
    print('---')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dinopelesevic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Review: food absolut wonder prepar present pleas
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---
Review: staff give us good servic
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---
Review: restaur clean food terribl
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---
Review: food delici servic great
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---
Review: food ok like servic
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---
Review: ate fine food great
Predicted rating (CNN): 5
Predicted rating (LSTM): 5
---


2023-06-11 15:09:55.379213: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-11 15:09:55.382274: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-06-11 15:09:55.384552: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In the BERT model testing, the DistilBertTokenizer is used for tokenizing the reviews, and the reviews are formatted as inputs for the BERT model. The predictions are then made using the BERT model and the predicted ratings are printed.

In [14]:
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Test Reviews
test_reviews = [
    'The food was absolutely wonderful, from preparation to presentation, very pleasing.',
    'The staff did not give us good service.',
    'The restaurant was not clean. Our food was terrible.',
    'The food was delicious and the service was great!',
    'The food was ok. But we liked the service.',
    'We ate not fine, but the food was not great at all!'
]

# For Bert: Tokenize and format the sentences as model inputs
inputs = tokenizer(test_reviews, return_tensors='tf', padding=True, truncation=True, max_length=512)

# Make predictions on the test reviews
predictions_bert = model_bert(inputs)

# Print the predictions
for i, review in enumerate(test_reviews):
    print(f'Review: {review}')
    print(f'Predicted rating (BERT): {np.argmax(predictions_bert.logits[i]) + 1}')
    print('---')


Review: The food was absolutely wonderful, from preparation to presentation, very pleasing.
Predicted rating (BERT): 5
---
Review: The staff did not give us good service.
Predicted rating (BERT): 5
---
Review: The restaurant was not clean. Our food was terrible.
Predicted rating (BERT): 5
---
Review: The food was delicious and the service was great!
Predicted rating (BERT): 5
---
Review: The food was ok. But we liked the service.
Predicted rating (BERT): 5
---
Review: We ate not fine, but the food was not great at all!
Predicted rating (BERT): 5
---
