# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [11]:
# !pip3 install -r requirements.txt

Collecting tensorflow>=2.17.1 (from -r requirements.txt (line 14))
  Downloading tensorflow-2.18.0-cp312-cp312-win_amd64.whl.metadata (3.3 kB)
Collecting tensorflow-intel==2.18.0 (from tensorflow>=2.17.1->-r requirements.txt (line 14))
  Downloading tensorflow_intel-2.18.0-cp312-cp312-win_amd64.whl.metadata (4.9 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow-intel==2.18.0->tensorflow>=2.17.1->-r requirements.txt (line 14))
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tensorflow-2.18.0-cp312-cp312-win_amd64.whl (7.5 kB)
Downloading tensorflow_intel-2.18.0-cp312-cp312-win_amd64.whl (390.3 MB)
   ---------------------------------------- 0.0/390.3 MB ? eta -:--:--
   ---------------------------------------- 4.2/390.3 MB 22.9 MB/s eta 0:00:17
   - -------------------------------------- 10.0/390.3 MB 24.8 MB/s eta 0:00:16
   - -------------------------------------- 15.5/390.3 MB 25.6 MB/s eta 0:00:15
   -- ------------------------------------- 21.0

  You can safely remove it manually.

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [29]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [30]:
data = pd.read_csv('final_df.csv')

In [31]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [32]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [33]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

# Basic RNN

In [37]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
vocab_size = 5000         # Limit vocabulary to 5000 words
embedding_dim = 128        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Tokenize and Pad the Text
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Step 2: Define a Simple RNN Model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length, trainable=True))
model.add(SimpleRNN(64, activation='tanh'))
model.add(Dropout(0.5))  # Add dropout for regularization
model.add(Dense(3, activation='softmax'))   # Output layer for 3 classes

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Step 3: Train the Model
model.fit(X_train, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1)

y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10




[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 60ms/step - accuracy: 0.6356 - loss: 0.8535 - val_accuracy: 0.8036 - val_loss: 0.5356
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 51ms/step - accuracy: 0.8384 - loss: 0.4481 - val_accuracy: 0.8150 - val_loss: 0.5042
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 41ms/step - accuracy: 0.8947 - loss: 0.2977 - val_accuracy: 0.7965 - val_loss: 0.5594
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 41ms/step - accuracy: 0.9629 - loss: 0.1397 - val_accuracy: 0.8258 - val_loss: 0.6575
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 42ms/step - accuracy: 0.9811 - loss: 0.0728 - val_accuracy: 0.7971 - val_loss: 0.6594
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 42ms/step - accuracy: 0.9890 - loss: 0.0487 - val_accuracy: 0.7645 - val_loss: 0.8196
Epoch 7/10
[1m116/116[0m [32m━

# RNN while accounting for imbalanced classes

Overall F1 score drops very slightly

By applying `class_weight` using `compute_class_weight`, the model pays more attention to minority classes, which may cause it to misclassify some instances of the majority class. This re-balncing can lower the overall F1 score if the model sacrifices performance on majority classes.

In [39]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
vocab_size = 5000         # Limit vocabulary to 5000 words
embedding_dim = 128        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Tokenize and Pad the Text
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Step 2: Define a Simple RNN Model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length, trainable=True))
model.add(SimpleRNN(64, activation='tanh'))
model.add(Dropout(0.5))  # Add dropout for regularization
model.add(Dense(3, activation='softmax'))   # Output layer for 3 classes

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Step 3: Train the Model
model.fit(X_train, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1, class_weight=class_weights_dict)

y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10




[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 49ms/step - accuracy: 0.4260 - loss: 1.0870 - val_accuracy: 0.6772 - val_loss: 0.8115
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 47ms/step - accuracy: 0.7761 - loss: 0.7477 - val_accuracy: 0.7575 - val_loss: 0.5963
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.8885 - loss: 0.3859 - val_accuracy: 0.7596 - val_loss: 0.6410
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.9447 - loss: 0.1741 - val_accuracy: 0.8074 - val_loss: 0.5949
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.9728 - loss: 0.0915 - val_accuracy: 0.8101 - val_loss: 0.6316
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.9858 - loss: 0.0437 - val_accuracy: 0.7862 - val_loss: 0.7083
Epoch 7/10
[1m116/116[0m [32m━

# RNN + Count Vectoriser

### Loss of Sequential Information
Poor performance because RNNs are not well-suited to the bag-of-words representation generated by `CountVectorizer`. Since `CountVectorizer` treats each document as a set of words without any order, words are represented only by their counts, not by their position in the text. Since RNNs are designed to work with ordered sequences, where the position and context of words matter, without preserving word order, the RNN cannot capture dependencies between words over time.

### Sparse, non-contextual input
`CountVectorizer` produces a sparse representation where each word is treated as an independent feature based on its frequency. There is no semantic or contextual relationship between words, and the word counts lack dense, meaningful relationships that an RNN could leverage, since RNNs perform best with dense, continuous data that represents meaningful relationships between words, typically achieved with word embeddings.

In [40]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
max_features = 5000       # Limit vocabulary to 5000 words
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Text Vectorization using CountVectorizer
vectorizer = CountVectorizer(max_features=max_features)
X_counts = vectorizer.fit_transform(data['processed_full_review']).toarray()

# Convert Counts to Sequences
X_padded = pad_sequences(X_counts, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Reshape input to 3D for RNN (samples, timesteps, features)
X_train_reshaped = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_reshaped = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

model = Sequential()
model.add(SimpleRNN(64, activation='tanh', input_shape=(X_train_reshaped.shape[1], 1)))  # Input shape adjusted
model.add(Dropout(0.5))  # Dropout for regularization
model.add(Dense(3, activation='softmax'))  # Output layer for binary classification

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

model.fit(X_train_reshaped, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1, class_weight=class_weights_dict)

y_pred_prob = model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10


  super().__init__(**kwargs)


[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 37ms/step - accuracy: 0.3968 - loss: 1.1084 - val_accuracy: 0.3793 - val_loss: 1.1062
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 33ms/step - accuracy: 0.3866 - loss: 1.0804 - val_accuracy: 0.2870 - val_loss: 1.1182
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 33ms/step - accuracy: 0.3607 - loss: 1.0819 - val_accuracy: 0.4742 - val_loss: 1.0924
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 35ms/step - accuracy: 0.3930 - loss: 1.0843 - val_accuracy: 0.5648 - val_loss: 1.0602
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 35ms/step - accuracy: 0.3768 - loss: 1.0897 - val_accuracy: 0.4954 - val_loss: 1.0961
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 34ms/step - accuracy: 0.3963 - loss: 1.0876 - val_accuracy: 0.2241 - val_loss: 1.2186
Epoch 7/10
[1m116/116[0m [32m━

# RNN + Count Vectoriser + Conversion to pseudo-sequences with word indices

Performance is better than Basic RNN.

Over here, we transform the `CountVectorizer` output into integer sequences which is compatible with the embedding layer. 

Why `CountVectorizer` is better here is because sentiment analysis often hinges more on the presence of certain key words rather than on the strict order of words in a sequence. Unlike other NLP tasks where the exact sequence of words matters (e.g. translation or grammar correction), sentiment analysis can often succeed with just the occurrence or frequency of these key items. `CountVectorizer` captures this by creating a bag-of-words representation that prioritises word presence and frequency, which is often enough for sentiment detection.



In [42]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
max_features = 5000       # Limit vocabulary to 5000 words
embedding_dim = 128        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Text Vectorization using CountVectorizer
vectorizer = CountVectorizer(max_features=max_features)
X_counts = vectorizer.fit_transform(data['processed_full_review'])
word_index = vectorizer.vocabulary_

# Inverse vocabulary mapping for sequences creation
index_to_word = {i: word for word, i in word_index.items()}

def counts_to_sequences(X_counts):
    sequences = []
    for i in range(X_counts.shape[0]):
        indices = X_counts[i].nonzero()[1]
        words = [index_to_word[idx] for idx in indices]
        seq = [word_index[word] + 1 for word in words]  # +1 because 0 is reserved for padding
        sequences.append(seq)
    return sequences

sequences = counts_to_sequences(X_counts)
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

model = Sequential()
model.add(Embedding(input_dim=len(word_index) + 1, output_dim=embedding_dim, input_length=max_sequence_length, trainable=True))
model.add(SimpleRNN(64, activation='tanh', input_shape=(X_train_reshaped.shape[1], 1)))  # Input shape adjusted
model.add(Dropout(0.5))  # Dropout for regularization
model.add(Dense(3, activation='softmax'))  # Output layer for binary classification

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

model.fit(X_train, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1, class_weight=class_weights_dict)

y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10


  super().__init__(**kwargs)


[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 43ms/step - accuracy: 0.4720 - loss: 1.0589 - val_accuracy: 0.7911 - val_loss: 0.5905
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 42ms/step - accuracy: 0.8687 - loss: 0.5639 - val_accuracy: 0.7629 - val_loss: 0.5588
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 43ms/step - accuracy: 0.9391 - loss: 0.2436 - val_accuracy: 0.7466 - val_loss: 0.6518
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.9750 - loss: 0.0925 - val_accuracy: 0.7982 - val_loss: 0.6056
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 53ms/step - accuracy: 0.9902 - loss: 0.0385 - val_accuracy: 0.8041 - val_loss: 0.6644
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 52ms/step - accuracy: 0.9983 - loss: 0.0145 - val_accuracy: 0.8041 - val_loss: 0.7157
Epoch 7/10
[1m116/116[0m [32m━

# RNN + Within model trained Word2Vec

`Word2Vec` performs worse than `CountVectorizer`.

Because our dataset is only 10k rows, Word2Vec embeddings might lack the depth needed for nuanced sentiment patterns, particularly without pre-training on a larger corpus. If Word2Vec embeddings do not generalise well or have insufficient context, the RNN might not capture subtle sentiment signals in the text, which can degrade model performance. In contrast, CountVectorizer builds a fixed vocab of words based on frequency, and does not need to learn semantic relationships among words, making it robust in cases where the model vocab size is small. 


In [43]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
from gensim.models import Word2Vec
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
vocab_size = 5000         # Limit vocabulary to 5000 words
embedding_dim = 128        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Tokenize and Pad the Text
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

sentences = [text.split() for text in data['processed_full_review']]
word2vec_model = Word2Vec(sentences, vector_size=embedding_dim, window=5, min_count=1, workers=4, sg=1)

# Create Embedding Matrix from Trained Word2Vec Model
embedding_matrix = np.zeros((vocab_size, embedding_dim))
word_index = tokenizer.word_index

for word, i in word_index.items():
    if i < vocab_size:
        # Retrieve the embedding vector for the word
        embedding_vector = word2vec_model.wv[word] if word in word2vec_model.wv else None
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Step 2: Define a Simple RNN Model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=True))
model.add(SimpleRNN(64, activation='tanh'))
model.add(Dropout(0.5))  # Add dropout for regularization
model.add(Dense(3, activation='softmax'))   # Output layer for 3 classes

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Step 3: Train the Model
model.fit(X_train, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1, class_weight=class_weights_dict)

y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10




[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 50ms/step - accuracy: 0.4982 - loss: 1.0830 - val_accuracy: 0.7189 - val_loss: 0.6627
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 45ms/step - accuracy: 0.7076 - loss: 0.7795 - val_accuracy: 0.7699 - val_loss: 0.5509
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 44ms/step - accuracy: 0.7935 - loss: 0.6131 - val_accuracy: 0.7802 - val_loss: 0.5531
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.8305 - loss: 0.5432 - val_accuracy: 0.7694 - val_loss: 0.5675
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.8801 - loss: 0.3689 - val_accuracy: 0.5659 - val_loss: 0.9977
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 46ms/step - accuracy: 0.8229 - loss: 0.3752 - val_accuracy: 0.7482 - val_loss: 0.6872
Epoch 7/10
[1m116/116[0m [32m━

# RNN + Pre-trained Word2Vec

Pre-trained Word2Vec performs worse than within model trained Word2Vec.

Google's Word2Vec embeddings were trained on very general Google News dataset, which may not align well with the context or vocabulary of our specific dataset, while custom embeddings trained directly on our dataset are tailored to the specific language and sentiment patterns within it.

Since our dataset cotntains a lot of domain-specific terms and sentiment-heavy words that are less common in general news (like "amazing", "terrible", "refund"), pre-trained embeddings may not capture these terms accurately. Within-model embeddings can adapt specifically to the words and nuances in our dataset.

In [44]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
from gensim.models import KeyedVectors
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
vocab_size = 5000         # Limit vocabulary to 5000 words
embedding_dim = 300        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Tokenize and Pad the Text
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

word2vec_model = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)

# Create Embedding Matrix with Pre-trained Word2Vec
embedding_matrix = np.zeros((vocab_size, embedding_dim))
word_index = tokenizer.word_index

for word, i in word_index.items():
    if i < vocab_size:
        # Retrieve the embedding vector for the word
        if word in word2vec_model:
            embedding_matrix[i] = word2vec_model[word]

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Step 2: Define a Simple RNN Model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=False))
model.add(SimpleRNN(64, activation='tanh'))
model.add(Dropout(0.5))  # Add dropout for regularization
model.add(Dense(3, activation='softmax'))   # Output layer for 3 classes

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Step 3: Train the Model
model.fit(X_train, y_train, epochs=10, batch_size=64,  validation_split=0.2, verbose=1, class_weight=class_weights_dict)

y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

Epoch 1/10




[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 41ms/step - accuracy: 0.4276 - loss: 1.1304 - val_accuracy: 0.7184 - val_loss: 0.6579
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 40ms/step - accuracy: 0.6504 - loss: 0.8589 - val_accuracy: 0.6994 - val_loss: 0.7364
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 40ms/step - accuracy: 0.6428 - loss: 0.8659 - val_accuracy: 0.6804 - val_loss: 0.9888
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 39ms/step - accuracy: 0.5904 - loss: 1.0298 - val_accuracy: 0.7303 - val_loss: 0.6904
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 39ms/step - accuracy: 0.7149 - loss: 0.7646 - val_accuracy: 0.7656 - val_loss: 0.5968
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 39ms/step - accuracy: 0.7062 - loss: 0.7707 - val_accuracy: 0.7699 - val_loss: 0.5865
Epoch 7/10
[1m116/116[0m [32m━

# RNN + Count Vectoriser + Conversion to pseudo-sequences with word indices + GridSearch CV

Since we see that RNN + CountVectorizer + conversion to pseudo-sequences performs the best so far, we will perform GridSearchCV to select the best combination of hyperparameters to improve our model.

In [51]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from scikeras.wrappers import KerasClassifier
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)


# Parameters
max_features = 5000       # Limit vocabulary to 5000 words
embedding_dim = 128        # Embedding dimensions for each word
max_sequence_length = 300 # Max number of words in each sequence

# Step 1: Text Vectorization using CountVectorizer
vectorizer = CountVectorizer(max_features=max_features)
X_counts = vectorizer.fit_transform(data['processed_full_review'])
word_index = vectorizer.vocabulary_

# Inverse vocabulary mapping for sequences creation
index_to_word = {i: word for word, i in word_index.items()}

def counts_to_sequences(X_counts):
    sequences = []
    for i in range(X_counts.shape[0]):
        indices = X_counts[i].nonzero()[1]
        words = [index_to_word[idx] for idx in indices]
        seq = [word_index[word] + 1 for word in words]  # +1 because 0 is reserved for padding
        sequences.append(seq)
    return sequences

sequences = counts_to_sequences(X_counts)
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

def build_model(embedding_dim=128, rnn_units=64, dropout_rate=0.5):
    model = Sequential()
    model.add(Embedding(input_dim=len(word_index) + 1, output_dim=embedding_dim, input_length=max_sequence_length, trainable=True))
    model.add(SimpleRNN(rnn_units, activation='tanh'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=build_model, verbose=1, class_weight=class_weights_dict)

# Define the parameter grid to search
param_grid = {
    'embedding_dim': [64, 128, 300],      # Different embedding dimensions
    'rnn_units': [32, 64, 128],           # Number of units in SimpleRNN layer
    'dropout_rate': [0.3, 0.5, 0.7],      # Dropout rates
    'batch_size': [32, 64, 128],          # Batch sizes
    'epochs': [5, 10]                     # Number of epochs
}

# Instantiate GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3)

# Run GridSearchCV
grid_result = grid.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", grid_result.best_params_)
print("Best cross-validation accuracy: ", grid_result.best_score_)

# Evaluate on the test set
best_model = grid_result.best_estimator_
y_pred_prob = best_model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)

# Calculate and print classification report
report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4)
print('Performance Metrics:\n', report)

ImportError: cannot import name '_deprecate_Xt_in_inverse_transform' from 'sklearn.utils.deprecation' (c:\Users\Redbu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\deprecation.py)