# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
!pip3 install -r requirements.txt



In [1]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [2]:
data = pd.read_csv('final_df.csv')

In [3]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [4]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [5]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

## Basic LSTM


In [11]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Load the dataset
data = pd.read_csv('final_df.csv')

# Preprocess text and labels
texts = data['processed_full_review'].astype(str)
labels = data['sentiment']

# Encode labels (e.g., Positive=2, Negative=0, Neutral=1)
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=10000)  # Adjust the vocabulary size as needed
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
maxlen = 100  # Set maximum sequence length, adjust as needed
X = pad_sequences(sequences, maxlen=maxlen)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels_encoded, test_size=0.2, random_state=42)

# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=maxlen))  # Adjust output_dim as needed
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))  # 3 classes for Positive, Negative, Neutral

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Generate predictions for the test set
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes, target_names=label_encoder.classes_, digits=4))


Epoch 1/10




[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 141ms/step - accuracy: 0.6747 - loss: 0.8129 - val_accuracy: 0.8188 - val_loss: 0.4700
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 147ms/step - accuracy: 0.8518 - loss: 0.4210 - val_accuracy: 0.8275 - val_loss: 0.4556
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 152ms/step - accuracy: 0.8761 - loss: 0.3300 - val_accuracy: 0.8388 - val_loss: 0.4805
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 151ms/step - accuracy: 0.9064 - loss: 0.2525 - val_accuracy: 0.8383 - val_loss: 0.5099
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 147ms/step - accuracy: 0.9183 - loss: 0.2355 - val_accuracy: 0.8426 - val_loss: 0.6264
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 149ms/step - accuracy: 0.9450 - loss: 0.1709 - val_accuracy: 0.8367 - val_loss: 0.7412
Epoch 7/10
[1m116/11

## LSTM with Hashing Vectorization

Hashing Vectorizer is much faster than the Tokenizer and Embedding approach from above code.

Hashing Vectorizer directly transforms text into fixed-length numerical vectors by hsahing the terms and mapping them to a specified number of features. This eliminate the need to build a vocabulary or convert tokens into embeddings. Whereas in Tokenizer, it creates a vocabulary, then tokenizes the text into sequences of integers, which are then converted into dense vectors using an `Embedding` layer. This two-step process is more computationally intensive than direct hashing.

In [15]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Load the dataset
data = pd.read_csv('final_df.csv')

# Preprocess text and labels
texts = data['processed_full_review'].astype(str)
labels = data['sentiment']

# Encode labels (e.g., Positive=2, Negative=0, Neutral=1)
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# Use Hashing Vectorizer
vectorizer = HashingVectorizer(n_features=5000, alternate_sign=False)  # Set n_features as needed
X = vectorizer.transform(texts).toarray()

# Reshape to 3D array as expected by LSTM input (samples, timesteps, features)
X = np.reshape(X, (X.shape[0], 1, X.shape[1]))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels_encoded, test_size=0.2, random_state=42)

# Define the LSTM model
model = Sequential()
model.add(LSTM(64, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))  # 3 classes for Positive, Negative, Neutral

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Generate predictions for the test set
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes, target_names=label_encoder.classes_, digits=4))

  super().__init__(**kwargs)


Epoch 1/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 20ms/step - accuracy: 0.6778 - loss: 0.9333 - val_accuracy: 0.7830 - val_loss: 0.5660
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.8307 - loss: 0.4873 - val_accuracy: 0.8410 - val_loss: 0.4140
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.8619 - loss: 0.3743 - val_accuracy: 0.8410 - val_loss: 0.4039
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.8720 - loss: 0.3322 - val_accuracy: 0.8394 - val_loss: 0.4103
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.8851 - loss: 0.2973 - val_accuracy: 0.8367 - val_loss: 0.4216
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.8957 - loss: 0.2715 - val_accuracy: 0.8361 - val_loss: 0.4366
Epoch 7/10
[1m116/116

## LSTM + Hashing Vectorizer + GridSearch CV

In [None]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding
from scikeras.wrappers import KerasClassifier  # Updated import
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Load the dataset
data = pd.read_csv('final_df.csv')

# Preprocess text and labels
texts = data['processed_full_review'].astype(str)
labels = data['sentiment']

# Encode labels
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# Use Hashing Vectorizer
vectorizer = HashingVectorizer(n_features=5000, alternate_sign=False)  # Set n_features as needed
X = vectorizer.transform(texts).toarray()
X = np.reshape(X, (X.shape[0], 1, X.shape[1]))  # Reshape for LSTM

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels_encoded, test_size=0.2, random_state=42)

# Define a function to create the model (for use in KerasClassifier)
def create_model(units=64, dropout_rate=0.5, optimizer='adam'):
    model = Sequential()
    model.add(LSTM(units, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(units, return_sequences=False))
    model.add(Dense(units // 2, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(3, activation='softmax'))  # 3 classes for Positive, Negative, Neutral
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Wrap the model using KerasClassifier from scikeras
model = KerasClassifier(model=create_model, verbose=0)  # scikeras syntax

# Define the grid of hyperparameters
param_grid = {
    'model__units': [32, 64],
    'model__dropout_rate': [0.3, 0.5],
    'optimizer': ['adam', 'rmsprop'],
    'epochs': [5, 10],  # Reduced for demo; increase as needed
    'batch_size': [16, 32]
}

# Set up GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# Perform grid search
grid_result = grid.fit(X_train, y_train)

# Display the best parameters and accuracy
print("Best Parameters:", grid_result.best_params_)
print("Best Score:", grid_result.best_score_)

# Evaluate the best model on the test set
best_model = grid_result.best_estimator_
y_pred = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_, digits=4))

  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__in

Best Parameters: {'batch_size': 32, 'epochs': 5, 'model__dropout_rate': 0.5, 'model__units': 32, 'optimizer': 'rmsprop'}
Best Score: 0.8505538553425414

Classification Report:
              precision    recall  f1-score   support

    Negative     0.7003    0.8851    0.7820       470
     Neutral     0.3462    0.0789    0.1286       228
    Positive     0.9240    0.9539    0.9387      1606

    accuracy                         0.8533      2304
   macro avg     0.6568    0.6393    0.6164      2304
weighted avg     0.8212    0.8533    0.8266      2304

