# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [2]:
!pip3 install -r requirements.txt

Collecting tensorflow-gpu>=2.11.0 (from -r requirements.txt (line 11))
  Using cached tensorflow-gpu-2.12.0.tar.gz (2.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [44 lines of output]
      Traceback (most recent call last):
        File "C:\Users\Admin\AppData\Roaming\Python\Python312\site-packages\packaging\requirements.py", line 36, in __init__
          parsed = _parse_requirement(requirement_string)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\Admin\AppData\Roaming\Python\Python312\site-packages\packaging\_parser.py", line 62, in parse_requirement
          return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\Admin\AppData\Roaming\Python\Python312\site-packages\packaging\_parser.py", line 80, in _parse_requirement
          url, specifier, marker = _parse_requirement_details(tokenizer)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\User

In [3]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [4]:
data = pd.read_csv('final_df.csv')

In [5]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [6]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [7]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

## Convolutional Neural Network

A Convolutional Neural Network (CNN) is a type of deep learning model that is particularly effective for pattern recognition tasks, especially in images and, increasingly, in text. Here’s how a CNN works in principle, broken down into its key components.

Below is an explanation of how a basic CNN works:

1. Convolutional Layer:
	- A CNN’s core layer is the convolutional layer, which applies filters (kernels) to small regions of the input data.
	- For text, a convolutional layer slides filters over sequences of words or tokens. Each filter is designed to detect specific patterns, such as n-grams (e.g., “not good” or “very interesting”) or word sequences relevant to the task.
	- The convolution operation outputs a feature map, where each entry represents the presence or strength of a detected pattern in a specific region of the input.
    
2.	Activation Function (e.g., ReLU):
	- After convolution, an activation function (like ReLU) is applied to introduce non-linearity, allowing the network to model complex patterns.
	- This function essentially “activates” certain features, helping the network focus on meaningful patterns while ignoring less relevant details.

3.	Pooling Layer:
	- A pooling layer (often Global Max Pooling for text data) is applied after convolution to reduce the dimensionality of the feature map, keeping only the most important features.
	- Pooling helps make the network more robust to minor variations and reduces the number of parameters, which speeds up training and helps prevent overfitting.

4.	Fully Connected (Dense) Layer:
	- The pooled features are then passed through one or more fully connected (dense) layers. These layers process the extracted features, combining them to make predictions.
	- In text classification, a final dense layer with a sigmoid or softmax activation is often used to produce a probability score or class label for each input.


### Basic Convolutional Neural Network

- For our task of fake news classification, we add an embedding layer before the convolution layer. An embedding layer is often included to convert words into dense, continuous vector representations (embeddings) that capture semantic relationships.

- An embedding layer provides input that’s suitable for convolution by encoding words as vectors. This way, the CNN can capture patterns in these representations rather than working with raw token IDs.

In [13]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder

# Assuming 'data' is your DataFrame with 'processed_full_review' and 'sentiment' columns

# Step 1: Tokenization and Padding
max_words = 10000  # Maximum vocabulary size
max_sequence_length = 300  # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])

# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_sequence_length)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False) 
y = onehot_encoder.fit_transform(data[['sentiment']])

# Step 2: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define the Basic CNN Model
def create_basic_cnn():
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(input_dim=max_words, output_dim=128))
    
    # Convolutional layer
    model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))

    # Max pooling layer
    model.add(GlobalMaxPooling1D())
    
    # Fully connected layer
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))  # Dropout for regularization
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Categorical cross-entropy is used for multi-class classification
    return model

# Step 4: Train the Model
model = create_basic_cnn()
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2, verbose=1)

# Step 5: Evaluate the Model
y_pred = np.argmax(model.predict(X_test), axis=1) #argmax to convert one-hot encoded output to label
y_true = np.argmax(y_test, axis=1) #argmax to convert one-hot encoded output to label

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred, target_names=onehot_encoder.categories_[0],digits=4)

print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)

Epoch 1/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 18ms/step - accuracy: 0.7104 - loss: 0.7826 - val_accuracy: 0.8302 - val_loss: 0.4405
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 17ms/step - accuracy: 0.8518 - loss: 0.3861 - val_accuracy: 0.8470 - val_loss: 0.3685
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step - accuracy: 0.9005 - loss: 0.2548 - val_accuracy: 0.8497 - val_loss: 0.3724
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step - accuracy: 0.9416 - loss: 0.1671 - val_accuracy: 0.8459 - val_loss: 0.4221
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step - accuracy: 0.9629 - loss: 0.1227 - val_accuracy: 0.8535 - val_loss: 0.4701
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step - accuracy: 0.9806 - loss: 0.0754 - val_accuracy: 0.8562 - val_loss: 0.5309
Epoch 7/10
[1m116/116

### Convolutional Neural Network + TF-IDF Vectorizer

Using TF-IDF vectorizer along with CNN led to a drastic fall in performance. Below are some reasons why we should not use TF-IDF vectorizer along with a CNN or other neural networks.

#### Lack of Spatial Structure:

TF-IDF vectors are sparse and non-sequential representations where each position in the vector represents a word, not a spatial pattern.
CNNs are designed to detect patterns in sequential or spatially structured data (e.g., images or sentences), so they might struggle to find meaningful patterns in TF-IDF vectors.

#### High-Dimensional Sparse Data:

TF-IDF vectors, especially with a high max_features value (like 10,000), result in a high-dimensional but sparse input.
CNNs are generally not well-suited for such high-dimensional sparse data; they perform better with dense embeddings where words have contextually meaningful dimensions.

#### Mismatch Between Input Type and CNN Architecture:

CNNs are typically effective when applied to word embeddings (like GloVe or Word2Vec) because embeddings maintain semantic relationships and neighborhood structures.
TF-IDF, however, does not capture word order or semantic relationships, which means the convolution operation might not yield meaningful feature maps.


In [12]:
import numpy as np
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Dropout, Reshape, Input
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, precision_recall_fscore_support

# Step 1: Apply TF-IDF Vectorization
max_features = 10000  # Limit TF-IDF to top 10,000 features
tfidf_vectorizer = TfidfVectorizer(max_features=max_features)
X_tfidf = tfidf_vectorizer.fit_transform(data['processed_full_review']).toarray()

# Convert the labels and one-hot encode for categorical cross-entropy
y = data['sentiment'].values
y = pd.get_dummies(y).values  # Assuming sentiment is a categorical string label

# Step 2: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 3: Define the CNN Model for TF-IDF Input
def create_cnn_with_tfidf():
    inputs = Input(shape=(max_features,))
    x = Reshape((max_features, 1))(inputs)  # Reshape TF-IDF output to be compatible with Conv1D

    # Convolutional layer
    x = Conv1D(filters=128, kernel_size=5, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)
    
    # Fully connected layer
    x = Dense(64, activation='relu')(x)
    x = Dropout(0.5)(x)  # Dropout for regularization
    outputs = Dense(3, activation='softmax')(x)  # Output layer for triple classification

    # Create model
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Step 4: Train the Model
model = create_cnn_with_tfidf()
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2, verbose=1)

# Step 5: Evaluate the Model
y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred, target_names=["Negative","Neutral","Positive"], digits=4)

print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)

Epoch 1/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 154ms/step - accuracy: 0.6738 - loss: 0.9241 - val_accuracy: 0.6636 - val_loss: 0.8511
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 151ms/step - accuracy: 0.6994 - loss: 0.8191 - val_accuracy: 0.6636 - val_loss: 0.8503
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 151ms/step - accuracy: 0.6926 - loss: 0.8250 - val_accuracy: 0.6636 - val_loss: 0.8525
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 152ms/step - accuracy: 0.6873 - loss: 0.8257 - val_accuracy: 0.6636 - val_loss: 0.8506
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 151ms/step - accuracy: 0.6925 - loss: 0.8239 - val_accuracy: 0.6636 - val_loss: 0.8505
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 150ms/step - accuracy: 0.6901 - loss: 0.8290 - val_accuracy: 0.6636 - val_loss: 0.8510
Epoch 7/10

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Convolutional Neural Network + Custom-trained Word2Vec embeddings

In this case, we train the embedding layer on our dataset, which allows it to better capture domain-specific vocabulary, as compared to using pre-trained embeddings that are trained on a very large and general corpus.

##### 1. Word embeddings capture the semantic relationships between words in a dense, low-dimensional space.
Fake news often uses subtle language, and word embeddings like GloVe can capture the semantic context of words, allowing the model to understand relationships between words that simple vectorizers would miss. This helps in detecting nuanced differences in language use between real and fake news.

##### 2. Word embeddings produce dense, low-dimensional vectors (e.g., 100-300 dimensions) that capture rich word information.
Pre-trained embeddings are built on large corpora like Wikipedia and news articles, giving our model external knowledge that’s useful for distinguishing between real news and fake news. This boosts the model's ability to generalize on unseen test data from our web scraping.

##### 3. Efficient Representation of Semantics
Words in fake news can appear in different contexts, but with similar underlying meanings (e.g., "hoax" and "lie"). GloVe embeddings represent these similar words in close proximity in the vector space, helping the model recognize fake news patterns more effectively than TF-IDF or Count Vectorizer.

##### 4. Handling Synonyms and Rare Words:
Fake news often uses alternative phrases or rare terminology. Pre-trained embeddings like GloVe can handle these rare words because they’ve seen a broad variety of language during training, making our model more robust against unusual vocabulary choices in fake news.

In [1]:
# import pandas as pd
# import numpy as np
# from tensorflow.keras.models import Model
# from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout, Input
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, precision_recall_fscore_support
# from gensim.models import Word2Vec


# # Tokenization parameters
# max_words = 10000  # Maximum number of words to keep in the vocabulary
# max_sequence_length = 300  # Maximum length of sequences

# # Tokenize and create sequences
# tokenizer = Tokenizer(num_words=max_words)
# tokenizer.fit_on_texts(data['processed_full_review'])
# sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
# X = pad_sequences(sequences, maxlen=max_sequence_length)
# y = data['sentiment'].values  # Target sentiment labels

# # Step 2: Train Word2Vec Embeddings
# # Prepare sentences as lists of words for Word2Vec training
# sentences = [text.split() for text in data['processed_full_review']]

# # Train custom Word2Vec model
# embedding_dim = 200  # Set embedding dimension (try 100-200)
# custom_word2vec = Word2Vec(sentences, vector_size=embedding_dim, window=5, min_count=2, workers=4)

# # Step 3: Create Embedding Matrix from Custom Word2Vec
# vocab_size = len(tokenizer.word_index) + 1  # Add 1 for the padding token
# embedding_matrix = np.zeros((vocab_size, embedding_dim))

# # Map words in tokenizer's vocabulary to the Word2Vec vectors
# for word, i in tokenizer.word_index.items():
#     if i < max_words:  # Limit to top max_words
#         if word in custom_word2vec.wv:
#             embedding_matrix[i] = custom_word2vec.wv[word]
#         else:
#             embedding_matrix[i] = np.random.normal(size=(embedding_dim,))  # Random init for OOV words

# # Step 4: Train-Test Split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Step 5: Define CNN Model with Custom Word2Vec Embeddings
# def create_cnn_with_custom_word2vec():
#     input_layer = Input(shape=(max_sequence_length,))
    
#     # Embedding layer with custom Word2Vec embeddings
#     embedding_layer = Embedding(input_dim=vocab_size,
#                                 output_dim=embedding_dim,
#                                 weights=[embedding_matrix],
#                                 trainable=False)(input_layer)  # Set to non-trainable

#     # Convolutional and pooling layers
#     x = Conv1D(filters=128, kernel_size=5, activation='relu')(embedding_layer)
#     x = GlobalMaxPooling1D()(x)
    
#     # Fully connected layer with Dropout
#     x = Dense(64, activation='relu')(x)
#     x = Dropout(0.5)(x)
#     output_layer = Dense(3, activation='softmax')(x)  # Output layer for multi-class classification

#     # Compile model
#     model = Model(inputs=input_layer, outputs=output_layer)
#     model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#     return model

# # Step 6: Train the CNN Model
# model = create_cnn_with_custom_word2vec()
# history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2, verbose=1)

# # Step 7: Evaluate the Model
# y_pred = (model.predict(X_test) > 0.5).astype(int)
# accuracy = accuracy_score(y_test, y_pred)
# precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')

# print("\nEvaluation Metrics:")
# print(f"Accuracy: {accuracy:.4f}")
# print(f"Precision: {precision:.4f}")
# print(f"Recall: {recall:.4f}")
# print(f"F1 Score: {f1:.4f}")