# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
!pip3 install -r requirements.txt

Collecting tensorflow>=2.17.1 (from -r requirements.txt (line 11))
  Obtaining dependency information for tensorflow>=2.17.1 from https://files.pythonhosted.org/packages/26/08/556c4159675c1a30e077ec2a942eeeb81b457cc35c247a5b4a59a1274f05/tensorflow-2.18.0-cp311-cp311-macosx_12_0_arm64.whl.metadata
  Downloading tensorflow-2.18.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (4.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow>=2.17.1->-r requirements.txt (line 11))
  Obtaining dependency information for absl-py>=1.0.0 from https://files.pythonhosted.org/packages/a2/ad/e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4/absl_py-2.1.0-py3-none-any.whl.metadata
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow>=2.17.1->-r requirements.txt (line 11))
  Obtaining dependency information for astunparse>=1.6.0 from https://files.pythonhosted.org/packages/2b/03/13dde6512ad7b4557eb792fbcf0c653af6076b81e5941d36ec61f7ce6028/astunparse-

In [2]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [3]:
data = pd.read_csv('final_df.csv')

In [4]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [5]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [6]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

## Simple Neural Network

A Simple Neural Network, or fully connected neural network (FCNN), is a basic deep learning model ideal for straightforward classification tasks. It consists mainly of fully connected layers that process flattened data inputs, making it versatile for many types of data, including text.

Below is an explanation of how a simple NN works:

1. Embedding Layer (for Text Data):
	- For text inputs, an embedding layer transforms words into numerical vectors that capture meaning and context.
    
2.	Flattening:
	- The embeddings are flattened into a single long vector, allowing the network to process them as one input.

3.	Dense (Fully Connected) Layers:
	- Dense layers are the core of an FCNN. Each neuron connects to all neurons in the previous layer, learning complex relationships.
	- Activation functions, such as ReLU, are applied here to introduce non-linearity, helping the network capture more intricate patterns.

4.	Output Layer:
	- The final layer outputs class probabilities using a softmax activation (for multi-class classification) or sigmoid (for binary classification).
	- This layer helps the model predict the likelihood of each class for an input.
	
5.	Training:
	- During training, the network adjusts its weights to minimize prediction errors, gradually improving its accuracy through backpropagation.


In [9]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder

# Assuming 'data' is your DataFrame with 'processed_full_review' and 'sentiment' columns

# Step 1: Tokenization and Padding
max_words = 10000  # Maximum vocabulary size
max_sequence_length = 300  # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])

# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_sequence_length)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Step 2: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define the Simple Neural Network Model
def create_simple_nn():
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length))
    
    # Flatten the embeddings to feed into dense layers
    model.add(Flatten())
    
    # Fully connected layers
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))  # Dropout for regularization
    
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Categorical cross-entropy is used for multi-class classification
    return model

# Step 4: Train the Model
model = create_simple_nn()
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test), verbose=1)

# Step 5: Evaluate the Model
y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred, target_names=onehot_encoder.categories_[0], digits=4)

print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)


Epoch 1/10




[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 29ms/step - accuracy: 0.6435 - loss: 0.9063 - val_accuracy: 0.6984 - val_loss: 0.6102
Epoch 2/10
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 62ms/step - accuracy: 0.7104 - loss: 0.6659 - val_accuracy: 0.8368 - val_loss: 0.4321
Epoch 3/10
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 53ms/step - accuracy: 0.8236 - loss: 0.4386 - val_accuracy: 0.8398 - val_loss: 0.4099
Epoch 4/10
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 72ms/step - accuracy: 0.8627 - loss: 0.3280 - val_accuracy: 0.8477 - val_loss: 0.3986
Epoch 5/10
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 57ms/step - accuracy: 0.8866 - loss: 0.2583 - val_accuracy: 0.8459 - val_loss: 0.4422
Epoch 6/10
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 62ms/step - accuracy: 0.9252 - loss: 0.1976 - val_accuracy: 0.8485 - val_loss: 0.5799
Epoch 7/10
[1m144/144[0m [32m