# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
#!pip3 install -r requirements.txt

In [9]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

import nltk

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

In [10]:
## for Mac users, might have to install this manually

# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading punkt_tab: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]   

False

## Data Preparation (Loading CSV)

Load the final CSV file into a dataframe named `data`.

In [18]:
data = pd.read_csv('final_df.csv')

In [19]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [20]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

# Hashing Vectorization
We applied several text preprocessing techniques to prepare the dataset:
- TF-IDF with N-grams: Captures word combinations (bigrams/trigrams) to represent more complex patterns in text.

- Hashing Vectorizer: A memory-efficient method to represent text as a fixed-size sparse vector.

- Latent Semantic Analysis (LSA): Used singular value decomposition (SVD) to reduce the dimensionality of the TF-IDF matrix while capturing important features

- We combined these text features (from TF-IDF, Hashing, and LSA) with other numeric features like year and month to create the final feature matrix.

In [23]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.decomposition import TruncatedSVD

# Assuming your data is in a dataframe called 'df'
# The text column is 'processed_full_review', and the target column is 'sentiment'

# Step 1: TF-IDF with Bigrams/Trigrams
tfidf_ngram_vectorizer = TfidfVectorizer(ngram_range=(2, 3), max_features=1000)
X_tfidf_ngram = tfidf_ngram_vectorizer.fit_transform(data['processed_full_review'])

# Convert TF-IDF matrix to DataFrame for readability
tfidf_ngram_df = pd.DataFrame(X_tfidf_ngram.toarray(), columns=tfidf_ngram_vectorizer.get_feature_names_out())

# Step 2: Hashing Vectorizer
hash_vectorizer = HashingVectorizer(n_features=1000, alternate_sign=False)
X_hash = hash_vectorizer.fit_transform(data['processed_full_review'])

# Step 3: LSA (Latent Semantic Analysis) via SVD
# Applying SVD on the TF-IDF matrix for dimensionality reduction
svd = TruncatedSVD(n_components=100)
X_lsa = svd.fit_transform(X_tfidf_ngram)

# Combine features if needed
X_combined = pd.concat([tfidf_ngram_df, pd.DataFrame(X_lsa)], axis=1)

# Step 4: Select Other Columns (Year, Month, etc.)
df_selected = data[['year', 'month']]  # Modify based on the features you want to select
X_combined = pd.concat([X_combined, df_selected], axis=1)

# Output the final dataframe for inspection
X_combined.head()


Unnamed: 0,air crew,air hostess,air line,air new,air new zealand,air nz,air singapor,air steward,airlin airlin,airlin alway,...,92,93,94,95,96,97,98,99,year,month
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.040275,-0.023772,0.006079,0.010352,0.019244,-0.013823,-0.040387,-0.022565,2024,3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.057204,0.005147,-0.106796,0.042967,-0.056759,0.054508,-0.101851,0.002805,2024,3
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.031834,-0.04035,0.005602,-0.089119,-0.041138,-0.070701,0.027507,-0.016442,2024,3
3,0.0,0.0,0.0,0.0,0.0,0.451325,0.0,0.0,0.0,0.0,...,-0.006751,0.040318,-0.026912,0.012194,-0.027348,0.052086,0.004994,-0.003546,2024,3
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020591,-0.045212,0.006276,0.034718,-0.041666,0.014788,-0.027002,-0.025512,2024,3


# RF

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'X_combined' is the feature matrix and 'data['sentiment']' is the target
X_combined.columns = X_combined.columns.map(str)
y = data['sentiment']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Initialize and train Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Detailed classification report
print(classification_report(y_test, y_pred))


Accuracy: 79.99%
              precision    recall  f1-score   support

    Negative       0.81      0.58      0.67       470
     Neutral       0.85      0.05      0.09       228
    Positive       0.80      0.97      0.88      1606

    accuracy                           0.80      2304
   macro avg       0.82      0.53      0.55      2304
weighted avg       0.81      0.80      0.76      2304



# RF with cross-validation

Cross-validation is a technique used to evaluate the performance of a model by splitting the data into multiple subsets, training the model on some subsets, and validating it on others.

In [25]:
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Ensure X_combined columns are strings
X_combined.columns = X_combined.columns.map(str)
y = data['sentiment']

# Split data into separate train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Step 1: Cross-Validation on the training set only
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Cross-validation on the training set
cv_scores = cross_val_score(rf_classifier, X_train, y_train, cv=cv, scoring='accuracy')

# Print Cross-validation results
print(f"Cross-validation accuracy scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean():.2f}")

# Step 2: Train on the full training set after cross-validation
rf_classifier.fit(X_train, y_train)

# Step 3: Evaluate on the separate test set
y_pred = rf_classifier.predict(X_test)

# Step 4: Classification report on test set
print(classification_report(y_test, y_pred))


Cross-validation accuracy scores: [0.79055887 0.79164406 0.78187737 0.79327184 0.79261672]
Mean cross-validation accuracy: 0.79
              precision    recall  f1-score   support

    Negative       0.81      0.58      0.67       470
     Neutral       0.85      0.05      0.09       228
    Positive       0.80      0.97      0.88      1606

    accuracy                           0.80      2304
   macro avg       0.82      0.53      0.55      2304
weighted avg       0.81      0.80      0.76      2304



# RF with Grid Search for Hyperparameter Tuning

Instead of using default hyperparameters, we perform a Grid Search to find the best combination of hyperparameters (such as n_estimators, max_depth, and min_samples_split). 



In [None]:
from sklearn.model_selection import GridSearchCV

# Ensure X_combined columns are strings
X_combined.columns = X_combined.columns.map(str)
y = data['sentiment']

# Split data into separate train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Define the parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Set up Grid Search
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model with Grid Search
grid_search.fit(X_train, y_train)

# Get the best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")
print(f"Best cross-validation accuracy: {best_score:.2f}")

# Use the best model for predictions
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


# RF with Out-of-Bag (OOB) Evaluation

Random Forest has an internal method for evaluating performance called Out-of-Bag (OOB) score. It evaluates the model on samples not used during the training of individual trees, providing an internal cross-validation.

In [35]:
# Initialize Random Forest with OOB enabled
rf_classifier = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Print the OOB score
print(f"OOB Score: {rf_classifier.oob_score_:.2f}")

# Evaluate on the test set
y_pred = rf_classifier.predict(X_test)
print(classification_report(y_test, y_pred))


OOB Score: 0.68
              precision    recall  f1-score   support

    Negative       0.62      0.97      0.75       178
     Neutral       0.00      0.00      0.00        38
    Positive       0.82      0.43      0.57       130

    accuracy                           0.66       346
   macro avg       0.48      0.47      0.44       346
weighted avg       0.63      0.66      0.60       346



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Ensemble Stacking

Stacking is a technique where you combine the predictions of multiple models, not just Random Forests, but any set of models, to make a final prediction.

In [37]:
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Define the base models
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(kernel='linear', probability=True, random_state=42))
]

# Define the final estimator (meta-learner)
stacking_clf = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression())

# Train the stacking classifier
stacking_clf.fit(X_train, y_train)

# Make predictions
y_pred = stacking_clf.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.73      0.90      0.81       178
     Neutral       0.00      0.00      0.00        38
    Positive       0.76      0.72      0.74       130

    accuracy                           0.73       346
   macro avg       0.49      0.54      0.51       346
weighted avg       0.66      0.73      0.69       346



# Log Regression

In [34]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42).fit(X_train, y_train)
clf_predictions = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, clf_predictions))
print("Classification Report:\n", classification_report(y_test, clf_predictions))

Accuracy: 0.7196531791907514
Classification Report:
               precision    recall  f1-score   support

    Negative       0.69      0.94      0.79       178
     Neutral       0.00      0.00      0.00        38
    Positive       0.80      0.62      0.70       130

    accuracy                           0.72       346
   macro avg       0.50      0.52      0.50       346
weighted avg       0.65      0.72      0.67       346



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
