# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [2]:
# !pip3 install -r requirements.txt

In [3]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

In [3]:
data = pd.read_csv('final_df.csv')

data

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...
...,...,...,...,...
11513,2021,11,Negative,websit buggi paid first busi class ticket webs...
11514,2021,10,Negative,reduc level qualiti servic fear futur airlin t...
11515,2021,10,Negative,chang would cost usd book ticket singapor airl...
11516,2021,8,Negative,disappoint flight check secur check frankfurt ...


# Feature Selection
Now, we select the final features to use for our sentiment analysis of airline reviews. 
- `processed_full_review`,`processed_review_length`, `sentiment`,`year`,`month`

- Columns excluded: [`published_platform`,`type`,`helpful_votes`,`language`,`review_length`,`day`,`day_of_week`,`year_month`]

- Create a new DataFrame (`data_final`) by selecting the specifc columns mentioned above from the original DataFrame `data`.

In [4]:
data_final = data[['processed_full_review','sentiment']]
data_final.head()

Unnamed: 0,processed_full_review,sentiment
0,ok use airlin go singapor london heathrow issu...,Neutral
1,don give money book paid receiv email confirm ...,Negative
2,best airlin world best airlin world seat food ...,Positive
3,premium economi seat singapor airlin not worth...,Negative
4,imposs get promis refund book flight full mont...,Negative


# Multinomial NB with TF-IDF (`max_features=1000`)

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['processed_full_review'])

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['sentiment'],test_size=0.3, random_state=42)

nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)

nb_predictions = nb_model.predict(X_test)

print("Multinomial NB Accuracy:", accuracy_score(y_test,nb_predictions))
print("Multinomial NB Classification Report:\n", classification_report(y_test, nb_predictions, digits=3))

Multinomial NB Accuracy: 0.8318865740740741
Multinomial NB Classification Report:
               precision    recall  f1-score   support

    Negative      0.836     0.694     0.758       692
     Neutral      0.571     0.081     0.142       346
    Positive      0.836     0.979     0.902      2418

    accuracy                          0.832      3456
   macro avg      0.748     0.584     0.601      3456
weighted avg      0.809     0.832     0.797      3456



# Complement NB with TF-IDF (`max_features=1000`)

CNB is designed to handle imbalanced classes better than MNB, which improves the classification accuracy for minority classes and often yields more balanced performance.

CNB includes a form of implicit regularisation by estimating the probability of a class as a complement of the other classes, which helps smooth out the likelihood of each feature.

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import accuracy_score, classification_report

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['processed_full_review'])

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['sentiment'],test_size=0.3, random_state=42)

nb_model = ComplementNB()
nb_model.fit(X_train, y_train)

nb_predictions = nb_model.predict(X_test)

print("Complement NB Accuracy:", accuracy_score(y_test,nb_predictions))
print("Complement NB Classification Report:\n", classification_report(y_test, nb_predictions, digits=3))

Complement NB Accuracy: 0.8275462962962963
Complement NB Classification Report:
               precision    recall  f1-score   support

    Negative      0.634     0.867     0.733       692
     Neutral      0.438     0.223     0.295       346
    Positive      0.935     0.903     0.919      2418

    accuracy                          0.828      3456
   macro avg      0.669     0.664     0.649      3456
weighted avg      0.825     0.828     0.819      3456



# Complement NB with TF-IDF (`max_features=1000`) with GridSearchCV

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import accuracy_score, classification_report

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['processed_full_review'])

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['sentiment'], test_size=0.3, random_state=42)

param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0]}  
grid_search = GridSearchCV(ComplementNB(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_nb_model = grid_search.best_estimator_

nb_predictions = best_nb_model.predict(X_test)

print("Best Alpha:", grid_search.best_params_)
print("Complement NB Accuracy:", accuracy_score(y_test, nb_predictions))
print("Complement NB Classification Report:\n", classification_report(y_test, nb_predictions, digits=3))

Best Alpha: {'alpha': 5.0}
Complement NB Accuracy: 0.8339120370370371
Complement NB Classification Report:
               precision    recall  f1-score   support

    Negative      0.658     0.850     0.741       692
     Neutral      0.465     0.228     0.306       346
    Positive      0.926     0.916     0.921      2418

    accuracy                          0.834      3456
   macro avg      0.683     0.665     0.656      3456
weighted avg      0.826     0.834     0.824      3456



# RF with TF-IDF (`max_features=1000`)

In [29]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_predictions = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_predictions, digits=3))

Random Forest Accuracy: 0.8451967592592593
Random Forest Classification Report:
               precision    recall  f1-score   support

    Negative      0.808     0.747     0.776       692
     Neutral      0.929     0.075     0.139       346
    Positive      0.853     0.983     0.914      2418

    accuracy                          0.845      3456
   macro avg      0.863     0.602     0.610      3456
weighted avg      0.851     0.845     0.809      3456



# Log Regression with TF-IDF (`max_features=1000`)

In [39]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42, multi_class='multinomial', solver='lbfgs').fit(X_train, y_train)
clf_predictions = clf.predict(X_test)
print("Log Regression Accuracy:", accuracy_score(y_test, clf_predictions))
print("Log Regression Classification Report:\n", classification_report(y_test, clf_predictions, digits=3))

Log Regression Accuracy: 0.8587962962962963
Log Regression Classification Report:
               precision    recall  f1-score   support

    Negative      0.785     0.795     0.790       692
     Neutral      0.489     0.257     0.337       346
    Positive      0.905     0.963     0.933      2418

    accuracy                          0.859      3456
   macro avg      0.726     0.672     0.687      3456
weighted avg      0.839     0.859     0.845      3456



# SVM (linear) with TF-IDF with Stratified K-fold

Linear kernel computes the dot product between 2 vectors, works best for lienarly separated data, more effective when features are numerous, as in text classification, where each word or term often represents a feature in high-dimensional space.

Less computationally less intensive and faster to train.

Class Imbalance Handling: We set `class_weight='balanced'` to automatically adjust the class weights inversely proportional to the class frequencies in the training data, helping the model pay more attention to minority classes.

Stratified K-fold: Maintain class distribution across the folds, which is important for imbalanced data.

In [32]:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score

svm_model = SVC(kernel='linear', C=1, class_weight='balanced', random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = cross_val_score(svm_model, X_train, y_train, cv=skf, scoring='accuracy')

svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

print("Cross-Validation Accuracy Scores:", cross_val_scores)
print("Mean Cross-Validation Accuracy:", cross_val_scores.mean())

print("SVM(linear) Accuracy:", accuracy_score(y_test, svm_predictions))
print("SVM(linear) Classification Report:\n", classification_report(y_test, svm_predictions, digits=3))

Cross-Validation Accuracy Scores: [0.81277123 0.8233106  0.82506203 0.81637717 0.81265509]
Mean Cross-Validation Accuracy: 0.818035225578773
SVM(lienar) Accuracy: 0.8081597222222222
SVM(linear) Classification Report:
               precision    recall  f1-score   support

    Negative      0.743     0.757     0.750       692
     Neutral      0.335     0.581     0.425       346
    Positive      0.961     0.855     0.905      2418

    accuracy                          0.808      3456
   macro avg      0.680     0.731     0.693      3456
weighted avg      0.855     0.808     0.826      3456



# SVM (radial basis function (rbf)) with TF-IDF

RBF kernel, a.k.a. Gaussian kernel, is a non-linear kernel that maps data to a higher-dimensional space. Allows for non-linear separation, where classes cannot be separated by a single straight line, can capture complex patterns by creating flexible decision boundaries.

More computationally expensive and requires careful tuning of parameters to avoid overfitting.

Linear kernel more suitable in this context for text airline sentiment classification.

In [33]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

svm_model = SVC(kernel='rbf' , C=1, class_weight='balanced', random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = cross_val_score(svm_model, X_train, y_train, cv=skf, scoring='accuracy')

svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

print("Cross-Validation Accuracy Scores:", cross_val_scores)
print("Mean Cross-Validation Accuracy:", cross_val_scores.mean())

print("SVM(rbf) Accuracy:", accuracy_score(y_test, svm_predictions))
print("SVM(rbf) Classification Report:\n", classification_report(y_test, svm_predictions, digits=3))

Cross-Validation Accuracy Scores: [0.866708   0.86546807 0.86724566 0.87282878 0.86166253]
Mean Cross-Validation Accuracy: 0.8667826084281097
SVM(rbf) Accuracy: 0.8642939814814815
SVM(rbf) Classification Report:
               precision    recall  f1-score   support

    Negative      0.761     0.838     0.798       692
     Neutral      0.502     0.471     0.486       346
    Positive      0.947     0.928     0.938      2418

    accuracy                          0.864      3456
   macro avg      0.737     0.746     0.740      3456
weighted avg      0.865     0.864     0.864      3456



# SVM (linear) with TF-IDF with GridSearchCV

Performed GridSearchCV for hyperparameter tuning of the Regularization Parameter `C`.

The `C` parameter controls the trade-off between maximising the margin and minimising classification errors.

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100]}
grid_search = GridSearchCV(SVC(kernel='linear'), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best C parameter:", grid_search.best_params_)

Best C parameter: {'C': 1}
