# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
# !pip3 install -r requirements.txt

In [2]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the `singapore_airline_reviews.csv` file into a pandas DataFrame `data`.

In [3]:
data = pd.read_csv("final_df.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   published_date      10000 non-null  object
 1   published_platform  10000 non-null  object
 2   rating              10000 non-null  int64 
 3   type                10000 non-null  object
 4   text                10000 non-null  object
 5   title               9999 non-null   object
 6   helpful_votes       10000 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


In [4]:
data.head()

Unnamed: 0,published_date,published_platform,rating,type,text,title,helpful_votes
0,2024-03-12T14:41:14-04:00,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0
1,2024-03-11T19:39:13-04:00,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0
2,2024-03-11T12:20:23-04:00,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0
3,2024-03-11T07:12:27-04:00,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0
4,2024-03-10T05:34:18-04:00,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0


# Data Cleaning

## Remove Duplicate Rows

- Drop duplicate rows from the dataframe (`data`) and reset the index.

In [5]:
data = data.drop_duplicates().reset_index(drop=True)

# Display the new dataframe shape
print("The new shape is: ", data.shape)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", data.duplicated().sum())

The new shape is:  (10000, 7)
Remaining duplicate rows: 0


## Check for Null Values

- Here we check which features have null values using the `isnull()` function.

In [6]:
# In this case only `title` feature has one null value, will fill it with empty string " "
data.isnull().sum()

published_date        0
published_platform    0
rating                0
type                  0
text                  0
title                 1
helpful_votes         0
dtype: int64

In [7]:
# Fill missing values with empty string
data = data.fillna("")

In [8]:
# Verify that there are no missing values
data.isnull().sum()

published_date        0
published_platform    0
rating                0
type                  0
text                  0
title                 0
helpful_votes         0
dtype: int64

## Convert data types

Since the column `published_date` is in data type (`str`), we will
- Convert `published_date` to a standard timezone (UTC) format as a new column `date`.
- Drop the original `published_date` column after conversion and reset the index.

In [9]:
# Set `utc=True` to convert the date to common timezone (UTC)
data["date"] = pd.to_datetime(data["published_date"], utc=True)
print(data["date"].dtype)

datetime64[ns, UTC]


In [10]:
# Drop `published_date` column and reset the index
data = data.drop(columns=["published_date"]).reset_index(drop=True)
data.head()

Unnamed: 0,published_platform,rating,type,text,title,helpful_votes,date
0,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0,2024-03-12 18:41:14+00:00
1,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0,2024-03-11 23:39:13+00:00
2,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0,2024-03-11 16:20:23+00:00
3,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0,2024-03-11 11:12:27+00:00
4,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0,2024-03-10 09:34:18+00:00


## Remove Outliers

### `text`

The `text` column of `data`, which is of string (`str`) type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `text_length` in the DataFrame `data` by calculating the length of each review. (Set the value as 0 if the correponding `text` column has NaN values.)

2. Check the statistics of `text_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `text_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `text_length` column and set the corresponding `text` to np.nan.

6. Drop the `text_length` column from the DataFrame.

In [11]:
data['text_length'] = data['text'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["text_length"]
stats_TL = TL.describe()
print(stats_TL)

  published_platform  rating    type  \
0            Desktop       3  review   
1            Desktop       5  review   
2            Desktop       1  review   

                                                text  \
0  We used this airline to go from Singapore to L...   
1  The service on Singapore Airlines Suites Class...   
2  Booked, paid and received email confirmation f...   

                                               title  helpful_votes  \
0                                                 Ok              0   
1  The service in Suites Class makes one feel lik...              0   
2                         Don’t give them your money              0   

                       date  text_length  
0 2024-03-12 18:41:14+00:00         1356  
1 2024-03-11 23:39:13+00:00         4674  
2 2024-03-11 16:20:23+00:00          420  
count    10000.00000
mean       558.33400
std        642.79261
min        100.00000
25%        228.00000
50%        381.00000
75%        667.25000
max      1

In [12]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'text' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'text'] = np.nan
# print(data.head(3))

data = data.drop("text_length", axis=1)

data.head()
data.shape

(10000, 7)

### `title`

Similarly, the `title` column of `data` (of type `str`) may also contain values with unusually long lengths, indicating the presence of outliers.

1. Create a new column `title_length` in the DataFrame `data` by calculating the length of each price value. (Set the value as 0 if the correponding `title` column has NaN values.)

2. Check the statistics of `title_length` using `describe()` method and display its unique values.

3. Identify the outlier values by inspecting the content in `title` corresponding to the abnormal value in `title_length` and set the corresponding value of `title` to np.nan.

4. Drop the `title_length` column from the DataFrame.

In [13]:
data['title_length'] = data['title'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["title_length"]
stats_TL = TL.describe()
print(stats_TL)

  published_platform  rating    type  \
0            Desktop       3  review   
1            Desktop       5  review   
2            Desktop       1  review   

                                                text  \
0  We used this airline to go from Singapore to L...   
1                                                NaN   
2  Booked, paid and received email confirmation f...   

                                               title  helpful_votes  \
0                                                 Ok              0   
1  The service in Suites Class makes one feel lik...              0   
2                         Don’t give them your money              0   

                       date  title_length  
0 2024-03-12 18:41:14+00:00             2  
1 2024-03-11 23:39:13+00:00            51  
2 2024-03-11 16:20:23+00:00            26  
count    10000.000000
mean        28.409600
std         17.279945
min          0.000000
25%         16.000000
50%         24.000000
75%         36.000000

In [14]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'title' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'title'] = np.nan
# print(data.head(3))

data = data.drop("title_length", axis=1)
data.head()

Unnamed: 0,published_platform,rating,type,text,title,helpful_votes,date
0,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0,2024-03-12 18:41:14+00:00
1,Desktop,5,review,,The service in Suites Class makes one feel lik...,0,2024-03-11 23:39:13+00:00
2,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0,2024-03-11 16:20:23+00:00
3,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0,2024-03-11 11:12:27+00:00
4,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0,2024-03-10 09:34:18+00:00


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   published_platform  10000 non-null  object             
 1   rating              10000 non-null  int64              
 2   type                10000 non-null  object             
 3   text                9846 non-null   object             
 4   title               9834 non-null   object             
 5   helpful_votes       10000 non-null  int64              
 6   date                10000 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), int64(2), object(4)
memory usage: 547.0+ KB


In [16]:
#check data types of each column, make sure they are correct
print(data.dtypes)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", data.duplicated().sum())

# Check for outliers in ratings
print("Unique ratings:", data['rating'].unique())

published_platform                 object
rating                              int64
type                               object
text                               object
title                              object
helpful_votes                       int64
date                  datetime64[ns, UTC]
dtype: object
Remaining duplicate rows: 0
Unique ratings: [3 5 1 2 4]


In [17]:
data.isnull().sum()

published_platform      0
rating                  0
type                    0
text                  154
title                 166
helpful_votes           0
date                    0
dtype: int64

# Feature Engineering

### Create new column `full_review`
Since there are some rows with empty `text` and `title`, we will concatenate both columns (`text` and `title`) to form a new column `full_review`.
1. Replace `NaN` values in `text` and `title` with an empty string.

2. Combine `text` and `title` into `full_review`.

3. Strip any leading/trailing whitespaces in `full_review`.

4. Drop `text` and `title` columns.

In [18]:
# 1) Fill NaN values in 'text' and 'title' with an empty string
data['title'] = data['title'].fillna('')
data['text'] = data['text'].fillna('')

# 2) Combine 'text' and 'title' into 'full_review'
data['full_review'] = data['text'] + " " + data['title']

# 3) Strip any leading/trailing whitespace
data['full_review'] = data['full_review'].str.strip()

# 4) Drop `text` and `title` columns
data = data.drop(columns = ['text', 'title'])

# Check if the 'full_review' column was added and if 'text' and 'title' columns has been dropped
print(data.head())
print("\nThe old shape is:",data.shape)

  published_platform  rating    type  helpful_votes                      date  \
0            Desktop       3  review              0 2024-03-12 18:41:14+00:00   
1            Desktop       5  review              0 2024-03-11 23:39:13+00:00   
2            Desktop       1  review              0 2024-03-11 16:20:23+00:00   
3            Desktop       5  review              0 2024-03-11 11:12:27+00:00   
4            Desktop       2  review              0 2024-03-10 09:34:18+00:00   

                                         full_review  
0  We used this airline to go from Singapore to L...  
1  The service in Suites Class makes one feel lik...  
2  Booked, paid and received email confirmation f...  
3  Best airline in the world, seats, food, servic...  
4  Premium Economy Seating on Singapore Airlines ...  

The old shape is: (10000, 6)


### Remove empty strings
1. Drop rows where `full_review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [19]:
# 1) Drop rows where `full_review` are empty strings and reset the index
data = data[data['full_review'] != ""].reset_index(drop=True)
print("The new shape is:",data.shape)

# 2) Check if there are no more null values in `data`
data.isnull().sum()

The new shape is: (9990, 6)


published_platform    0
rating                0
type                  0
helpful_votes         0
date                  0
full_review           0
dtype: int64

### Create new column `language`
In the case where there are rows where `full_review` are in different languages (e.g., French, Russian, etc.) other than English. We decided to use 2 different language detector libraries (`langdetect`, `langid`) on the `full_review` column and combined the predictions of all 2 libraries and selecting the most frequent predicted language.

**Reason**: `langdetect` might perform well on longer texts while `langid` is more reliable on short texts, using multiple detectors reduces the likelihood of misclassification and mitigates individual detector errors, leading to more accurate overall predictions. Also, even if one detector fails or throws an error, the other can still provide predictions, therefore improving the robustness of the language detection.

1. Set a seed for `langdetect` to ensure reproducibility.

2. Preprocess the text in `full_review`:
    - a\) Function to remove non-alphabetic characters and normalise whitespaces in  `full_review`.
    - b\) Function to determine if the text is non-language (e.g., numbers, symbols only).

3. Two functions for language detection:
    - a\) Using `langdetect`.
    - b\) Using `langid`.

4. Function for calculating majority vote for each language.

5. Function for parallel processing for efficiency.

6. Caching function for repeated inputs

7. Function for choosing language based on combined majority voting.

8. Applying the combined function on `full_review` column.

9. Display the resulting `data` DataFrame.

### <span style="color:red">The code below will take approximately 1 minute to run!</span>

In [20]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
data['language'] = [get_cached_language(text) for text in tqdm(data['full_review'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
data

Language Detection: 100%|██████████| 9990/9990 [02:04<00:00, 80.05it/s] 


Unnamed: 0,published_platform,rating,type,helpful_votes,date,full_review,language
0,Desktop,3,review,0,2024-03-12 18:41:14+00:00,We used this airline to go from Singapore to L...,en
1,Desktop,5,review,0,2024-03-11 23:39:13+00:00,The service in Suites Class makes one feel lik...,en
2,Desktop,1,review,0,2024-03-11 16:20:23+00:00,"Booked, paid and received email confirmation f...",en
3,Desktop,5,review,0,2024-03-11 11:12:27+00:00,"Best airline in the world, seats, food, servic...",en
4,Desktop,2,review,0,2024-03-10 09:34:18+00:00,Premium Economy Seating on Singapore Airlines ...,en
...,...,...,...,...,...,...,...
9985,Desktop,5,review,1,2018-08-06 07:48:21+00:00,First part done with Singapore Airlines - acce...,en
9986,Mobile,5,review,1,2018-08-06 02:50:29+00:00,And again a great Flight with Singapore Air. G...,en
9987,Desktop,5,review,1,2018-08-06 02:47:06+00:00,"We flew business class from Frankfurt, via Sin...",en
9988,Desktop,4,review,2,2018-08-06 00:32:03+00:00,"As always, the A380 aircraft was spotlessly pr...",en


In [21]:
# See distribution of languages
data["language"].value_counts()

language
en         9953
unknown      32
es            1
de            1
th            1
fr            1
sv            1
Name: count, dtype: int64

In [22]:
# Drop rows where language is NOT in english and reset the index
data = data[data['language'] == 'en'].reset_index(drop=True)
print(data.shape)

(9953, 7)


We will drop the `language` column since all values of `language` are `en` and all `full_review` are in the English language.

In [23]:
data.info()
data.drop(columns=["language"], inplace=True)
print("The new shape is:", data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9953 entries, 0 to 9952
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   published_platform  9953 non-null   object             
 1   rating              9953 non-null   int64              
 2   type                9953 non-null   object             
 3   helpful_votes       9953 non-null   int64              
 4   date                9953 non-null   datetime64[ns, UTC]
 5   full_review         9953 non-null   object             
 6   language            9953 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(2), object(4)
memory usage: 544.4+ KB
The new shape is: (9953, 6)


In [24]:
data.head()

Unnamed: 0,published_platform,rating,type,helpful_votes,date,full_review
0,Desktop,3,review,0,2024-03-12 18:41:14+00:00,We used this airline to go from Singapore to L...
1,Desktop,5,review,0,2024-03-11 23:39:13+00:00,The service in Suites Class makes one feel lik...
2,Desktop,1,review,0,2024-03-11 16:20:23+00:00,"Booked, paid and received email confirmation f..."
3,Desktop,5,review,0,2024-03-11 11:12:27+00:00,"Best airline in the world, seats, food, servic..."
4,Desktop,2,review,0,2024-03-10 09:34:18+00:00,Premium Economy Seating on Singapore Airlines ...


# Text Preprocessing for NLP

Here we will define a function `process_full_review` that takes a textual value as input and applies the following processing steps in sequence:

1. Convert the input text to lowercase using the `lower()` function.

2. Tokenize the lowercase text using the `word_tokenize` function from the NLTK library.

3. Create a list (`alphabetic_tokens`) containing only alphanetic tokens using a list comprehension with a regular expression match.

4. Remove stopwords
-   Obtain a set of English stopwords using the `stopwords.words('english')` method.
-   Define a list of `allowed_words` that should not be removed.
-   Remove the stopwords (excluding those that should not be removed).

5. Apply lemmatization to each token in the list (`lemmatized_words`) using the `lemmatize` method.

6. Join the lemmatized tokens into a single processed text using the `join` method and return the processed text.

Create a new column `processed_full_review` in `data` by applying the `process_full_review` function to the `full_review` column.

In [25]:
# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Redbu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Redbu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Redbu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Redbu\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [26]:
from nltk.stem import PorterStemmer
# Define function to process text
def process_full_review(text):
    processed_text = ""

    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Keep only alphabetic tokens
    alphabetic_tokens = [i for i in tokens if re.match('^[a-zA-Z]+$', i)]

    if len(alphabetic_tokens) == 0:
        # Return empty processed text if there are no alphabetic tokens
        return processed_text

    # List of stopwords
    stop_words = stopwords.words('english')

    # List of allowed words (to preserve certain negative words and conjuctions)
    allowed_words = ["no", "not", "don't", "dont", "don", "but", 
                     "however", "never", "wasn't", "wasnt", "shouldn't",
                     "shouldnt", "mustn't", "musnt"]
    '''
    these words may carry important information, such as negative connotations. in examples such as
    "don't ever get this dish" -> if don't was removed, it may be interpreted as "get dish", which is of the opposite sentiment
    of what the original review is supposed to be.
    Conjunctions like "but" and "however" shows a contrast to the sentence said before, meaning that the sentiment can be
    negatively affected or at the very least, impacted. Similarly for "mustn't" or "shouldn't", they typically carry a negative sentiment.
    '''

    # Filter out stopwords, keeping allowed words
    filtered_tokens = [i for i in alphabetic_tokens if i not in stop_words or i in allowed_words]

    # Initialise the WordNet Lemmatizer
    stemmer = PorterStemmer()

    # Stem the filtered tokens
    stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

    # Join the stemmed words back into a single string
    processed_text = ' '.join(stemmed_words)

    return processed_text

In [27]:
# Enable tqdm for pandas (progress bar)
tqdm.pandas(desc="Processing Reviews")

# Apply process_full_review function with tqdm progress bar and expand the results into a
data['processed_full_review'] = data['full_review'].progress_apply(process_full_review)

data

Processing Reviews: 100%|██████████| 9953/9953 [00:34<00:00, 287.57it/s]


Unnamed: 0,published_platform,rating,type,helpful_votes,date,full_review,processed_full_review
0,Desktop,3,review,0,2024-03-12 18:41:14+00:00,We used this airline to go from Singapore to L...,use airlin go singapor london heathrow issu ti...
1,Desktop,5,review,0,2024-03-11 23:39:13+00:00,The service in Suites Class makes one feel lik...,servic suit class make one feel like vip
2,Desktop,1,review,0,2024-03-11 16:20:23+00:00,"Booked, paid and received email confirmation f...",book paid receiv email confirm extra legroom s...
3,Desktop,5,review,0,2024-03-11 11:12:27+00:00,"Best airline in the world, seats, food, servic...",best airlin world seat food servic brilliant c...
4,Desktop,2,review,0,2024-03-10 09:34:18+00:00,Premium Economy Seating on Singapore Airlines ...,premium economi seat singapor airlin narrow se...
...,...,...,...,...,...,...,...
9948,Desktop,5,review,1,2018-08-06 07:48:21+00:00,First part done with Singapore Airlines - acce...,first part done singapor airlin accept comfort...
9949,Mobile,5,review,1,2018-08-06 02:50:29+00:00,And again a great Flight with Singapore Air. G...,great flight singapor air great uniqu servic o...
9950,Desktop,5,review,1,2018-08-06 02:47:06+00:00,"We flew business class from Frankfurt, via Sin...",flew busi class frankfurt via singapor brisban...
9951,Desktop,4,review,2,2018-08-06 00:32:03+00:00,"As always, the A380 aircraft was spotlessly pr...",alway aircraft spotlessli present board carpet...


### Mapping ratings to sentiment labels

In [28]:
# Function to map ratings to sentiment
def rating_to_sentiment(rating):
    if rating <= 2:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'
    else:
        return 'Positive'

# Apply the function to the 'rating' column
data['sentiment'] = data['rating'].apply(rating_to_sentiment)

# Check the sentiment distribution
print(data['sentiment'].value_counts())

sentiment
Positive    7376
Negative    1577
Neutral     1000
Name: count, dtype: int64


# Feature Selection
Now, we select the final features to use for our sentiment analysis of airline reviews. 
- `processed_full_review`,`processed_review_length`, `sentiment`,`year`,`month`

- Columns excluded: [`published_platform`,`type`,`helpful_votes`,`language`,`review_length`,`day`,`day_of_week`,`year_month`]

- Create a new DataFrame (`data_final`) by selecting the specifc columns mentioned above from the original DataFrame `data`.

In [29]:
data_final = data[['processed_full_review','sentiment']]
data_final.head()

Unnamed: 0,processed_full_review,sentiment
0,use airlin go singapor london heathrow issu ti...,Neutral
1,servic suit class make one feel like vip,Positive
2,book paid receiv email confirm extra legroom s...,Negative
3,best airlin world seat food servic brilliant c...,Positive
4,premium economi seat singapor airlin narrow se...,Negative


# Multinomial NB with Count Vectorizer

Count vectoriser converts a collection of text documents into a matrix of token counts. It simply counts the number of occurrences of each word in the document without considering the importance of frequency of words across the entire corpus.

Count vectoriser counts the raw frequency where each word is weighted equally while tf-idf is weighted by term frequency and rarity across documents.

Count vectoriser is simple and fast as it's just raw counts, while tf-idf requires more computation as it is slightly more complex due to the use of inverse document frequency.

In [3]:
data = pd.read_csv("final_df.csv")

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['processed_full_review'])

X_train, X_test, y_train, y_test = train_test_split(X, data['sentiment'],test_size=0.3, random_state=42)

nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)

nb_predictions = nb_model.predict(X_test)

print(accuracy_score(y_test,nb_predictions))
print(classification_report(y_test, nb_predictions))

0.8240740740740741
              precision    recall  f1-score   support

    Negative       0.78      0.69      0.73       692
     Neutral       0.38      0.64      0.48       346
    Positive       0.95      0.89      0.92      2418

    accuracy                           0.82      3456
   macro avg       0.70      0.74      0.71      3456
weighted avg       0.86      0.82      0.84      3456



# RF with Count Vectorizer

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_predictions = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_predictions))

Random Forest Accuracy: 0.8518518518518519
Random Forest Classification Report:
               precision    recall  f1-score   support

    Negative       0.81      0.78      0.79       692
     Neutral       0.88      0.08      0.15       346
    Positive       0.86      0.98      0.92      2418

    accuracy                           0.85      3456
   macro avg       0.85      0.61      0.62      3456
weighted avg       0.85      0.85      0.82      3456



# Log Regression with Count Vectorizer

In [6]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42, max_iter=100).fit(X_train, y_train)
clf_predictions = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, clf_predictions))
print("Classification Report:\n", classification_report(y_test, clf_predictions))

Accuracy: 0.8449074074074074
Classification Report:
               precision    recall  f1-score   support

    Negative       0.76      0.76      0.76       692
     Neutral       0.43      0.36      0.39       346
    Positive       0.92      0.94      0.93      2418

    accuracy                           0.84      3456
   macro avg       0.70      0.69      0.69      3456
weighted avg       0.84      0.84      0.84      3456



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


When using count vectoriser, log regression does not converge since count vectoriser produces raw counts for each word, which can result in large values in the feature matrix, especially if a word appears very frequently in a document while TF-IDF normalises these counts by down-weighting common words and dividing by the document length, leading to smaller values in the feature matrix.

Log regression is sensitive to the scale of features. Since Count Vectoriser can produce features of widely varying scales (some words may appear 1000 times, others only once), this can affect convergence. TF-IDF normalises the data, results in features that are more evenly scaled, leading to easier optimisation and quicker convergence.

Fixed by increasing max_iter from 100 to 200 OR scaling data.

In [7]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42, max_iter=200).fit(X_train, y_train)
clf_predictions = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, clf_predictions))
print("Classification Report:\n", classification_report(y_test, clf_predictions))

Accuracy: 0.84375
Classification Report:
               precision    recall  f1-score   support

    Negative       0.76      0.76      0.76       692
     Neutral       0.43      0.36      0.39       346
    Positive       0.92      0.94      0.93      2418

    accuracy                           0.84      3456
   macro avg       0.70      0.68      0.69      3456
weighted avg       0.84      0.84      0.84      3456



In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False)  # with_mean=False because sparse matrices don't support centering
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = LogisticRegression(random_state=42).fit(X_train_scaled, y_train)
clf_predictions = clf.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, clf_predictions))
print("Classification Report:\n", classification_report(y_test, clf_predictions))


Accuracy: 0.8035300925925926
Classification Report:
               precision    recall  f1-score   support

    Negative       0.70      0.70      0.70       692
     Neutral       0.34      0.39      0.36       346
    Positive       0.91      0.89      0.90      2418

    accuracy                           0.80      3456
   macro avg       0.65      0.66      0.65      3456
weighted avg       0.81      0.80      0.81      3456

