# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
# !pip3 install -r requirements.txt

In [2]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

import nltk

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

In [3]:
## for Mac users, might have to install this manually

# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptr

True

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [4]:
tripadvisor_df = pd.read_csv('singapore_airlines_reviews.csv')
tripadvisor_scraped_df = pd.read_csv('singapore_airlines_reviews_tripadvisor.csv')
skytrax_scraped_df = pd.read_csv('singapore_airlines_reviews_skytrax.csv')

In [5]:
tripadvisor_df.head()

Unnamed: 0,published_date,published_platform,rating,type,text,title,helpful_votes
0,2024-03-12T14:41:14-04:00,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0
1,2024-03-11T19:39:13-04:00,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0
2,2024-03-11T12:20:23-04:00,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0
3,2024-03-11T07:12:27-04:00,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0
4,2024-03-10T05:34:18-04:00,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0


In [6]:
tripadvisor_scraped_df.head()

Unnamed: 0,Year,Month,Title,Review Text,Rating
0,2024,May,I would give them zero stars!,I’d give zero stars if I could. I always see o...,1.0
1,2024,September,Very Poor service/responce,Zero stars. Very disappointed with Singapore a...,1.0
2,2024,October,Business Class of Singpore Airlines 2024,"Amazing, best service ever. Food is amazing, e...",5.0
3,2024,October,10/10 for how they handled our delayed flight,We had a 4.5 hour delay on our flight from Lon...,5.0
4,2024,October,Premium Economy,The leg room was great but the food was terrib...,3.0


In [7]:
skytrax_scraped_df.head()

Unnamed: 0,Year,Month,Title,Review Text,Rating
0,2024,10,"""one of our most enjoyable flights""",✅Trip Verified| We flew Singapore Air;ine (SIA...,10
1,2024,10,"""Excellent for economy""",✅Trip Verified| Excellent for economy. Five ...,10
2,2024,10,"""dismissive and unapologetic tone""",✅Trip Verified| My sister made an error in t...,1
3,2024,9,“Moon Festival Treats were a nice touch”,✅Trip Verified| SQ22 Solo Seat. Boards on Chan...,10
4,2024,9,“Little touches are always nice”,✅Trip Verified| SQ191 HAN-SIN in Economy. Staf...,10


# Data Cleaning

### Clean original tripadvisor dataset

In [8]:
## Drop rows with missing values
tripadvisor_df.dropna(inplace=True)

## Convert published_date column to datetime type
tripadvisor_df['date'] = pd.to_datetime(tripadvisor_df['published_date'],utc=True)

## Extract year and month
tripadvisor_df['year'] = tripadvisor_df['date'].dt.year
tripadvisor_df['month'] = tripadvisor_df['date'].dt.month

## Drop unnecessary columns
tripadvisor_df.drop(columns=['published_date','date','published_platform','type','helpful_votes'],inplace=True)

## Drop duplicates
tripadvisor_df.drop_duplicates(inplace=True)

In [9]:
tripadvisor_df.head()

Unnamed: 0,rating,text,title,year,month
0,3,We used this airline to go from Singapore to L...,Ok,2024,3
1,5,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,2024,3
2,1,"Booked, paid and received email confirmation f...",Don’t give them your money,2024,3
3,5,"Best airline in the world, seats, food, servic...",Best Airline in the World,2024,3
4,2,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,2024,3


### Sentiment mapping

We mapped the ratings of each review to a sentiment {Negative,Neutral,Positive}

In [10]:
# Function to map ratings to sentiment
def tripadvisor_rating_to_sentiment(rating):
    if rating <= 2:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'
    else:
        return 'Positive'
    

def skytrax_rating_to_sentiment(rating):
    if rating <= 4:
        return 'Negative'
    elif rating == 5:
        return 'Neutral'
    else:
        return 'Positive'

### Cleaning Scraped TripAdvisor Data

In [11]:
# 1. Clean TripAdvisor scraped dataframe
def clean_tripadvisor_df(df):
    cleaned_df = df.copy()
    
    # Convert rating to integer
    cleaned_df['Rating'] = cleaned_df['Rating'].astype(int)  
      
    # Dictionary to map month names to numbers
    month_map = {
        'January': 1, 'February': 2, 'March': 3, 'April': 4,
        'May': 5, 'June': 6, 'July': 7, 'August': 8,
        'September': 9, 'October': 10, 'November': 11, 'December': 12}

    # Applying the mapping to the 'Month' column
    cleaned_df['Month'] = cleaned_df['Month'].map(month_map)
    
    ## Convert Year and Month to lowercase
    cleaned_df = cleaned_df.rename(columns={'Year': 'year', 'Month': 'month', 'Review Text': 'text','Title': 'title', 'Rating': 'rating'})

    # Clean text fields
    cleaned_df['text'] = cleaned_df['text'].apply(lambda x: str(x).strip())
    cleaned_df['title'] = cleaned_df['title'].apply(lambda x: str(x).strip())
        
    # Remove any empty reviews
    cleaned_df = cleaned_df.dropna(subset=['text', 'rating'])
    
    # Remove any duplicate reviews
    cleaned_df.drop_duplicates(subset=['text']).reset_index(drop=True)
        
    return cleaned_df

# Apply cleaning function
clean_scraped_tripadvisor = clean_tripadvisor_df(tripadvisor_scraped_df)

In [12]:
# Apply sentiment mapping to both tripadvisor dataframes
clean_scraped_tripadvisor['sentiment'] = clean_scraped_tripadvisor['rating'].apply(tripadvisor_rating_to_sentiment)
tripadvisor_df['sentiment'] = tripadvisor_df['rating'].apply(tripadvisor_rating_to_sentiment)

In [13]:
tripadvisor_df.head()

Unnamed: 0,rating,text,title,year,month,sentiment
0,3,We used this airline to go from Singapore to L...,Ok,2024,3,Neutral
1,5,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,2024,3,Positive
2,1,"Booked, paid and received email confirmation f...",Don’t give them your money,2024,3,Negative
3,5,"Best airline in the world, seats, food, servic...",Best Airline in the World,2024,3,Positive
4,2,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,2024,3,Negative


In [14]:
clean_scraped_tripadvisor.head()

Unnamed: 0,year,month,title,text,rating,sentiment
0,2024,5,I would give them zero stars!,I’d give zero stars if I could. I always see o...,1,Negative
1,2024,9,Very Poor service/responce,Zero stars. Very disappointed with Singapore a...,1,Negative
2,2024,10,Business Class of Singpore Airlines 2024,"Amazing, best service ever. Food is amazing, e...",5,Positive
3,2024,10,10/10 for how they handled our delayed flight,We had a 4.5 hour delay on our flight from Lon...,5,Positive
4,2024,10,Premium Economy,The leg room was great but the food was terrib...,3,Neutral


### Cleaning scraped Skytrax data

In [15]:
# 2. Clean Skytrax scraped dataframe
def clean_skytrax_df(df):
    cleaned_df = df.copy()
    
    # Clean rating and convert to 1 to 5 scale
    cleaned_df['Rating'] = pd.to_numeric(cleaned_df['Rating'], errors='coerce')
    cleaned_df['Rating'] = cleaned_df['Rating'].astype(int)
        
    # Clean text fields and remove "Trip Verified" prefix
    cleaned_df['Review Text'] = cleaned_df['Review Text'].apply(lambda x: str(x).replace('✅Trip Verified| ', '').strip())
    cleaned_df['Title'] = cleaned_df['Title'].apply(lambda x: str(x).strip().strip('"'))
    
    ##  Convert Year and Month to lowercase
    cleaned_df = cleaned_df.rename(columns={'Year': 'year', 'Month': 'month', 'Review Text': 'text','Title': 'title', 'Rating': 'rating'})

    # Remove any empty reviews
    cleaned_df = cleaned_df.dropna(subset=['text', 'rating'])
    
    # Remove any duplicate reviews
    cleaned_df.drop_duplicates(subset=['text']).reset_index(drop=True)
    
    ## Keep reviews only from years 2021 to 2024
    cleaned_df = cleaned_df[(cleaned_df['year'] >= 2021) & (cleaned_df['year'] <= 2024)]
    
    
    return cleaned_df

# Apply cleaning functions
clean_skytrax = clean_skytrax_df(skytrax_scraped_df)

# Apply sentiment mapping
clean_skytrax['sentiment'] = clean_skytrax['rating'].apply(skytrax_rating_to_sentiment)

In [16]:
clean_skytrax.head()

Unnamed: 0,year,month,title,text,rating,sentiment
0,2024,10,one of our most enjoyable flights,We flew Singapore Air;ine (SIA) for the first ...,10,Positive
1,2024,10,Excellent for economy,Excellent for economy. Five hours into flight ...,10,Positive
2,2024,10,dismissive and unapologetic tone,"My sister made an error in the booking, and I ...",1,Negative
3,2024,9,“Moon Festival Treats were a nice touch”,SQ22 Solo Seat. Boards on Changi didn’t have t...,10,Positive
4,2024,9,“Little touches are always nice”,SQ191 HAN-SIN in Economy. Staff at SQ HAN coun...,10,Positive


In [17]:
tripadvisor_df.info()
clean_scraped_tripadvisor.info()
clean_skytrax.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9999 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   rating     9999 non-null   int64 
 1   text       9999 non-null   object
 2   title      9999 non-null   object
 3   year       9999 non-null   int32 
 4   month      9999 non-null   int32 
 5   sentiment  9999 non-null   object
dtypes: int32(2), int64(1), object(3)
memory usage: 468.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1446 entries, 0 to 1445
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       1446 non-null   int64 
 1   month      1446 non-null   int64 
 2   title      1446 non-null   object
 3   text       1446 non-null   object
 4   rating     1446 non-null   int32 
 5   sentiment  1446 non-null   object
dtypes: int32(1), int64(2), object(3)
memory usage: 62.3+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 3

In [18]:
# Drop rating column for all dataframes
tripadvisor_df = tripadvisor_df.drop(columns=['rating'])
clean_scraped_tripadvisor = clean_scraped_tripadvisor.drop(columns=['rating'])
clean_skytrax = clean_skytrax.drop(columns=['rating'])

In [19]:
## Combine all dataframes into one
data = pd.concat([tripadvisor_df,clean_scraped_tripadvisor, clean_skytrax],ignore_index=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11780 entries, 0 to 11779
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       11780 non-null  object
 1   title      11780 non-null  object
 2   year       11780 non-null  int64 
 3   month      11780 non-null  int64 
 4   sentiment  11780 non-null  object
dtypes: int64(2), object(3)
memory usage: 460.3+ KB


In [20]:
data.head()

Unnamed: 0,text,title,year,month,sentiment
0,We used this airline to go from Singapore to L...,Ok,2024,3,Neutral
1,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,2024,3,Positive
2,"Booked, paid and received email confirmation f...",Don’t give them your money,2024,3,Negative
3,"Best airline in the world, seats, food, servic...",Best Airline in the World,2024,3,Positive
4,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,2024,3,Negative


## Feature Engineering

### Create new column `full_review`
Since there are some rows with empty `text` and `title`, we will concatenate both columns (`text` and `title`) to form a new column `full_review`.
1. Replace `NaN` values in `text` and `title` with an empty string.

2. Combine `text` and `title` into `full_review`.

3. Strip any leading/trailing whitespaces in `full_review`.

4. Drop `text` and `title` columns.

In [21]:
# 1) Combine 'text' and 'title' into 'full_review'
data['full_review'] = data['title'] + " " + data['text']

# 2) Strip any leading/trailing whitespace
data['full_review'] = data['full_review'].str.strip()

# 3) Drop `text` and `title` columns
data = data.drop(columns = ['text', 'title'])

# Check if the 'full_review' column was added and if 'text' and 'title' columns has been dropped
print(data.head())
print("\nThe old shape is:", data.shape)

   year  month sentiment                                        full_review
0  2024      3   Neutral  Ok We used this airline to go from Singapore t...
1  2024      3  Positive  The service in Suites Class makes one feel lik...
2  2024      3  Negative  Don’t give them your money Booked, paid and re...
3  2024      3  Positive  Best Airline in the World Best airline in the ...
4  2024      3  Negative  Premium Economy Seating on Singapore Airlines ...

The old shape is: (11780, 4)


## Remove Outliers

### `full_review`

The `full_review` column of `combined_df`, which is of string (`str`) type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `text_length` in the DataFrame `combined_df` by calculating the length of each review. (Set the value as 0 if the correponding `text` column has NaN values.)

2. Check the statistics of `text_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `text_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `text_length` column and set the corresponding `text` to np.nan.

6. Drop the `text_length` column from the DataFrame.

In [23]:
def remove_outliers(df, text_column='', threshold=3):
    """
    Removes outliers based on the length of the text in the given column. 
    Outliers are defined as texts whose lengths are more than the specified 
    number of standard deviations away from the mean.

    Parameters:
    - df: DataFrame containing the text data.
    - text_column: Name of the column containing text reviews (default: 'Review Text').
    - threshold: Number of standard deviations to define an outlier (default: 3).

    Returns:
    - df: DataFrame with outliers (texts with extreme lengths) removed.
    """
    
    # Create a new column for text length
    df['text_length'] = df[text_column].apply(lambda x: len(x) if pd.notna(x) else 0)
    
    # Calculate the statistics for text length
    TL = df["text_length"]
    stats_TL = TL.describe()
    print(stats_TL)

    # Calculate z-scores for text lengths
    z_score = zscore(TL)
    
    # Remove rows where the z-score exceeds the threshold
    df.loc[abs(z_score) > threshold, text_column] = np.nan
    
    # Drop the temporary 'text_length' column after cleaning
    df = df.drop("text_length", axis=1)
    
    return df

data = remove_outliers(data, text_column='full_review', threshold=3)


count    11780.000000
mean       560.107301
std        614.266970
min        103.000000
25%        265.000000
50%        365.000000
75%        655.000000
max      18793.000000
Name: text_length, dtype: float64


In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11780 entries, 0 to 11779
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         11780 non-null  int64 
 1   month        11780 non-null  int64 
 2   sentiment    11780 non-null  object
 3   full_review  11593 non-null  object
dtypes: int64(2), object(2)
memory usage: 368.2+ KB


In [25]:
## Drop rows where full_review is NaN
data = data.dropna(subset=['full_review'])

In [26]:
#check data types of each column, make sure they are correct
print(data.dtypes)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", data.duplicated().sum())
data.drop_duplicates(inplace=True)

# Check for outliers in ratings
print("Unique sentiments:", data['sentiment'].unique())

year            int64
month           int64
sentiment      object
full_review    object
dtype: object
Remaining duplicate rows: 68
Unique sentiments: ['Neutral' 'Negative' 'Positive']


In [27]:
data['year'].value_counts()

year
2019    5133
2018    2596
2022    1186
2023    1111
2020     889
2024     514
2021      96
Name: count, dtype: int64

# Feature Engineering

### Remove empty strings
1. Drop rows where `full_review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [28]:
# 1) Drop rows where `full_review` are empty strings and reset the index
data = data[data['full_review'] != ""].reset_index(drop=True)
print("The new shape is:",data.shape)

# 2) Check if there are no more null values in `data`
data.isnull().sum()

The new shape is: (11525, 4)


year           0
month          0
sentiment      0
full_review    0
dtype: int64

### Create new column `language`
In the case where there are rows where `full_review` are in different languages (e.g., French, Russian, etc.) other than English. We decided to use 2 different language detector libraries (`langdetect`, `langid`) on the `full_review` column and combined the predictions of all 2 libraries and selecting the most frequent predicted language.

**Reason**: `langdetect` might perform well on longer texts while `langid` is more reliable on short texts, using multiple detectors reduces the likelihood of misclassification and mitigates individual detector errors, leading to more accurate overall predictions. Also, even if one detector fails or throws an error, the other can still provide predictions, therefore improving the robustness of the language detection.

1. Set a seed for `langdetect` to ensure reproducibility.

2. Preprocess the text in `full_review`:
    - a\) Function to remove non-alphabetic characters and normalise whitespaces in  `full_review`.
    - b\) Function to determine if the text is non-language (e.g., numbers, symbols only).

3. Two functions for language detection:
    - a\) Using `langdetect`.
    - b\) Using `langid`.

4. Function for calculating majority vote for each language.

5. Function for parallel processing for efficiency.

6. Caching function for repeated inputs

7. Function for choosing language based on combined majority voting.

8. Applying the combined function on `full_review` column.

9. Display the resulting `data` DataFrame.

### <span style="color:red">The code below will take approximately 1 minute to run!</span>

In [29]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
data['language'] = [get_cached_language(text) for text in tqdm(data['full_review'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
data

Language Detection: 100%|██████████| 11525/11525 [01:21<00:00, 141.12it/s]


Unnamed: 0,year,month,sentiment,full_review,language
0,2024,3,Neutral,Ok We used this airline to go from Singapore t...,en
1,2024,3,Negative,"Don’t give them your money Booked, paid and re...",en
2,2024,3,Positive,Best Airline in the World Best airline in the ...,en
3,2024,3,Negative,Premium Economy Seating on Singapore Airlines ...,en
4,2024,3,Negative,Impossible to get a promised refund We booked ...,en
...,...,...,...,...,...
11520,2021,11,Negative,their website is too buggy I paid $7200 for a ...,en
11521,2021,10,Negative,reduce both the level and quality of service I...,en
11522,2021,10,Negative,the change would cost 490 USD I booked a round...,en
11523,2021,8,Negative,"I was disappointed with my flights Check in, s...",en


In [30]:
# See distribution of languages
data["language"].value_counts()

language
en         11517
unknown        5
th             1
nl             1
fr             1
Name: count, dtype: int64

In [31]:
# Drop rows where language is NOT in english and reset the index
data = data[data['language'] == 'en'].reset_index(drop=True)
print(data.shape)

(11517, 5)


We will drop the `language` column since all values of `language` are `en` and all `full_review` are in the English language.

In [32]:
data.info()
data.drop(columns=["language"], inplace=True)
print("The new shape is:", data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11517 entries, 0 to 11516
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         11517 non-null  int64 
 1   month        11517 non-null  int64 
 2   sentiment    11517 non-null  object
 3   full_review  11517 non-null  object
 4   language     11517 non-null  object
dtypes: int64(2), object(3)
memory usage: 450.0+ KB
The new shape is: (11517, 4)


In [33]:
data.head()

Unnamed: 0,year,month,sentiment,full_review
0,2024,3,Neutral,Ok We used this airline to go from Singapore t...
1,2024,3,Negative,"Don’t give them your money Booked, paid and re..."
2,2024,3,Positive,Best Airline in the World Best airline in the ...
3,2024,3,Negative,Premium Economy Seating on Singapore Airlines ...
4,2024,3,Negative,Impossible to get a promised refund We booked ...


# Text Preprocessing for NLP

Here we will define a function `process_full_review` that takes a textual value as input and applies the following processing steps in sequence:

1. Convert the input text to lowercase using the `lower()` function.

2. Tokenize the lowercase text using the `word_tokenize` function from the NLTK library.

3. Create a list (`alphabetic_tokens`) containing only alphanetic tokens using a list comprehension with a regular expression match.

4. Remove stopwords
-   Obtain a set of English stopwords using the `stopwords.words('english')` method.
-   Define a list of `allowed_words` that should not be removed.
-   Remove the stopwords (excluding those that should not be removed).

5. Apply lemmatization to each token in the list (`lemmatized_words`) using the `lemmatize` method.

6. Join the lemmatized tokens into a single processed text using the `join` method and return the processed text.

Create a new column `processed_full_review` in `data` by applying the `process_full_review` function to the `full_review` column.

In [None]:
# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\johnt\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\johnt\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\johnt\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\johnt\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [34]:
# Define function to process text
from nltk import pos_tag
import string
def process_full_review(text):
    # Convert to lowercase and tokenize
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in string.punctuation]
    pos_categories = {
    'pronouns': ['PRP', 'PRP$', 'WP', 'WP$'],
    'modal_verbs': ['MD'],
    'negations': ['RB', 'DT'],  # Words like 'not', 'no', 'never'
    'verbs': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],  # Action words
    'nouns': ['NN', 'NNS', 'NNP', 'NNPS'],  # Subjects, objects
    'adverbs': ['RB', 'RBR', 'RBS'],  # Words that modify verbs/adjectives
    }
    # List of stopwords
    stop_words = stopwords.words('english')
    allowed_words = ["no", "not", "don't", "dont", "don", "but", 
                     "however", "never", "wasn't", "wasnt", "shouldn't",
                     "shouldnt", "mustn't", "musnt"]

    # POS tagging
    pos_tags = pos_tag(tokens)
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Lemmatize tokens based on POS tags in your dictionary
    lemmatized_tokens = []
    for word, tag in pos_tags:
        if word not in stop_words or word in allowed_words:
            if tag in pos_categories['verbs']:
                lemmatized_tokens.append(lemmatizer.lemmatize(word, 'v'))
            elif tag in pos_categories['nouns']:
                lemmatized_tokens.append(lemmatizer.lemmatize(word, 'n'))
            elif tag in pos_categories['adverbs']:
                lemmatized_tokens.append(lemmatizer.lemmatize(word, 'r'))
            else:
                lemmatized_tokens.append(lemmatizer.lemmatize(word))

    # Join and return processed text
    return ' '.join(lemmatized_tokens)

In [35]:
# Enable tqdm for pandas (progress bar)
tqdm.pandas(desc="Processing Reviews")

# Apply process_full_review function with tqdm progress bar and expand the results into a
data['processed_full_review'] = data['full_review'].progress_apply(process_full_review)

data

Processing Reviews: 100%|██████████| 11517/11517 [00:44<00:00, 260.43it/s]


Unnamed: 0,year,month,sentiment,full_review,processed_full_review
0,2024,3,Neutral,Ok We used this airline to go from Singapore t...,ok use airline go singapore london heathrow is...
1,2024,3,Negative,"Don’t give them your money Booked, paid and re...",don ’ give money book pay receive email confir...
2,2024,3,Positive,Best Airline in the World Best airline in the ...,best airline world best airline world seat foo...
3,2024,3,Negative,Premium Economy Seating on Singapore Airlines ...,premium economy seat singapore airline not wor...
4,2024,3,Negative,Impossible to get a promised refund We booked ...,impossible get promised refund book flight ful...
...,...,...,...,...,...
11512,2021,11,Negative,their website is too buggy I paid $7200 for a ...,website buggy pay 7200 first business class ti...
11513,2021,10,Negative,reduce both the level and quality of service I...,reduce level quality service fear future airli...
11514,2021,10,Negative,the change would cost 490 USD I booked a round...,change would cost 490 usd book round-trip tick...
11515,2021,8,Negative,"I was disappointed with my flights Check in, s...",disappoint flight check security check frankfu...


In [36]:
## Drop full_review column
data.drop(columns=['full_review'], inplace=True)

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11517 entries, 0 to 11516
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   11517 non-null  int64 
 1   month                  11517 non-null  int64 
 2   sentiment              11517 non-null  object
 3   processed_full_review  11517 non-null  object
dtypes: int64(2), object(2)
memory usage: 360.0+ KB


In [38]:
data.head()
data.info()
data['year'].value_counts()
data['sentiment'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11517 entries, 0 to 11516
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   11517 non-null  int64 
 1   month                  11517 non-null  int64 
 2   sentiment              11517 non-null  object
 3   processed_full_review  11517 non-null  object
dtypes: int64(2), object(2)
memory usage: 360.0+ KB


sentiment
Positive    7913
Negative    2441
Neutral     1163
Name: count, dtype: int64

In [40]:
data_final = data[['processed_full_review','sentiment']]
data_final.head()

Unnamed: 0,processed_full_review,sentiment
0,ok use airline go singapore london heathrow is...,Neutral
1,don ’ give money book pay receive email confir...,Negative
2,best airline world best airline world seat foo...,Positive
3,premium economy seat singapore airline not wor...,Negative
4,impossible get promised refund book flight ful...,Negative


# Complement NB with CountVec

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize CountVectorizer instead of TfidfVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
count_matrix = count_vectorizer.fit_transform(data['processed_full_review'])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(count_matrix, data['sentiment'], test_size=0.2, random_state=42)

# Train the model
nb_model = ComplementNB(alpha=5.0)
nb_model.fit(X_train, y_train)

# Make predictions
nb_predictions = nb_model.predict(X_test)

# Evaluate the model
print("Complement NB Accuracy:", accuracy_score(y_test, nb_predictions))
print("Complement NB Classification Report:\n", classification_report(y_test, nb_predictions, digits=4))

Complement NB Accuracy: 0.8333333333333334
Complement NB Classification Report:
               precision    recall  f1-score   support

    Negative     0.6856    0.8278    0.7500       482
     Neutral     0.4336    0.2130    0.2857       230
    Positive     0.9149    0.9246    0.9197      1592

    accuracy                         0.8333      2304
   macro avg     0.6780    0.6552    0.6518      2304
weighted avg     0.8188    0.8333    0.8209      2304



# RF with CountVec

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize CountVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
count_matrix = count_vectorizer.fit_transform(data['processed_full_review'])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(count_matrix, data['sentiment'], test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_predictions, digits=4))

Random Forest Accuracy: 0.8441840277777778
Random Forest Classification Report:
               precision    recall  f1-score   support

    Negative     0.8205    0.7490    0.7831       482
     Neutral     0.8571    0.0783    0.1434       230
    Positive     0.8497    0.9837    0.9118      1592

    accuracy                         0.8442      2304
   macro avg     0.8424    0.6036    0.6128      2304
weighted avg     0.8443    0.8442    0.8082      2304



# Log Regression with CountVec

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize CountVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
count_matrix = count_vectorizer.fit_transform(data['processed_full_review'])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(count_matrix, data['sentiment'], test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
clf = LogisticRegression(random_state=42, multi_class='multinomial', solver='lbfgs', max_iter=200)
clf.fit(X_train, y_train)

# Make predictions
clf_predictions = clf.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, clf_predictions))
print("Logistic Regression Classification Report:\n", classification_report(y_test, clf_predictions, digits=4))



Logistic Regression Accuracy: 0.8389756944444444
Logistic Regression Classification Report:
               precision    recall  f1-score   support

    Negative     0.7588    0.7635    0.7611       482
     Neutral     0.3842    0.3391    0.3603       230
    Positive     0.9202    0.9340    0.9271      1592

    accuracy                         0.8390      2304
   macro avg     0.6877    0.6789    0.6828      2304
weighted avg     0.8329    0.8390    0.8358      2304

