# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [13]:
# !pip3 install -r requirements.txt

In [14]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

import nltk

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

In [15]:
## for Mac users, might have to install this manually

# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading punkt_tab: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]   

False

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [16]:
data = pd.read_csv("../singapore_airlines_reviews.csv")
tripadvisor_scraped_df = pd.read_csv('singapore_airlines_reviews_tripadvisor.csv')
skytrax_scraped_df = pd.read_csv('singapore_airlines_reviews_skytrax.csv')

In [17]:
data.head()

Unnamed: 0,published_date,published_platform,rating,type,text,title,helpful_votes
0,2024-03-12T14:41:14-04:00,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0
1,2024-03-11T19:39:13-04:00,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0
2,2024-03-11T12:20:23-04:00,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0
3,2024-03-11T07:12:27-04:00,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0
4,2024-03-10T05:34:18-04:00,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0


In [18]:
tripadvisor_scraped_df.head()

Unnamed: 0,Year,Month,Title,Review Text,Rating
0,2024,May,I would give them zero stars!,I’d give zero stars if I could. I always see o...,1.0
1,2024,September,Very Poor service/responce,Zero stars. Very disappointed with Singapore a...,1.0
2,2024,October,Business Class of Singpore Airlines 2024,"Amazing, best service ever. Food is amazing, e...",5.0
3,2024,October,10/10 for how they handled our delayed flight,We had a 4.5 hour delay on our flight from Lon...,5.0
4,2024,October,Premium Economy,The leg room was great but the food was terrib...,3.0


In [19]:
skytrax_scraped_df.head()

Unnamed: 0,Year,Month,Title,Review Text,Rating
0,2024,10,"""one of our most enjoyable flights""",✅Trip Verified| We flew Singapore Air;ine (SIA...,10
1,2024,10,"""Excellent for economy""",✅Trip Verified| Excellent for economy. Five ...,10
2,2024,10,"""dismissive and unapologetic tone""",✅Trip Verified| My sister made an error in t...,1
3,2024,9,“Moon Festival Treats were a nice touch”,✅Trip Verified| SQ22 Solo Seat. Boards on Chan...,10
4,2024,9,“Little touches are always nice”,✅Trip Verified| SQ191 HAN-SIN in Economy. Staf...,10


# Data Cleaning

### TripAdvisor

In [20]:
# 1. Clean TripAdvisor scraped dataframe
def clean_tripadvisor_df(df):
    cleaned_df = df.copy()
    
    # Convert rating to integer
    cleaned_df['Rating'] = cleaned_df['Rating'].astype(int)  
      
    # Dictionary to map month names to numbers
    month_map = {
        'January': 1, 'February': 2, 'March': 3, 'April': 4,
        'May': 5, 'June': 6, 'July': 7, 'August': 8,
        'September': 9, 'October': 10, 'November': 11, 'December': 12}

    # Applying the mapping to the 'Month' column
    cleaned_df['Month'] = cleaned_df['Month'].map(month_map)
    
    ## Convert Year and Month to lowercase
    cleaned_df = cleaned_df.rename(columns={'Year': 'year', 'Month': 'month'})

    # Clean text fields
    cleaned_df['Review Text'] = cleaned_df['Review Text'].apply(lambda x: str(x).strip())
    cleaned_df['Title'] = cleaned_df['Title'].apply(lambda x: str(x).strip())
        
    # Remove any empty reviews
    cleaned_df = cleaned_df.dropna(subset=['Review Text', 'Rating'])
    
    # Remove any duplicate reviews
    cleaned_df.drop_duplicates(subset=['Review Text']).reset_index(drop=True)
        
    return cleaned_df



# Apply cleaning functions
clean_tripadvisor = clean_tripadvisor_df(tripadvisor_scraped_df)


In [21]:
clean_tripadvisor.head()

Unnamed: 0,year,month,Title,Review Text,Rating
0,2024,5,I would give them zero stars!,I’d give zero stars if I could. I always see o...,1
1,2024,9,Very Poor service/responce,Zero stars. Very disappointed with Singapore a...,1
2,2024,10,Business Class of Singpore Airlines 2024,"Amazing, best service ever. Food is amazing, e...",5
3,2024,10,10/10 for how they handled our delayed flight,We had a 4.5 hour delay on our flight from Lon...,5
4,2024,10,Premium Economy,The leg room was great but the food was terrib...,3


### Skytrax

In [22]:
# 2. Clean Skytrax scraped dataframe
def clean_skytrax_df(df):
    cleaned_df = df.copy()
    
    # Clean rating and convert to 1 to 5 scale
    cleaned_df['Rating'] = pd.to_numeric(cleaned_df['Rating'], errors='coerce')
    cleaned_df['Rating'] = cleaned_df['Rating'].apply(lambda x: round(x / 2))
    cleaned_df['Rating'] = cleaned_df['Rating'].astype(int)
        
    # Clean text fields and remove "Trip Verified" prefix
    cleaned_df['Review Text'] = cleaned_df['Review Text'].apply(lambda x: str(x).replace('✅Trip Verified| ', '').strip())
    cleaned_df['Title'] = cleaned_df['Title'].apply(lambda x: str(x).strip().strip('"'))
    
    ##  Convert Year and Month to lowercase
    cleaned_df = cleaned_df.rename(columns={'Year': 'year', 'Month': 'month'})

    # Remove any empty reviews
    cleaned_df = cleaned_df.dropna(subset=['Review Text', 'Rating'])
    
    # Remove any duplicate reviews
    cleaned_df.drop_duplicates(subset=['Review Text']).reset_index(drop=True)
    
    ## Keep reviews only from years 2021 to 2024
    cleaned_df = cleaned_df[(cleaned_df['year'] >= 2021) & (cleaned_df['year'] <= 2024)]
    
    
    return cleaned_df

clean_skytrax = clean_skytrax_df(skytrax_scraped_df)

In [23]:
clean_tripadvisor.info()
clean_skytrax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1446 entries, 0 to 1445
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         1446 non-null   int64 
 1   month        1446 non-null   int64 
 2   Title        1446 non-null   object
 3   Review Text  1446 non-null   object
 4   Rating       1446 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 56.6+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 335 entries, 0 to 334
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         335 non-null    int64 
 1   month        335 non-null    int64 
 2   Title        335 non-null    object
 3   Review Text  335 non-null    object
 4   Rating       335 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 15.7+ KB


In [24]:
## Combine both dataframes into one
combined_df = pd.concat([clean_tripadvisor, clean_skytrax], ignore_index=True)

combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1781 entries, 0 to 1780
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         1781 non-null   int64 
 1   month        1781 non-null   int64 
 2   Title        1781 non-null   object
 3   Review Text  1781 non-null   object
 4   Rating       1781 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 69.7+ KB


### Create new column `full_review`
Since there are some rows with empty `text` and `title`, we will concatenate both columns (`text` and `title`) to form a new column `full_review`.
1. Replace `NaN` values in `text` and `title` with an empty string.

2. Combine `text` and `title` into `full_review`.

3. Strip any leading/trailing whitespaces in `full_review`.

4. Drop `text` and `title` columns.

In [25]:
# 1) Combine 'text' and 'title' into 'full_review'
combined_df['full_review'] = combined_df['Review Text'] + " " + combined_df['Title']

# 2) Strip any leading/trailing whitespace
combined_df['full_review'] = combined_df['full_review'].str.strip()

# 3) Drop `text` and `title` columns
combined_df = combined_df.drop(columns = ['Review Text', 'Title'])

# Check if the 'full_review' column was added and if 'text' and 'title' columns has been dropped
print(combined_df.head())
print("\nThe old shape is:", combined_df.shape) 

   year  month  Rating                                        full_review
0  2024      5       1  I’d give zero stars if I could. I always see o...
1  2024      9       1  Zero stars. Very disappointed with Singapore a...
2  2024     10       5  Amazing, best service ever. Food is amazing, e...
3  2024     10       5  We had a 4.5 hour delay on our flight from Lon...
4  2024     10       3  The leg room was great but the food was terrib...

The old shape is: (1781, 4)


## Remove Outliers

### `full_review`

The `full_review` column of `combined_df`, which is of string (`str`) type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `text_length` in the DataFrame `combined_df` by calculating the length of each review. (Set the value as 0 if the correponding `text` column has NaN values.)

2. Check the statistics of `text_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `text_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `text_length` column and set the corresponding `text` to np.nan.

6. Drop the `text_length` column from the DataFrame.

In [26]:
def remove_outliers(df, text_column='', threshold=3):
    """
    Removes outliers based on the length of the text in the given column. 
    Outliers are defined as texts whose lengths are more than the specified 
    number of standard deviations away from the mean.

    Parameters:
    - df: DataFrame containing the text data.
    - text_column: Name of the column containing text reviews (default: 'Review Text').
    - threshold: Number of standard deviations to define an outlier (default: 3).

    Returns:
    - df: DataFrame with outliers (texts with extreme lengths) removed.
    """
    
    # Create a new column for text length
    df['text_length'] = df[text_column].apply(lambda x: len(x) if pd.notna(x) else 0)
    
    # Calculate the statistics for text length
    TL = df["text_length"]
    stats_TL = TL.describe()
    print(stats_TL)

    # Calculate z-scores for text lengths
    z_score = zscore(TL)
    
    # Remove rows where the z-score exceeds the threshold
    df.loc[abs(z_score) > threshold, text_column] = np.nan
    
    # Drop the temporary 'text_length' column after cleaning
    df = df.drop("text_length", axis=1)
    
    return df

combined_df = remove_outliers(combined_df, text_column='full_review', threshold=3)


count    1781.000000
mean      405.410444
std       344.534000
min       114.000000
25%       295.000000
50%       310.000000
75%       337.000000
max      3285.000000
Name: text_length, dtype: float64


In [27]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1781 entries, 0 to 1780
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         1781 non-null   int64 
 1   month        1781 non-null   int64 
 2   Rating       1781 non-null   int64 
 3   full_review  1730 non-null   object
dtypes: int64(3), object(1)
memory usage: 55.8+ KB


In [28]:
## Drop rows where full_review is NaN
combined_df = combined_df.dropna(subset=['full_review'])

In [29]:
#check data types of each column, make sure they are correct
print(combined_df.dtypes)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", combined_df.duplicated().sum())
combined_df.drop_duplicates(inplace=True)

# Check for outliers in ratings
print("Unique ratings:", combined_df['Rating'].unique())

year            int64
month           int64
Rating          int64
full_review    object
dtype: object
Remaining duplicate rows: 0
Unique ratings: [1 5 3 4 2 0]


In [30]:
combined_df['year'].value_counts()

year
2022    677
2023    628
2024    416
2021      9
Name: count, dtype: int64

# Feature Engineering

### Remove empty strings
1. Drop rows where `full_review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [31]:
# 1) Drop rows where `full_review` are empty strings and reset the index
combined_df = combined_df[combined_df['full_review'] != ""].reset_index(drop=True)
print("The new shape is:",combined_df.shape)

# 2) Check if there are no more null values in `data`
combined_df.isnull().sum()

The new shape is: (1730, 4)


year           0
month          0
Rating         0
full_review    0
dtype: int64

### Create new column `language`
In the case where there are rows where `full_review` are in different languages (e.g., French, Russian, etc.) other than English. We decided to use 2 different language detector libraries (`langdetect`, `langid`) on the `full_review` column and combined the predictions of all 2 libraries and selecting the most frequent predicted language.

**Reason**: `langdetect` might perform well on longer texts while `langid` is more reliable on short texts, using multiple detectors reduces the likelihood of misclassification and mitigates individual detector errors, leading to more accurate overall predictions. Also, even if one detector fails or throws an error, the other can still provide predictions, therefore improving the robustness of the language detection.

1. Set a seed for `langdetect` to ensure reproducibility.

2. Preprocess the text in `full_review`:
    - a\) Function to remove non-alphabetic characters and normalise whitespaces in  `full_review`.
    - b\) Function to determine if the text is non-language (e.g., numbers, symbols only).

3. Two functions for language detection:
    - a\) Using `langdetect`.
    - b\) Using `langid`.

4. Function for calculating majority vote for each language.

5. Function for parallel processing for efficiency.

6. Caching function for repeated inputs

7. Function for choosing language based on combined majority voting.

8. Applying the combined function on `full_review` column.

9. Display the resulting `data` DataFrame.

### <span style="color:red">The code below will take approximately 1 minute to run!</span>

In [32]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
combined_df['language'] = [get_cached_language(text) for text in tqdm(combined_df['full_review'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
combined_df

Language Detection: 100%|██████████| 1730/1730 [00:10<00:00, 168.60it/s]


Unnamed: 0,year,month,Rating,full_review,language
0,2024,5,1,I’d give zero stars if I could. I always see o...,en
1,2024,9,1,Zero stars. Very disappointed with Singapore a...,en
2,2024,10,5,"Amazing, best service ever. Food is amazing, e...",en
3,2024,10,5,We had a 4.5 hour delay on our flight from Lon...,en
4,2024,10,3,The leg room was great but the food was terrib...,en
...,...,...,...,...,...
1725,2021,12,5,"Due to the pandemic, it has been two years sin...",en
1726,2021,12,5,"SQ25, JFK-FRA, in Business. Check in Counter, ...",en
1727,2021,11,4,"FRA to JFK, still a good and safe airline but ...",en
1728,2021,11,0,I paid $7200 for a first / business class tick...,en


In [33]:
# See distribution of languages
combined_df["language"].value_counts()

language
en    1729
nl       1
Name: count, dtype: int64

In [34]:
# Drop rows where language is NOT in english and reset the index
combined_df = combined_df[combined_df['language'] == 'en'].reset_index(drop=True)
print(combined_df.shape)

(1729, 5)


We will drop the `language` column since all values of `language` are `en` and all `full_review` are in the English language.

In [35]:
combined_df.info()
combined_df.drop(columns=["language"], inplace=True)
print("The new shape is:", combined_df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1729 entries, 0 to 1728
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   year         1729 non-null   int64 
 1   month        1729 non-null   int64 
 2   Rating       1729 non-null   int64 
 3   full_review  1729 non-null   object
 4   language     1729 non-null   object
dtypes: int64(3), object(2)
memory usage: 67.7+ KB
The new shape is: (1729, 4)


In [36]:
combined_df.head()

Unnamed: 0,year,month,Rating,full_review
0,2024,5,1,I’d give zero stars if I could. I always see o...
1,2024,9,1,Zero stars. Very disappointed with Singapore a...
2,2024,10,5,"Amazing, best service ever. Food is amazing, e..."
3,2024,10,5,We had a 4.5 hour delay on our flight from Lon...
4,2024,10,3,The leg room was great but the food was terrib...


# Text Preprocessing for NLP

Here we will define a function `process_full_review` that takes a textual value as input and applies the following processing steps in sequence:

1. Convert the input text to lowercase using the `lower()` function.

2. Tokenize the lowercase text using the `word_tokenize` function from the NLTK library.

3. Create a list (`alphabetic_tokens`) containing only alphanetic tokens using a list comprehension with a regular expression match.

4. Remove stopwords
-   Obtain a set of English stopwords using the `stopwords.words('english')` method.
-   Define a list of `allowed_words` that should not be removed.
-   Remove the stopwords (excluding those that should not be removed).

5. Apply lemmatization to each token in the list (`lemmatized_words`) using the `lemmatize` method.

6. Join the lemmatized tokens into a single processed text using the `join` method and return the processed text.

Create a new column `processed_full_review` in `data` by applying the `process_full_review` function to the `full_review` column.

In [37]:
# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]     (_ssl.c:1000)>


False

In [38]:
from nltk.stem import PorterStemmer
# Define function to process text
def process_full_review(text):
    processed_text = ""

    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Keep only alphabetic tokens
    alphabetic_tokens = [i for i in tokens if re.match('^[a-zA-Z]+$', i)]

    if len(alphabetic_tokens) == 0:
        # Return empty processed text if there are no alphabetic tokens
        return processed_text

    # List of stopwords
    stop_words = stopwords.words('english')

    # List of allowed words (to preserve certain negative words and conjuctions)
    allowed_words = ["no", "not", "don't", "dont", "don", "but", 
                     "however", "never", "wasn't", "wasnt", "shouldn't",
                     "shouldnt", "mustn't", "musnt"]
    '''
    these words may carry important information, such as negative connotations. in examples such as
    "don't ever get this dish" -> if don't was removed, it may be interpreted as "get dish", which is of the opposite sentiment
    of what the original review is supposed to be.
    Conjunctions like "but" and "however" shows a contrast to the sentence said before, meaning that the sentiment can be
    negatively affected or at the very least, impacted. Similarly for "mustn't" or "shouldn't", they typically carry a negative sentiment.
    '''

    # Filter out stopwords, keeping allowed words
    filtered_tokens = [i for i in alphabetic_tokens if i not in stop_words or i in allowed_words]

    # Initialise the WordNet Lemmatizer
    stemmer = PorterStemmer()

    # Stem the filtered tokens
    stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

    # Join the stemmed words back into a single string
    processed_text = ' '.join(stemmed_words)

    return processed_text

In [39]:
# Enable tqdm for pandas (progress bar)
tqdm.pandas(desc="Processing Reviews")

# Apply process_full_review function with tqdm progress bar and expand the results into a
combined_df['processed_full_review'] = combined_df['full_review'].progress_apply(process_full_review)

combined_df

Processing Reviews: 100%|██████████| 1729/1729 [00:00<00:00, 1812.46it/s]


Unnamed: 0,year,month,Rating,full_review,processed_full_review
0,2024,5,1,I’d give zero stars if I could. I always see o...,give zero star could alway see onlin singapor ...
1,2024,9,1,Zero stars. Very disappointed with Singapore a...,zero star disappoint singapor air not much due...
2,2024,10,5,"Amazing, best service ever. Food is amazing, e...",amaz best servic ever food amaz especi starter...
3,2024,10,5,We had a 4.5 hour delay on our flight from Lon...,hour delay flight london singapor thought woul...
4,2024,10,3,The leg room was great but the food was terrib...,leg room great but food terribl travel singapo...
...,...,...,...,...,...
1724,2021,12,5,"Due to the pandemic, it has been two years sin...",due pandem two year sinc board flight opportun...
1725,2021,12,5,"SQ25, JFK-FRA, in Business. Check in Counter, ...",busi check counter not know put final destin l...
1726,2021,11,4,"FRA to JFK, still a good and safe airline but ...",fra jfk still good safe airlin but massiv cut ...
1727,2021,11,0,I paid $7200 for a first / business class tick...,paid first busi class ticket websit buggi hand...


### Mapping ratings to sentiment labels

In [40]:
# Function to map ratings to sentiment
def rating_to_sentiment(rating):
    if rating <= 2:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'
    else:
        return 'Positive'

# Apply the function to the 'rating' column
combined_df['sentiment'] = combined_df['Rating'].apply(rating_to_sentiment)

# Check the sentiment distribution
print(combined_df['sentiment'].value_counts())

sentiment
Negative    964
Positive    581
Neutral     184
Name: count, dtype: int64


# Feature Selection
Now, we select the final features to use for our sentiment analysis of airline reviews. 
- `processed_full_review`,`processed_review_length`, `sentiment`,`year`,`month`

- Columns excluded: [`published_platform`,`type`,`helpful_votes`,`language`,`review_length`,`day`,`day_of_week`,`year_month`]

- Create a new DataFrame (`data_final`) by selecting the specifc columns mentioned above from the original DataFrame `data`.

In [41]:
scraped_data_final = combined_df[['year','month','processed_full_review','sentiment']]
scraped_data_final.head()
scraped_data_final.to_csv('scraped_data_final.csv', index=False)
