# Enhancing Airline Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**

Developed a a series of data preprocessing tasks, utilizing datasets from [Airlines Review Dataset](https://www.kaggle.com/datasets/juhibhojani/airline-reviews). Performed sentiment analysis and evaluated performance metrics using multiple models.

## Airline Customer Review Dataset Information

The [Airline Customer Review Dataset](https://www.kaggle.com/datasets/juhibhojani/airline-reviews) contains customer review data for airline flights.

- **Airline Name**: Name of Airline.
- **Overall_Rating:** Rating given by the user.
- **Review_Title:** Title of review.
- **Review Date:** The date when review was entered (e.g., 1st January 2019).
- **Verified:** Whether the reviewer is verified or not.
- **Review:** Detailed review given by the user.
- **Aircraft:** Type of aircraft.
- **Type Of Traveller:** The type of traveller (e.g., Solo Leisure).
- **Seat Type:** Categorical seat class type (e.g., Economy Class).
- **Route:** Flight source and destination.
- **Date Flown:** Month and year of flight (e.g., September 2019).
- **Seat Comfort:** Rating out of 5.
- **Cabin Staff Service:** Rating out of 5.
- **Food & Beverages:** Rating out of 5.
- **Ground Service:** Rating out of 5.
- **Inflight Entertainment:** Rating out of 5.
- **Wifi & Connectivity:** Rating out of 5.
- **Value For Money:** Rating out of 5.
- **Recommended:** Whether the flight is recommended or not.

## Import Libraries

Uncomment the line below to install the dependencies required for this notebook.

In [1]:
# !pip3 install -r requirements.txt

In [62]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords
# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re
from textblob import TextBlob
import emoji

# Word Cloud generation
from wordcloud import WordCloud

# Data Preparation (Loading CSV)

Load the Airline Review `data.csv` file into a pandas DataFrame `data_raw`.

In [95]:
data_raw = pd.read_csv('data.csv')

In [96]:
data_raw.info()
print("Dataframe Shape: ", data_raw.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23171 entries, 0 to 23170
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              23171 non-null  int64  
 1   Airline Name            23171 non-null  object 
 2   Overall_Rating          23171 non-null  object 
 3   Review_Title            23171 non-null  object 
 4   Review Date             23171 non-null  object 
 5   Verified                23171 non-null  bool   
 6   Review                  23171 non-null  object 
 7   Aircraft                7129 non-null   object 
 8   Type Of Traveller       19433 non-null  object 
 9   Seat Type               22075 non-null  object 
 10  Route                   19343 non-null  object 
 11  Date Flown              19417 non-null  object 
 12  Seat Comfort            19016 non-null  float64
 13  Cabin Staff Service     18911 non-null  float64
 14  Food & Beverages        14500 non-null

In [97]:
data_raw.head()

Unnamed: 0.1,Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended
0,0,AB Aviation,9,"""pretty decent airline""",11th November 2019,True,Moroni to Moheli. Turned out to be a pretty ...,,Solo Leisure,Economy Class,Moroni to Moheli,November 2019,4.0,5.0,4.0,4.0,,,3.0,yes
1,1,AB Aviation,1,"""Not a good airline""",25th June 2019,True,Moroni to Anjouan. It is a very small airline...,E120,Solo Leisure,Economy Class,Moroni to Anjouan,June 2019,2.0,2.0,1.0,1.0,,,2.0,no
2,2,AB Aviation,1,"""flight was fortunately short""",25th June 2019,True,Anjouan to Dzaoudzi. A very small airline an...,Embraer E120,Solo Leisure,Economy Class,Anjouan to Dzaoudzi,June 2019,2.0,1.0,1.0,1.0,,,2.0,no
3,3,Adria Airways,1,"""I will never fly again with Adria""",28th September 2019,False,Please do a favor yourself and do not fly wi...,,Solo Leisure,Economy Class,Frankfurt to Pristina,September 2019,1.0,1.0,,1.0,,,1.0,no
4,4,Adria Airways,1,"""it ruined our last days of holidays""",24th September 2019,True,Do not book a flight with this airline! My fr...,,Couple Leisure,Economy Class,Sofia to Amsterdam via Ljubljana,September 2019,1.0,1.0,1.0,1.0,1.0,1.0,1.0,no


# Feature Selection
Here we select the relevant features for sentiment analysis.
- `Review_Title`, `Review`, `Recommended`.
- Create a new DataFrame (`data`) by selecting the specifc columns mentioned above from the original DataFrame `data_raw`.

In [102]:
# Selecting the relevant features for sentiment analysis 
data = data_raw[[
    'Review_Title',
    'Review',
    "Overall_Rating",
    "Recommended",
    "Seat Comfort"
]]
print(type(data))
print(data.head())

# Shape before dropping duplicates
print("The old shape is: ", data.shape)

<class 'pandas.core.frame.DataFrame'>
                            Review_Title  \
0                "pretty decent airline"   
1                   "Not a good airline"   
2         "flight was fortunately short"   
3    "I will never fly again with Adria"   
4  "it ruined our last days of holidays"   

                                              Review Overall_Rating  \
0    Moroni to Moheli. Turned out to be a pretty ...              9   
1   Moroni to Anjouan. It is a very small airline...              1   
2    Anjouan to Dzaoudzi. A very small airline an...              1   
3    Please do a favor yourself and do not fly wi...              1   
4   Do not book a flight with this airline! My fr...              1   

  Recommended  Seat Comfort  
0         yes           4.0  
1          no           2.0  
2          no           2.0  
3          no           1.0  
4          no           1.0  
The old shape is:  (23171, 5)


In [106]:
data[(data["Seat Comfort"] == 1)]

Unnamed: 0,Review_Title,Review,Overall_Rating,Recommended,Seat Comfort
3,"""I will never fly again with Adria""",Please do a favor yourself and do not fly wi...,1,no,1.0
4,"""it ruined our last days of holidays""",Do not book a flight with this airline! My fr...,1,no,1.0
5,"""Had very bad experience""",Had very bad experience with rerouted and ca...,1,no,1.0
6,"""worse than the budget airlines""","Ljubljana to Zürich. Firstly, Ljubljana airp...",1,no,1.0
7,"""book another company""","First of all, I am not complaining about a s...",1,no,1.0
...,...,...,...,...,...
23137,ZIPAIR customer review,The seats are abysmal for a long-haul flight...,1,no,1.0
23148,"""has the worst customer service""",Zipair has the worst customer service. The c...,3,no,1.0
23149,"""airline is a total rip-off for the extras""",This airline is a total rip-off for the extra...,2,no,1.0
23155,"""Very frustrating experience""",Very frustrating experience. Does not allow c...,1,no,1.0


# Data Cleaning

## Remove Duplicate Rows
- Drop duplicate rows from the dataframe (`data`)

In [70]:
data = data.drop_duplicates()

# Display the new dataframe shape
print("The new shape is: ", data.shape)

The new shape is:  (23050, 3)


In [71]:
data.isnull().sum()

Review_Title      0
Review            0
Overall_Rating    0
dtype: int64

## Remove Outliers

### `Review` 

The `Review` column of `data`, which is of string type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `review_length` in the DataFrame `data` by calculating the length of each review. (Set the value as 0 if the correponding `Review` column has NaN values.)

2. Check the statistics of `review_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `review_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `review_length` column and set the corresponding `Review` to np.nan.

6. Drop the `review_length` column from the DataFrame.

In [72]:
data['review_length'] = data['Review'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["review_length"]
stats_TL = TL.describe()
print(stats_TL)

                     Review_Title  \
0         "pretty decent airline"   
1            "Not a good airline"   
2  "flight was fortunately short"   

                                              Review Overall_Rating  \
0    Moroni to Moheli. Turned out to be a pretty ...              9   
1   Moroni to Anjouan. It is a very small airline...              1   
2    Anjouan to Dzaoudzi. A very small airline an...              1   

   review_length  
0            352  
1            689  
2            405  
count    23050.000000
mean       721.822299
std        537.901606
min         14.000000
25%        361.000000
50%        561.000000
75%        904.000000
max       5080.000000
Name: review_length, dtype: float64


In [73]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'Review' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'Review'] = np.nan
# print(data.head(3))

data = data.drop("review_length", axis=1)
data.head()

Unnamed: 0,Review_Title,Review,Overall_Rating
0,"""pretty decent airline""",Moroni to Moheli. Turned out to be a pretty ...,9
1,"""Not a good airline""",Moroni to Anjouan. It is a very small airline...,1
2,"""flight was fortunately short""",Anjouan to Dzaoudzi. A very small airline an...,1
3,"""I will never fly again with Adria""",Please do a favor yourself and do not fly wi...,1
4,"""it ruined our last days of holidays""",Do not book a flight with this airline! My fr...,1


### `Review_Title`

Similarly, the `Review_Title` column of `data` (of type `str`) may also contain values with unusually long lengths, indicating the presence of outliers.

1. Create a new column `title_length` in the DataFrame `data` by calculating the length of each price value. (Set the value as 0 if the correponding `Review_Title` column has NaN values.)

2. Check the statistics of `title_length` using `describe()` method and display its unique values.

3. Identify the outlier values by inspecting the content in `Review_Title` corresponding to the abnormal value in `title_length` and set the corresponding value of `Review_Title` to np.nan.

4. Drop the `title_length` column from the DataFrame.

In [74]:
data['title_length'] = data['Review_Title'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["title_length"]
stats_TL = TL.describe()
print(stats_TL)

                     Review_Title  \
0         "pretty decent airline"   
1            "Not a good airline"   
2  "flight was fortunately short"   

                                              Review Overall_Rating  \
0    Moroni to Moheli. Turned out to be a pretty ...              9   
1   Moroni to Anjouan. It is a very small airline...              1   
2    Anjouan to Dzaoudzi. A very small airline an...              1   

   title_length  
0            23  
1            20  
2            30  
count    23050.000000
mean        29.841085
std          7.087795
min          2.000000
25%         25.000000
50%         29.000000
75%         33.000000
max         83.000000
Name: title_length, dtype: float64


In [76]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'Review_Title' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'Review_Title'] = np.nan
# print(data.head(3))

data = data.drop("title_length", axis=1)
data.head()

KeyError: "['title_length'] not found in axis"

In [77]:
data.isnull().sum()

Review_Title      266
Review            504
Overall_Rating      0
dtype: int64

## Feature Engineering

### Create new column `full_review`
Since there are some rows with empty `Review_Title` and `Review`, we will concatenate both columns (`Review_Title` and `Review`) to form a new column `Full_Review`.
1. Replace `NaN` values in `Review_Title` and `Review` with an empty string.

2. Strip starting and ending `"` double inverted commas from `Review_Title`.

3. Combine `Review_Title` and `Review` into `full_review`.

4. Strip any leading/trailing whitespaces in `full_review`.

5. Drop `Review_Title` and `Review` columns.

6. Drop rows where `full_review` are empty strings.

In [78]:
# 1) Fill NaN values in 'Review_Title' with an empty string
data['Review_Title'] = data['Review_Title'].fillna('')
data['Review'] = data['Review'].fillna('')

# 2) Strip starting and ending `"` double inverted commas from 'Review_Title'
data['Review_Title'] = data['Review_Title'].str.strip('"')

# 3) Combine 'Review_Title' and 'Review' into 'full_review'
data['full_review'] = data['Review_Title'] + " " + data['Review']

# 4) Strip any leading/trailing whitespace
data['full_review'] = data['full_review'].str.strip()

# 5) Drop `Review_Title` and `Review` columns
data = data.drop(columns = ['Review_Title', 'Review'])

# Check if the 'full_review' column was added correctly and whether 'Review_Title' and 'Review' columns has been dropped
data.head()
print("The old shape is:",data.shape)

The old shape is: (23050, 2)


### Handle Missing Values
1. Drop rows where `Full_Review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [79]:
# 1) Drop rows where `full_review` are empty strings and reset the index
data = data[data['full_review'] != ""].reset_index(drop=True)
print("The new shape is:",data.shape)

# 2) Check if there are no more null values in `data`
data.isnull().sum()

The new shape is: (23040, 2)


Overall_Rating    0
full_review       0
dtype: int64

In [80]:
data.head()

Unnamed: 0,Overall_Rating,full_review
0,9,pretty decent airline Moroni to Moheli. Turn...
1,1,Not a good airline Moroni to Anjouan. It is a...
2,1,flight was fortunately short Anjouan to Dzao...
3,1,I will never fly again with Adria Please do ...
4,1,it ruined our last days of holidays Do not bo...


In [87]:
data[data["Overall_Rating"] != "1"]

Unnamed: 0,Overall_Rating,full_review
0,9,pretty decent airline Moroni to Moheli. Turn...
9,8,the crew was nice Ljubljana to Munich. The ho...
12,2,overall very poor We were traveling from Par...
13,2,Would not fly again Ljubljana to Munich. Adr...
14,3,very unpleasant experience A very unpleasant...
...,...,...
23019,2,I should not have to pay extra just to get a m...
23021,9,I highly recommend ZipAir It was my first ti...
23022,5,insist I paid for another 23kg of luggage Tw...
23037,3,Will not recommend to anyone Flight was leav...


### Create new column `language`
We found that there are rows where `full_review` are in different languages (e.g., French, Russian, etc.) other than English. We decided to use 2 different language detector libraries (`langdetect`, `langid`) on the `full_review` column and combined the predictions of all 2 libraries and selecting the most frequent predicted language.

**Reason**: `langdetect` might perform well on longer texts while `langid` is more reliable on short texts, using multiple detectors reduces the likelihood of misclassification and mitigates individual detector errors, leading to more accurate overall predictions. Also, even if one detector fails or throws an error, the other can still provide predictions, therefore improving the robustness of the language detection.

1. Set a seed for `langdetect` to ensure reproducibility.

2. Preprocess the text in `full_review`:
    - a\) Function to remove non-alphabetic characters and normalise whitespaces in `full_review`.
    - b\) Function to determine if the text is non-language (e.g., numbers, symbols only).

3. Two functions for language detection:
    - a\) Using `langdetect`.
    - b\) Using `langid`.

4. Function for calculating majority vote for each language.

5. Function for parallel processing for efficiency.

6. Caching function for repeated inputs

7. Function for choosing language based on combined majority voting.

8. Applying the combined function on `full_review` column.

9. Display the resulting `data` DataFrame.

In [52]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
data['language'] = [get_cached_language(text) for text in tqdm(data['full_review'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
data

Language Detection: 100%|██████████| 23039/23039 [03:01<00:00, 127.05it/s]


Unnamed: 0,full_review,language
0,pretty decent airline Moroni to Moheli. Turn...,en
1,Not a good airline Moroni to Anjouan. It is a...,en
2,flight was fortunately short Anjouan to Dzao...,en
3,I will never fly again with Adria Please do ...,en
4,it ruined our last days of holidays Do not bo...,en
...,...,...
23034,customer service is terrible Bangkok to Tokyo...,en
23035,Avoid at all costs Avoid at all costs. I boo...,en
23036,Will not recommend to anyone Flight was leav...,en
23037,It was immaculately clean,en


In [53]:
# See distribution of languages
data["language"].value_counts()

language
en         22852
unknown      145
fr            18
es             7
it             4
de             2
no             2
id             2
sv             2
et             1
nl             1
ca             1
pt             1
th             1
Name: count, dtype: int64

In [54]:
# Drop rows where language is NOT in english and reset the index
data = data[data['language'] == 'en'].reset_index(drop=True)
data

Unnamed: 0,full_review,language
0,pretty decent airline Moroni to Moheli. Turn...,en
1,Not a good airline Moroni to Anjouan. It is a...,en
2,flight was fortunately short Anjouan to Dzao...,en
3,I will never fly again with Adria Please do ...,en
4,it ruined our last days of holidays Do not bo...,en
...,...,...
22847,customer service is terrible Bangkok to Tokyo...,en
22848,Avoid at all costs Avoid at all costs. I boo...,en
22849,Will not recommend to anyone Flight was leav...,en
22850,It was immaculately clean,en


We will drop the `language` column since all values of `language` are `en` and all `full_review` are in the English language.

In [55]:
data.info()
data.drop(columns=["language"], inplace=True)
print("The new shape is:", data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22852 entries, 0 to 22851
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   full_review  22852 non-null  object
 1   language     22852 non-null  object
dtypes: object(2)
memory usage: 357.2+ KB
The new shape is: (22852, 1)


In [56]:
data

Unnamed: 0,full_review
0,pretty decent airline Moroni to Moheli. Turn...
1,Not a good airline Moroni to Anjouan. It is a...
2,flight was fortunately short Anjouan to Dzao...
3,I will never fly again with Adria Please do ...
4,it ruined our last days of holidays Do not bo...
...,...
22847,customer service is terrible Bangkok to Tokyo...
22848,Avoid at all costs Avoid at all costs. I boo...
22849,Will not recommend to anyone Flight was leav...
22850,It was immaculately clean


# Text Preprocessing for NLP

Here we will define a function `process_full_review` that takes a textual value as input and applies the following processing steps in sequence:

1. Convert the input text to lowercase using the `lower()` function.

2. Tokenize the lowercase text using the `word_tokenize` function from the NLTK library.

3. Create a list (`alphabetic_tokens`) containing only alphanetic tokens using a list comprehension with a regular expression match.

4. Remove stopwords
-   Obtain a set of English stopwords using the `stopwords.words('english')` method.
-   Define a list of `allowed_words` that should not be removed.
-   Remove the stopwords (excluding those that should not be removed).

5. Apply lemmatization to each token in the list (`lemmatized_words`) using the `lemmatize` method.

6. Join the lemmatized tokens into a single processed text using the `join` method and return the processed text.

Create a new column (`processed_full_review`) in `data` by applying the `process_full_review` function to the `Full_Review` column.

In [57]:
# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [67]:
# Define function to process text
def process_full_review(text):
    processed_text = ""

    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Keep only alphabetic tokens
    alphabetic_tokens = [i for i in tokens if re.match('^[a-zA-Z]+$', i)]

    if len(alphabetic_tokens) == 0:
        # Return empty processed text and zero ratios if no valid token
        # return processed_text
        return processed_text

    # List of stopwords
    stop_words = stopwords.words('english')

    # List of allowed words (to preserve certain negative words and conjuctions)
    allowed_words = ["no", "not", "don't", "dont", "don", "but", 
                     "however", "never", "wasn't", "wasnt", "shouldn't",
                     "shouldnt", "mustn't", "musnt"]
    '''
    these words may carry important information, such as negative connotations. in examples such as
    "don't ever get this dish" -> if don't was removed, it may be interpreted as "get dish", which is of the opposite sentiment
    of what the original review is supposed to be.
    Conjunctions like "but" and "however" shows a contrast to the sentence said before, meaning that the sentiment can be
    negatively affected or at the very least, impacted. Similarly for "mustn't" or "shouldn't", they typically carry a negative sentiment.
    '''

    # Filter out stopwords, keeping allowed words
    filtered_tokens = [i for i in alphabetic_tokens if i not in stop_words or i in allowed_words]

    # Initialise the WordNet Lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Lemmatize the filtered tokens
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    # Join the lemmatized words back into a single string
    processed_text = ' '.join(lemmatized_words)

    return processed_text

In [68]:
# Enable tqdm for pandas (progress bar)
tqdm.pandas(desc="Processing Airlines Reviews")

# Apply process_full_review function with tqdm progress bar and expand the results into separate columns.
data['processed_full_review'] = data['full_review'].progress_apply(process_full_review)

data

Processing Airlines Reviews: 100%|██████████| 22852/22852 [00:26<00:00, 862.24it/s]


Unnamed: 0,full_review,processed_full_review
0,pretty decent airline Moroni to Moheli. Turn...,pretty decent airline moroni moheli turned pre...
1,Not a good airline Moroni to Anjouan. It is a...,not good airline moroni anjouan small airline ...
2,flight was fortunately short Anjouan to Dzao...,flight fortunately short anjouan dzaoudzi smal...
3,I will never fly again with Adria Please do ...,never fly adria please favor not fly adria rou...
4,it ruined our last days of holidays Do not bo...,ruined last day holiday not book flight airlin...
...,...,...
22847,customer service is terrible Bangkok to Tokyo...,customer service terrible bangkok tokyo flown ...
22848,Avoid at all costs Avoid at all costs. I boo...,avoid cost avoid cost booked flight go singapo...
22849,Will not recommend to anyone Flight was leav...,not recommend anyone flight leaving hour half ...
22850,It was immaculately clean,immaculately clean


# Exploratory Data Analysis (EDA)

## Statistical Summary

In [32]:
print("The shape of the `data` DataFrame is:",data.shape)

data.isnull().sum()

data["Recommended"].value_counts()

The shape of the `data` DataFrame is: (22853, 3)


Recommended
no     15096
yes     7757
Name: count, dtype: int64

#### Class Distribution

In [None]:
# Plotting the distribution of the "Overall Rating" dependent variable
sns.countplot(x='Overall_Rating', data=df_cleaned)
plt.title('Class Distribution')
plt.show()

In [None]:
# Get percentage distribution of "Overall Rating"
class_distribution_percentage = df_cleaned['Overall_Rating'].value_counts(normalize=True) * 100

print(class_distribution_percentage)

#### Distribution of Features

#### Text Preprocessing

In [None]:
# Text preprocessing
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Function to clean text for "Review" column of df_cleaned
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply cleaning to "Review"
df_cleaned['Cleaned_Review'] = df_cleaned['Review'].apply(preprocess_text)
df_cleaned["Cleaned_Review"].head() 

In [None]:
#Convert Text to Sequences

#for an RNN, text data needs to be converted into numerical form

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with a limit on vocabulary size
max_words = 5000
vectorizer = TfidfVectorizer(max_features=max_words)

# Fit and transform the text data into numerical sequences
sequences = vectorizer.fit_transform(df_cleaned['Cleaned_Review'])

# Convert to array (if needed)
sequences_array = sequences.toarray()

# Check the shape of the output
print(sequences_array.shape)



#### Visualizing the Distribution of Text Length

In [None]:
df_cleaned['review_length'] = df_cleaned['Cleaned_Review'].apply(lambda x: len(x.split()))
df_cleaned['review_length'].hist(bins=50)


In [None]:
import seaborn as sns
sns.histplot(df_cleaned['review_length'], kde=True)

#### Most Common Words



In [None]:
from collections import Counter

all_words = ' '.join([text for text in df_cleaned['Cleaned_Review']])
word_counts = Counter(all_words.split())
common_words = word_counts.most_common(20)

print(common_words)


#### Word Cloud Analysis


In [None]:
from wordcloud import WordCloud

# Generating word cloud from cleaned reviews
text = ' '.join(df_cleaned['Cleaned_Review'].tolist())
wordcloud = WordCloud(width=800, height=400, max_words=100).generate(text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

#### Bigram Analysis

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a count vectorizer for most common bigrams(Phrases of 2 words)
vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=19)

# Fit and transform the cleaned review data
bigrams = vectorizer.fit_transform(df_cleaned['Cleaned_Review'])

# Geting the bigram frequencies
bigram_frequencies = pd.DataFrame(bigrams.toarray(), columns=vectorizer.get_feature_names_out()).sum().sort_values(ascending=False)

print(bigram_frequencies)

#### Finding the Sentiment Polarity Distribution

In [None]:
%pip install textblob
from textblob import TextBlob

# Calculate polarity
# Take note that this current polarity is calculated using the TextBlob library

df_cleaned['polarity'] = df_cleaned['Cleaned_Review'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Plot the polarity distribution
sns.histplot(df_cleaned['polarity'], bins=30)
plt.title('Sentiment Polarity Distribution')
plt.show()

In [None]:
# After additional columns added to df_cleaned, this is how it looks like now
df_cleaned.tail()

#### Correlation Matrix


In [None]:
# With only 1 numerical independent variable, the correlation matrix is as follows
sns.heatmap(df_cleaned.corr(), annot=True)

#### Pairplot of Features
