# Enhancing Airline Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**

Developed a a series of data preprocessing tasks, utilizing datasets from [Airlines Review Dataset](https://www.kaggle.com/datasets/juhibhojani/airline-reviews). Performed sentiment analysis and evaluated performance metrics using multiple models.

### Airline Customer Review Dataset Information

The [Airline Customer Review Dataset](https://www.kaggle.com/datasets/juhibhojani/airline-reviews) contains customer review data for airline flights.

- **Airline Name**: Name of Airline.
- **Overall_Rating:** Rating given by the user.
- **Review_Title:** Title of review.
- **Review Date:** The date when review was entered (e.g., 1st January 2019).
- **Verified:** Whether the reviewer is verified or not.
- **Review:** Detailed review given by the user.
- **Aircraft:** Type of aircraft.
- **Type Of Traveller:** The type of traveller (e.g., Solo Leisure).
- **Seat Type:** Categorical seat class type (e.g., Economy Class).
- **Route:** Flight source and destination.
- **Date Flown:** Month and year of flight (e.g., September 2019).
- **Seat Comfort:** Rating out of 5.
- **Cabin Staff Service:** Rating out of 5.
- **Food & Beverages:** Rating out of 5.
- **Ground Service:** Rating out of 5.
- **Inflight Entertainment:** Rating out of 5.
- **Wifi & Connectivity:** Rating out of 5.
- **Value For Money:** Rating out of 5.
- **Recommended:** Whether the flight is recommended or not.

## Import Libraries

Uncomment the line below to install the dependencies required for this notebook.

In [None]:
# !pip3 install -r requirements.txt

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import zscore

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Text Preprocessing and NLP
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
import nltk
import re
import spacy
from wordcloud import WordCloud


## Data Preparation (Loading CSV)

Load the Airline Review `csv` file into a pandas DataFrame `data_raw`.

In [103]:
data_raw = pd.read_csv('data.csv')

In [None]:
data_raw.info()
print("Dataframe Shape: ", data_raw.shape)

In [None]:
data_raw.head()

## Feature Selection
Here we select the relevant features for sentiment analysis
- 'Airline Name', 'Overall_Rating', 'Review_Title', 'Review Date', 
    'Recommended', 'Review', 'Type Of Traveller', 'Seat Type'
- Create a new DataFrame (`data`) by selecting the specifc columns mentioned above from the original DataFrame `data_raw`.

### Remove Duplicate Rows
- Drop duplicate rows from the dataframe (`data`).

In [None]:
# Selecting the relevant features for sentiment analysis 
data = data_raw[[
    'Airline Name', 'Overall_Rating', 'Review_Title', 'Review Date',
    'Review', 'Type Of Traveller', 'Seat Type', 'Recommended'
]]
print(type(data))
print(data.head())

# Shape before dropping duplicates
print("The old shape is: ", data.shape)

data = data.drop_duplicates()

# Display the new dataframe shape
print("The new shape is: ", data.shape)

## Remove Outliers

#### Review 

The `Review` column of `data`, which is of string type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `review_length` in the DataFrame `data` by calculating the length of each review. (Set the value as 0 if the correponding `Review` column has NaN values.)
2. Check the statistics of `review_length` using `describe()` method.
3. Calculate the mean and standard deviation of the `review_length` column.
4. Set the Z-score threshold for identifying outliers to 3.
5. Identify outliers of the `review_length` column and set the corresponding `Review` to np.nan.
6. Drop the `title_length` column from the DataFrame.

In [None]:
data['review_length'] = data['Review'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["review_length"]
stats_TL = TL.describe()
print(stats_TL)

In [None]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'Review' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'Review'] = np.nan
# print(data.head(3))

data = data.drop("review_length", axis=1)
# print(data.head(3))

#### Review_Title

Similarly, the `Review_Title` column of `data` (of type `str`) may also contain values with unusually long lengths, indicating the presence of outliers.

1. Create a new column `title_length` in the DataFrame `data` by calculating the length of each price value. (Set the value as 0 if the correponding `Review_Title` column has NaN values.)
2. Check the statistics of `title_length` using `describe()` method and display its unique values.
3. Identify the outlier values by inspecting the content in `Review_Title` corresponding to the abnormal value in `title_length` and set the corresponding value of `Review_Title` to np.nan.
4. Drop the `title_length` column from the DataFrame.

In [None]:
data['title_length'] = data['Review_Title'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["title_length"]
stats_TL = TL.describe()
print(stats_TL)

In [None]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'Review_Title' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'Review_Title'] = np.nan
# print(data.head(3))

data = data.drop("title_length", axis=1)
# print(data.head(3))

In [None]:
data.isnull().sum()

## Feature Engineering

### Create new column `Full_Review`
Since there are some rows with empty `Review_Title` and `Review`, we will concatenate both columns (`Review_Title` and `Review`) to form a new column `Full_Review`.
1. Replace `NaN` values in `Review_Title` and `Review` with an empty string

2. Strip starting and ending `"` double inverted commas from `Review_Title`

3. Combine `Review_Title` and `Review` into `Full_Review`

4. Strip any leading/trailing whitespaces in `Full_Review`

5. Drop `Review_Title` and `Review` columns

In [None]:
# 1) Fill NaN values in 'Review_Title' with an empty string
data['Review_Title'] = data['Review_Title'].fillna('')
data['Review'] = data['Review'].fillna('')

# 2) Strip starting and ending `"` double inverted commas from 'Review_Title'
data['Review_Title'] = data['Review_Title'].str.strip('"')

# 3) Combine 'Review_Title' and 'Review' into 'Full_Review'
data['Full_Review'] = data['Review_Title'] + " " + data['Review']

# 4) Strip any leading/trailing whitespace
data['Full_Review'] = data['Full_Review'].str.strip()

# 5) Drop `Review_Title` and `Review` columns
# data = data.drop(columns = ['Review_Title', 'Review'])

# Check if the 'Full_Review' column was added correctly and whether 'Review_Title' and 'Review' columns has been dropped
data.head()

In [None]:
data[(data["Overall_Rating"] == "1") & (data["Recommended"]=="yes")]

## Handle Missing Values

In [None]:
# Convert Overall_Rating to numeric 
unique_ratings = data['Overall_Rating'].unique()
print(unique_ratings)

# Step 2: Convert 'Overall_Rating' to numeric and handle non-numeric values (errors='coerce' converts non-numeric values to NaN)
data['Overall_Rating'] = pd.to_numeric(data['Overall_Rating'], errors='coerce')

# Check how many missing values were introduced in 'Overall_Rating'
data['Overall_Rating'].isnull().sum()


In [None]:
# Remove rows with missing 'Overall_Rating' values
data = df_cleaned.dropna(subset=['Overall_Rating'])

# Display the shape and info of the cleaned dataframe
print(df_cleaned.shape)
print(df_cleaned.info())

In [None]:
#Display first few rows of the df_cleaned
df_cleaned.head()

In [None]:
# Label encode 'Recommended' as binary values
df_cleaned['Recommended'] = df_cleaned['Recommended'].apply(lambda x: 1 if x.lower() == 'yes' else 0)

# Handle missing values in 'Type Of Traveller' and 'Seat Type' by filling with 'Unknown'
df_cleaned['Type Of Traveller'].fillna('Unknown', inplace=True)
df_cleaned['Seat Type'].fillna('Unknown', inplace=True)

# Display the final cleaned dataframe information and first few rows
df_cleaned_info_final = df_cleaned.info()
df_cleaned_head_final = df_cleaned.head()

df_cleaned_info_final, df_cleaned_head_final


In [12]:

#remove duplicates 
df_cleaned = df_cleaned.drop_duplicates()




### Exploratory Data Analysis

#### Statistical Summary

In [None]:
df_cleaned.shape
df_cleaned.isnull().sum()

#### Class Distribution

In [None]:
# Plotting the distribution of the "Overall Rating" dependent variable
sns.countplot(x='Overall_Rating', data=df_cleaned)
plt.title('Class Distribution')
plt.show()

In [None]:
# Get percentage distribution of "Overall Rating"
class_distribution_percentage = df_cleaned['Overall_Rating'].value_counts(normalize=True) * 100

print(class_distribution_percentage)

#### Distribution of Features

#### Text Preprocessing

In [None]:
# Text preprocessing
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Function to clean text for "Review" column of df_cleaned
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply cleaning to "Review"
df_cleaned['Cleaned_Review'] = df_cleaned['Review'].apply(preprocess_text)
df_cleaned["Cleaned_Review"].head() 

In [None]:
#Convert Text to Sequences

#for an RNN, text data needs to be converted into numerical form

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with a limit on vocabulary size
max_words = 5000
vectorizer = TfidfVectorizer(max_features=max_words)

# Fit and transform the text data into numerical sequences
sequences = vectorizer.fit_transform(df_cleaned['Cleaned_Review'])

# Convert to array (if needed)
sequences_array = sequences.toarray()

# Check the shape of the output
print(sequences_array.shape)



#### Visualizing the Distribution of Text Length

In [None]:
df_cleaned['review_length'] = df_cleaned['Cleaned_Review'].apply(lambda x: len(x.split()))
df_cleaned['review_length'].hist(bins=50)


In [None]:
import seaborn as sns
sns.histplot(df_cleaned['review_length'], kde=True)

#### Most Common Words



In [None]:
from collections import Counter

all_words = ' '.join([text for text in df_cleaned['Cleaned_Review']])
word_counts = Counter(all_words.split())
common_words = word_counts.most_common(20)

print(common_words)


#### Word Cloud Analysis


In [None]:
from wordcloud import WordCloud

# Generating word cloud from cleaned reviews
text = ' '.join(df_cleaned['Cleaned_Review'].tolist())
wordcloud = WordCloud(width=800, height=400, max_words=100).generate(text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

#### Bigram Analysis

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a count vectorizer for most common bigrams(Phrases of 2 words)
vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=19)

# Fit and transform the cleaned review data
bigrams = vectorizer.fit_transform(df_cleaned['Cleaned_Review'])

# Geting the bigram frequencies
bigram_frequencies = pd.DataFrame(bigrams.toarray(), columns=vectorizer.get_feature_names_out()).sum().sort_values(ascending=False)

print(bigram_frequencies)

#### Finding the Sentiment Polarity Distribution

In [None]:
%pip install textblob
from textblob import TextBlob

# Calculate polarity
# Take note that this current polarity is calculated using the TextBlob library

df_cleaned['polarity'] = df_cleaned['Cleaned_Review'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Plot the polarity distribution
sns.histplot(df_cleaned['polarity'], bins=30)
plt.title('Sentiment Polarity Distribution')
plt.show()

In [None]:
# After additional columns added to df_cleaned, this is how it looks like now
df_cleaned.tail()

#### Correlation Matrix


In [None]:
# With only 1 numerical independent variable, the correlation matrix is as follows
sns.heatmap(df_cleaned.corr(), annot=True)

#### Pairplot of Features
