*Data Cleaning*

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

In [68]:
#imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re

In [69]:
#create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/BA_reviews.csv", index_col=0)

In [70]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | I have never travelled wit...,5.0,14th February 2025,United Kingdom
1,"✅ Trip Verified | Terrible overall, medium ser...",9.0,7th February 2025,Switzerland
2,✅ Trip Verified | London Heathrow to Male In...,1.0,1st February 2025,United Kingdom
3,Not Verified | Very good flight following an ...,9.0,20th January 2025,United Kingdom
4,Not Verified | An hour's delay due to late ar...,9.0,19th January 2025,United Kingdom


In [71]:
#We will also create a column which mentions if the user is verified or not.

df['verified'] = df.reviews.str.contains("Trip Verified")

df['verified']

0        True
1        True
2        True
3       False
4       False
        ...  
3495    False
3496    False
3497    False
3498    False
3499    False
Name: verified, Length: 3500, dtype: bool

In [73]:
df.columns

Index(['reviews', 'stars', 'date', 'country', 'verified'], dtype='object')

*Cleaning Reviews*

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [30]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [None]:
import nltk
nltk.download('wordnet')        # For lemmatization
nltk.download('omw-1.4')        # WordNet data
nltk.download('stopwords')      # Stopwords dataset

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shaik\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\shaik\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shaik\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [74]:
import pandas as pd
import re

#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [75]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [76]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | I have never travelled wit...,5.0,14th February 2025,United Kingdom,True,never travelled british airway first time chos...
1,"✅ Trip Verified | Terrible overall, medium ser...",9.0,7th February 2025,Switzerland,True,ble overall medium service flight delayed help...
2,✅ Trip Verified | London Heathrow to Male In...,1.0,1st February 2025,United Kingdom,True,london heathrow male new business class ba con...
3,Not Verified | Very good flight following an ...,9.0,20th January 2025,United Kingdom,False,verified good flight following equally good fl...
4,Not Verified | An hour's delay due to late ar...,9.0,19th January 2025,United Kingdom,False,verified hour delay due late arrival incoming ...


In [77]:
df.columns

Index(['reviews', 'stars', 'date', 'country', 'verified', 'corpus'], dtype='object')

In [78]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | I have never travelled wit...,5.0,14th February 2025,United Kingdom,True,never travelled british airway first time chos...
1,"✅ Trip Verified | Terrible overall, medium ser...",9.0,7th February 2025,Switzerland,True,ble overall medium service flight delayed help...
2,✅ Trip Verified | London Heathrow to Male In...,1.0,1st February 2025,United Kingdom,True,london heathrow male new business class ba con...
3,Not Verified | Very good flight following an ...,9.0,20th January 2025,United Kingdom,False,verified good flight following equally good fl...
4,Not Verified | An hour's delay due to late ar...,9.0,19th January 2025,United Kingdom,False,verified hour delay due late arrival incoming ...


*Cleaning/Fromat date*

In [79]:
df.dtypes

reviews      object
stars       float64
date         object
country      object
verified       bool
corpus       object
dtype: object

In [83]:
# convert the date to datetime format

# Remove ordinal suffixes (st, nd, rd, th)
df['date'] = df['date'].apply(lambda x: re.sub(r'(\d+)(st|nd|rd|th)', r'\1', x))

df['date'] = pd.to_datetime(df['date'], format="%d %B %Y")


df.date.head()

0   2025-02-14
1   2025-02-07
2   2025-02-01
3   2025-01-20
4   2025-01-19
Name: date, dtype: datetime64[ns]

In [103]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | I have never travelled wit...,5.0,2025-02-14,United Kingdom,True,never travelled british airway first time chos...
1,"✅ Trip Verified | Terrible overall, medium ser...",9.0,2025-02-07,Switzerland,True,ble overall medium service flight delayed help...
2,✅ Trip Verified | London Heathrow to Male In...,1.0,2025-02-01,United Kingdom,True,london heathrow male new business class ba con...
3,Not Verified | Very good flight following an ...,9.0,2025-01-20,United Kingdom,False,verified good flight following equally good fl...
4,Not Verified | An hour's delay due to late ar...,9.0,2025-01-19,United Kingdom,False,verified hour delay due late arrival incoming ...


*Cleaning ratings with stars*

In [107]:
# drop the rows where the value of ratings is None
df.drop(df[df.stars == "None"].index, axis=0, inplace=True)

In [108]:
#check for unique values
df.stars.unique()

array([ 5.,  9.,  1.,  7.,  2.,  8.,  4., 10.,  3.,  6., nan])

In [109]:
df.stars.value_counts()

stars
1.0     892
2.0     407
3.0     402
8.0     337
10.0    277
7.0     274
9.0     264
5.0     244
4.0     233
6.0     169
Name: count, dtype: int64

*Check for Null Values*

In [110]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     3498
                       True     False     False        1
         True   False  False    False     False        1
Name: count, dtype: int64

In [114]:
df.stars.isnull().value_counts()

stars
False    3499
True        1
Name: count, dtype: int64

In [113]:
df.country.isnull().value_counts()

country
False    3499
True        1
Name: count, dtype: int64

In [None]:
#We have two missing values for country. For this we can just remove those two reviews (rows) from the dataframe.
#drop the rows using index where the country value is null
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)

In [117]:
df.shape

(3499, 6)

In [118]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | I have never travelled wit...,5.0,2025-02-14,United Kingdom,True,never travelled british airway first time chos...
1,"✅ Trip Verified | Terrible overall, medium ser...",9.0,2025-02-07,Switzerland,True,ble overall medium service flight delayed help...
2,✅ Trip Verified | London Heathrow to Male In...,1.0,2025-02-01,United Kingdom,True,london heathrow male new business class ba con...
3,Not Verified | Very good flight following an ...,9.0,2025-01-20,United Kingdom,False,verified good flight following equally good fl...
4,Not Verified | An hour's delay due to late ar...,9.0,2025-01-19,United Kingdom,False,verified hour delay due late arrival incoming ...
...,...,...,...,...,...,...
3494,I travel to and from Singapore on BA in Club w...,9.0,2014-11-20,United Kingdom,False,travel singapore ba club world month first tim...
3495,First time with BA (a code share flight for JA...,10.0,2014-11-20,Australia,False,first time ba code share flight jal travelled ...
3496,London Heathrow to Zagreb return in economy. U...,10.0,2014-11-20,United Kingdom,False,london heathrow zagreb return economy used ba ...
3497,BA16 Singapore to London. B777 World Traveller...,7.0,2014-11-20,Singapore,False,ba singapore london b world traveller cabin on...


In [120]:
# export the cleaned data

df.to_csv(cwd + "/cleaned-BA-reviews.csv")