Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

In [87]:
#imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re

In [88]:
#create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/BA_reviews.csv", index_col=0)

In [89]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,"✅ Trip Verified | Flight cancelled, not refu...",5.0,7th November 2024,Canada
1,"✅ Trip Verified | I had visa issues, and hen...",1.0,5th November 2024,India
2,✅ Trip Verified | Singapore to Heathrow with...,1.0,5th November 2024,United Kingdom
3,✅ Trip Verified | I recently travelled from ...,6.0,3rd November 2024,United Kingdom
4,Not Verified | I paid for seats 80 A and B on...,1.0,3rd November 2024,United States


In [90]:
df.dtypes

reviews     object
stars      float64
date        object
country     object
dtype: object

We will also create a column which mentions if the user is verified or not.

In [91]:
df['verified']  = df.reviews.str.contains("Trip Verified")

In [92]:
df['verified'] 

0        True
1        True
2        True
3        True
4       False
        ...  
3495    False
3496    False
3497    False
3498    False
3499    False
Name: verified, Length: 3500, dtype: bool

Cleaning Reviews

In [93]:
#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [94]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [95]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,"✅ Trip Verified | Flight cancelled, not refu...",5.0,7th November 2024,Canada,True,flight cancelled refunding money saying took f...
1,"✅ Trip Verified | I had visa issues, and hen...",1.0,5th November 2024,India,True,visa issue hence debarred flying ground staff ...
2,✅ Trip Verified | Singapore to Heathrow with...,1.0,5th November 2024,United Kingdom,True,singapore heathrow ba two choice route economy...
3,✅ Trip Verified | I recently travelled from ...,6.0,3rd November 2024,United Kingdom,True,recently travelled munich london british airwa...
4,Not Verified | I paid for seats 80 A and B on...,1.0,3rd November 2024,United States,False,verified paid seat b flight heathrow boston pa...


In [96]:
df.dtypes

reviews      object
stars       float64
date         object
country      object
verified       bool
corpus       object
dtype: object

In [97]:
# convert the date to datetime format

df.date = pd.to_datetime(df['date'], format='mixed')

In [98]:
df.date.head()

0   2024-11-07
1   2024-11-05
2   2024-11-05
3   2024-11-03
4   2024-11-03
Name: date, dtype: datetime64[ns]

In [99]:
#check for unique values
df.stars.unique()

array([ 5.,  1.,  6.,  3.,  7.,  9.,  2.,  8.,  4., 10., nan])

In [100]:
df['stars'].value_counts()

stars
1.0     880
2.0     406
3.0     400
8.0     339
10.0    283
7.0     273
9.0     265
5.0     246
4.0     235
6.0     170
Name: count, dtype: int64

In [109]:
df.dropna(subset=["stars"], inplace=True)
df['stars'] = df['stars'].astype(int)
df['stars'].value_counts()

stars
1     880
2     406
3     400
8     339
10    283
7     273
9     265
5     246
4     235
6     170
Name: count, dtype: int64

In [110]:
#check the unique values again
df.stars.unique()

array([ 5,  1,  6,  3,  7,  9,  2,  8,  4, 10])

In [111]:
df.stars.value_counts()

stars
1     880
2     406
3     400
8     339
10    283
7     273
9     265
5     246
4     235
6     170
Name: count, dtype: int64

Check for null Values

In [112]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     3496
                       True     False     False        1
Name: count, dtype: int64

In [105]:
df.country.isnull().value_counts()

country
False    3499
True        1
Name: count, dtype: int64

We have 1 missing value for country. For this we can just remove that one review (row) from the dataframe.


In [113]:
#drop the rows using index where the country value is null
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)

In [114]:
df.shape

(3496, 6)

In [115]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,"✅ Trip Verified | Flight cancelled, not refu...",5,2024-11-07,Canada,True,flight cancelled refunding money saying took f...
1,"✅ Trip Verified | I had visa issues, and hen...",1,2024-11-05,India,True,visa issue hence debarred flying ground staff ...
2,✅ Trip Verified | Singapore to Heathrow with...,1,2024-11-05,United Kingdom,True,singapore heathrow ba two choice route economy...
3,✅ Trip Verified | I recently travelled from ...,6,2024-11-03,United Kingdom,True,recently travelled munich london british airwa...
4,Not Verified | I paid for seats 80 A and B on...,1,2024-11-03,United States,False,verified paid seat b flight heathrow boston pa...
...,...,...,...,...,...,...
3491,BA268 LAX-LHR seat WT+ on the A380. Check in w...,1,2014-11-06,United Kingdom,False,ba lax lhr seat wt check quick security onewor...
3492,Travelled from Gatwick to Orlando on the 24th ...,10,2014-11-06,United Kingdom,False,avelled gatwick orlando th october ba disappoi...
3493,LGW to Cancun - flew with BA in CW. The galler...,7,2014-11-06,United Kingdom,False,lgw cancun flew ba cw gallery lounge gatwick o...
3494,31.10.14 - LHR to Berlin Tegel. Flight out goo...,9,2014-11-06,United Kingdom,False,lhr berlin tegel flight good modern plane clea...


Now our data is all cleaned and ready for data visualization and data analysis.

In [116]:
# export the cleaned data

df.to_csv(cwd + "/cleaned-BA-reviews.csv")