# Data Cleaning: Preparing Extracted Data for Analysis
After extracting the data from the website, it is essential to perform data cleaning before analyzing it. The reviews section requires cleaning to remove punctuation marks, correct spellings, and eliminate other unwanted characters. Data cleaning ensures that the data is in a standardized and consistent format, enabling accurate analysis.

In [23]:
#imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re

In [24]:
#create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/../Data_collection/BA_reviews.csv", index_col=0)

In [25]:
df.head()

Unnamed: 0,reviews,name,stars,date_of_review,country,locations,date_of_travel,travel_type,seat_type,route
0,✅ Trip Verified |. The BA first lounge at Term...,E Michaels,5.0,22nd May 2023,United Kingdom,United Kingdom,May 2023,Business,Business Class,London Heathrow to Malaga
1,Not Verified | Paid a quick visit to Nice yest...,Steve Bennett,2.0,22nd May 2023,United Kingdom,United Kingdom,May 2023,Couple Leisure,Business Class,London to Nice
2,✅ Trip Verified | Words fail to describe this...,N Mayle,4.0,19th May 2023,United States,United States,September 2022,Solo Leisure,Business Class,London to San Francisco
3,✅ Trip Verified | Absolutely terrible experie...,E Heale,2.0,17th May 2023,United States,United States,April 2023,Solo Leisure,Economy Class,London to Dallas
4,✅ Trip Verified | BA overbook every flight to ...,H Mike,1.0,17th May 2023,United Kingdom,United Kingdom,May 2023,Business,Economy Class,London to Madrid


In [26]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [27]:
df['verified']

0        True
1       False
2        True
3        True
4        True
        ...  
3545    False
3546    False
3547    False
3548    False
3549    False
Name: verified, Length: 3550, dtype: bool

# Cleaning Reviews:

As part of the data preprocessing phase, we will extract the reviews column from the dataset and create a separate dataframe specifically for performing semantic analysis. The reviews in this dataframe will undergo cleaning processes such as removing stop words, handling special characters, and applying text normalization techniques to enhance the accuracy of the subsequent analysis. Cleaning the reviews ensures that we have high-quality data for meaningful semantic analysis.

In [28]:
#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

reviews_data = df.reviews.str.strip("✅ Trip Verified |")
#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for review in reviews_data:
    review = re.sub('[^a-zA-Z]',' ', review)
    review = review.lower()
    review = review.split()
    review = [lemma.lemmatize(word) for word in review if word not in set(STOPWORDS)]
    review = " ".join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [30]:
df.head()

Unnamed: 0,reviews,name,stars,date_of_review,country,locations,date_of_travel,travel_type,seat_type,route,verified,corpus
0,✅ Trip Verified |. The BA first lounge at Term...,E Michaels,5.0,22nd May 2023,United Kingdom,United Kingdom,May 2023,Business,Business Class,London Heathrow to Malaga,True,ba first lounge terminal zoo pm dirty table us...
1,Not Verified | Paid a quick visit to Nice yest...,Steve Bennett,2.0,22nd May 2023,United Kingdom,United Kingdom,May 2023,Couple Leisure,Business Class,London to Nice,False,verified paid quick visit nice yesterday heath...
2,✅ Trip Verified | Words fail to describe this...,N Mayle,4.0,19th May 2023,United States,United States,September 2022,Solo Leisure,Business Class,London to San Francisco,True,word fail describe last awful flight baby acro...
3,✅ Trip Verified | Absolutely terrible experie...,E Heale,2.0,17th May 2023,United States,United States,April 2023,Solo Leisure,Economy Class,London to Dallas,True,absolutely terrible experience app would let c...
4,✅ Trip Verified | BA overbook every flight to ...,H Mike,1.0,17th May 2023,United Kingdom,United Kingdom,May 2023,Business,Economy Class,London to Madrid,True,ba overbook every flight maximise income regar...


We will clean and format the date column in the dataset to ensure consistency and ease of analysis.

In [31]:
df.dtypes

reviews            object
name               object
stars             float64
date_of_review     object
country            object
locations          object
date_of_travel     object
travel_type        object
seat_type          object
route              object
verified             bool
corpus             object
dtype: object

In [32]:
# convert the date to datetime format

df.date_of_review = pd.to_datetime(df.date_of_review, format="mixed")
df.date_of_travel = pd.to_datetime(df.date_of_travel, format="mixed")

In [33]:
df.date_of_review.head(5)


0   2023-05-22
1   2023-05-22
2   2023-05-19
3   2023-05-17
4   2023-05-17
Name: date_of_review, dtype: datetime64[ns]

In [34]:
df.date_of_travel.head(5)

0   2023-05-01
1   2023-05-01
2   2022-09-01
3   2023-04-01
4   2023-05-01
Name: date_of_travel, dtype: datetime64[ns]

Cleaning ratings with stars

In [35]:
#check for unique values
df.stars.unique()
# remove the \t and \n from the ratings
df.stars = df.stars.astype(str)
df.stars = df.stars.str.strip("\n\t\t\t\t\t\t\t\t\t\t\t\t\t")

In [36]:
df.stars.value_counts()

stars
1.0     757
5.0     528
2.0     374
3.0     369
8.0     321
10.0    279
7.0     269
9.0     264
4.0     223
6.0     162
nan       4
Name: count, dtype: int64

There are 4 rows having values "nan" in the ratings. We will drop all these 5 rows.

In [37]:
# drop the rows where the value of ratings is None
df.drop(df[df.stars == "nan"].index, axis=0, inplace=True)

In [38]:
#check the unique values again
df.stars.unique()

array(['5.0', '2.0', '4.0', '1.0', '3.0', '10.0', '9.0', '7.0', '8.0',
       '6.0'], dtype=object)

cleaning for null Values


In [39]:
df.isnull().value_counts()

reviews  name   stars  date_of_review  country  locations  date_of_travel  travel_type  seat_type  route   verified  corpus
False    False  False  False           False    False      False           False        False      False   False     False     2764
                                                           True            True         False      True    False     False      762
                                                                           False        False      False   False     False        8
                                                           False           False        False      True    False     False        5
                                                                           True         False      False   False     False        2
                                                           True            False        False      True    False     False        2
                                       True     True       True            True     

In [40]:
df.country.isnull().value_counts()

country
False    3544
True        2
Name: count, dtype: int64

In [41]:
#drop the rows using index where the country value is null
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)

In [42]:
df.shape

(3544, 12)

In [43]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,name,stars,date_of_review,country,locations,date_of_travel,travel_type,seat_type,route,verified,corpus
0,✅ Trip Verified |. The BA first lounge at Term...,E Michaels,5.0,2023-05-22,United Kingdom,United Kingdom,2023-05-01,Business,Business Class,London Heathrow to Malaga,True,ba first lounge terminal zoo pm dirty table us...
1,Not Verified | Paid a quick visit to Nice yest...,Steve Bennett,2.0,2023-05-22,United Kingdom,United Kingdom,2023-05-01,Couple Leisure,Business Class,London to Nice,False,verified paid quick visit nice yesterday heath...
2,✅ Trip Verified | Words fail to describe this...,N Mayle,4.0,2023-05-19,United States,United States,2022-09-01,Solo Leisure,Business Class,London to San Francisco,True,word fail describe last awful flight baby acro...
3,✅ Trip Verified | Absolutely terrible experie...,E Heale,2.0,2023-05-17,United States,United States,2023-04-01,Solo Leisure,Economy Class,London to Dallas,True,absolutely terrible experience app would let c...
4,✅ Trip Verified | BA overbook every flight to ...,H Mike,1.0,2023-05-17,United Kingdom,United Kingdom,2023-05-01,Business,Economy Class,London to Madrid,True,ba overbook every flight maximise income regar...
...,...,...,...,...,...,...,...,...,...,...,...,...
3539,This was a bmi Regional operated flight on a R...,J Robertson,9.0,2012-08-29,United Kingdom,United Kingdom,NaT,,Economy Class,,False,bmi regional operated flight rj manchester hea...
3540,LHR to HAM. Purser addresses all club passenge...,Nick Berry,7.0,2012-08-28,United Kingdom,United Kingdom,NaT,,Business Class,,False,lhr ham purser address club passenger name boa...
3541,My son who had worked for British Airways urge...,Avril Barclay,1.0,2011-10-12,United Kingdom,United Kingdom,NaT,,Economy Class,,False,son worked british airway urged fly british ai...
3542,London City-New York JFK via Shannon on A318 b...,C Volz,2.0,2011-10-11,United States,United States,NaT,,Premium Economy,,False,london city new york jfk via shannon really ni...


In [44]:
# export the cleaned data

df.to_csv(cwd + "/cleaned_BA_reviews.csv")