## Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters. 

In [1]:
#imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re



In [2]:
#create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/BA_reviews.csv", index_col=0)

In [3]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,Not Verified | BA is not treating its premium ...,4,6th July 2023,(United Kingdom)
1,✅ Trip Verified | 24 hours before our departu...,1,5th July 2023,(South Africa)
2,✅ Trip Verified | We arrived at Heathrow at 0...,1,5th July 2023,(United Kingdom)
3,✅ Trip Verified | Original flight was cancell...,3,4th July 2023,(Greece)
4,Not Verified | Airport check in was functiona...,3,3rd July 2023,(Italy)


We will also create a column which mentions if the user is verified or not. 

In [4]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [5]:
df['verified']

0       False
1        True
2        True
3        True
4       False
        ...  
3495    False
3496    False
3497    False
3498    False
3499    False
Name: verified, Length: 3500, dtype: bool

### Cleaning Reviews

In [6]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PrafulcooL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PrafulcooL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [7]:
#for lemmatization of words we will use nltk library

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")
 meanin
#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [8]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [9]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,Not Verified | BA is not treating its premium ...,4,6th July 2023,(United Kingdom),False,verified ba treating premium economy passenger...
1,✅ Trip Verified | 24 hours before our departu...,1,5th July 2023,(South Africa),True,hour departure ba cape town heathrow thursday ...
2,✅ Trip Verified | We arrived at Heathrow at 0...,1,5th July 2023,(United Kingdom),True,arrived heathrow find flight ibiza cancelled b...
3,✅ Trip Verified | Original flight was cancell...,3,4th July 2023,(Greece),True,original flight cancelled explanation represen...
4,Not Verified | Airport check in was functiona...,3,3rd July 2023,(Italy),False,verified airport check functionary little warm...


### Cleaning/Fromat date

In [10]:
df.dtypes

reviews     object
stars       object
date        object
country     object
verified      bool
corpus      object
dtype: object

In [11]:
# convert the date to datetime format

df.date = pd.to_datetime(df.date)

In [12]:
df.date.head()

0   2023-07-06
1   2023-07-05
2   2023-07-05
3   2023-07-04
4   2023-07-03
Name: date, dtype: datetime64[ns]

### Cleaning ratings with stars

In [13]:
#check for unique values
df.stars.unique()

array(['4', '1', '3', '10', '2', '7', '9', '5', '8', '6', 'None'],
      dtype=object)

In [14]:
df.stars.value_counts()

1       801
2       397
3       396
8       348
10      311
7       302
9       299
4       239
5       224
6       178
None      5
Name: stars, dtype: int64

There are 5 rows having values "None" in the ratings. We will drop all these 5 rows. 

In [15]:
# drop the rows where the value of ratings is None
df.drop(df[df.stars == "None"].index, axis=0, inplace=True)

In [16]:
#check the unique values again
df.stars.unique()

array(['4', '1', '3', '10', '2', '7', '9', '5', '8', '6'], dtype=object)

In [17]:
df.isnull()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
3495,False,False,False,False,False,False
3496,False,False,False,False,False,False
3497,False,False,False,False,False,False
3498,False,False,False,False,False,False


## Check for null Values

In [18]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     3495
dtype: int64

In [19]:
df.country.isnull().value_counts()

False    3495
Name: country, dtype: int64

In [20]:
df.shape

(3495, 6)

In [21]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,Not Verified | BA is not treating its premium ...,4,2023-07-06,(United Kingdom),False,verified ba treating premium economy passenger...
1,✅ Trip Verified | 24 hours before our departu...,1,2023-07-05,(South Africa),True,hour departure ba cape town heathrow thursday ...
2,✅ Trip Verified | We arrived at Heathrow at 0...,1,2023-07-05,(United Kingdom),True,arrived heathrow find flight ibiza cancelled b...
3,✅ Trip Verified | Original flight was cancell...,3,2023-07-04,(Greece),True,original flight cancelled explanation represen...
4,Not Verified | Airport check in was functiona...,3,2023-07-03,(Italy),False,verified airport check functionary little warm...
...,...,...,...,...,...,...
3490,BA 213 LHR to Boston. T5 was very busy but che...,9,2014-06-12,(United Kingdom),False,ba lhr boston busy check fast efficient flight...
3491,Flew World Traveller Plus for the first time. ...,7,2014-06-12,(Canada),False,flew world traveller plus first time trip lhr ...
3492,Glasgow to LHR on a completely full flight. Th...,10,2014-06-12,(United Kingdom),False,glasgow lhr completely full flight crew amazin...
3493,The outward trip Manchester - Heathrow - Milan...,5,2014-06-10,(United Kingdom),False,outward trip manchester heathrow milan fine ev...


In [28]:
df['country'].unique().shape

(70,)

*****

Now our data is all cleaned and ready for data visualization and data analysis.

In [22]:
# export the cleaned data

df.to_csv(cwd + "/cleaned-BA-reviews.csv")