# Data Cleaning

We will now check the scrapped data and clean it if necessary.

In [1]:
# importing libraries

import pandas as pd
import numpy as np

In [2]:
# reading the csv file into a dataframe

df = pd.read_csv("data/BA_reviews_raw.csv", index_col = 0)

In [3]:
# sanity check

df.head()

Unnamed: 0,reviews,rating,review date,country
0,Not Verified | If I could give a minus rating...,1,15th March 2023,United Kingdom
1,✅ Trip Verified | Plane was over an hour late ...,2,15th March 2023,United Kingdom
2,Not Verified | We were flying World Traveller...,2,14th March 2023,United Kingdom
3,Not Verified | This was literally one of the ...,1,13th March 2023,Ireland
4,✅ Trip Verified | The usual shambolic unfoldi...,1,12th March 2023,United Kingdom


In [4]:
# shape of the data

df.shape

(3497, 4)

### Cleaning the *reviews* column

Adding a column to specify whether the user is a verified user or not.

In [5]:
df['verified user'] = df['reviews'].str.contains('Trip Verified')

In [6]:
# sanity check

df.head()

Unnamed: 0,reviews,rating,review date,country,verified user
0,Not Verified | If I could give a minus rating...,1,15th March 2023,United Kingdom,False
1,✅ Trip Verified | Plane was over an hour late ...,2,15th March 2023,United Kingdom,True
2,Not Verified | We were flying World Traveller...,2,14th March 2023,United Kingdom,False
3,Not Verified | This was literally one of the ...,1,13th March 2023,Ireland,False
4,✅ Trip Verified | The usual shambolic unfoldi...,1,12th March 2023,United Kingdom,True


Lets now clean the review text to remove the *✅ Trip Verified* and *Not Verified* as this data in now contained in a separate column.

In [7]:
for i in range(0, len(df)):
    try:
        df.loc[i,'reviews'] = df.loc[i,'reviews'].split("|")[1].strip()
    except IndexError:
        continue

In [8]:
# sanity check

df.head()

Unnamed: 0,reviews,rating,review date,country,verified user
0,"If I could give a minus rating, I would. Suppo...",1,15th March 2023,United Kingdom,False
1,"Plane was over an hour late leaving, no proble...",2,15th March 2023,United Kingdom,True
2,We were flying World Traveller Plus their Prem...,2,14th March 2023,United Kingdom,False
3,This was literally one of the worst experience...,1,13th March 2023,Ireland,False
4,The usual shambolic unfolding that BA has now ...,1,12th March 2023,United Kingdom,True


### Checking for null values and datatypes

In [9]:
# checking the null values, datatypes of the data

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3497 entries, 0 to 3496
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   reviews        3497 non-null   object
 1   rating         3497 non-null   object
 2   review date    3497 non-null   object
 3   country        3495 non-null   object
 4   verified user  3497 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 269.1+ KB


<ul>
    <li>We can see that the <i>country</i> column has 2 missing values, which can be dropped without much data loss.</li>
    <li>The <i>review date</i> column is of object type which can be converted into datetime.</li>
    <li>The <i>rating</i> column can be converted to int.</li>
</ul>

In [10]:
# dropping the rows where the country information is missing

df.drop(df[df['country'].isnull() == True].index, axis = 0, inplace = True)

In [11]:
# convert the date column to a datetime format

df['review date'] = pd.to_datetime(df['review date'])

In [12]:
# sanity check

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3495 entries, 0 to 3496
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviews        3495 non-null   object        
 1   rating         3495 non-null   object        
 2   review date    3495 non-null   datetime64[ns]
 3   country        3495 non-null   object        
 4   verified user  3495 non-null   bool          
dtypes: bool(1), datetime64[ns](1), object(3)
memory usage: 139.9+ KB


We see that the row count is reduced by 2.

Also, we do recollect that while scrapping the data we had errors while retriving the *rating* column. At that time we had overwritten the rating as **None**. Let us check how many reviews have such error.

In [13]:
# checking for unique rating values

df['rating'].unique()

array(['1', '2', '8', '10', '4', '6', '7', '5', '9', '3', 'None'],
      dtype=object)

In [14]:
df['rating'].value_counts()

1       765
2       394
3       386
8       360
10      318
7       308
9       303
4       242
5       229
6       185
None      5
Name: rating, dtype: int64

So there are only 5 reviews where this error occured. We can drop these reviews without causing much data loss.

In [15]:
# dropping the indexes where 'rating' is None

df.drop(df[df['rating'] == "None"].index, axis = 0, inplace = True)

In [16]:
# convert the rating column to a integers

df['rating'] = df['rating'].astype(int)

In [17]:
# sanity check

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3490 entries, 0 to 3496
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   reviews        3490 non-null   object        
 1   rating         3490 non-null   int32         
 2   review date    3490 non-null   datetime64[ns]
 3   country        3490 non-null   object        
 4   verified user  3490 non-null   bool          
dtypes: bool(1), datetime64[ns](1), int32(1), object(2)
memory usage: 126.1+ KB


In [18]:
# checking the shape

df.shape

(3490, 5)

We should now reset the index and export the cleaned data into a new csv file.

In [19]:
# resetting the index

df = df.reset_index(drop = True)

In [20]:
# exporting the cleaned data

df.to_csv("data/BA_review_cleaned.csv")