# Data Cleaning

### After the data collection, it needs to be cleaned to make the data analysis easier & possible.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# import regex
import re
# Regular Expressions (RegEx) is a special sequence of characters that uses a search pattern to find a string or set of strings.
# It can detect the presence or absence of a text by matching it with a particular pattern, 
# and also can split a pattern into one or more sub-patterns. 
# Python provides a re module that supports the use of regex in Python. 
# Its primary function is to offer a search, where it takes a regular expression and a string. 

In [2]:
# Create a dataframe from the csv file
cwd = os.getcwd()

df = pd.read_csv(cwd+ '/BA_reviews.csv', index_col=0)
df.head()

Unnamed: 0,reviews,stars,date,country
0,Not Verified | They changed our Flights from ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,18th April 2023,United States
1,Not Verified | At Copenhagen the most chaotic...,2,18th April 2023,United States
2,✅ Trip Verified | Worst experience of my life...,5,17th April 2023,United States
3,✅ Trip Verified | Due to code sharing with Ca...,1,17th April 2023,Hong Kong
4,✅ Trip Verified | LHR check in was quick at t...,3,16th April 2023,United Kingdom


#### Create a column that shows if a user is verified or not

In [3]:
# str.contains() function is used to test if pattern or regex is contained within a string of a Series or Index. 
# The function returns boolean Series or Index based on whether a given pattern or regex 
# is contained within a string of a Series or Index.

# Here, we will create a column that will check if a user is verified or not by checking if the strings in the first column
# have a the specified characters or pattern

df['Verified'] = df.reviews.str.contains("Trip Verified")

In [4]:
df.head()

Unnamed: 0,reviews,stars,date,country,Verified
0,Not Verified | They changed our Flights from ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,18th April 2023,United States,False
1,Not Verified | At Copenhagen the most chaotic...,2,18th April 2023,United States,False
2,✅ Trip Verified | Worst experience of my life...,5,17th April 2023,United States,True
3,✅ Trip Verified | Due to code sharing with Ca...,1,17th April 2023,Hong Kong,True
4,✅ Trip Verified | LHR check in was quick at t...,3,16th April 2023,United Kingdom,True


In [5]:
df['reviews'] = df.reviews.str.split('|', expand=True)[1]

In [6]:
df.head()

Unnamed: 0,reviews,stars,date,country,Verified
0,They changed our Flights from Brussels to Lo...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,18th April 2023,United States,False
1,At Copenhagen the most chaotic ticket counte...,2,18th April 2023,United States,False
2,Worst experience of my life trying to deal w...,5,17th April 2023,United States,True
3,Due to code sharing with Cathay Pacific I wa...,1,17th April 2023,Hong Kong,True
4,LHR check in was quick at the First Wing and...,3,16th April 2023,United Kingdom,True


In [7]:
df.isna().sum()

reviews     2984
stars          0
date           0
country        4
Verified       0
dtype: int64

In [8]:
df.dropna(axis=0, inplace=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4016 entries, 0 to 6649
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   reviews   4016 non-null   object
 1   stars     4016 non-null   object
 2   date      4016 non-null   object
 3   country   4016 non-null   object
 4   Verified  4016 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 160.8+ KB


In [10]:
df.head()

Unnamed: 0,reviews,stars,date,country,Verified
0,They changed our Flights from Brussels to Lo...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,18th April 2023,United States,False
1,At Copenhagen the most chaotic ticket counte...,2,18th April 2023,United States,False
2,Worst experience of my life trying to deal w...,5,17th April 2023,United States,True
3,Due to code sharing with Cathay Pacific I wa...,1,17th April 2023,Hong Kong,True
4,LHR check in was quick at the First Wing and...,3,16th April 2023,United Kingdom,True


### Cleaning the 'Reviews' Column

#### We will extract the 'reviews' column into a separate dataframe and prepare it for sentiment analysis.
#### Sentiment analysis is the technique we expect our machine to extract logic and meaning from natural language text.
#### It allows the computer to interpret the language structure and grammatical format and and identifies the relationship btwn words thus creating meaning.

#### Lemmatization - one of the most common text pre-processing techniques used in NLP and ML models to break down a word to its root in order to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good. 

#### NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP. It is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics.

####  A corpus is a collection of authentic text or audio organized into datasets. 'Authentic' in this case means text written or audio spoken by a native of the language or dialect.

In [11]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [12]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

##### Cleaning the 'Reviews' Column
We will extract the 'reviews' column into a separate dataframe and prepare it for sentiment analysis.v

In [14]:
# Check the first review to see words and haracters
df['reviews'].loc[0]

'  They changed our Flights from Brussels to London Heathrow to LAX on 4/16/2023. We paid extra to choose our seats. Since they cancelled they never honored the seat that we bought, they seated us in totally different seats. I asked the check in employee, she was very rude and told us that we have to understand that was a different flight. From London to LAX was worse, nobody in the airport help us. Employees from BA told us that we have to return next day for our flight we can rent a hotel or go terminal 3 and sleep there. Finally one employee help us and gives a voucher for hotel. It was a nightmare this airline. We missed one day work and BA didn’t return the money that we paid for our previous chosen seats.'

In [15]:
# extract the column from the dataframe and store in a variable
reviews_data = df.reviews
reviews_data

0         They changed our Flights from Brussels to Lo...
1         At Copenhagen the most chaotic ticket counte...
2         Worst experience of my life trying to deal w...
3         Due to code sharing with Cathay Pacific I wa...
4         LHR check in was quick at the First Wing and...
                              ...                        
5676      London Heathrow to Houston on British Airway...
5677      We have flown with British Airways over 100 ...
5678      British Airways from Seattle to Johannesburg...
5680      Gatwick to Amsterdam in Business class was t...
6649    ) I did not see the attendants down our aisle ...
Name: reviews, Length: 4016, dtype: object

In [28]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()

# create a list to store the cleaned reviews data
corpus = []

for review in reviews_data:
    #remove special characters and digits
    review = re.sub('[^A-Za-z]+', ' ', str(review))
    #lowercase
    review = review.lower()
    #split the text data to impose lemmatization
    review = review.split()
    #lemmatize the splitted word(s) and remove stopwords
    review = [lemma.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
    #join back the splitted text
    review = ' '.join(review)
    # append the cleaned data to corpus
    corpus.append(review)

In [29]:
# add corpus as a column to the df
df['cleaned data'] = corpus
df.head()

Unnamed: 0,reviews,stars,date,country,Verified,cleaned data
0,They changed our Flights from Brussels to Lo...,5,2023-04-18,United States,False,changed flight brussels london heathrow lax pa...
1,At Copenhagen the most chaotic ticket counte...,2,2023-04-18,United States,False,copenhagen chaotic ticket counter assignment h...
2,Worst experience of my life trying to deal w...,5,2023-04-17,United States,True,worst experience life trying deal customer ser...
3,Due to code sharing with Cathay Pacific I wa...,1,2023-04-17,Hong Kong,True,due code sharing cathay pacific downgraded ba ...
4,LHR check in was quick at the First Wing and...,3,2023-04-16,United Kingdom,True,lhr check quick first wing quickly security fi...


#### Clean the stars column

In [19]:
df.head()

Unnamed: 0,reviews,stars,date,country,Verified
0,They changed our Flights from Brussels to Lo...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,18th April 2023,United States,False
1,At Copenhagen the most chaotic ticket counte...,2,18th April 2023,United States,False
2,Worst experience of my life trying to deal w...,5,17th April 2023,United States,True
3,Due to code sharing with Cathay Pacific I wa...,1,17th April 2023,Hong Kong,True
4,LHR check in was quick at the First Wing and...,3,16th April 2023,United Kingdom,True


In [20]:
df.stars.unique()

array(['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '2', '5', '1', '3', '4', '9', '7',
       '10', '8', '6'], dtype=object)

In [24]:
df.stars = df.stars.str.strip('\n\t\t\t\t\t\t\t\t\t\t\t\t\t')

In [25]:
df.head()

Unnamed: 0,reviews,stars,date,country,Verified
0,They changed our Flights from Brussels to Lo...,5,18th April 2023,United States,False
1,At Copenhagen the most chaotic ticket counte...,2,18th April 2023,United States,False
2,Worst experience of my life trying to deal w...,5,17th April 2023,United States,True
3,Due to code sharing with Cathay Pacific I wa...,1,17th April 2023,Hong Kong,True
4,LHR check in was quick at the First Wing and...,3,16th April 2023,United Kingdom,True


##### Change the date format

In [26]:
df.date = pd.to_datetime(df.date)

In [27]:
df.date.head()

0   2023-04-18
1   2023-04-18
2   2023-04-17
3   2023-04-17
4   2023-04-16
Name: date, dtype: datetime64[ns]

##### Export the cleaned dataframe to a csv file

In [30]:
df.to_csv(cwd + '/cleaned-BA-reviews.csv')