# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com](https://www.airlinequality.com) you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways](https://www.airlinequality.com/airline-reviews/british-airways) you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [4]:
# # Install the requests module
# !pip install requests

Collecting requests
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
     -------------------------------------- 62.8/62.8 kB 305.7 kB/s eta 0:00:00
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.1.0-cp310-cp310-win_amd64.whl (97 kB)
     -------------------------------------- 97.1/97.1 kB 504.8 kB/s eta 0:00:00
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
     ------------------------------------ 140.6/140.6 kB 642.4 kB/s eta 0:00:00
Installing collected packages: urllib3, charset-normalizer, requests
Successfully installed charset-normalizer-3.1.0 requests-2.28.2 urllib3-1.26.14


In [1]:
# import requests
# from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
# pages = 30
# page_size = 100

# reviews = []

# # for i in range(1, pages + 1):
# for i in range(1, pages + 1):

#     print(f"Scraping page {i}")

#     # Create URL to collect links from paginated data
#     url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

#     # Collect HTML data from this page
#     response = requests.get(url)

#     # Parse content
#     content = response.content
#     parsed_content = BeautifulSoup(content, 'html.parser')
#     for para in parsed_content.find_all("div", {"class": "text_content"}):
#         reviews.append(para.get_text())
    
#     print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [3]:
# df = pd.DataFrame()
# df["reviews"] = reviews
# df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [4]:
# df.to_csv("../data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Analysis

We have increased the dataset from 1000 reviews to 3000 reviews in order to improve the output of our analysis

First let us list out our analysis work flow

1. Data Loading
2. Data Cleaning
3. Word Cloud
4. Topic Modeling
5. Sentiment Analysis

We have scrapped and loaded the data, now lets get to cleaning the data

### 1. Data Loading

In [76]:
import pandas as pd
df = pd.read_csv("../data/BA_reviews.csv")
df = df.drop(columns=["Unnamed: 0"], errors="ignore")

In [77]:
df

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing flight was delayed because French Air Traffic Contr...
1,✅ Trip Verified | Back in December my family and I as we were getting onto the plane were refus...
2,"✅ Trip Verified | As usual the flight is delayed this week, it already 3 hours and I’m held on ..."
3,"✅ Trip Verified | A short BA euro trip and this is where BA excel. Clean aircraft, good crew, pr..."
4,Not Verified | We are flying Business class for most of our flight and then Premium economy for...
...,...
2995,LHR-SFO-LHR. Why do I keep thinking BA will improve? The 747 on this route is falling to bits. T...
2996,Muscat - Abu Dhabi - London. Delayed over two hours no vouchers offered. No welcome drinks or pa...
2997,The last trip was in Nov to Washington flew first in the new A380 which is a great aircraft very...
2998,We flew from Manchester to LHR to YCC and return over the Christmas period on the Dreamliner. Th...


In [78]:
df.iloc[2604]

reviews    A pleasant trip with British Airways as usual but suffered a 60 minute delay, for which the Capt...
Name: 2604, dtype: object

In [69]:
df.shape

(3000, 1)

### 2. Data Cleaning

In [79]:
# Import labraries for test data cleaning
import pandas as pd 
import re 
import string 
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words("english")
wn = nltk.WordNetLemmatizer()
punct = string.punctuation

Remove the parts before the `|` sign

In [82]:
def split_review(text):
    if '|' in text:
        text = text.split('|')[1]
    else:
        pass
    return text

In [86]:
df["reviews"] = df["reviews"].apply(lambda x: split_review(x))
df

Unnamed: 0,reviews
0,The incoming and outgoing flight was delayed because French Air Traffic Controllers were on st...
1,Back in December my family and I as we were getting onto the plane were refused. Even though w...
2,"As usual the flight is delayed this week, it already 3 hours and I’m held on a bus waiting to ..."
3,"A short BA euro trip and this is where BA excel. Clean aircraft, good crew, professional, on ti..."
4,We are flying Business class for most of our flight and then Premium economy for the balance. ...
...,...
2995,LHR-SFO-LHR. Why do I keep thinking BA will improve? The 747 on this route is falling to bits. T...
2996,Muscat - Abu Dhabi - London. Delayed over two hours no vouchers offered. No welcome drinks or pa...
2997,The last trip was in Nov to Washington flew first in the new A380 which is a great aircraft very...
2998,We flew from Manchester to LHR to YCC and return over the Christmas period on the Dreamliner. Th...


In [35]:
# This function cleans the data

def clean_text(text):
    # Remove punctuations 
    text = "".join([word.lower() for word in text if word not in punct])
    
    # Remove any other signs
    text = " ".join(re.split('\W+', text))
    
    # remove trip, verified, review, unverified, not in the starting text
    #text = re.sub(r'^trip', '', text)
    
    return text

df["cleaned_reviews"] = df["reviews"]. apply(lambda x: clean_text(x))
df.sample(10)

Unnamed: 0,reviews,cleaned_reviews
2712,"10/6/15, LHR-GLA, A321, Seat 9A. Was able to choose this seat in advance which has plenty of le...",10615 lhrgla a321 seat 9a was able to choose this seat in advance which has plenty of legroom as...
1295,✅ Verified Review | Toulouse to London Heathrow. This airline will one day get its comeuppance ...,verified review toulouse to london heathrow this airline will one day get its comeuppance for p...
1191,✅ Verified Review | My wife and I booked two round trip business class air tickets from Johanne...,verified review my wife and i booked two round trip business class air tickets from johannesbur...
1334,✅ Verified Review | Dubai to London. Didn't have high expectations but surprised how bland and ...,verified review dubai to london didnt have high expectations but surprised how bland and uninte...
570,Not Verified | I had a stress free journey with my 8 yr old autistic son and 6 yr old girl from...,not verified i had a stress free journey with my 8 yr old autistic son and 6 yr old girl from ch...
121,✅ Trip Verified | Despite BA's promise to credit double tier points for a holiday booked on thei...,trip verified despite bas promise to credit double tier points for a holiday booked on their we...
1495,"✅ Verified Review | Buenos Aires to London. I was warned that BA had gone downhill, but I decid...",verified review buenos aires to london i was warned that ba had gone downhill but i decided to ...
476,"✅ Trip Verified | I had a connection flight from London to Berlin, traveling with hand luggage....",trip verified i had a connection flight from london to berlin traveling with hand luggage i got...
827,✅ Trip Verified | Gatwick to Las Vegas. Boarding by group number seemed to work well at Gatwick...,trip verified gatwick to las vegas boarding by group number seemed to work well at gatwick we t...
2902,Recently flew to EZE and back from LHR. Longest flight on the BA network. In each direction the ...,recently flew to eze and back from lhr longest flight on the ba network in each direction the cr...
