# British Airways Customer Reviews Webscraper

**Author:** Lester Liam  | [GitHub](https://github.com/lester-liam) | [LinkedIn](https://www.linkedin.com/in/lester-liam) <br/>
**Last Updated:** 27 February 2025
<br/>

> Generated Dataset can be found here in [Google Drive](https://drive.google.com/drive/folders/16VPHZBDPZUUnRndVtKLxfd1Px2CfUMhI?usp=drive_link)

This notebook is part of my project for the virtual experience learning on [Forage](https://www.theforage.com/simulations/british-airways/data-science-yqoz). It deploys a webscraper that collect reviews data from the "Airline" reviews section for (Skytrax | https://airlinequality.com).

**Background:**
> British Airways (BA) is the flag carrier airline of the United Kingdom (UK). Every day, thousands of BA flights arrive to and depart from the UK, carrying customers across the world. Whether it’s for holidays, work or any other reason, the end-to-end process of scheduling, planning, boarding, fuelling, transporting, landing, and continuously running flights on time, efficiently and with top-class customer service is a huge task with many highly important responsibilities.<br/><br/>
> As a data scientist at BA, it will be your job to apply your analytical skills to influence real life multi-million-pound decisions from day one, making a tangible impact on the business as your recommendations, tools and models drive key business decisions, reduce costs and increase revenue.<br/><br/>
> Customers who book a flight with BA will experience many interaction points with the BA brand. Understanding a customer's feelings, needs, and feedback is crucial for any business, including BA. <br/><br/>
> This first task is focused on scraping and collecting customer feedback and reviewing data from a third-party source and analysing this data to present any insights you may uncover.



<hr/>

# Table of Contents

>[Initial Website Preview](#scrollTo=Ev043d9HLoLc)

>[Test Extraction (100 Reviews)](#scrollTo=GPtb8IMHK067)

>>[Review Table Metrics Name Extraction](#scrollTo=2xaWltn6KuxM)

>>[Final CSV Dict Format](#scrollTo=9FjXhZxOKnOD)

>>[Review Extraction Function](#scrollTo=K9t1L2kAKjkT)

>[Webscraper (1000 Reviews)](#scrollTo=8xzV38vUKduU)

>>[Export Dataset](#scrollTo=ts35RK9EEavy)

<hr/>

# Initial Website Preview

Firstly, we inspected the site's `robot.txt` file to understand any web scraping policies being enforced. <br/> <br/>

![image.png](https://i.ibb.co/W4pytpqH/image.png)
<br/> <br/>
For any user-agent (*), we should have a crawl-delay of 5. Hence, it is important to implement a delay for each request being made to the site in our code in later sections.

<hr/>

The image below shows a sample latest review from the site.We can observe a paragraph of raw text, as well as ratings for specific metrics the customer has given. Using this information, we can measure the overall rating for those categories as well as perform sentiment analsysis on the raw text. <br/>
<br/>
Each page has multiple reviews and there are multiple pages. Inspecting the url mentioned earlier shows that we can simply update the page number in the url to retrieve the reviews as well as sort by what class of passenger. <br/>

**Example for Page 10, 100 Reviews** <br/>.../page/`10?pagesize=100`)

![image.png](https://i.ibb.co/mryjsjdC/skytrax-airline-review-metrics.png)

<hr/>

Given the number of reviews per cabin differs significantly, we will scrape the first 10 pages (1000) reviews and proceed from there.

In [None]:
# Import Packages
import pandas as pd

import time
import pprint
from datetime import date

# Webscraping Packages
import requests
from bs4 import BeautifulSoup

# Test Extraction (100 Reviews)

In [None]:
# Request First Review Page
base_url = "https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100"

response = requests.get(base_url, headers={'User-Agent': '*'})

# Check Responde Code
print("Status Code: ", response.status_code)

Status Code:  200


In [None]:
# Parse Page Content as HTML into BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

In [None]:
# Extract all '<article>' with 'review' itemprop attribute
reviews = soup.find_all("article", attrs={"itemprop": "review"})
print(len(reviews))

100


In [None]:
# Test Various Extractions on 1 Review
ratings_dict = {}

# Date Published
date_published = reviews[0].find("time", attrs={"itemprop": "datePublished"})
datetime_published = date_published['datetime']

# Rating Value
rating = reviews[0].find("span", attrs={"itemprop":"ratingValue"}).text

# Review Title
review_title = reviews[0].find("h2", attrs={"class":"text_header"}).text

# Review Title
review_body = reviews[0].find("div", attrs={"itemprop":"reviewBody"}).text.split('|')

# Ratings Table
ratings_table = reviews[0].find("table", attrs={"class":"review-ratings"})
ratings_rows = ratings_table.find_all("tr")

# For Each Table Row, Extract {Metric:Value}
for row in ratings_rows:

  row = row.find_all("td")

  metric = row[0].text

  if row[1].find('span', attrs={"class":"star fill"}) is None:
    value = row[1].text

  else:
    value = len(row[1].find_all('span', attrs={"class":"star fill"}))

  ratings_dict[metric] = value

In [None]:
print(datetime_published)
print(rating)
print(review_title)
print(review_body)
pprint.pp(ratings_dict)

2025-02-21
3
"not use British Airways on this route"
['✅ Trip Verified ', '   Prior to boarding a gate agent seemed to pick on elderly people and asked them to check in relatively small bags. By contrast the same staff member looked the other way to a boisterous group of 20 somethings all of whom had large amounts of luggage. London Gatwick to Marrakech with British Airways was rather disappointing. The flight is just over 3 hours and during this flight passengers in economy were offered a tiny packet of digestive biscuits. No water was offered other than a cabin crew member walking through the aisle with a bottle of mineral water and plastic cups. She claimed she ran out of water and so many passengers did not even get that miserly offering. There was no route map on this A320. The captain made only 2 very brief announcements. When I needed to use the lavatory one cabin crew member rudely drew the curtain as she was sitting behind the lavatories with one of her colleagues – they were 

## Review Table Metrics Name Extraction

Each review has a **different order and list of metrics**. Hence, we'll run through the first 100 reviews to collect a list of measurable metrics.

In [None]:
# Function to Extract All Metrics (from first 100 Reviews)
def getMetrics(reviews):

  metric_list = []
  for review in reviews:
      ratings_table = review.find("table", attrs={"class":"review-ratings"})
      ratings_rows = ratings_table.find_all("tr")

      # For Each Metric in Ratings Table Extract {metric:value}
      for row in ratings_rows:

        data = row.find_all("td")
        metric = data[0].text

        # Append Metric
        if metric not in metric_list:
          metric_list.append(metric)
          metric_list.sort()

  return(metric_list)

# Call Function
METRICS_LIST = getMetrics(reviews)
METRICS_LIST.sort()

print("List of Rating Metrics:")
pprint.pp(METRICS_LIST)

List of Rating Metrics:
['Aircraft',
 'Cabin Staff Service',
 'Date Flown',
 'Food & Beverages',
 'Ground Service',
 'Inflight Entertainment',
 'Recommended',
 'Route',
 'Seat Comfort',
 'Seat Type',
 'Type Of Traveller',
 'Value For Money',
 'Wifi & Connectivity']


## Final CSV Dict Format

Using the Dictionary we formed earlier, we form the final draft csv output format.

In [None]:
# Create a Dictionary of {ColumnName:[Values]}

dict_format = {
    "ratingValue": None,
    "title": None,
    "body": None,
    "tripVerified": None,
}

for m in METRICS_LIST:
  dict_format[m] = None

pprint.pp(dict_format)

{'ratingValue': None,
 'title': None,
 'body': None,
 'tripVerified': None,
 'Aircraft': None,
 'Cabin Staff Service': None,
 'Date Flown': None,
 'Food & Beverages': None,
 'Ground Service': None,
 'Inflight Entertainment': None,
 'Recommended': None,
 'Route': None,
 'Seat Comfort': None,
 'Seat Type': None,
 'Type Of Traveller': None,
 'Value For Money': None,
 'Wifi & Connectivity': None}


## Review Extraction Function
Using the methodology above, we can now create an extraction function for each `<article>` review. and store it inside a dictionary.

In [None]:
# Review Extraction Function

# Extract Elements
def extractReview(review, dict_format) -> dict:
  reviews_dict = dict_format.copy()

  # Rating Value
  reviews_dict['ratingValue'] = review.find("span", attrs={"itemprop":"ratingValue"}).text

  # Review Title
  reviews_dict['title'] = review.find("h2", attrs={"class":"text_header"}).text

  # Review Title
  review_body = review.find("div", attrs={"itemprop":"reviewBody"}).text.split("|")

  if "Not" in review_body[0]:
    reviews_dict['tripVerified'] = "Not Verified"
  else:
    reviews_dict['tripVerified'] = "Verified"

  reviews_dict['body'] = review_body[1]

  # Ratings Table
  ratings_table = review.find("table", attrs={"class":"review-ratings"})
  ratings_rows = ratings_table.find_all("tr")

  # Extract Table Data
  for row in ratings_rows:

    data = row.find_all("td")

    metric = data[0].text

    if data[1].find('span', attrs={"class":"star fill"}) is None:
      value = data[1].text
    else:
      value = len(data[1].find_all('span', attrs={"class":"star fill"}))

    reviews_dict[metric] = value

  return(reviews_dict)

In [None]:
# Test for first 5 reviews
for review in reviews[:5]:
  pprint.pp(extractReview(review, dict_format))

{'ratingValue': '3',
 'title': '"not use British Airways on this route"',
 'body': '   Prior to boarding a gate agent seemed to pick on elderly people '
         'and asked them to check in relatively small bags. By contrast the '
         'same staff member looked the other way to a boisterous group of 20 '
         'somethings all of whom had large amounts of luggage. London Gatwick '
         'to Marrakech with British Airways was rather disappointing. The '
         'flight is just over 3 hours and during this flight passengers in '
         'economy were offered a tiny packet of digestive biscuits. No water '
         'was offered other than a cabin crew member walking through the aisle '
         'with a bottle of mineral water and plastic cups. She claimed she ran '
         'out of water and so many passengers did not even get that miserly '
         'offering. There was no route map on this A320. The captain made only '
         '2 very brief announcements. When I needed to us

# Webscraper (1000 Reviews)
It would be resource intensive and time consuming to scrape all the reviews. Hence, we'll only scrape the first 10 pages. As reviews after page 10 may be outdated and not recent which may affect our analysis later in the next section.

In [None]:
# Create CSV Dictionary
csv_dict = {}
for key in dict_format.keys():
  csv_dict[key] = []

# Extract first 10 pages (10*100=1000) Reviews
for i in range(1, 11):
  url = f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100"

  response = requests.get(url, headers={'User-Agent': '*'})

  if response.status_code == 200:
    # Perform Review Extraction
    soup = BeautifulSoup(response.content, 'html.parser')
    reviews = soup.find_all("article", attrs={"itemprop": "review"})

    for r in reviews:
      row_dict = extractReview(r, dict_format)

      for col, value in row_dict.items():
        csv_dict[col].append(value)

    time.sleep(10) # Crawl Delay

  else:
    print(f"Unable to GET Request for Page {i}")
    time.sleep(10) # Crawl Delay
    continue

In [None]:
# Convert to DataFrame
df = pd.DataFrame(csv_dict)
df.head()

Unnamed: 0,ratingValue,title,body,tripVerified,Aircraft,Cabin Staff Service,Date Flown,Food & Beverages,Ground Service,Inflight Entertainment,Recommended,Route,Seat Comfort,Seat Type,Type Of Traveller,Value For Money,Wifi & Connectivity
0,3,"""not use British Airways on this route""",Prior to boarding a gate agent seemed to pi...,Verified,A320,1.0,February 2025,,2.0,,no,London Gatwick to Marrakech,4.0,Economy Class,Solo Leisure,2,
1,1,"""they still haven't replied""",I flew from Amsterdam to Las Vegas with a l...,Verified,,3.0,November 2024,3.0,1.0,3.0,no,Amsterdam to Las Vegas via London,3.0,Premium Economy,Business,1,
2,4,“food has really gone downhill”,"First the good news, the club suites are such...",Verified,A350-1000,1.0,February 2025,2.0,3.0,,no,London to Nairobi,4.0,Business Class,Couple Leisure,3,
3,9,"""thoroughly enjoyed this flight""",I have never travelled with British airways...,Verified,A380,5.0,February 2025,5.0,5.0,4.0,yes,Dubai to London Heathrow,4.0,Economy Class,Solo Leisure,5,
4,1,“customer support was terrible”,"Terrible overall, medium service and the flig...",Verified,,2.0,December 2024,1.0,1.0,1.0,no,Zürich to London,2.0,Economy Class,Couple Leisure,1,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ratingValue             1000 non-null   object 
 1   title                   1000 non-null   object 
 2   body                    1000 non-null   object 
 3   tripVerified            1000 non-null   object 
 4   Aircraft                526 non-null    object 
 5   Cabin Staff Service     902 non-null    float64
 6   Date Flown              1000 non-null   object 
 7   Food & Beverages        778 non-null    float64
 8   Ground Service          943 non-null    float64
 9   Inflight Entertainment  518 non-null    float64
 10  Recommended             1000 non-null   object 
 11  Route                   996 non-null    object 
 12  Seat Comfort            912 non-null    float64
 13  Seat Type               1000 non-null   object 
 14  Type Of Traveller       998 non-null    o

## Export Dataset

In [None]:
# Generate Output CSV File
output_file = "british_airways_reviews_" + date.today().strftime("%m_%Y") + ".csv"
df.to_csv(output_file, index=False)

<hr/>