There are mainly two ways to extract data from a website:

- Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
- Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

The steps involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup. Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

##Install required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

##Searching and navigating through the parse tree

In [2]:
def parse_reviews(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'html5lib')
  reviews = []
  for review in soup.find_all('article', attrs={'itemprop':'review'}):
    review_content = review.find('div', class_= 'text_content').get_text(strip = True).split('|')[1]
    date = review.find('time').get('datetime')
    verification = review.find('div', class_= 'text_content').get_text(strip = True).split('|')[0].replace('✅', '')

    rating = review.find('div', class_= 'rating-10')
    if rating:
      rating_value = rating.find('span', itemprop = 'ratingValue').get_text(strip=True)
    else:
      rating_value = None

    impression = review.find('h2', class_='text_header').get_text(strip=True) if review.find('h2', class_='text_header') else None

    author_info = review.find('h3', class_ = 'text_sub_header userStatusWrapper')
    if author_info:
      country = author_info.get_text(strip=True).split('(')[-1].split(')')[0]
    else:
      country = None

    content = {}
    for row in review.find_all('tr'):
      header = row.find('td', class_ = 'review-rating-header')
      value = row.find('td', class_ = 'review-value')
      stars = row.find_all('span', class_ = 'star fill')

      if header and (value or stars):
        header_text = header.get_text(strip=True)
        if value:
          content[header_text] = value.get_text(strip=True)
        elif stars:
          content[header_text] = len(stars)

      reviews_data = {'Date': date,
                      'Overall Rating' : rating_value,
                      'Country' : country,
                      'Impression' : impression,
                      'Verification' : verification,
                      'Review' : review_content,
                      **content}
    reviews.append(reviews_data)
  return reviews

##Parsing the HTML content

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the same time.

In [3]:
def parse_multiple_pages(base_url, num_pages=10):
    df = pd.DataFrame()

    for i in range(1, num_pages + 1):
        url = f'{base_url}/page/{i}/?sortby=post_date&pagesize=100'
        page_review = pd.DataFrame(parse_reviews(url))
        df = pd.concat([df, page_review], axis=0).reset_index(drop=True)
    return df

base_url = 'https://www.airlinequality.com/airline-reviews/british-airways'
df = parse_multiple_pages(base_url, num_pages=10)


In [4]:
df.dtypes

Unnamed: 0,0
Date,object
Overall Rating,object
Country,object
Impression,object
Verification,object
Review,object
Aircraft,object
Type Of Traveller,object
Seat Type,object
Route,object


In [5]:
columns_to_convert = ['Overall Rating', 'Seat Comfort', 'Cabin Staff Service', 'Food & Beverages', 'Inflight Entertainment', 'Ground Service', 'Wifi & Connectivity', 'Value For Money']
df[columns_to_convert] = df[columns_to_convert].apply(pd.to_numeric, errors = 'coerce')
df['Date'] = pd.to_datetime(df['Date'])

In [8]:
df.head(10)

Unnamed: 0,Date,Overall Rating,Country,Impression,Verification,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended
0,2024-10-14,1,South Africa,“Appalling service”,Trip Verified,Appalling service with failing defective fl...,A380,Business,Business Class,Johannesburg to London,October 2024,2.0,1.0,2.0,2.0,1.0,2.0,1,no
1,2024-10-12,6,United Kingdom,“BA’s petty penny pinching ”,Trip Verified,British Airways charge you for the pleasure o...,Boeing 787-900,Business,Business Class,London to Mexico City,October 2024,1.0,5.0,1.0,1.0,2.0,2.0,3,yes
2,2024-10-12,1,Germany,“British arrogance with no manners”,Trip Verified,What is wrong with you guys? People pay lots ...,A320N,Couple Leisure,Business Class,Berlin to London Heathrow,October 2024,1.0,1.0,1.0,,2.0,,1,no
3,2024-10-12,2,United States,"""Terrible customer service""",Trip Verified,We booked two business class seat with Brit...,,Couple Leisure,Business Class,London Heathrow to Philadelphia,September 2024,1.0,1.0,1.0,,1.0,1.0,1,no
4,2024-10-09,1,South Africa,“left much to be desired”,Trip Verified,"I’ve flown with many airlines, but my recent ...",,Solo Leisure,Economy Class,Johannesburg to New York City via Heathrow,October 2024,5.0,4.0,3.0,2.0,3.0,1.0,2,no
5,2024-10-08,8,United Kingdom,"""crew were all very friendly""",Trip Verified,I recently flew from New York back to Londo...,Boeing 777-336 ER,Solo Leisure,Premium Economy,New York JFK to London Heathrow,October 2024,5.0,5.0,3.0,5.0,5.0,,4,yes
6,2024-10-06,4,United Kingdom,"""Simply not good enough""",Not Verified,BA business class in Europe has a seat the ...,,Solo Leisure,Business Class,Antalya to Gatwick,October 2024,1.0,2.0,4.0,,2.0,1.0,2,no
7,2024-10-05,1,United States,“BA refuses to reimburse us”,Trip Verified,Our flight started in Seattle Wa heading to L...,,Couple Leisure,Economy Class,London to Lisbon,September 2024,3.0,2.0,1.0,1.0,1.0,1.0,2,no
8,2024-10-04,4,India,"""my luggage is missing""",Trip Verified,British Airways Flight from Edinburgh got d...,,Solo Leisure,Economy Class,Edinburgh to Delhi via London,September 2024,1.0,1.0,1.0,,1.0,,1,no
9,2024-09-28,5,United Kingdom,"""work is needed to provide a better customer e...",Trip Verified,British Airways World Traveller Plus (Premi...,A350,Solo Leisure,Premium Economy,London to Vancouver,September 2024,4.0,5.0,2.0,1.0,3.0,,2,yes


In [9]:
df.to_csv('british_airways_reviews.csv', index=False)