### Web Scraping Data

As part of the project, data needed to be web scraped to address the business outcome of analyzing customer reviews. This involved web scraping 392 pages of reviews from the website Skytrax. Various metrics were extracted

BeautifulSoup and Requests were used to scrape the relavant data. Pandas was then used to convert it into a DataFrame and produce a dataset in CSV format. 

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

base_url="https://www.airlinequality.com/airline-reviews/british-airways"
pages = 392

headers = { 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0"
}

all_reviews =[]

for i in range(1, pages+1): 
    print(f"Scraping page {i}...")

    #Constructing the URL
    if i == 1:
        url = base_url
    else: 
        url = f"{base_url}/page{i}"
    url = f"{base_url}/page/{i}/"
       
    #Send request
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    #Finding all reviews
    reviews = soup.find_all("article",{"itemprop":"review"})

    #Extracting details from each review 
    for review in reviews:
        try:
            title = review.find("h2", class_="text_header").get_text(strip=True)
            author = review.find("span", itemprop="name").get_text(strip=True)
            date_published = review.find("time", itemprop="datePublished")["datetime"]
            review_text = review.find("div", class_="text_content").get_text(strip=True)
            overall_rating = review.find("span", itemprop="ratingValue").get_text(strip=True)
            traveller_type = review.find("td", class_="review-rating-header type_of_traveller")
            traveller_type = traveller_type.find_next_sibling("td").get_text(strip=True) if traveller_type else None
            seat_type = review.find("td", class_="review-rating-header cabin_flown")
            seat_type = seat_type.find_next_sibling("td").get_text(strip=True) if seat_type else None
            route = review.find("td", class_="review-rating-header route")
            route = route.find_next_sibling("td").get_text(strip=True) if route else None
            date_flown = review.find("td", class_="review-rating-header date_flown")
            date_flown = date_flown.find_next_sibling("td").get_text(strip=True) if date_flown else None
            seat_comfort = len(review.find("td", class_="review-rating-header seat_comfort").find_next_sibling("td").find_all("span", class_="star fill"))
            cabin_staff = len(review.find("td", class_="review-rating-header cabin_staff_service").find_next_sibling("td").find_all("span",class_="star_fill"))
            food = len(review.find("td", class_="review-rating-header food_and_beverages").find_next_sibling("td").find_all("span", class_="star fill"))
            entertainment = len(review.find("td",class_="review-rating-header inflight_entertainment").find_next_sibling("td").find_all("span", class_="star fill"))
            ground_service = len(review.find("td", class_="review-rating-header ground_service").find_next_sibling("td").find_all("span", class_="star fill"))
            value_for_money = len(review.find("td", class_="review-rating-header value_for_money").find_next_sibling("td"))
            recommended = review.find("td", class_="review-rating-header recommended")
            recommended = recommended.find_next_sibling("td").get_text(strip=True) if recommended else None
            #Appending data to list
            all_reviews.append({
                "Title": title,
                "Author": author,
                "Date Published": date_published,
                "Review Text": review_text,
                "Overall Rating": overall_rating,
                "Traveller Type": traveller_type, 
                "Seat Type": seat_type,
                "Route": route,
                "Date Flown": date_flown, 
                "Seat Comfort": seat_comfort,
                "Cabin Staff Service": cabin_staff,
                "Food & Beverages": food, 
                "Inflight Entertainment": entertainment,
                "Ground Service": ground_service,
                "Value for Money": value_for_money,
                "Recommended": recommended
            })
        except Exception as e:
            print(f"Error extracting review: {e}")

    print(f" ---> {len(all_reviews)} total reviews collected so far")
    

Scraping page 1...
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
 ---> 5 total reviews collected so far
Scraping page 2...
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
 ---> 13 total reviews collected so far
Scraping page 3...
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'find_next_sibling'
Error extracting review: 'NoneType' object has no attribute 'fin

Pandas is being used to convert the data into a Dataframe, ready to be used as a dataset to analyse

In [9]:
df = pd.DataFrame(all_reviews)
df.head()

Unnamed: 0,Title,Author,Date Published,Review Text,Overall Rating,Traveller Type,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Value for Money,Recommended
0,"""they still haven't replied""",E Vandoon,2025-02-18,✅Trip Verified| I flew from Amsterdam to Las...,1,Business,Premium Economy,Amsterdam to Las Vegas via London,November 2024,3,0,3,3,1,5,no
1,"""thoroughly enjoyed this flight""",A Hashin,2025-02-14,✅Trip Verified| I have never travelled with ...,9,Solo Leisure,Economy Class,Dubai to London Heathrow,February 2025,4,0,5,4,5,5,yes
2,“customer support was terrible”,L Martin,2025-02-07,"✅Trip Verified| Terrible overall, medium servi...",1,Couple Leisure,Economy Class,Zürich to London,December 2024,2,0,1,1,1,5,no
3,"""a really enjoyable experience""",Paul Lee,2025-02-01,✅Trip Verified| London Heathrow to Male In n...,9,Couple Leisure,Business Class,London to Male,January 2025,5,0,4,5,5,5,yes
4,"""the flight was delayed""",S Herron,2025-01-05,✅Trip Verified| The check in process and rewar...,1,Business,Economy Class,London to Basel,January 2025,1,0,2,1,1,5,no


In [13]:
df.to_csv("Data/BA_reviews_on_skytrax.csv",index=False)
print("Data scrapping complete. File saved as 'Data/BA_reviews_on_skytrax.csv'.")

Data scrapping complete. File saved as 'Data/BA_reviews_on_skytrax.csv'.
