## Web scraping and analysis

This Jupyter notebook includes some code for web scraping. I used `BeautifulSoup` to collect the data from the web. Once the data has been collected it is saved into a local `.csv` file for further analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [3]:
# Importing Web scraping and Data manipulation libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Misc
import time
from pathlib import Path

In [5]:
# This code block scrapes reviews from the British Airways page on AirlineQuality
# The reviews are paginated, and we will scrape multiple pages to collect all reviews

# Set the base URL and pagination parameters
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 50
page_size = 100

reviews = []      # The reviews are stored in a list and will be converted to a DataFrame at the end

for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews") # Print the number of reviews collected after each page
    time.sleep(1)  # Sleep for 1 second to avoid overwhelming the server

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [6]:
df = pd.DataFrame()
df["reviews"] = reviews

In [7]:
df['reviews'].loc[1]

'Not Verified |  Decided to use point to upgrade to business after visiting family in the UK. I have to say British Airways Business Class was up there with the best of them. Comfortable seating, great service from the crew and with a smile, and great food. On the whole I found the experience comfortable and worth the points upgrade and will consider flying them again.'

In [9]:
# Base directory = location of the current script or notebook
base_dir = Path().resolve()

data_dir = base_dir / "data"

# Create the directory if it doesn't exist
data_dir.mkdir(parents=True, exist_ok=True)

# Then save the file
raw_reviews_path = data_dir / "british_airways_reviews.csv"
df.to_csv(raw_reviews_path, index=False)