# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
# Number of pages that are to be scraped.
pages = 20
# Total number of reviews that each page contains. The webpage offers 10, 20 and 100 reviews to be viewed at a single time.
page_size = 100

# Creating a list to store the information of each reviews.
reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        # Appending each information to the list of reviews.
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [None]:
df = pd.DataFrame()
df["reviews"] = reviews


In [11]:
df.head(1)

Unnamed: 0,reviews
0,✅ Trip Verified | When on our way to Heathrow ...


In [9]:
df.to_csv("/content/drive/MyDrive/British Airways/data.csv")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The data of 2000 records are saved as a csv file. Next step is to convert the data into a meaningful format.

## Scrapping Airline Reviews Data

In [22]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 1 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews


In [24]:
df.head()

Unnamed: 0,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Ground Service,Value For Money,Recommended,Aircraft,Food & Beverages,Inflight Entertainment,Wifi & Connectivity,reviews
0,Business,Business Class,London to Johannesburg,August 2023,3.0,2.0,2.0,1,no,,,,,✅ Trip Verified | When on our way to Heathrow ...
1,Couple Leisure,Business Class,LHR to LAX,August 2023,4.0,5.0,5.0,4,yes,Boeing 777-300,3.0,5.0,,"✅ Trip Verified | Nice flight, good crew, very..."
2,Family Leisure,Economy Class,Delhi to Vancouver via London,December 2022,1.0,1.0,1.0,1,no,,,,,✅ Trip Verified | 8 months have passed and st...
3,Solo Leisure,Economy Class,Copenhagen to London,June 2023,,,,1,no,,,,,✅ Trip Verified | In June my flight was cance...
4,Solo Leisure,Business Class,Larnaca to London Heathrow,July 2023,3.0,3.0,3.0,1,yes,A320neo,1.0,,,✅ Trip Verified | Ground and cabin crew alway...


## Scrapping Airline Seat Data

In [25]:
base_url = "https://www.airlinequality.com/seat-reviews/british-airways/"
pages = 1 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews


## Scrapping Airline Lounge Data

In [29]:
base_url = "https://www.airlinequality.com/lounge-reviews/british-airways/"
pages = 1 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews


In [30]:
df

Unnamed: 0,Lounge Name,Airport,Type Of Lounge,Date Visit,Type Of Traveller,Comfort,Cleanliness,Bar & Beverages,Catering,Washrooms,Recommended,Staff Service,Wifi Connectivity,reviews
0,T5 Galleries South,London Heathrow Airport,Business Class,June 2023,Business,4.0,4.0,5.0,5.0,4.0,yes,,,✅ Trip Verified | Flew London to Kalamata and...
1,Terminal 5 South Lounge,London Heathrow Airport,Business Class,April 2023,Business,1.0,1.0,1.0,,1.0,no,1.0,,✅ Trip Verified | We were at The British Airw...
2,Arrivals,London Heathrow Airport,Business Class,April 2023,Business,3.0,3.0,4.0,4.0,2.0,no,1.0,4.0,✅ Trip Verified | A pretty underwhelming expe...
3,T5 Galleries South,London Heathrow Airport,Business Class,April 2023,Business,3.0,3.0,3.0,3.0,,no,3.0,3.0,✅ Trip Verified | I travelled on a weekend wh...
4,Concorde Room T5,London Heathrow Airport,First Class,March 2023,,4.0,4.0,,3.0,3.0,no,1.0,,Not Verified | Arrived from Johannesburg and ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Gallery,Johannesburg Airport,First Class,November 2016,,2.0,2.0,2.0,2.0,,no,2.0,2.0,Common lounge for First and Business class pas...
96,Club Europe,London Gatwick Airport,Business Class,September 2016,Business,1.0,2.0,1.0,1.0,3.0,no,1.0,3.0,✅ Verified Review | My husband and I were loo...
97,Concorde Lounge,London Heathrow Airport,First Class,November 2016,,5.0,5.0,5.0,5.0,,yes,5.0,,As we were flying in BA First from London to W...
98,Terraces Lounge,San Francisco Airport,Business Class,September 2016,Business,3.0,4.0,4.0,1.0,,no,4.0,4.0,✅ Verified Review | Terrible food. Stale/dry ...
