# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The data of 2000 records are saved as a csv file. Next step is to convert the data into a meaningful format.

## Scrapping Airline Reviews Data

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [4]:
df.to_csv("Airline_reviews_data")

In [5]:
df.head()

Unnamed: 0,Type Of Traveller,Seat Type,Route,Date Flown,Ground Service,Value For Money,Recommended,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Aircraft,Wifi & Connectivity,reviews
0,Solo Leisure,Economy Class,Boston to London,May 2024,1.0,3,no,,,,,,,✅ Trip Verified | Really terrible user experi...
1,Couple Leisure,Economy Class,New York to Manchester via London,May 2024,4.0,4,yes,4.0,5.0,5.0,5.0,,,✅ Trip Verified | Very impressed with BA. Chec...
2,Family Leisure,Business Class,London to San Francisco,August 2023,5.0,2,no,2.0,5.0,4.0,4.0,A380 / Boeing 777-200ER,,"✅ Trip Verified | LHR - SFO, LAS - LGW August..."
3,Couple Leisure,Economy Class,Malaga to Boston via London,May 2024,1.0,2,no,2.0,2.0,1.0,2.0,,1.0,Not Verified | I flew from Malaga via LHR to...
4,Couple Leisure,Business Class,Milan Linate to Miami via London Heathrow,April 2024,2.0,1,no,1.0,2.0,3.0,2.0,A380,,✅ Trip Verified | Milan to Miami return via L...


## Scrapping Airline Seat Data

In [6]:
base_url = "https://www.airlinequality.com/seat-reviews/british-airways/"
pages = 20 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 197 total reviews
Scraping page 3
   ---> 197 total reviews
Scraping page 4
   ---> 197 total reviews
Scraping page 5
   ---> 197 total reviews
Scraping page 6
   ---> 197 total reviews
Scraping page 7
   ---> 197 total reviews
Scraping page 8
   ---> 197 total reviews
Scraping page 9
   ---> 197 total reviews
Scraping page 10
   ---> 197 total reviews
Scraping page 11
   ---> 197 total reviews
Scraping page 12
   ---> 197 total reviews
Scraping page 13
   ---> 197 total reviews
Scraping page 14
   ---> 197 total reviews
Scraping page 15
   ---> 197 total reviews
Scraping page 16
   ---> 197 total reviews
Scraping page 17
   ---> 197 total reviews
Scraping page 18
   ---> 197 total reviews
Scraping page 19
   ---> 197 total reviews
Scraping page 20
   ---> 197 total reviews


In [7]:
df.head()

Unnamed: 0,Seat Type,Aircraft Type,Seat Layout,Date Flown,Type Of Traveller,Sleep Comfort,Sitting Comfort,Seat/bed Width,Seat/bed Length,Seat Privacy,Power Supply,Seat Storage,Recommended,Seat Legroom,Seat Recline,Seat Width,Aisle Space,Viewing Tv Screen,reviews
0,Business Class,Boeing 787,2x3x2,May 2024,Leisure,1.0,2.0,1.0,3.0,4.0,1.0,5.0,no,,,,,,✅ Trip Verified | One of the worst business c...
1,Economy Class,A320,3x3,April 2024,Couple Leisure,,,,,,1.0,1.0,no,1.0,1.0,1.0,3.0,,✅ Trip Verified | BA doesn't seem to understa...
2,Business Class,Boeing 787-9,2x3x2,May 2023,Leisure,1.0,1.0,1.0,1.0,1.0,1.0,1.0,no,,,,,,✅ Trip Verified | Boeing 787-9 - not worth th...
3,Economy Class,A350,3-3-3,April 2023,Couple Leisure,,,,,,1.0,1.0,no,1.0,1.0,1.0,1.0,3.0,✅ Trip Verified | Worst seat we have ever had...
4,Premium Economy,Boeing 777,3x4x3,November 2022,Couple Leisure,,,,,,4.0,4.0,no,4.0,4.0,5.0,1.0,1.0,"✅ Trip Verified | Did research, premium econo..."


In [8]:
# Saving airline seat data
df.to_csv("Airline_seat_reviews_data")

## Scrapping Airline Lounge Data

In [9]:
base_url = "https://www.airlinequality.com/lounge-reviews/british-airways/"
pages = 20 #10
page_size = 100 #100

reviews = []
aircraft = []
seat_type = []
route = []
recommended = []
df = pd.DataFrame()

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    rating = []
    category = []

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    for para2 in parsed_content.find_all("div", {"class" : "review-stats"}):
        for para3 in para2.find_all('td',{'class' : 'review-value'}):
            rating.append(para3.get_text())
        recomend = rating[-1]
        rating = rating[:-1]

        for para4 in para2.find_all('td',{'class' : 'review-rating-stars stars'}):
            para5 = len(para4.find_all('span', {'class' : 'star fill'}))
            rating.append(para5)
        rating.append(recomend)
        #print(rating)

        for para6 in para2.find_all('td',{'class' : 'review-rating-header'}):
            category.append(para6.get_text())

        #print(category)
        # Create the records from both list, using zip and dict calls.

        # Build the dataframe from the dictionary.
        data_dict = pd.DataFrame([rating], columns=category)
        # df = df.append(data_dict, ignore_index=True).reset_index(drop=True)
        df = pd.concat([df, pd.DataFrame(data_dict, index=[0])], ignore_index=True)
        #print(df)
        rating = []
        category = []

    print(f"   ---> {len(reviews)} total reviews")

df["reviews"] = reviews

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 421 total reviews
Scraping page 6
   ---> 421 total reviews
Scraping page 7
   ---> 421 total reviews
Scraping page 8
   ---> 421 total reviews
Scraping page 9
   ---> 421 total reviews
Scraping page 10
   ---> 421 total reviews
Scraping page 11
   ---> 421 total reviews
Scraping page 12
   ---> 421 total reviews
Scraping page 13
   ---> 421 total reviews
Scraping page 14
   ---> 421 total reviews
Scraping page 15
   ---> 421 total reviews
Scraping page 16
   ---> 421 total reviews
Scraping page 17
   ---> 421 total reviews
Scraping page 18
   ---> 421 total reviews
Scraping page 19
   ---> 421 total reviews
Scraping page 20
   ---> 421 total reviews


In [10]:
df.head()

Unnamed: 0,Lounge Name,Airport,Type Of Lounge,Date Visit,Type Of Traveller,Comfort,Cleanliness,Bar & Beverages,Catering,Washrooms,Staff Service,Recommended,Wifi Connectivity,reviews
0,Business Class,Barbados Grantley Adams Airport,Business Class,March 2024,Business,3.0,3.0,3.0,3.0,1.0,3.0,yes,,✅ Trip Verified | The lounge is clean but the...
1,,London Heathrow Airport,Business Class,December 2023,Business,5.0,5.0,4.0,5.0,5.0,5.0,yes,5.0,✅ Trip Verified | The lounge is very spacious...
2,South Terminal,London Gatwick Airport,Business Class,September 2023,Business,1.0,1.0,2.0,2.0,1.0,1.0,no,3.0,"✅ Trip Verified | Tatty and uncared for, the ..."
3,First Lounge - T5,London Heathrow Airport,First Class,,,2.0,2.0,3.0,2.0,1.0,2.0,no,2.0,"✅ Trip Verified | Crowded or rather ""overcrow..."
4,T5 Galleries South,London Heathrow Airport,Business Class,June 2023,Business,4.0,4.0,5.0,5.0,4.0,,yes,,✅ Trip Verified | Flew London to Kalamata and...


In [11]:
df.to_csv("Airline_lounge_reviews_data.csv")

# Cleaning the data

In [30]:
data = pd.read_csv("/content/Airline_reviews_data")

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              2000 non-null   int64  
 1   Type Of Traveller       1997 non-null   object 
 2   Seat Type               2000 non-null   object 
 3   Route                   1997 non-null   object 
 4   Date Flown              2000 non-null   object 
 5   Ground Service          1933 non-null   float64
 6   Value For Money         2000 non-null   int64  
 7   Recommended             2000 non-null   object 
 8   Seat Comfort            1889 non-null   float64
 9   Cabin Staff Service     1877 non-null   float64
 10  Food & Beverages        1637 non-null   float64
 11  Inflight Entertainment  1140 non-null   float64
 12  Aircraft                1210 non-null   object 
 13  Wifi & Connectivity     495 non-null    float64
 14  reviews                 2000 non-null   

In [35]:
data.head()

Unnamed: 0.1,Unnamed: 0,Type Of Traveller,Seat Type,Route,Date Flown,Ground Service,Value For Money,Recommended,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Aircraft,Wifi & Connectivity,reviews
0,0,Solo Leisure,Economy Class,Boston to London,May 2024,1.0,3,no,,,,,,,✅ Trip Verified | Really terrible user experi...
1,1,Couple Leisure,Economy Class,New York to Manchester via London,May 2024,4.0,4,yes,4.0,5.0,5.0,5.0,,,✅ Trip Verified | Very impressed with BA. Chec...
2,2,Family Leisure,Business Class,London to San Francisco,August 2023,5.0,2,no,2.0,5.0,4.0,4.0,A380 / Boeing 777-200ER,,"✅ Trip Verified | LHR - SFO, LAS - LGW August..."
3,3,Couple Leisure,Economy Class,Malaga to Boston via London,May 2024,1.0,2,no,2.0,2.0,1.0,2.0,,1.0,Not Verified | I flew from Malaga via LHR to...
4,4,Couple Leisure,Business Class,Milan Linate to Miami via London Heathrow,April 2024,2.0,1,no,1.0,2.0,3.0,2.0,A380,,✅ Trip Verified | Milan to Miami return via L...


In [32]:
data.isna().sum()

Unnamed: 0                   0
Type Of Traveller            3
Seat Type                    0
Route                        3
Date Flown                   0
Ground Service              67
Value For Money              0
Recommended                  0
Seat Comfort               111
Cabin Staff Service        123
Food & Beverages           363
Inflight Entertainment     860
Aircraft                   790
Wifi & Connectivity       1505
reviews                      0
dtype: int64

### Splitting the data based on the type of the traveller since hte reviews change depending on the type of the traveller.

In [33]:
data['Type Of Traveller'].value_counts()

Type Of Traveller
Couple Leisure    660
Solo Leisure      614
Business          445
Family Leisure    278
Name: count, dtype: int64

In [34]:
data['Type Of Lounge'].value_counts()

KeyError: 'Type Of Lounge'

In [16]:
import re

# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', str(text))
    return text

Unnamed: 0,Lounge Name,Airport,Type Of Lounge,Date Visit,Type Of Traveller,Comfort,Cleanliness,Bar & Beverages,Catering,Washrooms,Staff Service,Recommended,Wifi Connectivity,reviews,Cleaned Reviews
0,Business Class,Barbados Grantley Adams Airport,Business Class,March 2024,Business,3.0,3.0,3.0,3.0,1.0,3.0,yes,,✅ Trip Verified | The lounge is clean but the...,Trip Verified The lounge is clean but the sea...
1,,London Heathrow Airport,Business Class,December 2023,Business,5.0,5.0,4.0,5.0,5.0,5.0,yes,5.0,✅ Trip Verified | The lounge is very spacious...,Trip Verified The lounge is very spacious wit...
2,South Terminal,London Gatwick Airport,Business Class,September 2023,Business,1.0,1.0,2.0,2.0,1.0,1.0,no,3.0,"✅ Trip Verified | Tatty and uncared for, the ...",Trip Verified Tatty and uncared for the BA bu...
3,First Lounge - T5,London Heathrow Airport,First Class,,,2.0,2.0,3.0,2.0,1.0,2.0,no,2.0,"✅ Trip Verified | Crowded or rather ""overcrow...",Trip Verified Crowded or rather overcrowded a...
4,T5 Galleries South,London Heathrow Airport,Business Class,June 2023,Business,4.0,4.0,5.0,5.0,4.0,,yes,,✅ Trip Verified | Flew London to Kalamata and...,Trip Verified Flew London to Kalamata and as ...


In [26]:
reviews_data['reviews'] = data['reviews']
reviews_data.head()

0    ✅ Trip Verified |  Really terrible user experi...
1    ✅ Trip Verified | Very impressed with BA. Chec...
2    ✅ Trip Verified |  LHR - SFO, LAS - LGW August...
3    Not Verified |   I flew from Malaga via LHR to...
4    ✅ Trip Verified |  Milan to Miami return via L...
Name: reviews, dtype: object

In [24]:
reviews_data['clean_reviews'] = clean(reviews_data)
reviews_data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews_data['clean_reviews'] = clean(reviews_data)


0    ✅ Trip Verified |  Really terrible user experi...
1    ✅ Trip Verified | Very impressed with BA. Chec...
2    ✅ Trip Verified |  LHR - SFO, LAS - LGW August...
3    Not Verified |   I flew from Malaga via LHR to...
4    ✅ Trip Verified |  Milan to Miami return via L...
Name: reviews, dtype: object