# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [87]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 36
page_size = 100

dfs = []  # List to store data frames for each page

for page_num in range(1, pages + 1):
    page_reviews = []
    page_stats = []

    print(f"Scraping page {page_num}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    review_divs = parsed_content.find_all("div", {"class": "text_content"})
    stats_divs = parsed_content.find_all("div", {"class": "review-stats"})

    for review_div, stats_div in zip(review_divs, stats_divs):
        page_reviews.append(review_div.get_text())

        rating_values = []
        category_values = []

        for rating_td in stats_div.find_all('td', {'class': 'review-value'}):
            rating_values.append(rating_td.get_text())

        recommendation = rating_values[-1]
        rating_values = rating_values[:-1]

        for stars_td in stats_div.find_all('td', {'class': 'review-rating-stars stars'}):
            num_stars = len(stars_td.find_all('span', {'class': 'star fill'}))
            rating_values.append(num_stars)
        rating_values.append(recommendation)

        for header_td in stats_div.find_all('td', {'class': 'review-rating-header'}):
            category_values.append(header_td.get_text())

        stats_data = {category_values[i]: rating_values[i] for i in range(len(category_values))}
        page_stats.append(stats_data)

    # Create a DataFrame for each page with the reviews and stats
    df_page = pd.DataFrame({'Review': page_reviews})
    df_page = df_page.assign(**pd.DataFrame(page_stats))

    # Append the page DataFrame to the list
    dfs.append(df_page)

# Concatenate all the data frames in the list
df = pd.concat(dfs, ignore_index=True)
print("Scraping complete")

Scraping page 1
Scraping page 2


In [72]:
df.sample(10)

Unnamed: 0,Review,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Ground Service,Value For Money,Recommended,Aircraft,Food & Beverages,Inflight Entertainment,Wifi & Connectivity
134,✅ Trip Verified | Organised boarding process. ...,Business,Economy Class,London Heathrow to Glasgow,January 2023,5.0,5.0,5.0,5,yes,A320,4.0,,5.0
19,✅ Trip Verified | We booked on the BA website...,Couple Leisure,Business Class,London to Amman,March 2023,,,,1,no,,,,
24,✅ Trip Verified | Having experienced delays a...,Business,Business Class,Amsterdam to London,June 2023,3.0,3.0,2.0,2,no,A350,3.0,2.0,1.0
65,Not Verified | Regarding the aircraft and seat...,Couple Leisure,Business Class,Singapore to London,April 2023,4.0,2.0,3.0,1,no,Boeing 787,1.0,5.0,1.0
30,✅ Trip Verified | We were traveling as a fami...,Family Leisure,Economy Class,Gatwick to Venice,June 2023,3.0,5.0,1.0,3,no,,4.0,,
177,✅ Trip Verified | Extremely sub-par service. H...,Solo Leisure,Economy Class,San Francisco to London,November 2022,2.0,1.0,3.0,2,no,A380,2.0,2.0,1.0
4,✅ Trip Verified | My family and I have flown ...,Couple Leisure,Premium Economy,Chennai to London,July 2023,3.0,2.0,4.0,1,no,Boeing 777,1.0,1.0,
193,✅ Trip Verified | Just a few years ago flying...,Family Leisure,Economy Class,Larnaca to London,October 2022,1.0,2.0,2.0,2,no,A320,1.0,1.0,
164,✅ Trip Verified | Turned up 3.5 hours in advan...,Solo Leisure,Economy Class,London Heathrow to Bangkok via Doha,December 2022,2.0,2.0,1.0,1,no,,2.0,2.0,
95,Not Verified | I was meant to fly in January t...,Family Leisure,Economy Class,London to Algiers,May 2022,1.0,1.0,1.0,2,no,,,,


In [73]:
df.to_csv("data/BA_reviews.csv", index=False)
df

Unnamed: 0,Review,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Ground Service,Value For Money,Recommended,Aircraft,Food & Beverages,Inflight Entertainment,Wifi & Connectivity
0,✅ Trip Verified | Customer Service does not e...,Family Leisure,Business Class,London to Rhodes,July 2023,1.0,1.0,1.0,1,no,,,,
1,✅ Trip Verified | Another really great pair of...,Family Leisure,Business Class,Newcastle to Las Vegas via Heathrow,June 2023,4.0,5.0,4.0,4,yes,A320 A350,4.0,3.0,
2,Not Verified | Our A380 developed a fault tax...,Solo Leisure,Business Class,London to Miami,June 2023,1.0,1.0,1.0,1,no,A380,1.0,,
3,Not Verified | Horrible airline. Does not care...,Solo Leisure,Economy Class,Amman to London,July 2023,3.0,1.0,4.0,3,no,A320Neo,1.0,1.0,3.0
4,✅ Trip Verified | My family and I have flown ...,Couple Leisure,Premium Economy,Chennai to London,July 2023,3.0,2.0,4.0,1,no,Boeing 777,1.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,✅ Trip Verified | Baby across the aisle cried ...,Solo Leisure,Business Class,Madrid to San Francisco via London,September 2022,3.0,1.0,3.0,1,no,A380,1.0,4.0,
196,✅ Trip Verified | Evening flight from LHR to ...,Solo Leisure,Economy Class,London to Washington,October 2022,3.0,5.0,2.0,5,yes,A380,5.0,3.0,
197,✅ Trip Verified | We boarded our flight at Ed...,Couple Leisure,Economy Class,London to San Francisco,October 2022,1.0,1.0,1.0,1,no,A380,3.0,3.0,5.0
198,✅ Trip Verified | While entering the aircraft...,Solo Leisure,Economy Class,London to Delhi,October 2022,2.0,2.0,4.0,1,no,,1.0,3.0,


Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

### Task 1.1 - Cleaning the data
---

In [84]:
print(df.shape)
df.sample(1)

(200, 15)


Unnamed: 0,Review,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Ground Service,Value For Money,Recommended,Aircraft,Food & Beverages,Inflight Entertainment,Wifi & Connectivity,Verified
84,Literally the worst flight of my life. After...,Couple Leisure,Economy Class,London to Malta,April 2023,3.0,1.0,1.0,1,no,,2.0,,,Not Verified


In [85]:
df[['Verified', 'Review']] = df['Review'].str.split('|', n=1, expand=True)

print(df.shape)
df.sample(5)

ValueError: Columns must be same length as key

In [None]:
df['Verified'] = df['Verified'].apply(lambda x: True if 'Trip Verified' in str(x) else False)

df.sample(5)

In [7]:
# Save new file
df.to_csv("data/BA_reviews_clean.csv", index=False)

### 1.2 - Tokenisation of data
Using ntlt, I will be tokenising the reviews, removing stopwords and POS tagging

In [8]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.corpus import wordnet

# Tokenisation
nltk.download('punkt')

# Stopwords
nltk.download('stopwords')

# POS tagging
nltk.download('averaged_perceptron_tagger')

# Wordnet
nltk.download('wordnet')

ModuleNotFoundError: No module named 'nltk'

In [None]:
# POS tagger dict
pos_dict = {'A':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'D':wordnet.ADV}

def posTag(review):
    tags = pos_tag(word_tokenize(review))
    
    list=[]
    
    for text, tag in tags:
        if text.lower() not in set(stopwords.words('english')):
            list.append(tuple([text, pos_dict.get(tag[0])]))
            
    return list

df['POS Tag'] = df['reviews'].apply(posTag)

NameError: name 'df' is not defined