# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [77]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [78]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 1
page_size = 100

rating_list=[]
text_header_list=[]
text_sub_header_list=[]
reviews_list= []
reviews_detail_list = []
# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)
    
    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    #print(parsed_content)
    l=1
    for para in parsed_content.find_all("article", {"itemprop": "review"}):
        #print(para)
        rating=para.find("div",{"class":"rating-10"})
        text_header=para.find("h2",{"class":"text_header"})
        text_sub_header=para.find("h3",{"class":"text_sub_header userStatusWrapper"})
        reviews=para.find("div",{"class":"text_content"})
        table_review=para.find("table",{"class":"review-ratings"})
        review_detail = {}
        for row in para.find_all('tr'):
            key_cell = row.find('td', class_='review-rating-header')
            value_cell = row.find('td', class_='review-value') or row.find('td', class_='review-rating-stars')
            if key_cell:
                key = key_cell.text.strip()
                if value_cell:
                    value = value_cell.text.strip()
                    if 'stars' in value_cell['class']:       
                        # Handle star ratings here, e.g., count filled stars
                        #print(value_cell.find_all('span', class_='star fill'))
                        value = len(value_cell.find_all('span', class_='star fill'))
                    review_detail[key] = value



        reviews_detail_list.append(review_detail)

        

        rating_list.append(rating.get_text())
        text_header_list.append(text_header.get_text())
        text_sub_header_list.append(text_sub_header.get_text())
        reviews_list.append(reviews.get_text())
    
rating_list = list(map(lambda x: x.replace('\n',''), rating_list))
text_header_list = list(map(lambda x: x.replace('"', "").replace('”','').replace('“',''), text_header_list))
text_sub_header_list = list(map(lambda x: x.replace('\n',''), text_sub_header_list))
        
        #print(para.get_text())
        #reviews.append(para.get_text())
print(f"   ---> {len(rating_list)} total rating_list")
print(f"   ---> {len(text_header_list)} total text_header_list")
print(f"   ---> {len(text_sub_header_list)}  total text_sub_header_list")
print(f"   ---> {len(reviews_list)} total reviews_list")
print(f"   ---> {len(reviews_detail_list)} total reviews_detail_list")




Scraping page 1
   ---> 100 total rating_list
   ---> 100 total text_header_list
   ---> 100  total text_sub_header_list
   ---> 100 total reviews_list
   ---> 100 total reviews_detail_list


In [79]:
def split_helper(text):
  parts = text.split("(")
  name = parts[0].strip()
  location, date = parts[1].split(")")
  location=location.strip()
  date =date.strip()
  return name, location, date

In [80]:
user_name_list=[]
location_list=[]
date_list=[]

In [81]:
for item in text_sub_header_list:
    name, location, date = split_helper(item)
    user_name_list.append(name)
    location_list.append(location)
    date_list.append(date)


In [82]:
all_keys_feature_name_list = set()
for item in reviews_detail_list:
    all_keys_feature_name_list.update(item.keys())
all_keys_feature_name_list

{'Aircraft',
 'Cabin Staff Service',
 'Date Flown',
 'Food & Beverages',
 'Ground Service',
 'Inflight Entertainment',
 'Recommended',
 'Route',
 'Seat Comfort',
 'Seat Type',
 'Type Of Traveller',
 'Value For Money',
 'Wifi & Connectivity'}

In [83]:
reviews_list

["✅ Trip Verified | An excellent flight! Despite this being a 4.5 hour flight in a A320 that is configured for short hops to Edinburgh or Glasgow, and despite there being no IFE, I ended up really enjoying the flight. I don't know if it was just that I had really low expectations, but I thought the crew were fabulous, and the food and wine was surprisingly good. The plane itself looked very new and was in perfect condition. From the very start, I really liked the crew - they were all cheerful, genuine, and clearly enjoying their work. I noticed they constantly had friendly and proactive interactions with passengers, and when the food and drink came around, they were generous with the wine - and happily provided top-ups later in the flight. The food box was deceptively small - it actually contained a decent selection of food, and was more than enough to keep me feeling satisfied on this trip. There were regular, and detailed, updates from the cockpit which I also appreciated. So whilst 

In [84]:
vertify_list=[]
for i in range(len(reviews_list)):
  data=reviews_list[i].split("|")
  if data[0] == "✅ Trip Verified ":
    data[0]="Trip Verified"
  else:
    data[0]="Not Verified"
  vertify_list.append(data[0])
  reviews_list[i]=data[1]



In [85]:
len(reviews_list)

100

In [86]:
df = pd.DataFrame()
df['date']=date_list
df['rating_list']=rating_list
df["text_header"]=text_header_list
df["Trip_Verified"]=vertify_list
df["reviews"] = reviews_list
df["user_name"] = user_name_list
df["location"]=location_list

df.head()

Unnamed: 0,date,rating_list,text_header,Trip_Verified,reviews,user_name,location
0,12th August 2024,8/10,the crew were fabulous,Trip Verified,An excellent flight! Despite this being a 4.5...,39 reviewsG Jones,Lebanon
1,11th August 2024,3/10,customer service has been horrible,Trip Verified,I recently traveled with British Airways and ...,Edward King,United States
2,9th August 2024,1/10,Not as much as an apology,Trip Verified,My family and I were booked to leave London...,N Kwok,United Kingdom
3,8th August 2024,2/10,no one attending or caring,Not Verified,We had to change from AA to BA for a flight ...,Greg Szczurek,United States
4,8th August 2024,2/10,disappointed in British Airways,Trip Verified,After paying $6500 for tickets for my family ...,G Cooper,United States


In [87]:
for item in all_keys_feature_name_list:
    df[item]=None

In [88]:
for i in range(len(df)):
    rating_reviwe=reviews_detail_list[i]
    for feature in all_keys_feature_name_list:
        df[feature][i]=reviews_detail_list[i].get(feature)

In [89]:
display(df)

Unnamed: 0,date,rating_list,text_header,Trip_Verified,reviews,user_name,location,Ground Service,Aircraft,Inflight Entertainment,Cabin Staff Service,Route,Recommended,Date Flown,Seat Type,Type Of Traveller,Wifi & Connectivity,Value For Money,Food & Beverages,Seat Comfort
0,12th August 2024,8/10,the crew were fabulous,Trip Verified,An excellent flight! Despite this being a 4.5...,39 reviewsG Jones,Lebanon,4,A320,,5,London to Amman,yes,August 2024,Economy Class,Solo Leisure,,4,4,3
1,11th August 2024,3/10,customer service has been horrible,Trip Verified,I recently traveled with British Airways and ...,Edward King,United States,2,A380,4,4,Barcelona to Dallas via Heathrow,no,August 2024,Economy Class,Family Leisure,,3,4,3
2,9th August 2024,1/10,Not as much as an apology,Trip Verified,My family and I were booked to leave London...,N Kwok,United Kingdom,1,,,,London to Hong Kong,no,August 2024,Economy Class,Family Leisure,,1,,1
3,8th August 2024,2/10,no one attending or caring,Not Verified,We had to change from AA to BA for a flight ...,Greg Szczurek,United States,2,,2,2,Dallas to London,no,May 2024,Economy Class,Couple Leisure,2,2,2,2
4,8th August 2024,2/10,disappointed in British Airways,Trip Verified,After paying $6500 for tickets for my family ...,G Cooper,United States,3,Boeing 777,4,1,London to Tampa,no,August 2024,Economy Class,Family Leisure,3,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,30th January 2024,9/10,no great expectations,Trip Verified,I flew to LHR from ATH in Club Europe with B...,Makoto Hashimoto,Japan,4,,,5,Athens to London,yes,December 2023,Business Class,Solo Leisure,3,4,5,3
96,29th January 2024,8/10,actively try and split families up,Trip Verified,I like the British Airways World Traveller P...,Clarke Roper,United Kingdom,1,Boeing 777-200,4,4,London Gatwick to Cancun,yes,January 2024,Premium Economy,Family Leisure,4,4,4,4
97,28th January 2024,3/10,my cabin luggage was taken,Trip Verified,I have come to boarding and my cabin luggage...,Mariia Volkovq,Ukraine,1,,2,3,London to Warsaw,yes,January 2024,Economy Class,Business,3,3,3,3
98,26th January 2024,2/10,appalling service,Trip Verified,Stinking nappies being changed in business ca...,M Baker,United Kingdom,3,,2,1,London Heathrow to Miami,no,December 2023,Business Class,Family Leisure,,2,2,2


In [90]:
df.to_csv("data/BA_reviews.csv")