# Task 1

---


### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20
page_size = 100

reviews=[]
review_recommendation=[]
dates=[]
seat_type=[]

for i in tqdm(range(pages)):
    url=f'{base_url}/page/{i+1}/?sortby=post_date%3ADesc&pagesize={page_size}'
    response=requests.get(url)
    
    content=response.content
    parsed_content=BeautifulSoup(content,'html.parser')
    
    for j,paras in enumerate(parsed_content(['div','td'],{'class':['text_content','review-value rating-yes','review-value rating-no']})):
        if j==0 or j%2==0:
            reviews.append(paras.get_text())
        else:
            review_recommendation.append(paras.get_text())
                                         
    for date in parsed_content.find_all('time'):
        dates.append(date.get('datetime'))
     
    for element in parsed_content.find_all("td",{'class':'review-rating-header cabin_flown'}):
        seat_type.append(element.next_sibling.get_text())
    #seat_type.append([element.next_sibling.get_text() for element in parsed_content.find_all("td", 
                                                                                     #  class_="review-rating-header cabin_flown")])

  0%|          | 0/20 [00:00<?, ?it/s]

In [4]:
df = pd.DataFrame()
df['date']=dates
df['seat_type']=seat_type
df["reviews"] = reviews
df['recommendation']=review_recommendation
df.head()

Unnamed: 0,date,seat_type,reviews,recommendation
0,2023-05-23,Premium Economy,Not Verified | Top Ten REASONS to not use Brit...,no
1,2023-05-23,Economy Class,Not Verified | Easy check in on the way to He...,yes
2,2023-05-23,Economy Class,✅ Trip Verified | Online check in worked fine...,yes
3,2023-05-22,Business Class,✅ Trip Verified |. The BA first lounge at Term...,no
4,2023-05-22,Business Class,Not Verified | Paid a quick visit to Nice yest...,no


Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [5]:
df.reviews

0       Not Verified | Top Ten REASONS to not use Brit...
1       Not Verified |  Easy check in on the way to He...
2       ✅ Trip Verified |  Online check in worked fine...
3       ✅ Trip Verified |. The BA first lounge at Term...
4       Not Verified | Paid a quick visit to Nice yest...
                              ...                        
1995    Manchester to New York via Heathrow. We were b...
1996    ✅ Verified Review |  The British Airways exper...
1997    Flew Malta to London. First the plus points. G...
1998    Philadelphia to London Heathrow with British A...
1999    Upgraded on the outbound flight from London to...
Name: reviews, Length: 2000, dtype: object

In [6]:
def filter_text(text):
    text_split=text.split('|')
    if len(text_split)>1:
        return text_split[1]
    else: return text

In [7]:
df['reviews']=df.reviews.apply(filter_text)

In [8]:
df['date']=pd.to_datetime(df.date)

In [9]:
df.head()

Unnamed: 0,date,seat_type,reviews,recommendation
0,2023-05-23,Premium Economy,Top Ten REASONS to not use British Airways To...,no
1,2023-05-23,Economy Class,Easy check in on the way to Heathrow. The fl...,yes
2,2023-05-23,Economy Class,Online check in worked fine. Quick security ...,yes
3,2023-05-22,Business Class,. The BA first lounge at Terminal 5 was a zoo...,no
4,2023-05-22,Business Class,Paid a quick visit to Nice yesterday from Hea...,no


In [None]:
df.to_csv("data/BA_reviews.csv")