# Task 1

---

## Web scraping and analysis

We will use a package called `BeautifulSoup` to collect the data from the web.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 12  #specifies the no.of pages u want to scrape
page_size = 100 #no.of reviews per page

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):    #loops through pages from 1

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"   #for each page url changes 

    # Collect HTML data from this page
    response = requests.get(url)    #sends request

    # Parse content
    content = response.content   #stores the raw html
    parsed_content = BeautifulSoup(content, 'html.parser')  #parses the html,using bs
    for para in parsed_content.find_all("div", {"class": "text_content"}):  #searches all <div> elements with the class textcontent
        reviews.append(para.get_text()) #extracts only text and adds to the review list
    
    print(f"   ---> {len(reviews)} total reviews")  #tracks how many reviews have been scraped in total after each page

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df

Unnamed: 0,reviews
0,✅ Trip Verified | On a recent flight from Cy...
1,✅ Trip Verified | Flight BA 0560 arrived in ...
2,✅ Trip Verified | This was the first time I ...
3,✅ Trip Verified | Pretty good flight but sti...
4,"✅ Trip Verified | Check in was fine, but no pr..."
...,...
1195,✅ Trip Verified | London to Miami. Worst long ...
1196,✅ Trip Verified | I used avios point to upgrad...
1197,"✅ Trip Verified | Boarding was fairly quick, t..."
1198,✅ Trip Verified | Bangalore to London. Ground...


In [4]:
df.to_csv("data/BA_reviews.csv")

Now we have our dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website.

In [5]:
pd.read_csv('data/BA_reviews.csv')

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | On a recent flight from Cy...
1,1,✅ Trip Verified | Flight BA 0560 arrived in ...
2,2,✅ Trip Verified | This was the first time I ...
3,3,✅ Trip Verified | Pretty good flight but sti...
4,4,"✅ Trip Verified | Check in was fine, but no pr..."
...,...,...
1195,1195,✅ Trip Verified | London to Miami. Worst long ...
1196,1196,✅ Trip Verified | I used avios point to upgrad...
1197,1197,"✅ Trip Verified | Boarding was fairly quick, t..."
1198,1198,✅ Trip Verified | Bangalore to London. Ground...
