# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup 
import pandas as pd   

In [2]:
import re 
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100
reviews_count = 0 
reviews = []
titles = [] 
dates = []
countries = []

ratings = [] 

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")
    reviews_count +=1
    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)
 
    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "body"}):
        
        ratings_data = {}


        #titles.append(para.find("h2","text_header").get_text() )
        reviews.append(para.find("div","text_content").get_text() ) 
        h3_tag = para.find('h3', {'class': 'text_sub_header userStatusWrapper'})
        country_name = re.search('\((.*?)\)', h3_tag.text).group(1)
        # countries.append(country_name)
        time_tag = para.find('time', itemprop='datePublished')
        date = time_tag['datetime']  
        # dates.append( date  )

        ratings_data['title'] = para.find("h2","text_header").get_text()
        ratings_data['review'] = para.find("div","text_content").get_text()
        ratings_data['country'] = country_name
        ratings_data['date'] =date


        table = para.find('table', class_='review-ratings')


        rows = table.find_all('tr')
        #ratings_data = {}
        for row in rows:
            cols = row.find_all('td')  
            if cols:
                key = cols[0].text.strip()
                value = cols[1].text.strip()
                if key == 'Ground Service' or key == 'Value For Money' or  key == 'Seat Comfort' or  key == 'Cabin Staff Service' or  key == 'Food & Beverages' or  key == 'Inflight Entertainment' :
                    filled_stars = len(cols[1].find_all('span', class_='fill'))
                    ratings_data[key] = filled_stars
                else:
                    ratings_data[key] = value
        ratings.append(ratings_data)

    print(f"   ---> {len(reviews)} total reviews")


Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [24]:
df = pd.DataFrame(ratings)
df

Unnamed: 0,title,review,country,date,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended,Aircraft
0,"""My baggage never arrived""",Not Verified | Lost my case and took 6 weeks ...,United Kingdom,2023-03-12,Business,Business Class,London to Dubai,December 2022,4.0,4.0,4.0,4.0,1.0,12345.0,1,no,
1,"""Total disruption, wasted time""",✅ Trip Verified | The incoming and outgoing f...,United Kingdom,2023-03-10,Solo Leisure,Economy Class,Geneva to London,March 2023,2.0,3.0,3.0,,1.0,,2,no,A320
2,"""what an absolute nightmare""",✅ Trip Verified | Back in December my family ...,Australia,2023-03-10,Family Leisure,Economy Class,Prague to London,December 2022,1.0,,,,1.0,,1,no,
3,"""I detest British Airways""",✅ Trip Verified | As usual the flight is dela...,United Kingdom,2023-03-10,Business,Economy Class,Heathrow to Glasgow,March 2023,1.0,1.0,1.0,1.0,1.0,12345.0,1,no,
4,"""Clean aircraft, good crew, professional""",✅ Trip Verified | A short BA euro trip and thi...,United Kingdom,2023-03-09,Business,Economy Class,London Heathrow to Arlanda Stockholm,March 2023,5.0,5.0,2.0,,5.0,,5,yes,A321
5,"""this airline is horrible""",Not Verified | We are flying Business class f...,United States,2023-03-08,Couple Leisure,Business Class,Portland to Tel Aviv via Heathrow,March 2023,,,,,,,1,no,
6,"""avoid flying British Airways""",✅ Trip Verified | I am in Australia and on Fr...,Australia,2023-03-06,Solo Leisure,Business Class,Heathrow to Milan Malpensa,April 2022,,,,,1.0,,1,no,
7,"""had better treatment from Ryanair""",✅ Trip Verified | At 7.54 am on the day of tr...,United Kingdom,2023-03-04,Solo Leisure,Economy Class,London to Los Angeles,March 2023,1.0,2.0,1.0,1.0,3.0,12345.0,1,no,
8,"""Would happily fly them again""",✅ Trip Verified | Would happily fly them agai...,United States,2023-03-02,Solo Leisure,Economy Class,New York to Istanbul via London,March 2023,5.0,5.0,5.0,5.0,5.0,,5,yes,Boeing 777 / A320
9,"""one drink service on 10 hour flight""","Not Verified | Flew premium, only worth the e...",United Kingdom,2023-03-02,Couple Leisure,Premium Economy,London Heathrow to Las Vegas,March 2023,3.0,2.0,1.0,4.0,2.0,,3,no,


In [4]:
df.to_csv("BA_reviews1.csv", index=False)      

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.