<a href="https://colab.research.google.com/github/matyi101/British_Airways/blob/main/BA_SKYTRAX_REVIEW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import os
from google.colab import drive

drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/')

Mounted at /content/drive


In [3]:
import csv
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [6]:
# Set the URL of the paginated webpage that you want to scrape
url = "https://www.airlinequality.com/airline-reviews/british-airways"

# Initialize an empty list to store the data that you scrape
data = []

# Setting the initial page number and the increment that you want to use to paginate through the webpage
page_size = 100
# maximum number of pages to be scraped
max_pages = 25

# Create a session object for HTTP requests
session = requests.Session()

# Use the built-in pagination feature of the requests library to paginate through the webpage and scrape the data
for page_num in range(1, max_pages + 1):
    print(f"Scraping page {page_num}")

    # Set the URL of the webpage to be scraped 
    paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # A GET request to the paginated URL using the session object
    response = session.get(paginated_url)

    # Parsing the response using BeautifulSoup
    parsed_content = BeautifulSoup(response.text, "html.parser")

    # Finding all the elements on the page that contain the data to be scraped using a list comprehension
    elements = [element for element in parsed_content.find_all("div",class_ = "body")]

    # Extract the data from each element using a list comprehension
    page_data = [[element.find("h2",class_ = "text_header").text.replace("\n", " "),
                  element.find("h3",class_ = "text_sub_header").text.replace("\n", " "),
                  element.find("div",class_ = "text_content").text.replace("\n", " ")]
                 for element in elements]

    # Append the page data to the overall data list
    data.extend(page_data)

    print(f"   ---> {len(data)} total reviews")


Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [8]:
#Coverting the list data into a dataframe
df = pd.DataFrame(data)
df.columns = ["review","info","content"]

#Removing unwanted text(first text preprocessing)
df.replace(re.compile(r'\s*✅ Trip Verified \|\s*'), '', inplace=True)
df

Unnamed: 0,review,info,content
0,"""It was a nightmare""",Guadalupe Carlos-Alarcon (United States) 18t...,Not Verified | They changed our Flights from ...
1,"""Abysmal service""",Patrick Sparks (United States) 18th April 2023,Not Verified | At Copenhagen the most chaotic...
2,"""trained to give you the runaround""",T Cayle (United States) 17th April 2023,Worst experience of my life trying to deal wit...
3,"""they only had one choice of meal""",1 reviews Andrew Pybus (Hong Kong) 17th A...,Due to code sharing with Cathay Pacific I was ...
4,"""relentless BA cost cutting""",M Edwards (United Kingdom) 16th April 2023,LHR check in was quick at the First Wing and q...
...,...,...,...
2495,"""inflight entertainment is rubbish""",T Ronayne (United Kingdom) 9th October 2015,I'm starting to go with British Airways less a...
2496,"""no passenger baggage arrived""",David Trounce (Canada) 8th October 2015,Flew British Airways from Toronto to Rome via ...
2497,"""need help to get the service right""",Nick Read (Australia) 7th October 2015,I paid for myself to get a First upgrade for m...
2498,"""you could care less""",E Ohler (United States) 7th October 2015,We paid for World Traveller Plus (Premium Econ...


Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [11]:
SA_df = df.drop(["review","info"], axis=1)
SA_df.replace(re.compile(r'\s*✅ Verified Review \|\s*'), '', inplace=True)
SA_df

Unnamed: 0,content
0,Not Verified | They changed our Flights from ...
1,Not Verified | At Copenhagen the most chaotic...
2,Worst experience of my life trying to deal wit...
3,Due to code sharing with Cathay Pacific I was ...
4,LHR check in was quick at the First Wing and q...
...,...
2495,I'm starting to go with British Airways less a...
2496,Flew British Airways from Toronto to Rome via ...
2497,I paid for myself to get a First upgrade for m...
2498,We paid for World Traveller Plus (Premium Econ...


# Task 2

---

## INSIGHTS
This Jupyter notebook includes some code to generate insights on what we learnt from the data. We will use a package called NLTK to perform sentiment analysis.

## GRAPHS AND PLOTS
We will then use a package called matplotlib to perform plot charts and graphs