# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com](https://www.airlinequality.com) you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways](https://www.airlinequality.com/airline-reviews/british-airways) you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [4]:
# # Install the requests module
# !pip install requests

Collecting requests
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
     -------------------------------------- 62.8/62.8 kB 305.7 kB/s eta 0:00:00
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.1.0-cp310-cp310-win_amd64.whl (97 kB)
     -------------------------------------- 97.1/97.1 kB 504.8 kB/s eta 0:00:00
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
     ------------------------------------ 140.6/140.6 kB 642.4 kB/s eta 0:00:00
Installing collected packages: urllib3, charset-normalizer, requests
Successfully installed charset-normalizer-3.1.0 requests-2.28.2 urllib3-1.26.14


In [1]:
# import requests
# from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
# pages = 30
# page_size = 100

# reviews = []

# # for i in range(1, pages + 1):
# for i in range(1, pages + 1):

#     print(f"Scraping page {i}")

#     # Create URL to collect links from paginated data
#     url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

#     # Collect HTML data from this page
#     response = requests.get(url)

#     # Parse content
#     content = response.content
#     parsed_content = BeautifulSoup(content, 'html.parser')
#     for para in parsed_content.find_all("div", {"class": "text_content"}):
#         reviews.append(para.get_text())
    
#     print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [3]:
# df = pd.DataFrame()
# df["reviews"] = reviews
# df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [4]:
# df.to_csv("../data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Analysis

We have increased the dataset from 1000 reviews to 3000 reviews in order to improve the output of our analysis

First let us list out our analysis work flow

1. Data Loading
2. Data Cleaning
3. Word Cloud
4. Topic Modeling
5. Sentiment Analysis

We have scrapped and loaded the data, now lets get to cleaning the data

### 1. Data Loading

In [12]:
df = pd.read_csv("../data/BA_reviews.csv")
df = df.drop(columns=["Unnamed: 0"], errors="ignore")

In [14]:
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [16]:
df.shape

(3000, 1)

### 2. Data Cleaning

In [24]:
# Import labraries for test data cleaning
import pandas as pd 
import re 
import string 
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words("english")
wn = nltk.WordNetLemmatizer()
punct = string.punctuation

In [43]:
# This function cleans the data

def clean_text(text):
    # Remove punctuations 
    text = "".join([word.lower() for word in text if word not in punct])
    
    # Remove any other signs
    text = " ".join(re.split('\W+', text))
    
    # remove trip, verified, review, unverified, not in the starting text
    text = re.sub('\^trip', "", text)
    
    return text

df["cleaned_reviews"] = df["reviews"]. apply(lambda x: clean_text(x))
df.sample(10)

Unnamed: 0,reviews,cleaned_reviews
914,"✅ Trip Verified | Istanbul to London Heathrow. The plane itself was old, I found the food choi...",trip verified istanbul to london heathrow the plane itself was old i found the food choices wer...
2219,The British Airways outbound flight from London to Hong Kong was on a 3 year old A380. We sat ne...,the british airways outbound flight from london to hong kong was on a 3 year old a380 we sat nea...
1012,✅ Trip Verified | Hamburg to Abu Dhabi via London. Hamburg to Heathrow not even a free glass of...,trip verified hamburg to abu dhabi via london hamburg to heathrow not even a free glass of wate...
1025,✅ Trip Verified | Miami to London. My most recent BA experience was positive. I fly BA for work...,trip verified miami to london my most recent ba experience was positive i fly ba for work and l...
724,✅ Trip Verified | Tenerife to Heathrow. Effectively a budget airline masquerading at premium ai...,trip verified tenerife to heathrow effectively a budget airline masquerading at premium airline...
548,✅ Trip Verified | Having booked this flight a week before the BA strike and mistakingly thinking...,trip verified having booked this flight a week before the ba strike and mistakingly thinking th...
484,✅ Trip Verified | San Francisco to London. A380 is a wonderful aeroplane. Movie selection was b...,trip verified san francisco to london a380 is a wonderful aeroplane movie selection was below a...
2365,"Heathrow to Las Vegas with British Airways, and a farcical flight to be honest. 2 hours into a 1...",heathrow to las vegas with british airways and a farcical flight to be honest 2 hours into a 10 ...
625,✅ Trip Verified | Frankfurt to London. BA staff watched while security went through partner's b...,trip verified frankfurt to london ba staff watched while security went through partners bag for...
2655,"Flight on time, nice crew on the plane, very comfortable seat and great food. Snacks and beverag...",flight on time nice crew on the plane very comfortable seat and great food snacks and beverages ...


In [None]:
re.sub()