# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com](https://www.airlinequality.com) you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways](https://www.airlinequality.com/airline-reviews/british-airways) you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [4]:
# # Install the requests module
# !pip install requests

Collecting requests
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
     -------------------------------------- 62.8/62.8 kB 305.7 kB/s eta 0:00:00
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.1.0-cp310-cp310-win_amd64.whl (97 kB)
     -------------------------------------- 97.1/97.1 kB 504.8 kB/s eta 0:00:00
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
     ------------------------------------ 140.6/140.6 kB 642.4 kB/s eta 0:00:00
Installing collected packages: urllib3, charset-normalizer, requests
Successfully installed charset-normalizer-3.1.0 requests-2.28.2 urllib3-1.26.14


In [1]:
# import requests
# from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
# pages = 30
# page_size = 100

# reviews = []

# # for i in range(1, pages + 1):
# for i in range(1, pages + 1):

#     print(f"Scraping page {i}")

#     # Create URL to collect links from paginated data
#     url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

#     # Collect HTML data from this page
#     response = requests.get(url)

#     # Parse content
#     content = response.content
#     parsed_content = BeautifulSoup(content, 'html.parser')
#     for para in parsed_content.find_all("div", {"class": "text_content"}):
#         reviews.append(para.get_text())
    
#     print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [3]:
# df = pd.DataFrame()
# df["reviews"] = reviews
# df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [4]:
# df.to_csv("../data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Analysis

We have increased the dataset from 1000 reviews to 3000 reviews in order to improve the output of our analysis

First let us list out our analysis work flow

1. Data Loading
2. Data Cleaning
3. Word Cloud
4. Topic Modeling
5. Sentiment Analysis

We have scrapped and loaded the data, now lets get to cleaning the data

### 1. Data Loading

In [18]:
import pandas as pd
df = pd.read_csv("../data/BA_reviews.csv")
df = df.drop(columns=["Unnamed: 0"], errors="ignore")

In [3]:
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [4]:
df.shape

(3000, 1)

### 2. Data Cleaning

In [19]:
# Import labraries for test data cleaning
import pandas as pd 
import re 
import string 
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words("english")
wn = nltk.WordNetLemmatizer()
punct = string.punctuation

In [9]:
report = 'trip to las vegas'

re.sub('^trip ', '', report)

'to las vegas'

In [20]:
# This function cleans the data

def clean_text(text):
    # Remove punctuations 
    text = "".join([word.lower() for word in text if word not in punct])
    
    # Remove any other signs
    text = " ".join(re.split('\W+', text))
    
    # remove trip, verified, review, unverified, not in the starting text
    #text = re.sub('^trip', '', text)
    
    return text

df["cleaned_reviews"] = df["reviews"]. apply(lambda x: clean_text(x))
df.sample(10)

Unnamed: 0,reviews,cleaned_reviews
2626,Quick online check-in and boarding passes with the BA mobile APP. It is so easy to use and defin...,quick online checkin and boarding passes with the ba mobile app it is so easy to use and definit...
1457,"✅ Verified Review | These are words you may not read often, but I actually enjoy the British Ai...",verified review these are words you may not read often but i actually enjoy the british airways...
1050,✅ Trip Verified | Hyderabad to Brussels via London. Didn't expect this from British Airways. For...,trip verified hyderabad to brussels via london didnt expect this from british airways for me qu...
576,Not Verified | Good morning. I would like to write a review for British Airways. It took me a w...,not verified good morning i would like to write a review for british airways it took me a while ...
396,"✅ Trip Verified | Mumbai to London. The check in process was quick, efficient and friendly. Unf...",trip verified mumbai to london the check in process was quick efficient and friendly unfortunat...
2106,London Heathrow to New York JFK return. This was the first time for a while that I have been on ...,london heathrow to new york jfk return this was the first time for a while that i have been on a...
2323,Montreal to Rome via London with British Airways. Boarded the plane on time (20.30hrs) and sat f...,montreal to rome via london with british airways boarded the plane on time 2030hrs and sat for s...
422,✅ Trip Verified | Mumbai to Boston via London. My flight with British airways was really good. ...,trip verified mumbai to boston via london my flight with british airways was really good the ca...
2077,✅ Verified Review | I was looking forward to trying the new 787 from British Airways flying fro...,verified review i was looking forward to trying the new 787 from british airways flying from ku...
1930,✅ Verified Review | \r\nI have to be upfront and say the flight from London Heathrow to Boston ...,verified review i have to be upfront and say the flight from london heathrow to boston exceeded...


In [24]:
# remove the words trip, verified, review
#df["cleaned_reviews"].map(lambda x: re.sub('trip ', '', x))
# df["cleaned_reviews"] = df["cleaned_reviews"].map(lambda x: re.sub('r^verified ', '', x))
# df["cleaned_reviews"] = df["cleaned_reviews"].map(lambda x: re.sub('r^review ', '', x))
df.sample(10)

Unnamed: 0,reviews,cleaned_reviews
2591,This was my first time on a long haul flight with British Airways. I found the seats to be very ...,this was my first time on a long haul flight with british airways i found the seats to be very c...
2207,Very good service on a packed British Airways flight from London to Mumbai. Crew were excellent....,very good service on a packed british airways flight from london to mumbai crew were excellent f...
615,✅ Trip Verified | Tel Aviv to Toronto via London. The plane from London to Toronto was run by A...,trip verified tel aviv to toronto via london the plane from london to toronto was run by air be...
1733,✅ Verified Review | London Heathrow to Seattle on a redemption ticket in premium economy. Board...,verified review london heathrow to seattle on a redemption ticket in premium economy boarding a...
2999,Travelled from London Heathrow to Tokyo Haneda overall not a very pleasurable experience. The se...,travelled from london heathrow to tokyo haneda overall not a very pleasurable experience the sea...
2242,Club class on British Airways between Cape Town and London. Boeing 747-400 has the flat bed and ...,club class on british airways between cape town and london boeing 747400 has the flat bed and wh...
2583,"Boarding delayed in MIA due to weather but BA kept passengers updated. On boarding, offered drin...",boarding delayed in mia due to weather but ba kept passengers updated on boarding offered drinks...
2818,LHR-BKK in Club on B777. Fourth long haul leg on BA in the last month and this was by far the be...,lhrbkk in club on b777 fourth long haul leg on ba in the last month and this was by far the best...
2758,LGW-RAK-LGW May 2015. These flights were on BA's newly configured A320's we paid for Emergency E...,lgwraklgw may 2015 these flights were on bas newly configured a320s we paid for emergency exit r...
2495,"London Heathrow to Copenhagen with British Airways, and I was surprised by the service. Customer...",london heathrow to copenhagen with british airways and i was surprised by the service customer s...
