## Web Scraping and Analysis

### Introduction 
British Airways (BA) is the flag carrier airline of the United Kingdom (UK). Every day, thousands of BA flights arrive to and depart from the UK, carrying customers across the world. Whether it’s for holidays, work or any other reason, the end-to-end process of scheduling, planning, boarding, fuelling, transporting, landing, and continuously running flights on time, efficiently and with top-class customer service is a huge task with many highly important responsibilities.

As a data scientist at British Airways, it will be your job to apply your analytical skills to influence real life multi-million-pound decisions from day one, making a tangible impact on the business as your recommendations, tools and models drive key business decisions, reduce costs and increase revenue.Customers who book a flight with BA will experience many interaction points with the BA brand. Understanding a customer's feelings, needs, and feedback is crucial for any business, British Airways. The steps taken are;


### 1. Scraping data from Skytrax

For this task, we will reviews the [British Airways Airline data](https://www.airlinequality.com/airline-reviews/british-airways).  `Python` and `BeautifulSoup` will be used to collect all the links to the reviews and then to collect the text data on each of the individual review links.

### 2. Analyse the data
Once we have the dataset, we will prepare it. The data is very messy and contain purely text. We will need to perform data cleaning in order to prepare the data for analysis. When the data is clean, we should perform several analysis to uncover some insights. 

### 3. Present insights
We have been required by the manager to summarise our findings within a single PowerPoint slide, so that they can present the results at the next board meeting. We would create visualisations and metrics to include within this slide, as well as clear and concise explanations in order to quickly provide the key points from our analysis.


### 1. Scraping data from Skytrax

In [1]:
# importing required libaries 
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# making a request to the site that contains the reviews we will be extracting
# The loops used to collect 2000 reviews by iterating through the paginated pages on the website.
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [3]:
# Creating a pandas dataframe out of the reviews list
df = pd.DataFrame()
df["reviews"] = reviews

### 2. Analyse the data

* Calculating four metrics for each review:

In [4]:
# Calculate word count - total number of words in each review
df['word_count'] = df['reviews'].apply(lambda x: len(str(x).split(" ")))

In [5]:
# Calculate character count - total number of characters in each review
df['char_count'] = df['reviews'].str.len()

In [6]:
# Average word length – the average length of words used
def avg_word(review):
  words = review.split()
  return (sum(len(word) for word in words) / len(words))

# Calculate average words
df['avg_word'] = df['reviews'].apply(lambda x: avg_word(x))

In [7]:
# Import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Calculate number of stop words
stop_words = stopwords.words('english')
df['stopword_count'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop_words]))

In [9]:
# a review the summary statistics
df.describe()

Unnamed: 0,word_count,char_count,avg_word,stopword_count
count,2000.0,2000.0,2000.0,2000.0
mean,172.135,938.608,4.53048,68.895
std,111.787477,601.803078,0.288742,49.154716
min,28.0,148.0,3.67,5.0
25%,92.0,508.0,4.33307,34.0
50%,144.5,798.0,4.514152,57.0
75%,214.0,1181.0,4.697146,87.0
max,659.0,3529.0,5.83871,323.0


In [10]:
# calling the first 10 dataset
df.head(10)

Unnamed: 0,reviews,word_count,char_count,avg_word,stopword_count
0,✅ Trip Verified | BA shuttle service across t...,101,527,4.27,36
1,✅ Trip Verified | I must admit like many other...,155,828,4.348387,52
2,Not Verified | When will BA update their Busi...,102,596,4.95,32
3,✅ Trip Verified | Paid £200 day before flight...,189,1001,4.324468,80
4,✅ Trip Verified | BA website did not work (we...,142,799,4.666667,45
5,✅ Trip Verified | Absolutely terrible experie...,127,721,4.722222,51
6,✅ Trip Verified | Vancouver to Delhi via Lond...,576,3209,4.58885,257
7,✅ Trip Verified | Old A320 with narrow pitch....,30,182,5.275862,6
8,✅ Trip Verified | Another BA Shambles. Starte...,274,1492,4.498155,102
9,Not Verified | BA cancelled my flight home to...,147,786,4.383562,65


In [11]:
# calling the last 10 dataset
df.tail(10)

Unnamed: 0,reviews,word_count,char_count,avg_word,stopword_count
1990,London Gatwick to Lima with British Airways. T...,143,788,4.517483,60
1991,✅ Verified Review | Flew British Airways from...,90,502,4.640449,29
1992,✅ Verified Review | After a hiatus of almost ...,187,1117,5.032432,69
1993,✅ Verified Review | Flew with British Airways...,163,863,4.354037,72
1994,✅ Verified Review | \r\nGoing against the gra...,518,2840,4.489362,202
1995,I was very impressed with the World Traveller ...,260,1466,4.642308,113
1996,Flew British Airways from Gatwick to Punta Can...,264,1420,4.382576,114
1997,✅ Verified Review | This was a flight from Ga...,268,1501,4.621723,108
1998,London to Calgary. It's hard to know quite wha...,78,408,4.24359,31
1999,"Warsaw to Heathrow, and the check in is only o...",67,340,4.089552,26


In [12]:
# saving data collected 
df.to_csv("data/BA_reviews.csv")

In [13]:
# checking for the data types
df.dtypes

reviews            object
word_count          int64
char_count          int64
avg_word          float64
stopword_count      int64
dtype: object

* Data Cleaning

In this session, we will be focused more on items listed below;

*Lowercasing all words*

*Removing punctuation*

*Removing stopwords*

*Removing excessively short and frequent words that are not important*

*Lowercasing all words*

By lowercasing all of the text in the reviews it means that words which are capitalised won’t be missed

In [31]:
# Lowercasing  all words
df['reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))

0       ✅ trip verified | ba shuttle service across th...
1       ✅ trip verified | i must admit like many other...
2       not verified | when will ba update their busin...
3       ✅ trip verified | paid £200 day before flight ...
4       ✅ trip verified | ba website did not work (wei...
                              ...                        
1995    i was very impressed with the world traveller ...
1996    flew british airways from gatwick to punta can...
1997    ✅ verified review | this was a flight from gat...
1998    london to calgary. it's hard to know quite wha...
1999    warsaw to heathrow, and the check in is only o...
Name: reviews, Length: 2000, dtype: object

In [23]:
# storing it up in a new column
df['reviews_lower'] = df['reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))

*Removing punctuation*
 
It is best to strip out punctuations as it does not add much meaning when searching for a word or trying to ascertain sentiment.

In [30]:
# Removing Punctuations
df['reviews_lower'].str.replace('[^\w\s]', '')

  df['reviews_lower'].str.replace('[^\w\s]', '')


0        trip verified  ba shuttle service across the ...
1        trip verified  i must admit like many others ...
2       not verified  when will ba update their busine...
3        trip verified  paid 200 day before flight for...
4        trip verified  ba website did not work weirdl...
                              ...                        
1995    i was very impressed with the world traveller ...
1996    flew british airways from gatwick to punta can...
1997     verified review  this was a flight from gatwi...
1998    london to calgary its hard to know quite what ...
1999    warsaw to heathrow and the check in is only op...
Name: reviews_lower, Length: 2000, dtype: object

In [24]:
# storing it up in a new column
df['reviews_nopun'] = df['reviews_lower'].str.replace('[^\w\s]', '')

  df['reviews_nopun'] = df['reviews_lower'].str.replace('[^\w\s]', '')


*Removing stopwords*

Stop words are commonly occuring words that hold little to no meaning. Hence, it is common practice in natural language processing to take them off.

In [35]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [32]:
# Removing Stopwords
df['reviews_nopun'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))

0       trip verified ba shuttle service across uk sti...
1       trip verified must admit like many others tend...
2       verified ba update business class cabin 8 acro...
3       trip verified paid 200 day flight upgrade econ...
4       trip verified ba website work weirdly deleted ...
                              ...                        
1995    impressed world traveller plus experience brit...
1996    flew british airways gatwick punta cana begin ...
1997    verified review flight gatwick bermuda 777 bri...
1998    london calgary hard know quite makes 787 comfo...
1999    warsaw heathrow check open 2 hours wanted shop...
Name: reviews_nopun, Length: 2000, dtype: object

In [25]:
# storing it up in a new column
df['reviews_nopun_nostop'] = df['reviews_nopun'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))

*Removing excessively short and frequent words that are not important*

In [44]:
# top 35 occuring words in our review
pd.Series(" ".join(df['reviews_nopun_nostop']).split()).value_counts()[:35]

flight        3734
ba            2566
verified      1852
service       1630
london        1566
food          1249
trip          1223
seat          1221
british       1151
crew          1149
airways       1149
time          1108
cabin         1058
class         1052
seats         1027
good          1004
one            936
heathrow       926
business       899
staff          859
would          858
review         772
economy        767
get            765
airline        677
first          661
flights        624
hours          620
us             620
passengers     588
back           586
boarding       582
plane          576
even           571
could          543
dtype: int64

In [46]:
# Returning frequency of values
freq= pd.Series(" ".join(df['reviews_nopun_nostop']).split()).value_counts()[:25]

In [49]:
# checking for other stopwords
other_stopwords = ['get', 'us', 'see', 'use', 'even', 'could', 'back', 'would' \
  'one', 'to', 'and', 'a', 'was', 'i', 'of', 'this', 'had', 'be'\
  'in', 'on', 'for', 'is', 'one', 'not', 'that', 'were', 'from', 'have'\
  'it', 'we', 'but', 'they', 'has', 'at', 'very', 'no']

In [50]:
# checking the lenght of other stopwords
len(other_stopwords)

33

In [54]:
# removing from the review text block.
df['reviews_nopun_nostop'].apply(lambda x: "".join(" ".join(x for x in x.split() if x not in other_stopwords)))

0       trip verified ba shuttle service across uk sti...
1       trip verified must admit like many others tend...
2       verified ba update business class cabin 8 acro...
3       trip verified paid 200 day flight upgrade econ...
4       trip verified ba website work weirdly deleted ...
                              ...                        
1995    impressed world traveller plus experience brit...
1996    flew british airways gatwick punta cana begin ...
1997    verified review flight gatwick bermuda 777 bri...
1998    london calgary hard know quite makes 787 comfo...
1999    warsaw heathrow check open 2 hours wanted shop...
Name: reviews_nopun_nostop, Length: 2000, dtype: object

In [59]:
# storing it up in a new column
df['cleanreviews'] = df['reviews_nopun_nostop'].apply(lambda x: "".join
                       (" ".join(x for x in x.split() if x not in other_stopwords)))

In [60]:
# checking the top 35 clean reviews
pd.Series(" ".join(df['cleanreviews']).split()).value_counts()[:35]

flight        3734
ba            2566
verified      1852
service       1630
london        1566
food          1249
trip          1223
seat          1221
british       1151
airways       1149
crew          1149
time          1108
cabin         1058
class         1052
seats         1027
good          1004
heathrow       926
business       899
staff          859
would          858
review         772
economy        767
airline        677
first          661
flights        624
hours          620
passengers     588
boarding       582
plane          576
lounge         539
return         520
experience     504
check          499
meal           494
club           490
dtype: int64

In [63]:
df.head()

Unnamed: 0,reviews,word_count,char_count,avg_word,stopword_count,reviews_lower,reviews_nopun,reviews_nopun_nostop,reviews_nopun_nostop_nocommon,cleanreviews
0,✅ Trip Verified | BA shuttle service across t...,101,527,4.27,36,✅ trip verified | ba shuttle service across th...,trip verified ba shuttle service across the ...,trip verified ba shuttle service across uk sti...,trip verified ba shuttle service across uk sti...,trip verified ba shuttle service across uk sti...
1,✅ Trip Verified | I must admit like many other...,155,828,4.348387,52,✅ trip verified | i must admit like many other...,trip verified i must admit like many others ...,trip verified must admit like many others tend...,trip verified must admit like many others tend...,trip verified must admit like many others tend...
2,Not Verified | When will BA update their Busi...,102,596,4.95,32,not verified | when will ba update their busin...,not verified when will ba update their busine...,verified ba update business class cabin 8 acro...,verified ba update business class cabin 8 acro...,verified ba update business class cabin 8 acro...
3,✅ Trip Verified | Paid £200 day before flight...,189,1001,4.324468,80,✅ trip verified | paid £200 day before flight ...,trip verified paid 200 day before flight for...,trip verified paid 200 day flight upgrade econ...,trip verified paid 200 day flight upgrade econ...,trip verified paid 200 day flight upgrade econ...
4,✅ Trip Verified | BA website did not work (we...,142,799,4.666667,45,✅ trip verified | ba website did not work (wei...,trip verified ba website did not work weirdl...,trip verified ba website work weirdly deleted ...,trip verified ba website work weirdly deleted ...,trip verified ba website work weirdly deleted ...


### 3. Present insights