# British Airways Predicting Customer Behaviours

### Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called BeautifulSoup to collect the data from the web. Once you've collected your data and saved it into a local .csv file you should start with your analysis.

Scraping data from Skytrax
If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [68]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [69]:
#web scraping
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []


# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [71]:
#make dataframe
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | As a Spanish born individual l...
1,✅ Trip Verified | A rather empty and quiet fl...
2,✅ Trip Verified | Easy check in and staff mem...
3,✅ Trip Verified | Being a silver flyer and bo...
4,Not Verified | I find BA incredibly tacky and...


In [73]:
df.to_csv("data/BA_reviews2.csv")

### Data Exploration

In [74]:
data_ba = pd.read_csv('data/BA_reviews2.csv', index_col=[0])

In [75]:
data_ba.head()

Unnamed: 0,reviews
0,Not Verified | As a Spanish born individual l...
1,✅ Trip Verified | A rather empty and quiet fl...
2,✅ Trip Verified | Easy check in and staff mem...
3,✅ Trip Verified | Being a silver flyer and bo...
4,Not Verified | I find BA incredibly tacky and...


In [76]:
data_ba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   reviews  1000 non-null   object
dtypes: object(1)
memory usage: 15.6+ KB


In [77]:
#copy data 
data1 = data_ba.copy()

In [78]:
data1.head()

Unnamed: 0,reviews
0,Not Verified | As a Spanish born individual l...
1,✅ Trip Verified | A rather empty and quiet fl...
2,✅ Trip Verified | Easy check in and staff mem...
3,✅ Trip Verified | Being a silver flyer and bo...
4,Not Verified | I find BA incredibly tacky and...


In [79]:
data1.isnull().any()

reviews    False
dtype: bool