# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [4]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Stupidly tried BA again aft...
1,Not Verified | Seat horribly narrow; 3-4-3 on...
2,✅ Trip Verified | Glasgow to London delayed b...
3,✅ Trip Verified | When I tried to check in on...
4,✅ Trip Verified | I flew from Prague to LHR. ...


In [5]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [6]:
df = df[df["reviews"].str.contains("Not Verified") == False]
df.head

<bound method NDFrame.head of                                                reviews
0    ✅ Trip Verified |  Stupidly tried BA again aft...
2    ✅ Trip Verified |  Glasgow to London delayed b...
3    ✅ Trip Verified |  When I tried to check in on...
4    ✅ Trip Verified |  I flew from Prague to LHR. ...
5    ✅ Trip Verified |  Disappointing again especia...
..                                                 ...
995  ✅ Trip Verified |  Milan to London Heathrow. T...
996  ✅ Trip Verified |  London to Helsinki, Busines...
997  ✅ Trip Verified |  London Heathrow to Denver. ...
998  ✅ Trip Verified | It was a 13 hours flight fro...
999  ✅ Trip Verified |  Worst business class I flew...

[838 rows x 1 columns]>

In [52]:
data['reviews'] = pd.DataFrame([df['reviews'].iloc[i].removeprefix("✅ Trip Verified | ") for i in range(len(df))])

In [53]:
data

Unnamed: 0,0,reviews
0,Stupidly tried BA again after a five year gap...,Stupidly tried BA again after a five year gap...
1,Glasgow to London delayed by 1 hour. My wife ...,Glasgow to London delayed by 1 hour. My wife ...
2,"When I tried to check in online, I was offere...","When I tried to check in online, I was offere..."
3,"I flew from Prague to LHR. Excellent service,...","I flew from Prague to LHR. Excellent service,..."
4,Disappointing again especially on business. T...,Disappointing again especially on business. T...
...,...,...
833,Milan to London Heathrow. The service and cle...,Milan to London Heathrow. The service and cle...
834,"London to Helsinki, Business class, Friday 23...","London to Helsinki, Business class, Friday 23..."
835,London Heathrow to Denver. Words cannot expre...,London Heathrow to Denver. Words cannot expre...
836,It was a 13 hours flight from Heathrow to Kual...,It was a 13 hours flight from Heathrow to Kual...


In [55]:
!pip install Unidecode

Collecting Unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: Unidecode
Successfully installed Unidecode-1.3.6


In [56]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
import string
import unidecode


def clean (text):
    
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
        
    lowercased = text.lower() # Lower Case
    
    unaccented_string = unidecode.unidecode(lowercased) # remove accents
    
    tokenized = word_tokenize(unaccented_string) # Tokenize
    
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    
    stop_words = set(stopwords.words('english')) # Make stopword list
    
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    
    return " ".join(without_stopwords)

df['clean_text'] = data['reviews'].apply(clean)

df.head()

Unnamed: 0,reviews,clean_text
0,✅ Trip Verified | Stupidly tried BA again aft...,stupidly tried ba five year gap paid wife go b...
2,✅ Trip Verified | Glasgow to London delayed b...,tried check online offered upgrade premium eco...
3,✅ Trip Verified | When I tried to check in on...,flew prague lhr excellent service attentive st...
4,✅ Trip Verified | I flew from Prague to LHR. ...,disappointing especially business service anci...
5,✅ Trip Verified | Disappointing again especia...,outbound return flights offered decent meal re...
