# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [5]:
# Create a dataframe for all reviews
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | Took a trip to Nashville with m...
1,Not Verified | A nightmare journey courtesy o...
2,✅ Trip Verified | Absolutely atrocious. LHR-OR...
3,✅ Trip Verified | As someone who flies relentl...
4,✅ Trip Verified | Flew with British Airways ...


In [6]:
# save it to CSV file
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Cleaning
In this part, we will remove unnecessary text from each of the rows. We will use `NLTK` to help us clean the data and get tokenized. The cleaned texts will be saved into dataframe and then saved in another CSV file.

In [7]:
import nltk
import ssl
import re

In [8]:
# Download stopwords and punkt from nltk
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeremychen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jeremychen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jeremychen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jeremychen/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jeremychen/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [9]:
# Load CSV file and read the dataset
reviews = pd.read_csv("BA_reviews.csv")
texts = reviews['reviews']
print(texts)

0      Not Verified | Took a trip to Nashville with m...
1      Not Verified |  A nightmare journey courtesy o...
2      ✅ Trip Verified | Absolutely atrocious. LHR-OR...
3      ✅ Trip Verified | As someone who flies relentl...
4      ✅ Trip Verified |   Flew with British Airways ...
                             ...                        
995    ✅ Trip Verified |  Return flight to Dublin. Ou...
996    ✅ Trip Verified |  Barbados to Gatwick. We boa...
997    ✅ Trip Verified |  I would like to praise the ...
998    ✅ Trip Verified | Madrid to London Heathrow. T...
999    ✅ Trip Verified | BA762 Heathrow to Oslo I hav...
Name: reviews, Length: 1000, dtype: object


In [10]:
# Preprocessing function 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

lemma = WordNetLemmatizer()

In [11]:
# Removing verification tags
texts = texts.str.strip("✅ Trip Verified |")
texts = texts.str.strip("Not Verified |")

# Function to clean text
def preprocess(text):
    # convert to lowercase and tokenize
    text = text.lower()
    text = re.sub(r'http\S+', '', text) # remove urls
    text = re.sub(r'[^\w\s]', '', text) # remove punctuations

    # create English stop words list
    en_stop = set(stopwords.words('english'))

    # remove stop words from text
    unstopped_words = [word for word in text.split() if word not in en_stop]

    # lemmatize words
    lemmatized_words = [lemma.lemmatize(word) for word in unstopped_words]
    return ' '.join(lemmatized_words)

# Clean the texts
cleaned_texts = texts.apply(preprocess)
reviews['cleaned_reviews'] = cleaned_texts
print(reviews)

     Unnamed: 0                                            reviews  \
0             0  Not Verified | Took a trip to Nashville with m...   
1             1  Not Verified |  A nightmare journey courtesy o...   
2             2  ✅ Trip Verified | Absolutely atrocious. LHR-OR...   
3             3  ✅ Trip Verified | As someone who flies relentl...   
4             4  ✅ Trip Verified |   Flew with British Airways ...   
..          ...                                                ...   
995         995  ✅ Trip Verified |  Return flight to Dublin. Ou...   
996         996  ✅ Trip Verified |  Barbados to Gatwick. We boa...   
997         997  ✅ Trip Verified |  I would like to praise the ...   
998         998  ✅ Trip Verified | Madrid to London Heathrow. T...   
999         999  ✅ Trip Verified | BA762 Heathrow to Oslo I hav...   

                                       cleaned_reviews  
0    took trip nashville wife leisure break arrived...  
1    nightmare journey courtesy british airwa

In [12]:
# Save cleaned texts to CSV file
reviews.to_csv("cleaned_BA_reviews.csv", index = False)