# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [7]:
# pip install pandas
# pip install beatifulsoup
# 
# !pip install pandas
!pip install beautifulsoup4



Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 23.0.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


In [10]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [11]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | London Heathrow to Male In...
1,Not Verified | Very good flight following an ...
2,Not Verified | An hour's delay due to late ar...
3,✅ Trip Verified | I booked through BA becaus...
4,✅ Trip Verified | British airways lost bags ...


In [13]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [14]:
df.head(5)

Unnamed: 0,reviews
0,✅ Trip Verified | London Heathrow to Male In...
1,Not Verified | Very good flight following an ...
2,Not Verified | An hour's delay due to late ar...
3,✅ Trip Verified | I booked through BA becaus...
4,✅ Trip Verified | British airways lost bags ...


In [16]:
import re
df = pd.DataFrame(reviews, columns=['Review'])

# Define a function to clean each review
def clean_review(text):
    # Remove "✅ Trip Verified", "Not Verified" or any other unnecessary prefix
    text = re.sub(r"✅ Trip Verified|Not Verified|Verified|Unverified|\\n", "", text)
    # Additional cleaning can be done here, such as removing extra spaces, unwanted characters, etc.
    return text.strip()

# Apply the cleaning function to the 'Review' column
df['Cleaned_Review'] = df['Review'].apply(clean_review)

# Print the cleaned reviews
print(df['Cleaned_Review'])

0      |   London Heathrow to Male In new business cl...
1      |  Very good flight following an equally good ...
2      |  An hour's delay due to late arrival of the ...
3      |   I booked through BA because Loganair don’t...
4      |   British airways lost bags in LHR then foun...
                             ...                        
995    |  London to Shanghai. The Concorde room in He...
996    |  I have often flown British Airways and have...
997    |  Good morning. I would like to write a revie...
998    | My flight was cancelled 3 days in a row. Was...
999    |  Hong Kong to Copenhagen via London. The who...
Name: Cleaned_Review, Length: 1000, dtype: object


In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [17]:
pip install pandas nltk matplotlib wordcloud


Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     ----- ---------------------------------- 0.2/1.5 MB 6.3 MB/s eta 0:00:01
     ----------------- ---------------------- 0.7/1.5 MB 8.4 MB/s eta 0:00:01
     ---------------------------------- ----- 1.3/1.5 MB 10.2 MB/s eta 0:00:01
     ---------------------------------------  1.5/1.5 MB 9.6 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 8.0 MB/s eta 0:00:00
Collecting matplotlib
  Using cached matplotlib-3.10.0-cp310-cp310-win_amd64.whl (8.0 MB)
Collecting wordcloud
  Downloading wordcloud-1.9.4-cp310-cp310-win_amd64.whl (299 kB)
     ---------------------------------------- 0.0/299.8 kB ? eta -:--:--
     ----------------------------- ------- 235.5/299.8 kB 14.1 MB/s eta 0:00:01
     -------------------------------------- 299.8/299.8 kB 4.6 M


[notice] A new release of pip is available: 23.0.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')  # If you plan to use sentiment analysis


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Manas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Manas\AppData\Roaming\nltk_data...
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Manas\AppData\Roaming\nltk_data...


True

In [19]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [28]:
def clean_review(text):
    # Remove '✅ Trip Verified' or any similar labels
    text = re.sub(r"✅ Trip Verified|Not Verified|Verified|Unverified|\n", "", text)
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation, numbers, and extra spaces
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    return text.strip()

# Apply the cleaning function
df['Cleaned_Review'] = df['Review'].apply(clean_review)

# Tokenization and removing stopwords
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def tokenize_and_clean(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

# Apply tokenization and stopword removal
df['Processed_Review'] = df['Cleaned_Review'].apply(tokenize_and_clean)

# Print cleaned reviews
print(df['Processed_Review'])

# Word Frequency Analysis
all_reviews = ' '.join(df['Processed_Review'])
word_tokens = word_tokenize(all_reviews)

# Get word frequency
# from collections import Counter
# word_freq = Counter(word_tokens)

# # Print most common words
# print(word_freq.most_common(10))

# # Visualization with Wordcloud
# wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)

# # Plot the word cloud
# plt.figure(figsize=(10, 5))
# plt.imshow(wordcloud, interpolation='bilinear')
# plt.axis("off")
# plt.show()

# # Optional: Sentiment Analysis
# from nltk.sentiment import SentimentIntensityAnalyzer

# sia = SentimentIntensityAnalyzer()
# df['Sentiment'] = df['Processed_Review'].apply(lambda x: sia.polarity_scores(x)['compound'])

# # Print sentiment analysis results
# print(df[['Review', 'Sentiment']].head())

# # Example: Positive/Negative Distribution
# df['Sentiment_Label'] = df['Sentiment'].apply(lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral'))

# # Plot sentiment distribution
# df['Sentiment_Label'].value_counts().plot(kind='bar', color=['green', 'red', 'gray'])
# plt.title('Sentiment Distribution')
# plt.ylabel('Count')
# plt.xlabel('Sentiment')
# plt.show()

0      london heathrow male new business class ba con...
1      good flight following equally good flight rome...
2      hour delay due late arrival incoming aircraft ...
3      booked ba loganair dont representative manches...
4      british airway lost bag lhr found sent cologne...
                             ...                        
995    london shanghai concorde room heathrow termina...
996    often flown british airway considered good air...
997    good morning would like write review british a...
998    flight cancelled day row flying thursday final...
999    hong kong copenhagen via london whole experien...
Name: Processed_Review, Length: 1000, dtype: object
