## Web scraping and analysis

 We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The worst airline I have e...
1,"✅ Trip Verified | Excellent service levels, ..."
2,Not Verified | Booked a very special holiday ...
3,"Not Verified | Just returned from Chicago, fle..."
4,✅ Trip Verified | BA standards continue to de...


In [5]:
df.to_csv("data/BA_reviews.csv")

The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [7]:
reviews = pd.read_csv("data/BA_reviews.csv")

In [9]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None

In [12]:
reviews.head(10)

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | The worst airline I have e...
1,1,"✅ Trip Verified | Excellent service levels, ..."
2,2,Not Verified | Booked a very special holiday ...
3,3,"Not Verified | Just returned from Chicago, fle..."
4,4,✅ Trip Verified | BA standards continue to de...
5,5,Not Verified | Awful. Business class check in...
6,6,✅ Trip Verified | Not a reliable airline. You...
7,7,✅ Trip Verified | I take comfort in reading t...
8,8,✅ Trip Verified | The worst journey in my lif...
9,9,✅ Trip Verified | The airplanes and the lounge...


In [14]:
reviews.shape

(1000, 2)

In [15]:
reviews.isnull().sum()

Unnamed: 0    0
reviews       0
dtype: int64

In [16]:
reviews.drop(columns = 'Unnamed: 0' , inplace = True)

In [17]:
reviews.head(10)

Unnamed: 0,reviews
0,✅ Trip Verified | The worst airline I have e...
1,"✅ Trip Verified | Excellent service levels, ..."
2,Not Verified | Booked a very special holiday ...
3,"Not Verified | Just returned from Chicago, fle..."
4,✅ Trip Verified | BA standards continue to de...
5,Not Verified | Awful. Business class check in...
6,✅ Trip Verified | Not a reliable airline. You...
7,✅ Trip Verified | I take comfort in reading t...
8,✅ Trip Verified | The worst journey in my lif...
9,✅ Trip Verified | The airplanes and the lounge...


In [18]:
import re

reviews['clean_reviews'] = reviews['reviews'].apply(lambda x: re.sub(r'✅ Trip Verified \| | Not Verified \|', '', x))


In [20]:
reviews.drop(columns = 'reviews' , inplace = True)

In [21]:
reviews.head(15)

Unnamed: 0,clean_reviews
0,The worst airline I have ever flown with. Al...
1,"Excellent service levels, proactive crew and..."
2,Not Verified | Booked a very special holiday ...
3,"Not Verified | Just returned from Chicago, fle..."
4,BA standards continue to decline every time I...
5,Not Verified | Awful. Business class check in...
6,Not a reliable airline. You cannot trust the ...
7,I take comfort in reading the last ten or so ...
8,The worst journey in my life. The connection ...
9,"The airplanes and the lounges are worn out, ol..."


In [24]:
!pip install TextBlob

from textblob import TextBlob

reviews['sentiment'] = reviews['clean_reviews'].apply(lambda x: TextBlob(x).sentiment.polarity)


Collecting TextBlob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
     ---------------------------------------- 0.0/636.8 kB ? eta -:--:--
     ---------- --------------------------- 174.1/636.8 kB 3.5 MB/s eta 0:00:01
     -------------------------------------- 636.8/636.8 kB 8.0 MB/s eta 0:00:00
Installing collected packages: TextBlob
Successfully installed TextBlob-0.17.1


In [25]:
reviews.head(10)

Unnamed: 0,clean_reviews,sentiment
0,The worst airline I have ever flown with. Al...,-0.093651
1,"Excellent service levels, proactive crew and...",0.349883
2,Not Verified | Booked a very special holiday ...,0.036967
3,"Not Verified | Just returned from Chicago, fle...",0.024118
4,BA standards continue to decline every time I...,0.008532
5,Not Verified | Awful. Business class check in...,-0.123958
6,Not a reliable airline. You cannot trust the ...,0.004924
7,I take comfort in reading the last ten or so ...,-0.127273
8,The worst journey in my life. The connection ...,-0.066477
9,"The airplanes and the lounges are worn out, ol...",-0.0675


In [26]:
categories = ['service', 'food', 'comfort', 'staff', 'punctuality']

def categorize_review(review):
    for category in categories:
        if category in review:
            return category
    return 'other'

reviews['category'] = reviews['clean_reviews'].apply(categorize_review)


In [27]:
reviews.head(10)

Unnamed: 0,clean_reviews,sentiment,category
0,The worst airline I have ever flown with. Al...,-0.093651,other
1,"Excellent service levels, proactive crew and...",0.349883,service
2,Not Verified | Booked a very special holiday ...,0.036967,food
3,"Not Verified | Just returned from Chicago, fle...",0.024118,service
4,BA standards continue to decline every time I...,0.008532,service
5,Not Verified | Awful. Business class check in...,-0.123958,staff
6,Not a reliable airline. You cannot trust the ...,0.004924,other
7,I take comfort in reading the last ten or so ...,-0.127273,food
8,The worst journey in my life. The connection ...,-0.066477,service
9,"The airplanes and the lounges are worn out, ol...",-0.0675,other


In [35]:
category_sentiments = reviews.groupby('category')['sentiment'].mean()
positive_reviews = reviews[reviews['sentiment'] > 0]
negative_reviews = reviews[reviews['sentiment'] < 0]


In [37]:
# Assign "positive" or "negative" based on sentiment score
reviews['sentiment_label'] = reviews['sentiment'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')


In [38]:
reviews.head(10)

Unnamed: 0,clean_reviews,sentiment,category,sentiment_label
0,The worst airline I have ever flown with. Al...,-0.093651,other,negative
1,"Excellent service levels, proactive crew and...",0.349883,service,positive
2,Not Verified | Booked a very special holiday ...,0.036967,food,positive
3,"Not Verified | Just returned from Chicago, fle...",0.024118,service,positive
4,BA standards continue to decline every time I...,0.008532,service,positive
5,Not Verified | Awful. Business class check in...,-0.123958,staff,negative
6,Not a reliable airline. You cannot trust the ...,0.004924,other,positive
7,I take comfort in reading the last ten or so ...,-0.127273,food,negative
8,The worst journey in my life. The connection ...,-0.066477,service,negative
9,"The airplanes and the lounges are worn out, ol...",-0.0675,other,negative


In [39]:
# Calculate the counts of each sentiment label
sentiment_counts = reviews['sentiment_label'].value_counts()

# Calculate the percentage of positive and negative sentiment
positive_percentage = (sentiment_counts.get('positive', 0) / len(reviews)) * 100
negative_percentage = (sentiment_counts.get('negative', 0) / len(reviews)) * 100

print(f"Percentage of positive sentiment: {positive_percentage:.2f}%")
print(f"Percentage of negative sentiment: {negative_percentage:.2f}%")


Percentage of positive sentiment: 64.50%
Percentage of negative sentiment: 34.40%


In [42]:
reviews.head(10)

Unnamed: 0,clean_reviews,sentiment,category,sentiment_label
0,The worst airline I have ever flown with. Al...,-0.093651,other,negative
1,"Excellent service levels, proactive crew and...",0.349883,service,positive
2,Not Verified | Booked a very special holiday ...,0.036967,food,positive
3,"Not Verified | Just returned from Chicago, fle...",0.024118,service,positive
4,BA standards continue to decline every time I...,0.008532,service,positive
5,Not Verified | Awful. Business class check in...,-0.123958,staff,negative
6,Not a reliable airline. You cannot trust the ...,0.004924,other,positive
7,I take comfort in reading the last ten or so ...,-0.127273,food,negative
8,The worst journey in my life. The connection ...,-0.066477,service,negative
9,"The airplanes and the lounges are worn out, ol...",-0.0675,other,negative


In [43]:
reviews['count'] = 1
reviews.groupby('category').sum('count')

Unnamed: 0_level_0,sentiment,count
category,Unnamed: 1_level_1,Unnamed: 2_level_1
comfort,2.584562,41
food,11.983202,134
other,17.719905,286
service,33.084169,457
staff,3.572504,82


In [51]:
# Create a DataFrame with category and sentiment_label columns
category_sentiment_counts = reviews.groupby(['category', 'sentiment_label'])['sentiment_label'].count().reset_index(name='count')

# Pivot the table to create separate columns for positive and negative counts
pivot_table = category_sentiment_counts.pivot(index='category', columns='sentiment_label', values='count')

# Fill NaN values with 0 (in case a category has only one sentiment label)
pivot_table.fillna(0, inplace=True)

# Calculate the total reviews for each category
pivot_table['total_reviews'] = pivot_table['negative'] + pivot_table['positive']

# Reset the index to have 'category' as a regular column
pivot_table.reset_index(inplace=True)

# If a category has no negative or positive reviews, replace NaN with 0
pivot_table['negative'].fillna(0, inplace=True)
pivot_table['positive'].fillna(0, inplace=True)

print(pivot_table)


sentiment_label category  negative  neutral  positive  total_reviews
0                comfort      16.0      0.0      25.0           41.0
1                   food      43.0      1.0      90.0          133.0
2                  other     102.0      5.0     179.0          281.0
3                service     149.0      3.0     305.0          454.0
4                  staff      34.0      2.0      46.0           80.0


In [53]:
df_1 = pd.DataFrame(pivot_table)

In [54]:
df_1.head()

sentiment_label,category,negative,neutral,positive,total_reviews
0,comfort,16.0,0.0,25.0,41.0
1,food,43.0,1.0,90.0,133.0
2,other,102.0,5.0,179.0,281.0
3,service,149.0,3.0,305.0,454.0
4,staff,34.0,2.0,46.0,80.0


After removing 'neutral' column

In [62]:
df_1.head()

sentiment_label,category,negative,positive,total_reviews
0,comfort,16.0,25.0,41.0
1,food,43.0,90.0,133.0
2,other,102.0,179.0,281.0
3,service,149.0,305.0,454.0
4,staff,34.0,46.0,80.0


In [63]:
df_1 = df_1.sort_values('total_reviews' , ascending = False)

In [64]:
df_1.head()

sentiment_label,category,negative,positive,total_reviews
3,service,149.0,305.0,454.0
2,other,102.0,179.0,281.0
1,food,43.0,90.0,133.0
4,staff,34.0,46.0,80.0
0,comfort,16.0,25.0,41.0


save the files in csv for future use

In [65]:
df_1.to_csv("categorywise_analysis.csv")
reviews.to_csv("cleaned_reviews_analysis.csv")

#insight
#Percentage of positive sentiment: 64.50%
#Percentage of negative sentiment: 34.40%