## Scrape Data

In this phase, we'll gather review data from the web, focusing on the airline. 
Utilize the Skytrax website for this purpose and aim to collect extensive data to enhance our analysis. 




In [1]:
#imports

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests 
import re

In [2]:
# Initialize a list to store review data
reviews_data = []

for i in range(1, 39):
    # Make a request to the airline reviews page
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
 
    # Parse HTML using Beautiful Soup
    soup = BeautifulSoup(page.content, 'html.parser')

    # Loop through each review element
    for review_elem in soup.find_all('article', class_='comp_media-review-rated'):
        review_data = {}

        # Extract rating information
        rating_elem = review_elem.find('span', itemprop='ratingValue')
        if rating_elem:
            review_data['Rating'] = int(rating_elem.get_text())

        # Extract title information
        title_elem = review_elem.find('h2', class_='text_header')
        if title_elem:
            review_data['Title'] = title_elem.get_text().strip()

        # Extract author and date information
        author_elem = review_elem.find('span', itemprop='author')
        date_elem = review_elem.find('time', itemprop='datePublished')
        country_elem = review_elem.find('h3', class_='text_sub_header')

        if author_elem and date_elem and country_elem:
            review_data['Author'] = author_elem.get_text().strip()
            review_data['Date'] = date_elem.get('datetime')
            country_text = country_elem.get_text()
            review_data['Country'] = country_text.split('(')[-1].split(')')[0]

        # Extract review information
        body_elem = review_elem.find('div', class_='text_content')
        if body_elem:
            review_data['Review'] = body_elem.get_text().strip()

        # Extract rating details
        rating_elems = review_elem.find_all('tr')
        for rating in rating_elems:
            header = rating.find('td', class_='review-rating-header')
            value = rating.find('td', class_='review-value')
            if header and value:
                review_data[header.get_text().strip()] = value.get_text().strip()

         # Extract star information for each rating category
        star_elems = review_elem.find_all('td', class_='review-rating-stars stars')
        for star_elem in star_elems:
            header = star_elem.find_previous_sibling('td', class_='review-rating-header')
            if header:
                stars = len(star_elem.find_all('span', class_='star fill'))
                review_data[header.get_text().strip()] = stars

        # Append review data to the list reviews_data
        reviews_data.append(review_data)
    
# Create a DataFrame from the list reviews_data
df = pd.DataFrame(reviews_data)
df

Unnamed: 0,Rating,Title,Author,Date,Country,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Recommended,Seat Comfort,Cabin Staff Service,Ground Service,Value For Money,Food & Beverages,Inflight Entertainment,Wifi & Connectivity
0,8.0,"""stick to their cabin bag size limit""",4 reviews\n\n\n\nMichael Powell,2024-03-21,United Kingdom,Not Verified | The flight was comfortable eno...,A320,Solo Leisure,Economy Class,London Heathrow to Toulouse,March 2024,yes,4.0,4.0,3.0,3,,,
1,8.0,"""crew were attentive, friendly""",N Wardan,2024-03-21,Canada,✅ Trip Verified | We had a really good flying...,Boeing 787-8 /777-200 / A320-200,Family Leisure,Economy Class,Montreal to Venice via London Heathrow,March 2024,yes,4.0,5.0,3.0,4,5.0,4.0,
2,3.0,"""Utterly outrageous""",Solomon Pachtinger,2024-03-19,United Kingdom,✅ Trip Verified | Waited an hour to check-in ...,A321,Business,Business Class,Paphos to London,March 2024,no,2.0,1.0,1.0,1,2.0,1.0,2.0
3,3.0,"""They have a long way to go""",Paul Roberts,2024-03-19,Singapore,Not Verified | Not a great experience at all...,Boeing 777-300,Business,Business Class,London to Houston,March 2024,no,1.0,3.0,1.0,1,3.0,2.0,2.0
4,7.0,"""FA's were friendly""",42 reviews\n\n\n\nE Carmere,2024-03-14,Belgium,✅ Trip Verified | Boarding was difficult caus...,A320,Solo Leisure,Business Class,London Heathrow to Brussels,March 2024,yes,2.0,4.0,3.0,3,3.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3766,1.0,British Airways customer review,J Robertson,2012-08-29,United Kingdom,This was a bmi Regional operated flight on a R...,,,Economy Class,,,no,3.0,1.0,,3,2.0,0.0,
3767,9.0,British Airways customer review,Nick Berry,2012-08-28,United Kingdom,LHR to HAM. Purser addresses all club passenge...,,,Business Class,,,yes,4.0,5.0,,3,4.0,0.0,
3768,5.0,British Airways customer review,Avril Barclay,2011-10-12,United Kingdom,My son who had worked for British Airways urge...,,,Economy Class,,,yes,,,,4,,,
3769,4.0,British Airways customer review,C Volz,2011-10-11,United States,London City-New York JFK via Shannon on A318 b...,,,Premium Economy,,,no,1.0,3.0,,1,5.0,0.0,


In [3]:
# Save the dataframe to an Excel file
df.to_excel('C:/Data Science - British Airways/Data/data_airline-reviews(British Airways).xlsx', index=False)