# Data Collection

**During this stage, our objective is to acquire customer rating data sourced from the airline quality portal known as Skytrax. To ensure comprehensive insight, we will gather the following information:**

1. Written Reviews
2. User Ratings
3. Posting Dates
4. User's Country

Link to the website: https://www.airlinequality.com/airline-reviews/british-airways/?sortby=post_date%3ADesc&pagesize=100

<hr>

### Imports

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Lists for Individual Collection

In [2]:
reviews  = []
ratings = []
date = []
country = []

### Web Scrapping

In [3]:
for i in range(1, 38):
    
    # Construct the URL for each page
    page_url = f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100"
    
    # Send an HTTP GET request to the URL
    page = requests.get(page_url)
    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(page.content, "html.parser")
    
    # Extract reviews
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)
    
    # Extract ratings
    for item in soup.find_all("div", class_="rating-10"):
        try:
            ratings.append(item.span.text)
        except:
            print(f"Error on page {i}")
            ratings.append("None")
    
    # Extract dates
    for item in soup.find_all("time"):
        date.append(item.text)
        
    # Extract reviewer countries
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

Error on page 32
Error on page 33
Error on page 33
Error on page 35
Error on page 36


### Checking Data for Proper Construction of Data Frame

In [4]:
print(f'Number of Reviews : {len(reviews)}')
print(f'Number of Ratings : {len(ratings)}')
print(f'Number of Date : {len(date)}')
print(f'Number of Country : {len(country)}')

Number of Reviews : 3685
Number of Ratings : 3722
Number of Date : 3685
Number of Country : 3685


#### Matching Ratings length with other lengths

In [5]:
ratings = ratings[:len(reviews)]

In [6]:
len(ratings)

3685

### Constructing DataFrame

In [7]:
data = {
    'Review': reviews,
    'Rating': ratings,
    'Date': date,
    'Country': country
}

df = pd.DataFrame(data)

In [8]:
df.head(5)

Unnamed: 0,Review,Rating,Date,Country
0,Not Verified | Flew back from Malta after sc...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,22nd October 2023,United Kingdom
1,Not Verified | Cabin luggage had to go to carg...,8,21st October 2023,Netherlands
2,✅ Trip Verified | I have been using BA for a ...,3,21st October 2023,United Kingdom
3,✅ Trip Verified | I flew from Istanbul to Lo...,4,19th October 2023,United Kingdom
4,Not Verified | I have flow on BA several time...,1,19th October 2023,United States


### Saving DataFrame as `csv` file

In [9]:
import os

cwd = os.getcwd()
df.to_csv(cwd+ "/BA_Reviews.csv")