# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [75]:
# importing necessary packages

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options

import time
import pprint
import pymongo
from pymongo import MongoClient
import pandas as pd
import re
import datetime

import psycopg2
from sqlalchemy import create_engine


In [125]:
# creating a driver
driver = webdriver.Chrome()

# base url
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"

# getting the url
driver.get(base_url)

# scrolling to and clicking the button to get 100 reviews per page
element_100 = driver.find_element(By.XPATH, '//*[@id="main"]/section[3]/div[1]/article/div[1]/div[2]/form/ul/li[4]/label')
driver.execute_script("arguments[0].scrollIntoView({'block':'center', 'inline':'center'});", element_100)
element_100.click()

# finding the total number of pages
num_element = driver.find_element(By.XPATH, '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[8]/a')
driver.execute_script("arguments[0].scrollIntoView({'block':'center', 'inline':'center'});", num_element)
num_pages = int(num_element.text)


# https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100
# url for paginated data



In [154]:
reviews = []
ratings = []
authors = []
countries = []
dates = []

for i in range(1, num_pages+1):
        
        print('scraping page ',i)

        # url for paginated data
        url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize=100"
        driver.get(url)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        all_revs = soup.find_all('div', class_ = 'text_content')
        all_ratings = soup.find_all('article', attrs={"itemprop":"review"})
        all_authors = soup.find_all('span', attrs={"itemprop":"name"})
        all_countries = soup.find_all('h3', class_="text_sub_header userStatusWrapper")
        all_dates = soup.find_all('time', attrs={"itemprop":"datePublished"})
        
        for rev in all_revs:
                        reviews.append(rev.text)

        for rate in all_ratings:
                        temp_rate = rate.find("div", class_ = 'rating-10').text
                        temp_rate = temp_rate.replace('\n', '')
                        
                        try:
                                temp_rate = temp_rate.split('/')
                                ratings.append(temp_rate[0])

                        except:
                                ratings.append(temp_rate)

        for auth in all_authors:
                        authors.append(auth.text)

        for count in all_countries:
                temp = re.findall(r'\(.*?\)', count.text)
                try:
                        temp = temp[0].strip('(')
                        temp = temp.strip(')')
                        countries.append(temp)
                except:
                        countries.append('NA')
        
        for date in all_dates:
                        dates.append(date['datetime'])
        

scraping page  1
scraping page  2
scraping page  3
scraping page  4
scraping page  5
scraping page  6
scraping page  7
scraping page  8
scraping page  9
scraping page  10
scraping page  11
scraping page  12
scraping page  13
scraping page  14
scraping page  15
scraping page  16
scraping page  17
scraping page  18
scraping page  19
scraping page  20
scraping page  21
scraping page  22
scraping page  23
scraping page  24
scraping page  25
scraping page  26
scraping page  27
scraping page  28
scraping page  29
scraping page  30
scraping page  31
scraping page  32
scraping page  33
scraping page  34
scraping page  35
scraping page  36
scraping page  37


In [163]:
len(reviews)

3695

In [100]:
# creating a mongo client
ba_client = MongoClient()

# creating a database
ba_database = ba_client['ba_database']

In [101]:
# creating a collection
ba_col = ba_database['ba_review_col']

In [171]:
# inserting reviews to mongoDB
ids = []
for i in range(len(reviews)):
        review = {
                '_id':i,
                'author':authors[i],
                'country':countries[i],
                'published':dates[i],
                'rating':ratings[i],
                'review':reviews[i]
        }

        ids.append(ba_col.insert_one(review))

In [184]:
df = pd.DataFrame()
df['author'] = authors
df['country'] = countries
df['published'] = dates
df['rating'] = ratings
df['review'] = reviews
df.head() 

Unnamed: 0,author,country,published,rating,review
0,Graeme Boothman,United Kingdom,2023-11-08,8,✅ Trip Verified | Booked online months ago an...
1,R Vines,United Kingdom,2023-11-07,7,✅ Trip Verified | The flight was on time. The...
2,Massimo Tricca,Italy,2023-11-05,2,"Not Verified | Angry, disappointed, and unsat..."
3,J Kaye,United Kingdom,2023-11-05,3,"✅ Trip Verified | As an infrequent flyer, Bri..."
4,M Collie,Ireland,2023-11-04,8,"Not Verified | A totally unremarkable flight,..."


In [185]:
df.to_csv("C:/Users/snkri/OneDrive/Desktop/virtual_interns/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.