# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
# importing necessary packages

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options

import time
import pprint
import pymongo
from pymongo import MongoClient
import pandas as pd
import re
import datetime

import psycopg2
from sqlalchemy import create_engine


In [31]:
# creating a driver
driver = webdriver.Chrome()

# base url
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"

# getting the url
driver.get(base_url)

# scrolling to and clicking the button to get 100 reviews per page
element_100 = driver.find_element(By.XPATH, '//*[@id="main"]/section[3]/div[1]/article/div[1]/div[2]/form/ul/li[4]/label')
driver.execute_script("arguments[0].scrollIntoView({'block':'center', 'inline':'center'});", element_100)
element_100.click()

# finding the total number of pages
num_element = driver.find_element(By.XPATH, '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[8]/a')
driver.execute_script("arguments[0].scrollIntoView({'block':'center', 'inline':'center'});", num_element)
num_pages = int(num_element.text)


# https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100
# url for paginated data



In [32]:
reviews = []

for i in range(1, num_pages+1):
        
        print('scraping page ',i)

        # url for paginated data
        url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize=100"
        driver.get(url)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        all_revs = soup.find_all('div', class_ = 'text_content')
        
        for rev in all_revs:
                        reviews.append(rev.text)
        
        

scraping page  1
scraping page  2
scraping page  3
scraping page  4
scraping page  5
scraping page  6
scraping page  7
scraping page  8
scraping page  9
scraping page  10
scraping page  11
scraping page  12
scraping page  13
scraping page  14
scraping page  15
scraping page  16
scraping page  17
scraping page  18
scraping page  19
scraping page  20
scraping page  21
scraping page  22
scraping page  23
scraping page  24
scraping page  25
scraping page  26
scraping page  27
scraping page  28
scraping page  29
scraping page  30
scraping page  31
scraping page  32
scraping page  33
scraping page  34
scraping page  35
scraping page  36
scraping page  37


In [33]:
pprint.pprint(len(reviews))

3694


In [34]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The flight was on time. The...
1,"Not Verified | Angry, disappointed, and unsat..."
2,"✅ Trip Verified | As an infrequent flyer, Bri..."
3,"Not Verified | A totally unremarkable flight,..."
4,✅ Trip Verified | 1. Ground crew in Heathrow...


In [36]:
df.to_csv("C:/Users/snkri/OneDrive/Desktop/virtual_interns/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [37]:
# creating a connection with postgres
conn = psycopg2.connect(
        database = 'ba_reviews',
        user = 'postgres',
        password = 'Hemanthkumar#1',
        port = '5432',
        host = 'localhost'
)

In [39]:
# creating a cursor
cursor = conn.cursor()

# creating an engine
engine = create_engine('postgresql+psycopg2://postgres:Hemanthkumar#1@localhost:5432/ba_reviews')

In [40]:
df.to_sql('ba_reviews', engine, if_exists='replace')

694

In [43]:
cursor.execute('SELECT * FROM ba_reviews')

# fetching the table
table = cursor.fetchall()

test_df = pd.DataFrame(table)

In [48]:
test_df.iloc[1][1]

'Not Verified |  Angry, disappointed, and unsatisfied. My route was from London to Atlanta. My suitcase was not boarded, therefore not landed with me. For both comfort and safety reason, a bag always fly with its passenger and that did not happen. Claims and few phone calls were made by desk assistants who answered my questions unprofessionally and miserably. Certainly, I was left with nothing but my backpack which contained not more than few snacks. Neither clothes nor anything else was ever provided as an apology. Meanwhile, I was also told that my bag would have been delivered through the next 24 hours which also did not happen. British Airways is a great airline to fly with but its organization, when it comes to customer service, is poor and uncertain. Still waiting for my bag.'

In [46]:
pprint.pprint(test_df.head())

   0                                                  1
0  0  ✅ Trip Verified |  The flight was on time. The...
1  1  Not Verified |  Angry, disappointed, and unsat...
2  2  ✅ Trip Verified |  As an infrequent flyer, Bri...
3  3  Not Verified |  A totally unremarkable flight,...
4  4  ✅ Trip Verified |   1. Ground crew in Heathrow...
