# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 177
page_size = 10

reviews = []

#declare dictionary
data_dict = {}
# for i in range(1, pages + 1):
review_count = 1
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    for para in parsed_content.find_all("article",{"itemprop":"review"}):
        new_dict = {}
        review_text = para.find_all("div", {"class": "text_content"})
        new_dict['reviews'] = review_text[0].get_text() # store reviews in dictionary

        table = para.select('table.review-ratings')
        length_rating = len(table[0].find_all("td",{"class":"review-rating-header"}))
        ratings = []
        count = 0
        stars_label_names = []
        for j in range(0,length_rating*2,2):
            type_rate = table[0].find_all("td")[j].attrs['class'][1]
            columns = table[0].find_all("td")[j].findNext('td').get_text()
            
            if(columns == '12345'):
                stars_label = table[0].find_all("td")[j].get_text()
                stars_label_names.append(stars_label)
                count+=1
            else:
                new_dict[type_rate] = columns
    
        stars_count = 0
        stars_index = 0
        
        for k in range(0,count*5,1):
            stars = table[0].find_all("span")[k].attrs['class']
            if(len(stars) == 2):
                stars_count+=1
            if((k+1)%5 == 0):
                #assign stars count
                new_dict[stars_label_names[stars_index]] = stars_count
                stars_index+=1
                stars_count = 0
        data_dict[review_count] = new_dict
        review_count+=1

    print(f"   ---> {len(data_dict)} total reviews")


Scraping page 1
   ---> 10 total reviews
Scraping page 2
   ---> 20 total reviews
Scraping page 3
   ---> 30 total reviews
Scraping page 4
   ---> 40 total reviews
Scraping page 5


In [6]:
df = pd.DataFrame(data_dict)
df = df.transpose()
df.shape
df.head(20)

(10, 14)

In [41]:
df.to_csv('reviews.csv')

In [12]:
print(df)

                                               reviews
0    Not Verified |  They changed our Flights from ...
1    Not Verified |  At Copenhagen the most chaotic...
2    ✅ Trip Verified |  Worst experience of my life...
3    ✅ Trip Verified |  Due to code sharing with Ca...
4    ✅ Trip Verified |  LHR check in was quick at t...
..                                                 ...
995  ✅ Trip Verified |  Linate to London. The morni...
996  ✅ Trip Verified | Flew British Airways from JK...
997  ✅ Trip Verified | I have flown British Airways...
998  ✅ Trip Verified | We can not fault the new 'Cl...
999  ✅ Trip Verified |  Very disappointing experien...

[1000 rows x 1 columns]


In [13]:
df['reviews'] = df['reviews'].apply(lambda x:x.split("|")[1])

In [14]:
df[df['reviews']

SyntaxError: unexpected EOF while parsing (2476954167.py, line 1)

## Topic Modelling,
## Sentiment Analysis
## wordclouds

### Topic Modelling

In [17]:
#import packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from nltk.corpus import stopwords
import numpy as np


In [18]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DEGGIE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
nltk.download('punkt') 
nltk.download('wordnet')   
from nltk import word_tokenize 
from nltk.stem import WordNetLemmatizer   
class LemmaTokenizer: 
    def __init__(self): 
        self.wnl = WordNetLemmatizer() 
    def __call__(self, doc): 
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if (t.isalpha() and len(t) >= 2)]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DEGGIE\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DEGGIE\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [20]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DEGGIE\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [23]:
new_stopwords = ['could', 'doe', 'ha', 'might', 'must', 'need', 'sha', 'wa', 'wo', 'would']
stp_word = nltk.corpus.stopwords.words('english')
stp_word.extend(new_stopwords)

In [24]:
count_vect = CountVectorizer(stop_words=stopwords.words('english'), tokenizer=LemmaTokenizer(), lowercase=True, max_df=0.5, min_df=10)
x_counts = count_vect.fit_transform(df['reviews'].to_numpy())
# x_counts.todense()

In [25]:
count_vect.get_feature_names_out()

array(['able', 'absolutely', 'accept', ..., 'york', 'young', 'zero'],
      dtype=object)

In [26]:
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

In [27]:
dimension = 3
lda = LDA(n_components=dimension, random_state=42)
lda_array = lda.fit_transform(x_tfidf)
lda_array

array([[0.88171134, 0.05048518, 0.06780348],
       [0.68497998, 0.08726333, 0.22775669],
       [0.85626632, 0.04579329, 0.09794038],
       ...,
       [0.37657081, 0.04012432, 0.58330487],
       [0.08426215, 0.04413684, 0.87160101],
       [0.05395486, 0.04928959, 0.89675555]])

In [28]:
components = [lda.components_[i] for i in range(len(lda.components_))]
features = count_vect.get_feature_names_out()
important_words = [sorted(features, key = lambda x: components[j][np.where(features == x )], reverse = True)[:5] for j in range(len(components))]
print(important_words)

[['customer', 'hour', 'airway', 'british', 'airline'], ['thank', 'clothes', 'sydney', 'suitcase', 'wish'], ['seat', 'crew', 'good', 'food', 'cabin']]


In [None]:
df.shape

In [29]:
import pyLDAvis

In [None]:
df.reviews.unique().shape

In [None]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.