<a href="https://colab.research.google.com/github/mdabushad/British-Airways-Web-Scrapping-and-Predicting-Customer-Buying-Behaviour.ipynb/blob/main/British_Airways_Predicting_Customer_Buying_Behaviour.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web scraping British Airways Reviews

## Import Important Libraries


In [1]:
#Install Beautiful Soup
!pip install beautifulSoup4

#Import the required modules
import requests
from bs4 import BeautifulSoup
import pandas as pd


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 36
page_size = 100

reviews = []

for i in range(1, pages+1):
    print(f"Scraping page {i}")

    # Create URL to collect data from each page
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Create a Beautiful Soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the appropriate HTML elements containing the reviews
    review_elements = soup.find_all('div', class_='text_content')

    # Extract the text content of the reviews and append them to the 'reviews' list
    for element in review_elements:
        reviews.append(element.get_text())

    print(f"   ---> {len(reviews)} total reviews")


Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [8]:
# Create an empty DataFrame with the desired data type
df = pd.DataFrame(dtype=str)

# Add the 'reviews' list as a column named 'reviews' to the DataFrame
df["reviews"] = reviews

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,reviews
0,✅ Trip Verified | The worst experience in all ...
1,✅ Trip Verified | The worst experience in all...
2,✅ Trip Verified | A serious medical problem a...
3,✅ Trip Verified | I haven't flown British Air...
4,Not Verified | My itinerary was supposed to b...


In [9]:
# Split the reviews based on "|", get the second part, and remove leading/trailing whitespaces
df["reviews"] = df["reviews"].apply(lambda x: x.split("|")[1].strip() if "|" in x else x.strip())

# Display the modified DataFrame
df.head()


Unnamed: 0,reviews
0,The worst experience in all my years of travel...
1,The worst experience in all my years of travel...
2,A serious medical problem appeared while I was...
3,I haven't flown British Airways before and wil...
4,My itinerary was supposed to be Las Vegas-Chic...


#Sentimental Analysis 


##[Rule-based approach](https://www.analyticsvidhya.com/blog/2021/06/rule-based-sentiment-analysis-in-python/)

###This is a practical approach to analyzing text without training or using machine learning models. The result of this approach is a set of rules based on which the text is labeled as positive/negative/neutral. These rules are also known as lexicons. Hence, the Rule-based approach is called Lexicon based approach.

###Widely used lexicon-based approaches are TextBlob, VADER, SentiWordNet.

##Data Preprocessing steps:
1. Cleaning the text
2. Tokenization
3. Enrichment-POS tagging
4. Stopwords removal
5. Obtaining the stem words





###Cleaning the text 
####Cleaning the text and removing all the numerical and special characters

In [10]:
# Import the regular expression module
import re

# Define a function to clean the reviews
def clean_reviews(review):
    # Remove all alphabetical characters from the review
    review = re.sub('[^A-Za-z]+', ' ', review)
    return review

# Apply the clean_reviews function to the "reviews" column and store the result in a new column "Cleaned_reviews"
df["Cleaned_reviews"] = df["reviews"].apply(clean_reviews)

# Display the modified DataFrame
df.head()


Unnamed: 0,reviews,Cleaned_reviews
0,The worst experience in all my years of travel...,The worst experience in all my years of travel...
1,The worst experience in all my years of travel...,The worst experience in all my years of travel...
2,A serious medical problem appeared while I was...,A serious medical problem appeared while I was...
3,I haven't flown British Airways before and wil...,I haven t flown British Airways before and wil...
4,My itinerary was supposed to be Las Vegas-Chic...,My itinerary was supposed to be Las Vegas Chic...


###**Tokenization**
####Tokenization is the process of breaking the text into smaller pieces called Tokens. It can be performed at sentences(sentence tokenization) or word level(word tokenization).

###**Enrichment-POS tagging**
####Parts of Speech (POS) tagging is a process of converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization.

###**Stopwords removal**
####Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language.

###**Obtaining the stem words**
####A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.
####The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.


In [13]:
# Import the necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet


# Download the stopwords corpus
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [14]:
#Create a POS tagger dictionary 
pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}

def token_stop_pos(review):
  tokens  = word_tokenize(review)
  tags = pos_tag(tokens)

  new_list = []
  for word, tag in tags:
    if word.lower() not in set(stopwords.words('english')):
      new_list.append(tuple([word, pos_dict.get(tag[0])]))
  return new_list

df["POS_tagged"] = df["Cleaned_reviews"].apply(token_stop_pos)



In [16]:
df.head()

Unnamed: 0,reviews,Cleaned_reviews,POS_tagged
0,The worst experience in all my years of travel...,The worst experience in all my years of travel...,"[(worst, a), (experience, n), (years, n), (tra..."
1,The worst experience in all my years of travel...,The worst experience in all my years of travel...,"[(worst, a), (experience, n), (years, n), (tra..."
2,A serious medical problem appeared while I was...,A serious medical problem appeared while I was...,"[(serious, a), (medical, a), (problem, n), (ap..."
3,I haven't flown British Airways before and wil...,I haven t flown British Airways before and wil...,"[(flown, a), (British, a), (Airways, n), (neve..."
4,My itinerary was supposed to be Las Vegas-Chic...,My itinerary was supposed to be Las Vegas Chic...,"[(itinerary, n), (supposed, v), (Las, n), (Veg..."


#####**Explanation**: token_stop_pos is the function that takes the text and performs tokenization, removes stopwords, and tags the words to their POS. We applied it to the ‘Cleaned_reviews’ column and created a new column for ‘POS_tagged’ data.

#####As mentioned earlier, to obtain the accurate Lemma the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.

#####To map pos_tag to wordnet tags,  we created a dictionary pos_dict. Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.

#####Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.