<a href="https://colab.research.google.com/github/i-ninte/machine_learning/blob/main/getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | I flew from Istanbul to Lo...
1,Not Verified | I have flow on BA several time...
2,✅ Trip Verified | The flight departed over a...
3,✅ Trip Verified | I hate British Airways! We...
4,✅ Trip Verified | Our BA flight from Porto t...


In [5]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [11]:
import pandas as pd
df= pd.read_csv("BA_reviews.csv", index_col=0)

In [12]:
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | I flew from Istanbul to Lo...
1,Not Verified | I have flow on BA several time...
2,✅ Trip Verified | The flight departed over a...
3,✅ Trip Verified | I hate British Airways! We...
4,✅ Trip Verified | Our BA flight from Porto t...


  plt.savefig(


In [13]:
df.columns

Index(['reviews'], dtype='object')

In [14]:
df.tail()

Unnamed: 0,reviews
995,✅ Trip Verified | I had flown British Airways ...
996,Not Verified | Gatwick to Tenerife. This airc...
997,✅ Trip Verified | Booked a flight through Exp...
998,✅ Trip Verified | Johannesburg to London. I h...
999,✅ Trip Verified | London to Kuala Lumpur. This...


In [15]:
df['reviews']

0      ✅ Trip Verified |   I flew from Istanbul to Lo...
1      Not Verified |  I have flow on BA several time...
2      ✅ Trip Verified |   The flight departed over a...
3      ✅ Trip Verified |   I hate British Airways! We...
4      ✅ Trip Verified |   Our BA flight from Porto t...
                             ...                        
995    ✅ Trip Verified | I had flown British Airways ...
996    Not Verified |  Gatwick to Tenerife. This airc...
997    ✅ Trip Verified |  Booked a flight through Exp...
998    ✅ Trip Verified |  Johannesburg to London. I h...
999    ✅ Trip Verified | London to Kuala Lumpur. This...
Name: reviews, Length: 1000, dtype: object

In [17]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load your DataFrame
df = pd.read_csv("BA_reviews.csv")

# Initialize the VADER sentiment analyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

# Apply sentiment analysis to each review
df['Sentiment_Scores'] = df['reviews'].apply(lambda x: sid.polarity_scores(x))


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [18]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load the data from "BA_reviews.csv"
df = pd.read_csv('BA_reviews.csv')

# Initialize the VADER sentiment analyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

# Apply sentiment analysis to each review and create a new column for sentiment scores
df['Sentiment_Scores'] = df['reviews'].apply(lambda x: sid.polarity_scores(x))

# Calculate the overall sentiment for each review
df['Sentiment'] = df['Sentiment_Scores'].apply(lambda x: 'Positive' if x['compound'] > 0 else ('Negative' if x['compound'] < 0 else 'Neutral'))

# Count the number of reviews in each sentiment category
sentiment_counts = df['Sentiment'].value_counts()

# Print the sentiment distribution
print("Sentiment Distribution:")
print(sentiment_counts)

# Calculate and print the average sentiment score
average_sentiment_score = df['Sentiment_Scores'].apply(lambda x: x['compound']).mean()
print("\nAverage Sentiment Score:", average_sentiment_score)

# Example: Display a few positive and negative reviews
positive_reviews = df[df['Sentiment'] == 'Positive'].head(2)['reviews']
negative_reviews = df[df['Sentiment'] == 'Negative'].head(2)['reviews']

print("\nExample Positive Reviews:")
for review in positive_reviews:
    print(review)

print("\nExample Negative Reviews:")
for review in negative_reviews:
    print(review)


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Sentiment Distribution:
Negative    502
Positive    493
Neutral       5
Name: Sentiment, dtype: int64

Average Sentiment Score: 0.028311700000000002

Example Positive Reviews:
Not Verified |  I have flow on BA several times since the airline started non-stop service between Austin, TX and London Heathrow. Every year the service gets worse. I brought a tape measure with me to see how far apart the seats are in the economy section. 25½ inches. BA has removed bathrooms in order to squeeze a few more rows of seats onto the aircraft. Thankfully I'm not a big person and was able to squeeze into my seat without being too uncomfortable. When BA first started flying out of Austin, the food was great. Now, when I arrive at Heathrow, I get real food and take it on the plane with me. I don't know where or who makes BA's food, but it is not eatable. If BA did not have a direct flight from my hometown to London, I would fly with a different airline. When we checked into Heathrow, the line was extrem