Submit request to website

In [1]:
import requests

def request(url):
    """sends a request to the URL"""

    # add https if not in there at start
    if url[0:8] != "https://" and url[0:7] != "http://":
        url = "https://" + url

    my_session = requests.session()
    for_cookies = requests.get(url, timeout=5).cookies
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
    }

    return my_session.get(url, headers=headers, cookies=for_cookies, timeout=5)

response = request("bbc.co.uk")

In [2]:
print(response.status_code)
print(response.text[0:100])

200
<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" 


Read the HTML content as a string to analyse. In order for people to replicate my results, I've already downloaded the HTML text.

In [3]:
#response_text = response.text

## this is how I saved the response in a text file

#with open("response.txt", "w") as text_file:
#    text_file.write(response_text)

with open("response.txt","r") as text_file:
    response_text = text_file.read()
    
response_text[0:200]

'<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-r'

get the individual text pieces inside the web page as separate list elements using Beautiful Soup

In [4]:
from bs4 import BeautifulSoup as bs

soup_li = bs(response_text, "lxml").body.get_text(separator="||").split("||")

In [5]:
print(len(soup_li))
print(soup_li[0:20])

280
['BBC Homepage', 'Skip to content', 'Accessibility Help', 'Your account', 'Notifications', 'Home', 'News', 'Sport', 'Weather', 'iPlayer', 'Sounds', 'CBBC', 'CBeebies', 'Menu', 'More', 'Search', 'Home', 'News', 'Sport', 'Weather']


There's a ton of one-word tokens; let's filter them out

In [6]:
long_tokens = [x for x in soup_li if len(x.split()) >= 5]

In [7]:
print(len(long_tokens))
long_tokens

73


['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Notice how there's a few phrases that repeat. It seems the shorter ones are more suitable, but it would be hard to make this work algorithmically. So we'll take the longer ones instead even though they contain irrelevant words.

In [14]:
def is_copy(text_input, text_li):
    counter = 0
    for text in text_li:
        if text_input in text:
            counter += 1
    return counter > 1

unique_tokens = [x for x in long_tokens if not is_copy(x, long_tokens)]
unique_tokens

['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Score each of the tokens using...
- [AFINN](https://github.com/fnielsen/afinn) = wordlist-based approach
- [VADER](https://github.com/cjhutto/vaderSentiment) = lexicon+rule-based approach

In [7]:
from afinn import Afinn
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

afinn = Afinn()
vader = SentimentIntensityAnalyzer()
scored_tokens = []

for token in tokens:
  # afinn.score calculates by adding up individual scores for words
  # so you need to standardise by dividing the length of the phrase
  afinn_score = afinn.score(token) / len(token.split())

  vader_score = vader.polarity_scores(token)["compound"]
  
  scored_tokens.append({"token": token, "afinn_score": afinn_score, "vader_score": vader_score})

Check the output...

In [8]:
print(len(scored_tokens), scored_tokens[0:10])

73 [{'token': ' 2021 BBC. The BBC is not responsible for the content of external sites. ', 'afinn_score': 0.15384615384615385, 'vader_score': -0.2411}, {'token': "'I'm just not ready to buy an electric car'", 'afinn_score': 0.0, 'vader_score': -0.2755}, {'token': "'I've had to develop a way of getting results instantly'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'One signing could decide the title race - but it's not Harry Kane'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'One signing could decide the title race, but it's not Harry Kane'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'We drove from Ireland to Australia in a camper van'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'We had to tip milk down the drain - now we sell 200-400 bottles a day'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'We never knew how dangerous Loch Lomond was'", 'afinn_score': -0.25, 'vader_score': 0.3724}, {'token': "'When you're on a BMX 20 feet in the air th

Let's move the data to pandas and spot-check some of the results

In [10]:
import pandas as pd

pd.set_option("max_row",None)
pd.set_option('display.max_colwidth', None)

token_df = pd.DataFrame(scored_tokens)

token_df[33:43]

Unnamed: 0,token,afinn_score,vader_score
33,Holiday 'stress' over paper vaccine certificate,-0.166667,-0.0258
34,How often do you bathe your children? Celebrities debate,0.0,0.0
35,Life in the Taliban's new territory,0.0,0.0
36,Life in the Taliban's new territory. Video,0.0,0.0
37,Looking back on 40 years of Indiana Jones,0.0,0.0
38,Looking back on 40 years of Indiana Jones. Audio,0.0,0.0
39,Lure of the island with no electricity or wi-fi,-0.111111,-0.296
40,Marvel launches new Disney+ show featuring Chadwick Boseman,0.375,0.4215
41,Medics warn of more cancelled operations,-0.5,-0.4005
42,"Murder-accused boy, 14, in court after stabbing",-0.571429,0.0


Both scoring models seem pretty accurate and mostly agree with each other. The few exceptions I noticed when spot-checking:
- `Composer, DJ and bandleader...` should be a positive text, yet both models scored it as neutral. I can see why though, there weren't any particular words that suggested positivity.
- `Totally affordable...` was missed out by VADER as being a positive word

Let's now see all the places where the two models disagree

In [10]:
filtered_token = token_df[(token_df["afinn_score"] * token_df["vader_score"] <= 0) &
    ((token_df["afinn_score"] != 0) | (token_df["vader_score"] != 0))]

filtered_token.count()

token          14
afinn_score    14
vader_score    14
dtype: int64

So there's 14 out of 79 places where the 2 models disagree, which is roughly 25% of the time.

Let's have a look where exactly there's disagreements.

In [11]:
filtered_token

Unnamed: 0,token,afinn_score,vader_score
8,'I'm just not ready to buy an electric car',0.0,-0.2755
14,2021 BBC. The BBC is not responsible for the content of external sites.,0.153846,-0.2411
19,Amazon moves Lord of the Rings production to UK,0.0,0.1779
22,'We never knew how dangerous Loch Lomond was',-0.25,0.3724
23,Why the Tourette's queen of Twitch hasn't been banned,-0.222222,0.357
44,"England lose two wickets in two balls - clips, radio & text",0.0,-0.4019
48,Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix,0.181818,-0.3182
49,UK records a further 100 Covid deaths,-0.285714,0.0
61,"Germany fears thousands got saline, not vaccine",0.0,-0.4215
63,"Murder-accused boy, 14, in court after stabbing",-0.571429,0.0


AFINN got 6 right, while VADER got the remainder 8. There's no clear winner then. What if we now look into this using a trained, state-of-the-art model? Enter HuggingFace.

We have several options on the menu: classic BERT, DistilBERT, RoBERTa and XLNet. I'm choosing DistilBERT because it's the most resource-light - this is a side project, after all.

Unnamed: 0,token,afinn_score,vader_score
33,Why you can trust the BBC,0.166667,0.5106
34,GB's Ujah suspended after positive test,0.166667,0.128
35,'We never knew how dangerous Loch Lomond was',-0.25,0.3724
36,2021 BBC. The BBC is not responsible for the content of external sites.,0.153846,-0.2411
37,"Woman arrested in murder probe after boy, 2, dies",-0.555556,-0.8316
38,Russia millionaire kills man he 'mistook for bear',-0.375,-0.5423
39,Casualty's 10 most memorable episodes,-0.2,0.0
40,PM calls emergency meeting to discuss Afghanistan,-0.285714,-0.3818
41,"One mystery, four suspects, many lies",-0.166667,-0.6369
42,"Composer, DJ and bandleader: The women redefining the sound of UK jazz",0.0,0.0
