Submit request to website

In [1]:
import requests

def request(url):
    """sends a request to the URL"""

    # add https if not in there at start
    if url[0:8] != "https://" and url[0:7] != "http://":
        url = "https://" + url

    my_session = requests.session()
    for_cookies = requests.get(url, timeout=5).cookies
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
    }

    return my_session.get(url, headers=headers, cookies=for_cookies, timeout=5)

response = request("bbc.co.uk")

In [2]:
print(response.status_code)
print(response.text[0:100])

200
<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" 


Read the HTML content as a string to analyse. In order for people to replicate my results, I've already downloaded the HTML text.

In [3]:
#response_text = response.text

## this is how I saved the response in a text file

#with open("response.txt", "w") as text_file:
#    text_file.write(response_text)

with open("response.txt","r") as text_file:
    response_text = text_file.read()
    
response_text[0:200]

'<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-r'

get the individual text pieces inside the web page as separate list elements using Beautiful Soup

In [4]:
from bs4 import BeautifulSoup as bs

soup_li = bs(response_text, "lxml").body.get_text(separator="||").split("||")

In [5]:
print(len(soup_li))
print(soup_li[0:20])

280
['BBC Homepage', 'Skip to content', 'Accessibility Help', 'Your account', 'Notifications', 'Home', 'News', 'Sport', 'Weather', 'iPlayer', 'Sounds', 'CBBC', 'CBeebies', 'Menu', 'More', 'Search', 'Home', 'News', 'Sport', 'Weather']


There's a ton of one-word tokens; let's filter them out

In [6]:
long_tokens = [x for x in soup_li if len(x.split()) >= 5]

In [7]:
print(len(long_tokens))
long_tokens

73


['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Notice how there's a few phrases that repeat. It seems the shorter ones are more suitable, but it would be hard to make this work algorithmically. So we'll take the longer ones instead even though they contain irrelevant words.

In [14]:
def is_copy(text_input, text_li):
    counter = 0
    for text in text_li:
        if text_input in text:
            counter += 1
    return counter > 1

unique_tokens = [x for x in long_tokens if not is_copy(x, long_tokens)]
unique_tokens

['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Score each of the tokens using...
- [AFINN](https://github.com/fnielsen/afinn) = wordlist-based approach
- [VADER](https://github.com/cjhutto/vaderSentiment) = lexicon+rule-based approach

In [17]:
from afinn import Afinn
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

tokens = unique_tokens

afinn = Afinn()
vader = SentimentIntensityAnalyzer()
scored_tokens = []

for token in tokens:
  # afinn.score calculates by adding up individual scores for words
  # so you need to standardise by dividing the length of the phrase
  afinn_score = afinn.score(token) / len(token.split())

  vader_score = vader.polarity_scores(token)["compound"]
  
  scored_tokens.append({"token": token, "afinn_score": afinn_score, "vader_score": vader_score})

Check the output...

In [18]:
print(len(scored_tokens), scored_tokens[0:10])

64 [{'token': 'Plymouth mass shooter was licensed gun holder', 'afinn_score': -0.14285714285714285, 'vader_score': -0.34}, {'token': 'PM calls emergency meeting to discuss Afghanistan', 'afinn_score': -0.2857142857142857, 'vader_score': -0.3818}, {'token': 'England bat after Anderson takes five wickets in second Test', 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "Teachers: 'It's been hell grading exams'", 'afinn_score': -0.6666666666666666, 'vader_score': -0.6808}, {'token': "'One signing could decide the title race, but it's not Harry Kane'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': 'Lure of the island with no electricity or wi-fi', 'afinn_score': -0.1111111111111111, 'vader_score': -0.296}, {'token': 'UK records a further 100 Covid deaths', 'afinn_score': -0.2857142857142857, 'vader_score': 0.0}, {'token': 'Germany fears thousands got saline, not vaccine', 'afinn_score': 0.0, 'vader_score': -0.4215}, {'token': "'We had to tip milk down the drain - now we sell 200-400

Let's move the data to pandas and spot-check some of the results

In [19]:
import pandas as pd

pd.set_option("max_row",None)
pd.set_option('display.max_colwidth', None)

token_df = pd.DataFrame(scored_tokens)

token_df[33:43]

Unnamed: 0,token,afinn_score,vader_score
33,'When you're on a BMX 20 feet in the air there's no room for error' Video,-0.1875,-0.5994
34,Swimmer taking on 'coldest swim on Earth' to highlight climate change,0.181818,0.34
35,Why the Tourette's queen of Twitch hasn't been banned,-0.222222,0.357
36,'We never knew how dangerous Loch Lomond was',-0.25,0.3724
37,Postcard from Chile arrives in Dorset after 30 years,0.0,0.0
38,Decades-old lesson found on hidden blackboard,0.0,0.0
39,Hampshire & Isle of Wight,0.0,0.0
40,Flexible recipes for when you need to use up leftovers,0.0,0.2263
41,Totally affordable 30-minute meals for two,0.333333,0.0
42,Create incredible pasta dishes with only five ingredients,0.0,0.2732


The models seem all over the place at a first glance. But once you examine the sentences, things make more sense. Just check out the first one - how can you understand its positivity without knowing the context? Is the intent to amaze the reader, or is it to scare them off?

Observations so far:
- both models get it right when they're both 0
- when one model is 0 and the other is non-0, the non-0 one is correct
- there's not enough data on negative sentiment to make a decision

Let's try another slice

In [20]:
token_df[17:27]

Unnamed: 0,token,afinn_score,vader_score
17,Medics warn of more cancelled operations,-0.5,-0.4005
18,3 things we love today,0.6,0.6369
19,Three of the strangest organs in the animal kingdom,0.0,0.0
20,"Composer, DJ and bandleader: The women redefining the sound of UK jazz",0.0,0.0
21,Victoria Derbyshire overcomes on-air shoes malfunction,0.0,0.0
22,'I've had to develop a way of getting results instantly',0.0,0.0
23,"From Amy Winehouse to James Arthur, Annabel Williams has had a stellar career as a singing coach",0.0,0.0
24,Amazon moves Lord of the Rings production to UK,0.0,0.1779
25,Britney Spears' father to step down as conservator,0.0,0.0
26,Grimmy on leaving Radio 1 and the 'instant bad mood' song,-0.272727,-0.5423


More observations:
- the models seem to be missing out positive phrases this time around like `Composer, DJ and...` or `From Amy Winehouse to...`
- but they get it right when they're both of a certain sentiment, whether it's positive or negative

So it seems like the models are good complements for a composite score. Let's give it a shot...

In [22]:
for token in scored_tokens:
    
    if (token["afinn_score"] == 0 and token["vader_score"] == 0) or (token["afinn_score"] * token["vader_score"] < 0):
        token["composite_score"] = 0
    elif (token["afinn_score"] > 0 or token["vader_score"] > 0):
        token["composite_score"] = 1
    else:
        token["composite_score"] = -1
        
token_df = pd.DataFrame(scored_tokens)
token_df[45:55]

Unnamed: 0,token,afinn_score,vader_score,composite_score
45,Get your kids active with these football club challenges,0.111111,0.4588,1
46,GB's Ujah suspended after positive test,0.166667,0.128,1
47,Ex-Olympian coached for years after abuse claim,-0.428571,-0.6369,-1
48,What are the fan tokens given to Messi by PSG?,0.3,0.3182,1
49,The new trick cyber-criminals use to cash out,-0.375,-0.0516,-1
50,Life in the Taliban's new territory. Video,0.0,0.0,0
51,Russia millionaire kills man he 'mistook for bear',-0.375,-0.5423,-1
52,Superstar violinist Nicola Benedetti delights the Proms. IPlayer-Video,0.375,0.4588,1
53,Looking back on 40 years of Indiana Jones. Audio,0.0,0.0,0
54,The ghosts are back for more supernatural shenanigans. IPlayer-Video,0.0,0.0,0


It seems to be doing better, with only 2 misses in `GB's Ujah suspended...` and `What are fan tokens...`. But what if we could make it even more accurate with a pre-trained, state-of-the-art model? Enter Hugging Face

In [23]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

ModuleNotFoundError: No module named 'transformers'

In [10]:
filtered_token = token_df[(token_df["afinn_score"] * token_df["vader_score"] <= 0) &
    ((token_df["afinn_score"] != 0) | (token_df["vader_score"] != 0))]

filtered_token.count()

token          14
afinn_score    14
vader_score    14
dtype: int64