Submit request to website

In [9]:
import requests

def request(url):
    """sends a request to the URL"""

    # add https if not in there at start
    if url[0:8] != "https://" and url[0:7] != "http://":
        url = "https://" + url

    my_session = requests.session()
    for_cookies = requests.get(url, timeout=5).cookies
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
    }

    return my_session.get(url, headers=headers, cookies=for_cookies, timeout=5)

response = request("bbc.co.uk")

In [10]:
print(response.status_code)
print(response.text[0:100])

200
<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" 


Read the HTML content as a string to analyse. In order for people to replicate my results, I've already downloaded the HTML text.

In [11]:
#response_text = response.text

## this is how I saved the response in a text file

#with open("response.txt", "w") as text_file:
#    text_file.write(response_text)

with open("response.txt","r") as text_file:
    response_text = text_file.read()
    
response_text[0:200]

'<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-r'

get the individual text pieces inside the web page as separate list elements using Beautiful Soup

In [12]:
from bs4 import BeautifulSoup as bs

soup_li = bs(response_text, "lxml").body.get_text(separator="||").split("||")

In [13]:
print(len(soup_li))
print(soup_li[0:20])

280
['BBC Homepage', 'Skip to content', 'Accessibility Help', 'Your account', 'Notifications', 'Home', 'News', 'Sport', 'Weather', 'iPlayer', 'Sounds', 'CBBC', 'CBeebies', 'Menu', 'More', 'Search', 'Home', 'News', 'Sport', 'Weather']


There's a ton of one-word tokens; let's filter them out

In [14]:
long_tokens = [x for x in soup_li if len(x.split()) >= 5]

In [15]:
print(len(long_tokens))
long_tokens

73


['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Notice how there's a few phrases that repeat. It seems the shorter ones are more suitable, but it would be hard to make this work algorithmically. So we'll take the longer ones instead even though they contain irrelevant words.

In [16]:
def is_copy(text_input, text_li):
    counter = 0
    for text in text_li:
        if text_input in text:
            counter += 1
    return counter > 1

unique_tokens = [x for x in long_tokens if not is_copy(x, long_tokens)]
unique_tokens

['Plymouth mass shooter was licensed gun holder',
 'PM calls emergency meeting to discuss Afghanistan',
 'England bat after Anderson takes five wickets in second Test',
 "Teachers: 'It's been hell grading exams'",
 "'One signing could decide the title race, but it's not Harry Kane'",
 'Lure of the island with no electricity or wi-fi',
 'UK records a further 100 Covid deaths',
 'Germany fears thousands got saline, not vaccine',
 "'We had to tip milk down the drain - now we sell 200-400 bottles a day'",
 "Gunman's victims include three-year-old girl",
 'Panic as thousands flee Taliban onslaught',
 'Murder-accused boy, 14, in court after stabbing',
 'England lose two wickets in two balls - clips, radio & text',
 "Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix",
 "'One signing could decide the title race - but it's not Harry Kane'",
 "Holiday 'stress' over paper vaccine certificate",
 'Woman arrested in murder probe after boy, 2, dies',
 'Medics warn of

Score each of the tokens using...
- [AFINN](https://github.com/fnielsen/afinn) = wordlist-based approach
- [VADER](https://github.com/cjhutto/vaderSentiment) = lexicon+rule-based approach

In [17]:
from afinn import Afinn
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

tokens = unique_tokens

afinn = Afinn()
vader = SentimentIntensityAnalyzer()
scored_tokens = []

for token in tokens:
  # afinn.score calculates by adding up individual scores for words
  # so you need to standardise by dividing the length of the phrase
  afinn_score = afinn.score(token) / len(token.split())

  vader_score = vader.polarity_scores(token)["compound"]
  
  scored_tokens.append({"token": token, "afinn_score": afinn_score, "vader_score": vader_score})

Check the output...

In [18]:
print(len(scored_tokens), scored_tokens[0:10])

64 [{'token': 'Plymouth mass shooter was licensed gun holder', 'afinn_score': -0.14285714285714285, 'vader_score': -0.34}, {'token': 'PM calls emergency meeting to discuss Afghanistan', 'afinn_score': -0.2857142857142857, 'vader_score': -0.3818}, {'token': 'England bat after Anderson takes five wickets in second Test', 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "Teachers: 'It's been hell grading exams'", 'afinn_score': -0.6666666666666666, 'vader_score': -0.6808}, {'token': "'One signing could decide the title race, but it's not Harry Kane'", 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': 'Lure of the island with no electricity or wi-fi', 'afinn_score': -0.1111111111111111, 'vader_score': -0.296}, {'token': 'UK records a further 100 Covid deaths', 'afinn_score': -0.2857142857142857, 'vader_score': 0.0}, {'token': 'Germany fears thousands got saline, not vaccine', 'afinn_score': 0.0, 'vader_score': -0.4215}, {'token': "'We had to tip milk down the drain - now we sell 200-400

Let's move the data to pandas and spot-check some of the results

In [19]:
import pandas as pd

pd.set_option("max_row",None)
pd.set_option('display.max_colwidth', None)

token_df = pd.DataFrame(scored_tokens)

token_df[33:43]

Unnamed: 0,token,afinn_score,vader_score
33,'When you're on a BMX 20 feet in the air there's no room for error' Video,-0.1875,-0.5994
34,Swimmer taking on 'coldest swim on Earth' to highlight climate change,0.181818,0.34
35,Why the Tourette's queen of Twitch hasn't been banned,-0.222222,0.357
36,'We never knew how dangerous Loch Lomond was',-0.25,0.3724
37,Postcard from Chile arrives in Dorset after 30 years,0.0,0.0
38,Decades-old lesson found on hidden blackboard,0.0,0.0
39,Hampshire & Isle of Wight,0.0,0.0
40,Flexible recipes for when you need to use up leftovers,0.0,0.2263
41,Totally affordable 30-minute meals for two,0.333333,0.0
42,Create incredible pasta dishes with only five ingredients,0.0,0.2732


The values can be split into 3 groups: negative, neutral and positive. But it's more optimal to have only 2 groups, so we'll group together the positive and neutral scores. 

With that in mind, the models seem to be performing pretty well. Only a couple of misses:
- `Why the Tourette's...` was scored negatively by AFINN
- `We never knew...` was scored positively by VADER

Let's check every place where the models are disagreeing...

In [20]:
token_df[((token_df["afinn_score"]<0) & (token_df["vader_score"]>=0)) | ((token_df["afinn_score"]>=0) & (token_df["vader_score"]<0))]

Unnamed: 0,token,afinn_score,vader_score
6,UK records a further 100 Covid deaths,-0.285714,0.0
7,"Germany fears thousands got saline, not vaccine",0.0,-0.4215
11,"Murder-accused boy, 14, in court after stabbing",-0.571429,0.0
12,"England lose two wickets in two balls - clips, radio & text",0.0,-0.4019
13,Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix,0.181818,-0.3182
28,Casualty's 10 most memorable episodes,-0.2,0.0
31,'I'm just not ready to buy an electric car',0.0,-0.2755
35,Why the Tourette's queen of Twitch hasn't been banned,-0.222222,0.357
36,'We never knew how dangerous Loch Lomond was',-0.25,0.3724
62,Ã‚Â© 2021 BBC. The BBC is not responsible for the content of external sites.,0.142857,-0.2411


There's 10 disagreements (out of 64 phrases). AFINN got 3 right, VADER 7 - hence VADER seems to be the superior model!

But what if we could make it even more accurate with a pre-trained, state-of-the-art model? Enter Hugging Face

In [21]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

Let's see if the model works

In [22]:
classifier('We are very happy to show you the ðŸ¤— Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

So far so good. Let's now apply it to the dataset.

In [23]:
for token in scored_tokens:
    bert = classifier(token["token"])[0]
    token["bert_score"] =  bert["score"] * (-1 if bert["label"] == "NEGATIVE" else 1)
    
token_df = pd.DataFrame(scored_tokens)

In [24]:
token_df[["token","bert_score"]][10:20]

Unnamed: 0,token,bert_score
10,Panic as thousands flee Taliban onslaught,-0.975269
11,"Murder-accused boy, 14, in court after stabbing",-0.982073
12,"England lose two wickets in two balls - clips, radio & text",-0.996979
13,Women's Hundred: Trent Rockets struggle in must-win game against Birmingham Phoenix,0.963504
14,'One signing could decide the title race - but it's not Harry Kane',-0.98869
15,Holiday 'stress' over paper vaccine certificate,-0.994894
16,"Woman arrested in murder probe after boy, 2, dies",-0.960317
17,Medics warn of more cancelled operations,-0.999388
18,3 things we love today,0.999752
19,Three of the strangest organs in the animal kingdom,0.993196


BERT has perfect accuracy on the slice, impressive! Let's see how it compares to VADER...

In [25]:
filter = token_df[((token_df["bert_score"]<0) & (token_df["vader_score"]>=0)) | ((token_df["bert_score"]>=0) & (token_df["vader_score"]<0))][["token","vader_score","bert_score"]]
print(filter.count())
filter.to_csv("bert-vader.csv")

token          19
vader_score    19
bert_score     19
dtype: int64


Turns out DistilBERT has lower accuracy than VADER?!

What if we use another model, tailored to twitter?

In [2]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [5]:
text = "Good night ðŸ˜Š"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

In [6]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-2.4362,  0.5167,  2.2756]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [7]:
import numpy as np
from scipy.special import softmax

In [8]:
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores

array([0.00760985, 0.14581196, 0.8465782 ], dtype=float32)

In [28]:
scores[0]

0.007609855

In [27]:
token_df

Unnamed: 0,token,afinn_score,vader_score,bert_score
0,Plymouth mass shooter was licensed gun holder,-0.142857,-0.34,-0.702464
1,PM calls emergency meeting to discuss Afghanistan,-0.285714,-0.3818,-0.804552
2,England bat after Anderson takes five wickets in second Test,0.0,0.0,0.941195
3,Teachers: 'It's been hell grading exams',-0.666667,-0.6808,-0.995785
4,"'One signing could decide the title race, but it's not Harry Kane'",0.0,0.0,-0.983202
5,Lure of the island with no electricity or wi-fi,-0.111111,-0.296,-0.999425
6,UK records a further 100 Covid deaths,-0.285714,0.0,-0.977418
7,"Germany fears thousands got saline, not vaccine",0.0,-0.4215,-0.998494
8,'We had to tip milk down the drain - now we sell 200-400 bottles a day',0.0,0.0,-0.998709
9,Gunman's victims include three-year-old girl,-0.6,-0.3182,-0.959833


In [30]:
for token in scored_tokens:
    encoded_input = tokenizer(token["token"], return_tensors='pt')
    output = model(**encoded_input)
    
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    if scores[0] > scores[1] and scores[0] > scores[2]:
        roberta_score = -1
    elif scores[1] > scores[0] and scores[1] > scores[2]:
        roberta_score = 0
    else:
        roberta_score = 1
        
    token["roberta_score"] = roberta_score

In [32]:
token_df = pd.DataFrame(scored_tokens)
token_df[["token","roberta_score"]][25:35]

Unnamed: 0,token,roberta_score
25,Britney Spears' father to step down as conservator,0
26,Grimmy on leaving Radio 1 and the 'instant bad mood' song,-1
27,Olympian Adam Peaty joins the all-star Strictly 2021 line-up,0
28,Casualty's 10 most memorable episodes,1
29,Marvel launches new Disney+ show featuring Chadwick Boseman,1
30,First batch of student's washing machines shipped,0
31,'I'm just not ready to buy an electric car',-1
32,Olympian given new medal after first got bitten,0
33,'When you're on a BMX 20 feet in the air there's no room for error' Video,0
34,Swimmer taking on 'coldest swim on Earth' to highlight climate change,0


In [33]:
filter = token_df[((token_df["roberta_score"]<0) & (token_df["vader_score"]>=0)) | ((token_df["roberta_score"]>=0) & (token_df["vader_score"]<0))][["token","vader_score","roberta_score"]]
print(filter.count())
filter.to_csv("roberta-vader.csv")

token            14
vader_score      14
roberta_score    14
dtype: int64
