Import modules from web app

In [None]:
import sys, os.path

modules_dir = os.path.join(os.path.abspath(''),"..") + "/main/modules"
sys.path.append(modules_dir)

import scraping

Retrieve HTML text from website

In [None]:
response = scraping.request("bbc.co.uk")

Check whether the request was successful

In [None]:
print(response.status_code)
print(response.text[0:100])

200
<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" 


Read the HTML content as a string to analyse. In order for people to replicate my results, I've already downloaded the HTML text.

In [None]:
#response_text = response.text
#with open("response.txt", "w") as text_file:
#    text_file.write(response_text)

with open("response.txt","r") as text_file:
    response_text = text_file.read()

Process the HTML text into tokens

In [None]:
tokens = scraping.process(response_text)

Check the output...

In [None]:
print(len(tokens))
print(tokens[0:10])

79
['One mystery, four suspects, many lies. iPlayer', 'Olympian Adam Peaty joins the all-star Strictly 2021 line-up', "Teachers: 'It's been hell grading exams'", "Gunman's victims include three-year-old girl", "Why the Tourette's queen of Twitch hasn't been banned", 'Superstar violinist Nicola Benedetti delights the Proms. IPlayer-Video', 'York & North Yorkshire', "'When you're on a BMX 20 feet in the air there's no room for error' Video", "Britney Spears' father to step down as conservator", 'England bat after Anderson takes five wickets in second Test']


Score each of the tokens using...
- [AFINN](https://github.com/fnielsen/afinn) = wordlist-based approach
- [VADER](https://github.com/cjhutto/vaderSentiment) = lexicon+rule-based approach

In [None]:
from afinn import Afinn
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

afinn = Afinn()
vader = SentimentIntensityAnalyzer()
scored_tokens = []

for token in tokens:
  # afinn.score calculates by adding up individual scores for words
  # so you need to standardise by dividing the length of the phrase
  afinn_score = afinn.score(token) / len(token.split())

  vader_score = vader.polarity_scores(token)["compound"]
  
  scored_tokens.append({"token": token, "afinn_score": afinn_score, "vader_score": vader_score})

Check the output...

In [None]:
print(len(scored_tokens), scored_tokens[0:10])

79 [{'token': 'One mystery, four suspects, many lies. iPlayer', 'afinn_score': -0.14285714285714285, 'vader_score': -0.6369}, {'token': 'Olympian Adam Peaty joins the all-star Strictly 2021 line-up', 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "Teachers: 'It's been hell grading exams'", 'afinn_score': -0.6666666666666666, 'vader_score': -0.6808}, {'token': "Gunman's victims include three-year-old girl", 'afinn_score': -0.6, 'vader_score': -0.3182}, {'token': "Why the Tourette's queen of Twitch hasn't been banned", 'afinn_score': -0.2222222222222222, 'vader_score': 0.357}, {'token': 'Superstar violinist Nicola Benedetti delights the Proms. IPlayer-Video', 'afinn_score': 0.375, 'vader_score': 0.4588}, {'token': 'York & North Yorkshire', 'afinn_score': 0.0, 'vader_score': 0.0}, {'token': "'When you're on a BMX 20 feet in the air there's no room for error' Video", 'afinn_score': -0.1875, 'vader_score': -0.5994}, {'token': "Britney Spears' father to step down as conservator", 'afinn_

Let's export a slice of the data in an Excel file so we can spot-check the scores

In [None]:
import pandas as pd

pd.set_option("max_row",None)

token_df = pd.DataFrame(scored_tokens)

token_df.iloc[33:43].to_csv("out.csv", index=False)

Both scoring models seem pretty accurate and mostly agree with each other. The few exceptions:
- `Composer, DJ and bandleader...` should be a positive text, yet both models scored it as neutral. I can see why though, there weren't any particular words that suggested positivity.
- `Totally affordable...` was missed out by VADER as being a positive word

Let's now see all the places where the two models disagree

In [None]:
filtered_token = token_df[(token_df["afinn_score"] * token_df["vader_score"] <= 0) &
    ((token_df["afinn_score"] != 0) | (token_df["vader_score"] != 0))]

filtered_token.count()

So there's 14 out of 79 places where the 2 models disagree, which is roughly 25% of the time.

Let's have a look where exactly there's disagreements.

In [None]:
pd.set_option('display.max_colwidth', None)
filtered_token

NameError: name 'filtered_token' is not defined

AFINN got 6 right, while VADER got the remainder 8. There's no clear winner then. What if we now look into this using a trained, state-of-the-art model? Enter HuggingFace.

We have several options on the menu: classic BERT, DistilBERT, RoBERTa and XLNet. I'm choosing DistilBERT because it's the most resource-light - this is a side project, after all.