# News Positivity Research

This is the Jupyter notebook that shows how to scrape all the text content from a single news site. The sentiment scoring is done in the [model.ipynb](./model.ipynb) notebook. 

The script is designed to work only for the landing pages (e.g. https://bbc.co.uk), not for the article pages (e.g. https://www.bbc.co.uk/news/uk-england-birmingham-58282348).

My analysis was done on 3 URLs:
- positive.news
- bbc.co.uk
- thecanary.co

In [2]:
url = "positive.news"

Retrieve the HTML text from the website. The news sites get constantly updated, which is why I needed to download the content into text files. Run as-is if you wish to replicate my results.

In [3]:
import requests

def request(url):
    """sends a request to the URL"""

    # add https if not in there at start
    if url[0:8] != "https://" and url[0:7] != "http://":
        url = "https://" + url

    my_session = requests.session()
    
    # these settings help avoid getting blocked by site
    for_cookies = requests.get(url, timeout=5).cookies
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
    }

    return my_session.get(url, headers=headers, cookies=for_cookies, timeout=5)

#response = request(url)

#with open(url+"-html_text.txt", "w") as text_file:
#    text_file.write(response.text)

with open(url+"-html_text.txt","r") as text_file:
    response_text = text_file.read()

In [4]:
response_text[0:200]

'<!DOCTYPE html>\n<!--[if lt IE 7]><html class="no-js ie ie6 lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->\n<!--[if IE 7]><html class="no-js ie ie7 lt-ie9 lt-ie8" lang="en"> <![endif]-->\n<!--[if IE 8]><h'

We need to pull out all the relevant text content from the HTML of the websites. We'll start by splitting the HTML text into an array of text pieces:

In [5]:
from bs4 import BeautifulSoup as bs

soup_li = bs(response_text, "lxml").body.get_text(separator="||").split("||")

In [6]:
soup_li[0:10]

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']

There's a lot of generic text. We can remove the far majority with 2 filters:
1) Must have at least 5 words
2) Must not include generic site keywords like `sign up` or `newsletter`

In [7]:
KEYWORDS = ["cookie",
            "newsletter",
            "copyright",
            "trademark",
            "mailing list",
            "subscribe",
            "sign up",
            "rights reserved",
            "this site",
            "©",
            "ltd",
            "llp",
            "inc"
           ]

def is_generic(text):
    if len(text.split()) < 5:
        return True
    
    lower_text = text.lower()
    
    for word in KEYWORDS:
        if word in lower_text:
            return True
        
    return False

long_text = [x for x in soup_li if not is_generic(x)]

In [8]:
long_text[0:10]

['What can I do about climate change? 14 ways to take positive action',
 'After the UN issued a ‘code red for humanity’ last week, many people are asking — what can I do about climate change? Quite a lot, actually ',
 'The cookbook for people who have long Covid',
 '\n          The authors of a new, free cookbook hope it will improve taste for Covid patients\n        ',
 'Wind power firm aims to nip nimbyism in the bud with tulip-shaped turbines',
 "\n          Want to improve your business's eco-credentials, and ward off nimby naysayers? Flower Turbines may be the ticket \n        ",
 'Meet the plastic-hunting ‘pirates’ of Cornwall',
 '\n          The pirates’ bounty is melted down to make sea kayaks, which are then used to collect more rubbish   \n        ',
 'What went right this week: Australia’s ‘healing journey’, plus more positive news',
 '\n          Australia pledged reparations for indigenous people, wildlife returned to Scottish rivers, plus coffee shops that help homeless p

Seems like there's some duplicates. Let's remove them.

In [9]:
unique_li = []

for text_l in long_text:
    unique = True
    
    for text_u in unique_li:
        if text_l in text_u:
            unique = False
        
    if unique:
        unique_li.append(text_l)

In [10]:
unique_li[0:10]

['What can I do about climate change? 14 ways to take positive action',
 'After the UN issued a ‘code red for humanity’ last week, many people are asking — what can I do about climate change? Quite a lot, actually ',
 'The cookbook for people who have long Covid',
 '\n          The authors of a new, free cookbook hope it will improve taste for Covid patients\n        ',
 'Wind power firm aims to nip nimbyism in the bud with tulip-shaped turbines',
 "\n          Want to improve your business's eco-credentials, and ward off nimby naysayers? Flower Turbines may be the ticket \n        ",
 'Meet the plastic-hunting ‘pirates’ of Cornwall',
 '\n          The pirates’ bounty is melted down to make sea kayaks, which are then used to collect more rubbish   \n        ',
 'What went right this week: Australia’s ‘healing journey’, plus more positive news',
 '\n          Australia pledged reparations for indigenous people, wildlife returned to Scottish rivers, plus coffee shops that help homeless p

Some empty spaces and weird characters due to HTML encoding. Let's remove them

In [11]:
import re

def text_transform(text_input):
    encoded_text = text_input.encode("ascii", "ignore")
    decoded_text = encoded_text.decode("unicode_escape")
    stripped_text = re.sub(
        r"\r|\n|\t| \(link opens in a new browser window\)", "", decoded_text
    ).strip()
    return stripped_text

processed_li = [text_transform(x) for x in unique_li]

In [12]:
processed_li[0:10]

['What can I do about climate change? 14 ways to take positive action',
 'After the UN issued a code red for humanity last week, many people are asking  what can I do about climate change? Quite a lot, actually',
 'The cookbook for people who have long Covid',
 'The authors of a new, free cookbook hope it will improve taste for Covid patients',
 'Wind power firm aims to nip nimbyism in the bud with tulip-shaped turbines',
 "Want to improve your business's eco-credentials, and ward off nimby naysayers? Flower Turbines may be the ticket",
 'Meet the plastic-hunting pirates of Cornwall',
 'The pirates bounty is melted down to make sea kayaks, which are then used to collect more rubbish',
 'What went right this week: Australias healing journey, plus more positive news',
 'Australia pledged reparations for indigenous people, wildlife returned to Scottish rivers, plus coffee shops that help homeless people']

Export into csv file

In [15]:
import pandas as pd

token_df = pd.DataFrame(processed_li,columns=["text"])
token_df["site"] = url

## Commented out to avoid overwriting
#token_df.to_csv(url+"-csv.csv", index=False)

After I ran this for the 3 news sites, I manually combined all of them into the `test.csv` file.