<a href="https://colab.research.google.com/github/jaylgee/NLP-Web-Scraper/blob/main/NLP_web_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**THE ASSIGNMENT**

Produce a scraper function that can return the following information when given a URL from the BBC news page:

a) URL (For example https://www.bbc.co.uk/news/uk-51004218)

b) Title

c) Date

d) Content (the main body of article)

**Part 2**:  Write a function that when given a block of text (as a string)  returns all the following entities in a json object:

a) people

b) places

c) organisations


In [15]:
# Import dependencies
import pytest
! pip install beautifulsoup4 requests
! pip install spacy

from bs4 import BeautifulSoup
import requests
import json
import spacy



In [16]:
def bbc_scraper(url):
    """
    This function takes a url relating to a BBC news site
    and returns a json object containing the following fields:
    URL, title, date_published, content
    """

    get_data = requests.get(url)
    bbc_soup = BeautifulSoup(get_data.text, 'html.parser')
    titles = bbc_soup.h1.text
    dates_published = bbc_soup.time.text
    content_data = bbc_soup.article
    content_results = content_data.find_all("div", attrs={"data-component":"text-block"})
    article_content = ""

    for i in range(len(content_results)):
      if i == len(content_results)-1:
        article_content += content_results[i].text
      else:
        article_content += content_results[i].text + " "

    results_dict = {
        'URL': url,
        'Title': titles,
        'Date_published': dates_published,
        'Content': article_content
       }

    results_json = json.dumps(results_dict)
    return results_json

In [17]:
# TEST 1: function with first test case url
bbc_results = bbc_scraper('https://www.bbc.co.uk/news/uk-52255054')
print(bbc_results)

# TEST 2: function with second test case url
bbc_results_two = bbc_scraper('https://www.bbc.co.uk/news/uk-51004218')
print(bbc_results_two)

# TEST 2: function iterating over a list of urls
urls = ['https://www.bbc.co.uk/news/uk-52255054', 'https://www.bbc.co.uk/news/uk-51004218']
for url in urls:
   print(bbc_scraper(url))

# TEST 3: function with current bbc news url
news = bbc_scraper('https://www.bbc.co.uk/news/world-europe-65740839')
print(news)

{"URL": "https://www.bbc.co.uk/news/uk-52255054", "Title": "Coronavirus: 'We need Easter as much as ever,' says the Queen", "Date_published": "11 April 2020", "Content": "\"Coronavirus will not overcome us,\" the Queen has said, in an Easter message to the nation. While celebrations would be different for many this year, she said: \"We need Easter as much as ever.\" Referencing the tradition of lighting candles to mark the occasion, she said: \"As dark as death can be - particularly for those suffering with grief - light and life are greater.\" It comes as the number of coronavirus deaths in UK hospitals reached 9,875. Speaking from Windsor Castle, the Queen said many religions had festivals celebrating light overcoming darkness, which often featured the lighting of candles. She said: \"They seem to speak to every culture, and appeal to people of all faiths, and of none. \"They are lit on birthday cakes and to mark family anniversaries, when we gather happily around a source of light. 

In [18]:
def extract_entities(string):
    """
    This function returns a json containing the people, places, and
    organisations in the text string provided.
    """

    # load english tokenizer and entity recognizer
    # Ref: spacy.io accessed 29 May 23
    english_tokenizer = spacy.load("en_core_web_sm")
    corpus = english_tokenizer(string)
    # create dictionary as basis for entities json
    entities_results = {
          'people': [],
          'places': [],
          'organisations': []
          }
    for entity in corpus.ents:
      # print(entity.text, entity.label)
      if entity.label == 383:  # organisations
        entities_results['organisations'].append(entity.text)
      if entity.label == 380:  # people
        entities_results['people'].append(entity.text)
      if entity.label == 384:  # places
        entities_results['places'].append(entity.text)
    entities_json = json.dumps(entities_results)
    return entities_json

In [19]:
# ENTITY TEST 1: function identifies test case entities
extract_entities("I work for Google, my name is Pete, I live in San Francisco")

# ENTITY TEST 2: function identifies entities from news article content
print(extract_entities(news))
print(extract_entities(bbc_results_two))

{"people": ["Gen Kyrylo Budanov", "Gen Kyrylo Budanov", "Gen Budanov", "Oleksandr Scherba", "Scherba", "Volodymyr Zelensky", "Kyiv", "Mr Zelensky"], "places": ["Ukraine", "Russia", "Ukraine", "Ukrainian", "Kyiv", "Russia", "Moscow", "Ukraine", "Ukraine", "Ukraine", "Russia", "Russia", "Moscow", "Ukraine", "Ukraine", "Belgorod", "Ukraine"], "organisations": ["BBC"]}
{"people": ["Qasem Soleimani", "Soleimani", "Donald Trump", "Mark Esper", "Esper", "Boris Johnson", "Adel Abdul Mahdi", "Johnson", "Dominic Raab", "Jens Stoltenberg", "Mr Abdul Mahdi", "Hamid Baeidinejad", "Twitter", "Simon Vincent Mayall", "Johnson", "Angela Merkel", "Emmanuel Macron", "\\\"It", "Johnson", "Trump", "Trump", "Soleimani", "Jeremy Hunt", "Nazanin Zaghari-Ratcliffe"], "places": ["Iraq", "UK", "Iraq", "UK", "US", "US", "Iraq", "US", "US", "Iraq", "US", "US", "UK", "Iraq", "Daesh", "UK", "US", "Iran", "Iraq", "US", "Iraq", "UK", "Iraq", "US", "UK", "Iraq", "Iran", "UK", "Soleimani", "Iran", "US", "UK", "UK", "Bri

In [23]:
####################################################################
# Test cases (partially adapted from provided code)

def test_bbc_scrape():
    results = {'URL': 'https://www.bbc.co.uk/news/uk-52255054',
                'Title': 'Coronavirus: \'We need Easter as much as ever,\' says the Queen',
                'Date_published': '11 April 2020',
                'Content': '"Coronavirus will not overcome us," the Queen has said, in an Easter message to the nation. While celebrations would be different for many this year, she said: "We need Easter as much as ever." Referencing the tradition of lighting candles to mark the occasion, she said: "As dark as death can be - particularly for those suffering with grief - light and life are greater." It comes as the number of coronavirus deaths in UK hospitals reached 9,875. Speaking from Windsor Castle, the Queen said many religions had festivals celebrating light overcoming darkness, which often featured the lighting of candles. She said: "They seem to speak to every culture, and appeal to people of all faiths, and of none. "They are lit on birthday cakes and to mark family anniversaries, when we gather happily around a source of light. It unites us." The monarch, who is head of the Church of England, said: "As darkness falls on the Saturday before Easter Day, many Christians would normally light candles together.  "In church, one light would pass to another, spreading slowly and then more rapidly as more candles are lit. It\'s a way of showing how the good news of Christ\'s resurrection has been passed on from the first Easter by every generation until now." As far as we know, this is the first time the Queen has released an Easter message. And coming as it does less than a week since the televised broadcast to the nation, it underlines the gravity of the situation as it is regarded by the monarch. It serves two purposes really; it is underlining the government\'s public safety message, acknowledging Easter will be difficult for us but by keeping apart we keep others safe, and the broader Christian message of hope and reassurance.  We know how important her Christian faith is, and coming on the eve of Easter Sunday, it is clearly a significant time for people of all faiths, but particularly Christian faith. She said the discovery of the risen Christ on the first Easter Day gave his followers new hope and fresh purpose, adding that we could all take heart from this.  Wishing everyone of all faiths and denominations a blessed Easter, she said: "May the living flame of the Easter hope be a steady guide as we face the future." The Queen, 93, recorded the audio message in the White Drawing Room at Windsor Castle, with one sound engineer in the next room.  The Palace described it as "Her Majesty\'s contribution to those who are celebrating Easter privately".  It follows a speech on Sunday, in which the monarch delivered a rallying message to the nation. In it, she said the UK "will succeed" in its fight against the coronavirus pandemic, thanked people for following government rules about staying at home and praised those "coming together to help others". She also thanked key workers, saying "every hour" of work "brings us closer to a return to more normal times".'}
    scraper_result = bbc_scraper('https://www.bbc.co.uk/news/uk-52255054')
    assert json.loads(scraper_result) == results


def test_extract_entities_google_org():
    input_string = "I work for Google."
    results_dict = {'people':[],
                    'places':[],
                    'organisations': ['Google']
                    }
    extracted_entities_results = extract_entities(input_string)
    assert json.loads(extracted_entities_results) == results_dict


def test_extract_entities_name():
    input_string = "My name is Pete"
    results_dict = {'people':['Pete'],
                    'places':[],
                    'organisations': []
                    }
    extracted_entities_results = extract_entities(input_string)
    assert json.loads(extracted_entities_results) == results_dict

In [22]:
test_bbc_scrape()
test_extract_entities_google_org()
test_extract_entities_name()