<a href="https://colab.research.google.com/github/jhall1996/nlp/blob/main/Mod5_NLP_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This provides a framework for functions that will be tested as part of module 5 NLP
# You should test your submissions against the cases listed below, they will then be 
# tested against further unseen cases before being reviewed manually.

**THE ASSIGNMENT**

Populate the notebook below to create functions that achieve the following tasks.
They must also pass the tests included at the bottom of the notebook. 
**Part 1**: produce a scraper function that can return the following information when given a URL from the BBC news page.  This function must be iterable - it can be used in a loop to examine a number of URLs and return the following information as a JSON.

a) URL (provided.  For example https://www.bbc.co.uk/news/uk-51004218)

b) Title

c) Date

d) Content (the main body of article)

**Part 2**:  Write a function that when given a block of text (as a string)  returns all the following entities in a json object,  It is suggested that you use a pre-built or custom entity recogniser rather than a rules based method.  There are entity recognisers in the following python packages: NLTK, spacy

a) people

b) places

c) organisations 

**CONSTRAINTS**
The code must run in Google Colab.

Do not change the name of the functions or their inputs.

Your functions will be expected to return outputs as specified in the template functions.

You may add additional functions as desired.

Do not change the test cases at the bottom.

In [1]:
#Do not chance these dependencies
import pytest
# Import your dependencies here
from bs4 import BeautifulSoup
import requests
import json
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [6]:
def bbc_scraper(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the title, date, and content using HTML tags and CSS selectors
    title = soup.select_one('#main-heading').text.strip() #Extract title from main-heading ID
    datetime_str = soup.select_one('time[data-testid="timestamp"]').get('datetime') #Extract the date as a string from the datetime ID
    date_published = datetime_str[:10]  # Extract only the year, month, and date from the datetime string
    content_divs = soup.select('div[data-component="text-block"]') #Selects and stores all text blocks from the article
    content = ' '.join([div.find('p').text.strip() for div in content_divs]) #Iterates through each element in content_divs, removing whitespace and joining extracted text into a single string 

    # Create a dictionary with the extracted information
    results = {
        'URL': url,
        'Title': title,
        'Date_published': date_published,
        'Content': content
    }

    return results

# Example usage
url = input('Please enter BBC URL ')
results = bbc_scraper(url)
print(json.dumps(results, indent=4))


Please enter BBC URL https://www.bbc.co.uk/news/entertainment-arts-65876276
{
    "URL": "https://www.bbc.co.uk/news/entertainment-arts-65876276",
    "Title": "Tony Awards 2023: Jodie Comer wins as Ariana DeBose hosts unscripted",
    "Date_published": "2023-06-12",
    "Content": "Jodie Comer says she's \"overwhelmed\" after winning a prestigious Tony Award for her one-woman Broadway show Prima Facie. The Killing Eve actress won best leading actress in a play for her portrayal of a defence lawyer who ends up in the witness box. The Tony Awards were hosted by Ariana DeBose in New York, but she did not use a script due to the writers strike. The ceremony also saw two non-binary actors win prizes for the first time. In her acceptance speech for her performance in Prima Facie, Comer said: \"This woman in this play has been my greatest teacher. \"I have to thank Suzie Miller for that, who wrote this magnificent piece. Without her writing that [I] would not be here so this feels just as mu

In [7]:
def extract_entities(string):
    #Initialise an empty dictionary for extracted entitied to be stored
    entities_json = {'People': [], 'Places': [], 'Organisations': []}

    #Performa speech tagging and entity recognition on string
    tags = (nltk.pos_tag(string.split()))
    entity_chunker = nltk.ne_chunk(tags)

    #Iterate over subtrees of the entity chunker
    for subtree in entity_chunker.subtrees(filter=lambda t: hasattr(t, 'label') and t.label() in ['PERSON', 'GPE', 'ORG']):
        #Extract entity type and name
        entity_type = subtree.label()
        entity_name = ' '.join(leaf[0] for leaf in subtree.leaves())
        
        #Store entity type within entity dictionary
        if entity_type == 'PERSON':
            entities_json['People'].append(entity_name)
        elif entity_type == 'GPE':
            entities_json['Places'].append(entity_name)
        elif entity_type == 'ORG':
            entities_json['Organisations'].append(entity_name)

    # Remove duplicates from the entity dictionary
    for entity_type in entities_json:
        entities_json[entity_type] = list(set(entities_json[entity_type]))

    #Return entities extracted from string as a json
    return entities_json

#Call scraper and print results
scraper_result = bbc_scraper(url)
entities_json = extract_entities(scraper_result['Content'])
print(json.dumps(entities_json, indent=4))

{
    "People": [
        "Andrzej",
        "Brandon Uranowitz",
        "Prima",
        "Sir Tom Stoppard",
        "Tony Award",
        "Ariana",
        "Good",
        "Tim Lutkin",
        "Brigitte Reiffenstuel",
        "Patrick Marber",
        "Hot Special Tony Award",
        "Alex Newell",
        "Akimbo Best",
        "Sean Hayes",
        "Ariana DeBose",
        "Jodie",
        "Liverpool",
        "Victoria",
        "Carolyn Downing",
        "Kimberly Akimbo Best",
        "Charlie Rosen",
        "Lutkin",
        "John Kander Isabelle Stevenson Tony Award",
        "Fleet Street Best",
        "Jodie Comer",
        "Nevin Steinberg",
        "Parade Best",
        "Harrison Ghee",
        "Arden",
        "Michael Arden",
        "David",
        "Beowulf Boritt",
        "Bryan Carter",
        "Robert Fried",
        "Kimberly Akimbo",
        "Harrison",
        "Suzie Miller",
        "Leopoldstadt Best",
        "Sweeney",
        "Bonnie Milligan",
      

In [None]:
####################################################################
# Test cases

def test_bbc_scrape():
    results = {'URL': 'https://www.bbc.co.uk/news/uk-52255054',
                'Title': 'Coronavirus: \'We need Easter as much as ever,\' says the Queen',
                'Date_published': '11 April 2020',
                'Content': '"Coronavirus will not overcome us," the Queen has said, in an Easter message to the nation. While celebrations would be different for many this year, she said: "We need Easter as much as ever." Referencing the tradition of lighting candles to mark the occasion, she said: "As dark as death can be - particularly for those suffering with grief - light and life are greater." It comes as the number of coronavirus deaths in UK hospitals reached 9,875. Speaking from Windsor Castle, the Queen said many religions had festivals celebrating light overcoming darkness, which often featured the lighting of candles. She said: "They seem to speak to every culture, and appeal to people of all faiths, and of none. "They are lit on birthday cakes and to mark family anniversaries, when we gather happily around a source of light. It unites us." The monarch, who is head of the Church of England, said: "As darkness falls on the Saturday before Easter Day, many Christians would normally light candles together.  "In church, one light would pass to another, spreading slowly and then more rapidly as more candles are lit. It\'s a way of showing how the good news of Christ\'s resurrection has been passed on from the first Easter by every generation until now." As far as we know, this is the first time the Queen has released an Easter message. And coming as it does less than a week since the televised broadcast to the nation, it underlines the gravity of the situation as it is regarded by the monarch. It serves two purposes really; it is underlining the government\'s public safety message, acknowledging Easter will be difficult for us but by keeping apart we keep others safe, and the broader Christian message of hope and reassurance.  We know how important her Christian faith is, and coming on the eve of Easter Sunday, it is clearly a significant time for people of all faiths, but particularly Christian faith. She said the discovery of the risen Christ on the first Easter Day gave his followers new hope and fresh purpose, adding that we could all take heart from this.  Wishing everyone of all faiths and denominations a blessed Easter, she said: "May the living flame of the Easter hope be a steady guide as we face the future." The Queen, 93, recorded the audio message in the White Drawing Room at Windsor Castle, with one sound engineer in the next room.  The Palace described it as "Her Majesty\'s contribution to those who are celebrating Easter privately".  It follows a speech on Sunday, in which the monarch delivered a rallying message to the nation. In it, she said the UK "will succeed" in its fight against the coronavirus pandemic, thanked people for following government rules about staying at home and praised those "coming together to help others". She also thanked key workers, saying "every hour" of work "brings us closer to a return to more normal times".'}
    scraper_result = bbc_scraper('https://www.bbc.co.uk/news/uk-52255054')
    assert json.loads(scraper_result) == results


def test_extract_entities_amazon_org():
    input_string = "I work for Amazon."
    results_dict = {'people':[],
                    'places':[],
                    'organisations': ['Amazon']
                    }
    extracted_entities_results = extract_entities(input_string)
    assert json.loads(extracted_entities_results) == results_dict


def test_extract_entities_name():
    input_string = "My name is Bob"
    results_dict = {'people':['Bob'],
                    'places':[],
                    'organisations': []
                    }
    extracted_entities_results = extract_entities(input_string)
    assert json.loads(extracted_entities_results) == results_dict

In [None]:
test_bbc_scrape()
test_extract_entities_amazon_org()
test_extract_entities_name()