# Scraping from 3 different news sites

**Datetime:** 04-05-2021 2021

NOTE: originally finished on 03-26-2021, had to re-run code to re-scrape news

**Sites:** 
    - Inquirer (Crawl delay: 5)
    - Manila Bulletin (Crawl delay: unspecified, use random between 10-20)
    - The Guardian (Using API)

## Imports

In [1]:
import requests
import json
import time
import random
from bs4 import BeautifulSoup

## Scraping from Inquirer

### Getting the latest news links

In [2]:
URL = "https://www.inquirer.net/latest-stories"
headers = {'User-agent' : '*'}
page = requests.get(URL, headers=headers) # Access time: 04-05-2021, 2014
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
container = soup.find(id='al-wrap')
headlines = container.find_all(id='al-box')
# Sanity Check
print(headlines[0])

<div id="al-box">
<div id="al-time">GLOBALNATION - 7:59 PM</div>
<h2><a href="https://globalnation.inquirer.net/194858/police-communities-across-us-fight-back-vs-anti-asian-hate-crimes">Police, communities across US fight back vs anti-Asian hate crimes</a></h2>
</div>


In [4]:
# parse content

for idx in range(len(headlines[0].contents)):
    print("Index ", idx, " Content:\n", headlines[0].contents[idx])

print("----------")    
for idx in range(len(headlines[0].contents[3].contents)):
    print("Index ", idx, " Content:\n", headlines[0].contents[3].contents[idx])

    
print("----------")
print("\nType - time: ", headlines[0].contents[1].text.strip())
print("\nHeadline: ", headlines[0].contents[3].contents[0].text.strip())
print("\nURL: ", headlines[0].contents[3].contents[0]['href'])

Index  0  Content:
 

Index  1  Content:
 <div id="al-time">GLOBALNATION - 7:59 PM</div>
Index  2  Content:
 

Index  3  Content:
 <h2><a href="https://globalnation.inquirer.net/194858/police-communities-across-us-fight-back-vs-anti-asian-hate-crimes">Police, communities across US fight back vs anti-Asian hate crimes</a></h2>
Index  4  Content:
 

----------
Index  0  Content:
 <a href="https://globalnation.inquirer.net/194858/police-communities-across-us-fight-back-vs-anti-asian-hate-crimes">Police, communities across US fight back vs anti-Asian hate crimes</a>
----------

Type - time:  GLOBALNATION - 7:59 PM

Headline:  Police, communities across US fight back vs anti-Asian hate crimes

URL:  https://globalnation.inquirer.net/194858/police-communities-across-us-fight-back-vs-anti-asian-hate-crimes


In [5]:
# get only of type NEWSINFO

news_headlines = []

for item in headlines:
    if item.contents[1].text.strip().find('NEWSINFO') != -1:
        news_headlines.append({
            'type': item.contents[1].text.strip(),
            'Headline': item.contents[3].contents[0].text.strip(),
            'URL': item.contents[3].contents[0]['href']
        })

news_urls = [item['URL'] for item in news_headlines]

# Sanity Check
print(news_urls)

['https://newsinfo.inquirer.net/1415144/healthcare-utilization-should-drop-to-60-before-easing-of-ncr-plus-restrictions-doh', 'https://newsinfo.inquirer.net/1415154/un-tribunal-denies-early-release-to-rwanda-genocide-mastermind', 'https://newsinfo.inquirer.net/1415157/jordans-prince-hamzah-strikes-defiant-tone-amid-palace-turmoil', 'https://newsinfo.inquirer.net/1415160/davao-city-braces-for-worst-as-ncr-cases-keep-surging', 'https://newsinfo.inquirer.net/1415151/ghana-investigates-after-dead-fish-dolphins-wash-up-on-shore', 'https://newsinfo.inquirer.net/1415150/838-yolanda-houses-turned-over-in-leyte-eastern-samar', 'https://newsinfo.inquirer.net/1415145/mandaue-mayor-isolates-self-after-covid-19-symptoms-show', 'https://newsinfo.inquirer.net/1415143/bangladesh-ferry-accident-kills-at-least-26', 'https://newsinfo.inquirer.net/1415137/govt-to-ramp-up-covid-19-testing-by-using-30000-antigen-kits-per-day', 'https://newsinfo.inquirer.net/1414910/singapore-to-accept-covid-19-digital-trave

### Scraping the news pages

In [6]:
curr_url = news_urls[0]
curr_page = requests.get(curr_url, headers=headers)
curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

# attempt to find format
headline = curr_soup.find('h1', {'class':'entry-title'}).text.strip()
author = curr_soup.find(id='art_author')['data-byline-strips']
date = curr_soup.find(id='art_plat').contents[2]

body = ''
body_soup = curr_soup.find(id='article_content').contents[1].find_all('p')

# remove all children from <p>s found in article content (e.g. advertisments, extenral links)
for x in body_soup:
    for y in x.find_all():
        if len(list(y.parents)) >= 1:
            y.extract()
    # to remove the caption
    if x.has_key('class'):
        if x['class'][0] == 'wp-caption-text':
            x.extract()

for x in body_soup:
    #print("Index: ", x)
    #print("\n", body_soup[x].text.strip())
    if x.text.strip() == 'RELATED STORIES': # cut off of article
        break
    if x.has_key('class'):
        if x['class'][0] == 'corona_article_tracker': # to account for the common paragraphs abt covid-19 in covid related news
            break
        pass
    else:
        body += '\n' + x.text.strip()

print('URL: ', curr_url)
print("\nHeadline: ", headline)
print("\nAuthor: ", author)
print("\nDate: ", date[4:])
print("\nArticle body: \n", body)

URL:  https://newsinfo.inquirer.net/1415144/healthcare-utilization-should-drop-to-60-before-easing-of-ncr-plus-restrictions-doh

Headline:  To ease ‘NCR Plus’ curbs, healthcare demand must drop to 60% – DOH

Author:  Christia Marie Ramos

Date:  7:49 PM April 05, 2021

Article body: 
 
MANILA, Philippines — The current restrictions enforced on the “NCR Plus” area can be eased once healthcare demand has been lowered to at least 60 percent, the Department of Health (DOH) said Monday.
“For healthcare utilization, we need to see that the utilization will be down to at least 60% before we can say that we are at that safe level,” DOH Undersecretary Ma. Rosario Vergeire said in a Palace briefing.
“The health system should be able to manage and should be able to breathe and should have this decongestion before we can say that we can easily lift the restrictions for this community quarantine,” she added.
According to a DOH official, the National Capital Region’s total healthcare demand is repor

  key))


In [14]:
# actual scraping

inquirer_news = []

for x in news_urls:
    print("[SCRAPING] URL: ", x)
    curr_url = x
    curr_page = requests.get(curr_url, headers=headers)
    curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

    # attempt to find format
    headline = curr_soup.find('h1', {'class':'entry-title'}).text.strip()
    if curr_soup.find(id='art_author') is not None:
        author = curr_soup.find(id='art_author')['data-byline-strips']
    else:
        author = ''
    if len(curr_soup.find(id='art_plat').contents) > 1:
        date = curr_soup.find(id='art_plat').contents[2]
    else:
        date = curr_soup.find(id='art_plat').contents[0]

    body = ''
    body_soup = curr_soup.find(id='article_content').contents[1].find_all('p')

    # remove all children from <p>s found in article content (e.g. advertisments, extenral links)
    for x in body_soup:
        for y in x.find_all():
            if len(list(y.parents)) >= 1:
                y.extract()
        # to remove the caption
        if x.has_key('class'):
            if x['class'][0] == 'wp-caption-text':
                x.extract()

    for x in body_soup:
        if x.text.strip() == 'RELATED STORIES': # cut off of article
            break
        if x.has_key('class'):
            if x['class'][0] == 'corona_article_tracker': # to account for the common paragraphs abt covid-19 in covid related news
                break
            pass
        else:
            body += '\n' + x.text.strip()
    
    inquirer_news.append({
        'source': curr_url,
        'date': date[4:],
        'title': headline,
        'article_body': body,
        'author': author
    })
    
    time.sleep(5)
    print("DONE, NEXT:")

[SCRAPING] URL:  https://newsinfo.inquirer.net/1415144/healthcare-utilization-should-drop-to-60-before-easing-of-ncr-plus-restrictions-doh
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415154/un-tribunal-denies-early-release-to-rwanda-genocide-mastermind
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415157/jordans-prince-hamzah-strikes-defiant-tone-amid-palace-turmoil
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415160/davao-city-braces-for-worst-as-ncr-cases-keep-surging
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415151/ghana-investigates-after-dead-fish-dolphins-wash-up-on-shore
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415150/838-yolanda-houses-turned-over-in-leyte-eastern-samar
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415145/mandaue-mayor-isolates-self-after-covid-19-symptoms-show
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1415143/bangladesh-ferry-accident-kills-at-least

In [15]:
# sanity check

article_no = random.randint(0,len(inquirer_news)-1)

print('Index: ', article_no)
print('\nSource: ', inquirer_news[article_no]['source'])
print("\nTitle: ", inquirer_news[article_no]['title'])
print("\nAuthor: ", inquirer_news[article_no]['author'])
print("\nDate: ", inquirer_news[article_no]['date'])
print("\nArticle body: \n", inquirer_news[article_no]['article_body'])

Index:  48

Source:  https://newsinfo.inquirer.net/1414931/csc-warns-public-anew-against-cse-review-classes-reviewers-for-sale

Title:  CSC warns public anew against CSE review classes, reviewers for sale

Author:  

Date:  1:42 PM April 05, 2021

Article body: 
 
MANILA, Philippines — The Civil Service Commission (CSC) has cautioned the public against review classes and reviewers for the Career Service Professional and Subprofessional Examination (CSE) being offered by any center or group, or individual.
CSC reiterated Monday that it does not conduct any review classes or disseminate any review materials for its CSE or any other civil service tests.
“May we advise again the public that the CSC neither holds any review classes nor publishes or distributes any review materials for the Career and any Civil Service examinations,” it said in a statement.
According to CSC, it has received information that reviewers and group reviews for CSEs were being offered and sold either through bookst

In [16]:
# write to file
with open('inquirer_news.json', 'w') as file:
    json.dump(inquirer_news, file, indent=4)

## Scraping from Manila Bulletin

### Getting the latest news links

In [17]:
URL = "https://mb.com.ph/news/"
headers = {'User-agent' : '*'}
page = requests.get(URL, headers=headers) # Access time: 03-26-2021, 2202
soup = BeautifulSoup(page.content, 'html.parser')

In [18]:
container = soup.find('ul',{'class': 'articles-list'}) # the div with the main news articles
headlines = container.find_all('li',{'class': 'article'})

# Sanity check
print(headlines[0])

<li class="article article-highlight">
<div class="article-inner row flex-row-reverse flex-md-row">
<figure class="article-figure col col-sm-auto">
<a class="article-img" href="https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/">
<img alt="US-CRIME-RACISM-ASIAN" src="https://mb.com.ph/wp-content/uploads/2021/03/000_9729JD-1.jpg"/>
</a>
</figure>
<div class="article-info col">
<div class="cat">
<a href="https://mb.com.ph/category/news/world/">World</a> </div>
<h3 class="title"><a href="https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/">Fil-Am woman attacked in NYC, security guard shuts doors on her</a></h3>
<div class="desc">
<p>An elderly Filipina immigrant was randomly attacked in Midtown Manhattan in New York City on Monday, the latest in the string of violent attacks against Asian Americans.</p>
<p>The 65-year-old Filipino woman was walking along 360 West 43rd Street on Monday when a man approached 

In [19]:
# parse format

print("URL", headlines[0].find('h3',{'class': 'title'}).find('a')['href'])
print('\nTitle: ', headlines[0].find('h3',{'class': 'title'}).find('a').text.strip())

URL https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/

Title:  Fil-Am woman attacked in NYC, security guard shuts doors on her


In [20]:
count = 0
for x in headlines:
    if x.find('h3',{'class': 'title'}) is not None:
        print("\nURL: ", x.find('h3',{'class': 'title'}).find('a')['href'])
        count+=1
    else:
        if x.find('h4',{'class': 'title'}) is not None:
            print("\nURL: ", x.find('h4',{'class': 'title'}).find('a')['href'])
            count+=1
            
print("\nNumber of headlines: ", count)


URL:  https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/

URL:  https://mb.com.ph/2021/04/05/no-extension-after-all-april-15-deadline-for-2020-itr-filing-stays-says-bir/

URL:  https://mb.com.ph/2021/04/05/defensor-to-give-away-ivermectin-to-qc-residents-for-free/

URL:  https://mb.com.ph/2021/04/05/doh-fda-do-not-recommend-use-of-ivermectin-for-covid-19-treatment/

URL:  https://mb.com.ph/2021/04/05/who-delivery-of-astrazeneca-vaccines-from-covax-may-be-delayed/

URL:  https://mb.com.ph/2021/04/05/doh-fda-deny-profiting-from-remdesivir-tocilizumab-use/

URL:  https://mb.com.ph/2021/04/05/mayor-tiangco-household-lockdown-is-a-good-plan-but-how-do-you-implement-it/

URL:  https://mb.com.ph/2021/04/05/start-of-something-good-dar-links-pangasinan-squash-kamote-farmers-to-enutribun-producer/

URL:  https://mb.com.ph/2021/04/05/avoid-rapid-tests-ph-red-cross-reminds-public-to-choose-rt-pcr-for-saliva-based-covid-19-testing/

URL:  https://mb.com.ph

In [21]:
# get more from second page
URL_2 = "https://mb.com.ph/news/page/2"
page_2 = requests.get(URL_2, headers=headers) # Access time: 04-05-2021, 20:20
soup_2 = BeautifulSoup(page_2.content, 'html.parser')
container_2 = soup_2.find('ul',{'class': 'articles-list'})
headlines_2 = container_2.find_all('li',{'class': 'article'})

count_2 = 0
for x in headlines_2:
    if x.find('h3',{'class': 'title'}) is not None:
        print("\nURL: ", x.find('h3',{'class': 'title'}).find('a')['href'])
        count_2+=1
    else:
        if x.find('h4',{'class': 'title'}) is not None:
            print("\nURL: ", x.find('h4',{'class': 'title'}).find('a')['href'])
            count_2+=1
            
print("\nNumber of headlines: ", count)


URL:  https://mb.com.ph/2021/04/05/go-holistic-approach-needed-to-curb-rise-in-covid-19-cases/

URL:  https://mb.com.ph/2021/04/05/tiangco-to-get-sinovac-vaccine-with-a-heavy-heart/

URL:  https://mb.com.ph/2021/04/05/lotto-digit-games-operations-remain-suspended-in-ecq-areas-pcso/

URL:  https://mb.com.ph/2021/04/05/covid-19-virus-spreading-in-ph-three-to-nine-times-more-contagious-doh/

URL:  https://mb.com.ph/2021/04/05/police-sarge-caught-in-anti-illegal-cockfighting-operation-in-iloilo/

URL:  https://mb.com.ph/2021/04/05/palace-to-philhealth-make-prompt-payment-of-hospital-claims/

URL:  https://mb.com.ph/2021/04/05/sec-guevarra-suggests-no-prison-term-or-fine-for-ecq-violators-in-ncr/

URL:  https://mb.com.ph/2021/04/05/binay-backs-doctors-groups-call-to-convert-hotels-into-temporary-hospital-facilities/

URL:  https://mb.com.ph/2021/04/05/9-rescued-as-motor-launch-loaded-with-gasoline-explodes/

URL:  https://mb.com.ph/2021/04/05/calls-for-covid-19-mass-testing-ok-if-funds-ava

In [22]:
# store all headlines
news_urls = []

for x in headlines:
    if x.find('h3',{'class': 'title'}) is not None:
        news_urls.append(x.find('h3',{'class': 'title'}).find('a')['href'])
    else:
        if x.find('h4',{'class': 'title'}) is not None:
            news_urls.append(x.find('h4',{'class': 'title'}).find('a')['href'])
            
for x in headlines_2:
    if x.find('h3',{'class': 'title'}) is not None:
        news_urls.append(x.find('h3',{'class': 'title'}).find('a')['href'])
    else:
        if x.find('h4',{'class': 'title'}) is not None:
            news_urls.append(x.find('h4',{'class': 'title'}).find('a')['href'])

print("Number of headlines: ", len(news_urls))

Number of headlines:  27


### Scraping the news pages

In [23]:
curr_url = news_urls[0]
curr_page = requests.get(curr_url, headers=headers)
curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

# parsing the page
headline = curr_soup.find('h2',{'class':'title'}).text.strip()
author = curr_soup.find('div',{'class':'meta'}).find('p',{'class':'author'}).text.strip()
date = curr_soup.find('div',{'class':'meta'}).find('p',{'class':'published'}).text.strip()

body = ''
body_parts = curr_soup.find('section',{'class':'article-content'}).find_all('p')

for p in body_parts:
    body += p.text.strip() + '\n'

print("URL: ", curr_url)
print("\nHeadline: ", headline)
print("\nAuthor: ", author[3:])
print("\nDate: ", date[10:])
print("\nArticle Body: \n", body)

URL:  https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/

Headline:  Fil-Am woman attacked in NYC, security guard shuts doors on her

Author:  Jaleen Ramos

Date:  March 31, 2021, 9:10 AM

Article Body: 
 An elderly Filipina immigrant was randomly attacked in Midtown Manhattan in New York City on Monday, the latest in the string of violent attacks against Asian Americans.
The 65-year-old Filipino woman was walking along 360 West 43rd Street on Monday when a man approached her and suddenly kicked her multiple times, according to the report by the New York Police District (NYPD).
Authorities said the attacker “made anti-Asian statements” towards the victim as he kicked her and shouted “F*ck you. You don’t belong here.”
In a video circulated online, bystanders who were on the scene just watched without intervening.
A security guard from inside an adjacent building where the incident happened also failed to aid the woman and even closed the door.
T

In [24]:
# actual scraping

mb_news = []

for x in news_urls:
    print("[SCRAPING] URL: ", x)
    
    curr_url = x
    curr_page = requests.get(curr_url, headers=headers)
    curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

    headline = curr_soup.find('h2',{'class':'title'}).text.strip()
    author = curr_soup.find('div',{'class':'meta'}).find('p',{'class':'author'}).text.strip()
    date = curr_soup.find('div',{'class':'meta'}).find('p',{'class':'published'}).text.strip()

    body = ''
    body_parts = curr_soup.find('section',{'class':'article-content'}).find_all('p')

    for p in body_parts:
        body += p.text.strip() + '\n'
        
    mb_news.append({
        'source': curr_url,
        'date': date[10:],
        'title': headline,
        'article_body': body,
        'author': author[3:]
    })
    
    rand_delay = random.randint(10,20)
    print("DONE. DELAY: ", rand_delay)
    time.sleep(rand_delay)
    
print("\nArticles scraped: ", len(mb_news))

[SCRAPING] URL:  https://mb.com.ph/2021/03/31/fil-am-woman-attacked-in-nyc-security-guard-shuts-doors-on-her/
DONE. DELAY:  10
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/no-extension-after-all-april-15-deadline-for-2020-itr-filing-stays-says-bir/
DONE. DELAY:  13
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/defensor-to-give-away-ivermectin-to-qc-residents-for-free/
DONE. DELAY:  11
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/doh-fda-do-not-recommend-use-of-ivermectin-for-covid-19-treatment/
DONE. DELAY:  15
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/who-delivery-of-astrazeneca-vaccines-from-covax-may-be-delayed/
DONE. DELAY:  19
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/doh-fda-deny-profiting-from-remdesivir-tocilizumab-use/
DONE. DELAY:  15
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/mayor-tiangco-household-lockdown-is-a-good-plan-but-how-do-you-implement-it/
DONE. DELAY:  12
[SCRAPING] URL:  https://mb.com.ph/2021/04/05/start-of-something-good-dar-links-pangasinan-squa

In [25]:
# sanity check

article_no = random.randint(0,len(mb_news)-1)

print('Index: ', article_no)
print('\nSource: ', mb_news[article_no]['source'])
print("\nTitle: ", mb_news[article_no]['title'])
print("\nAuthor: ", mb_news[article_no]['author'])
print("\nDate: ", mb_news[article_no]['date'])
print("\nArticle body: \n", mb_news[article_no]['article_body'])

Index:  15

Source:  https://mb.com.ph/2021/04/05/tiangco-to-get-sinovac-vaccine-with-a-heavy-heart/

Title:  Tiangco to get Sinovac vaccine ‘with a heavy heart’

Author:  Joseph Pedrajas

Date:  April 5, 2021, 5:55 PM

Article body: 
 Navotas City Mayor Toby Tiangco said Monday he will take the Chinese Sinovac vaccine “with a heavy heart” as Metro Manila mayors started getting vaccinated amid rising cases of coronavirus disease (COVID-19) in the country.
Tiangco explained in an ANC interview that his reservations on getting the Chinese vaccine were not because of its “medical efficacy” but because of the Philippines’ issue with China on the West Philippine Sea.
“I don’t know if it’s a wrong sense of patriotism, it could be wrong, ‘yung issue kasi doon sa (but the issue on) West Philippine Sea… parang papasukin namin ‘yung West Philippine Sea, hindi naman kayo makaka-angal kasi nagbibigay kami ng bakuna eh (For me it’s like they could occupy the West Philippine Sea since we could not o

In [26]:
# write to file
with open('mb_news.json', 'w') as file:
    json.dump(mb_news, file, indent=4)

## Using API for The Guardian

In [27]:
# get the key from local file
API_KEY = ''
with open('key.txt', 'r') as file:
    API_KEY = file.read()

# get 50 articles
results = requests.get("https://content.guardianapis.com/search?api-key=" + API_KEY + "&section=news&page-size=50")
articles = json.loads(results.text)
print(len(articles['response']['results']))

print(articles['response']['results'][0])

50
{'id': 'news/2021/apr/04/corrections-and-clarifications', 'type': 'article', 'sectionId': 'news', 'sectionName': 'News', 'webPublicationDate': '2021-04-04T20:00:16Z', 'webTitle': 'Corrections and clarifications', 'webUrl': 'https://www.theguardian.com/news/2021/apr/04/corrections-and-clarifications', 'apiUrl': 'https://content.guardianapis.com/news/2021/apr/04/corrections-and-clarifications', 'isHosted': False, 'pillarId': 'pillar/news', 'pillarName': 'News'}


In [28]:
# get the author and the text as well
results = requests.get("https://content.guardianapis.com/search?api-key=" + API_KEY + "&section=news&page-size=50&show-tags=contributor&show-blocks=body")
# Access time: 03-27-2021, 0047
articles = json.loads(results.text)
print(articles['response']['results'][0])

{'id': 'news/2021/apr/04/corrections-and-clarifications', 'type': 'article', 'sectionId': 'news', 'sectionName': 'News', 'webPublicationDate': '2021-04-04T20:00:16Z', 'webTitle': 'Corrections and clarifications', 'webUrl': 'https://www.theguardian.com/news/2021/apr/04/corrections-and-clarifications', 'apiUrl': 'https://content.guardianapis.com/news/2021/apr/04/corrections-and-clarifications', 'tags': [{'id': 'profile/editor-of-the-corrections-and-clarifications-column', 'type': 'contributor', 'webTitle': 'Corrections and clarifications column editor', 'webUrl': 'https://www.theguardian.com/profile/editor-of-the-corrections-and-clarifications-column', 'apiUrl': 'https://content.guardianapis.com/profile/editor-of-the-corrections-and-clarifications-column', 'references': [], 'firstName': 'editor', 'lastName': 'ofthecorrectionsandclarificationscolumn'}], 'blocks': {'body': [{'id': '5e74af488f085c6327bc3b4f', 'bodyHtml': '<p>• An article about city traffic (<a href="https://www.theguardian.

In [31]:
# get the article text according to the docs
print(articles['response']['results'][0]['blocks']['body'][0]['bodyTextSummary'])
test = articles

• An article about city traffic (The long read: To the barricades, 25 March, page 5, Journal) said 15 pedestrians and cyclists were killed on London’s Kensington High Street in the last three years; that is the number who were seriously injured. Also, mention was made of a 1939 plan to create national cycle routes, but some already existed then. • We mistakenly reprinted an old Doonesbury cartoon in Friday’s G2 (page 12). The strip we meant to publish can be found at theguardian.com/doonesbury2803.


In [32]:
# filter out other article types (e.g. liveblogs)
articles = list(filter(lambda x: x['type'] == 'article', test['response']['results']))

# parsing the json
url = articles[3]['webUrl']
headline = articles[3]['webTitle']
author = []

for x in articles[3]['tags']:
    if x['type'] != 'contributor':
        continue
    author.append(x['webTitle'])

date = articles[3]['webPublicationDate']
body = articles[3]['blocks']['body'][0]['bodyTextSummary']

print("URL: ", url)
print("\nHeadline: ", headline)
print("\nAuthor: ", author)
print("\nDate: ", date)
print("\nArticle Body: \n", body)

URL:  https://www.theguardian.com/news/2021/apr/02/corrections-and-clarifications

Headline:  Corrections and clarifications

Author:  ['Corrections and clarifications column editor']

Date:  2021-04-02T20:00:18Z

Article Body: 
 • The photo featuring a railway viaduct that accompanied a travel article about surprising villages did not show the “view across Morecambe Bay towards … Arnside”, as claimed in the caption; it was of the Leven viaduct, a few miles away from the Kent viaduct that approaches Arnside. A second photo mistakenly described the Youlgrave water tank, dated 1829, as Victorian. And we are pleased to say decorative tiles are still made in Jackfield, contrary to what the article said (Small wonders, 27 March, page 58). • Other recently amended articles include: Grenfell expert witness is father of council’s head of fire safety ‘A game of two halves’: how Bristol protest went from calm to mayhem Senior policing figures fear further violence after Bristol ‘kill the bill’ p

In [33]:
# clean the body & headline
body.replace("â€™", "'").replace("â€œ", "\"").replace("â€�", "\"").replace("â€¢", "*").replace("Ã©", "e").replace("Ã¼", "u").replace("â€“", "-")
headline.replace("â€™", "'").replace("â€œ", "\"").replace("â€�", "\"").replace("â€¢", "*").replace("Ã©", "e").replace("Ã¼", "u").replace("â€“", "-")

print(headline)
print(body)

Corrections and clarifications
• The photo featuring a railway viaduct that accompanied a travel article about surprising villages did not show the “view across Morecambe Bay towards … Arnside”, as claimed in the caption; it was of the Leven viaduct, a few miles away from the Kent viaduct that approaches Arnside. A second photo mistakenly described the Youlgrave water tank, dated 1829, as Victorian. And we are pleased to say decorative tiles are still made in Jackfield, contrary to what the article said (Small wonders, 27 March, page 58). • Other recently amended articles include: Grenfell expert witness is father of council’s head of fire safety ‘A game of two halves’: how Bristol protest went from calm to mayhem Senior policing figures fear further violence after Bristol ‘kill the bill’ protest Brit awards nominations 2021: Dua Lipa, Arlo Parks and Celeste lead improved field for women Rihanna’s 30 greatest singles – ranked! What is allowed under Covid lockdown rules around the UK? ‘

In [34]:
guardian_news = []

for x in articles:
    curr_url = x['webUrl']
    headline = x['webTitle']
    author = []

    for y in x['tags']:
        if y['type'] != 'contributor':
            continue
        author.append(y['webTitle'])

    date = x['webPublicationDate']
    body = x['blocks']['body'][0]['bodyTextSummary']
    body.replace("â€™", "'").replace("â€œ", "\"").replace("â€�", "\"").replace("â€¢", "*").replace("Ã©", "e").replace("Ã¼", "u").replace("â€“", "-")
    headline.replace("â€™", "'").replace("â€œ", "\"").replace("â€�", "\"").replace("â€¢", "*").replace("Ã©", "e").replace("Ã¼", "u").replace("â€“", "-")
    
    guardian_news.append({
        'source': curr_url,
        'date': date,
        'title': headline,
        'article_body': body,
        'author': author
    })
    
print(len(guardian_news))

49


In [35]:
# write to file
with open('guardian_news.json', 'w') as file:
    json.dump(guardian_news, file, indent=4)