# Scraping from 3 different news sites

**Datetime:** 03-26-2021 1600

**Sites:** 
    - Inquirer (Crawl delay: 5)
    - ABS-CBS (Crawl delay: 10)
    - Philstar (Crawl delay: unspecified, will use 5)

## Imports

In [69]:
import requests
import json
import time
import random
from bs4 import BeautifulSoup

## Scraping from Inquirer

### Getting the latest news links

In [3]:
URL = "https://www.inquirer.net/latest-stories"
headers = {'User-agent' : '*'}
page = requests.get(URL, headers=headers) # Access time: 03-26-2021, 1644
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
container = soup.find(id='al-wrap')
headlines = container.find_all(id='al-box')
# Sanity Check
print(headlines[0])

<div id="al-box">
<div id="al-time">POP - 4:38 PM</div>
<h2><a href="https://pop.inquirer.net/107813/this-teacher-took-his-class-on-a-virtual-field-trip-to-the-zoo">This teacher took his class on a virtual field trip to the zoo</a></h2>
</div>


In [6]:
# parse content

for idx in range(len(headlines[0].contents)):
    print("Index ", idx, " Content:\n", headlines[0].contents[idx])

print("----------")    
for idx in range(len(headlines[0].contents[3].contents)):
    print("Index ", idx, " Content:\n", headlines[0].contents[3].contents[idx])

    
print("----------")
print("\nType - time: ", headlines[0].contents[1].text.strip())
print("\nHeadline: ", headlines[0].contents[3].contents[0].text.strip())
print("\nURL: ", headlines[0].contents[3].contents[0]['href'])

Index  0  Content:
 

Index  1  Content:
 <div id="al-time">POP - 4:38 PM</div>
Index  2  Content:
 

Index  3  Content:
 <h2><a href="https://pop.inquirer.net/107813/this-teacher-took-his-class-on-a-virtual-field-trip-to-the-zoo">This teacher took his class on a virtual field trip to the zoo</a></h2>
Index  4  Content:
 

----------
Index  0  Content:
 <a href="https://pop.inquirer.net/107813/this-teacher-took-his-class-on-a-virtual-field-trip-to-the-zoo">This teacher took his class on a virtual field trip to the zoo</a>
----------

Type - time:  POP - 4:38 PM

Headline:  This teacher took his class on a virtual field trip to the zoo

URL:  https://pop.inquirer.net/107813/this-teacher-took-his-class-on-a-virtual-field-trip-to-the-zoo


In [11]:
# get only of type NEWSINFO

news_headlines = []

for item in headlines:
    if item.contents[1].text.strip().find('NEWSINFO') != -1:
        news_headlines.append({
            'type': item.contents[1].text.strip(),
            'Headline': item.contents[3].contents[0].text.strip(),
            'URL': item.contents[3].contents[0]['href']
        })

news_urls = [item['URL'] for item in news_headlines]

# Sanity Check
print(news_urls)

['https://newsinfo.inquirer.net/1411623/sc-confirms-another-justice-to-retire-for-health-reasons', 'https://newsinfo.inquirer.net/1411718/phs-covid-19-cases-top-700000-as-doh-reports-about-10000-new-infections', 'https://newsinfo.inquirer.net/1411705/thailand-faces-meth-trafficking-surge-after-myanmar-coup', 'https://newsinfo.inquirer.net/1411707/covid-19-hits-over-15000-health-workers-doh', 'https://newsinfo.inquirer.net/1411686/bureau-of-immigration-warns-public-against-online-scammers', 'https://newsinfo.inquirer.net/1411690/ched-six-heis-partner-with-lgus-to-serve-as-vaccination-centers', 'https://newsinfo.inquirer.net/1411680/solon-seeks-instant-sanctions-for-govt-execs-who-skipped-vaccine-priority-list', 'https://newsinfo.inquirer.net/1411696/zambales-mayor-tests-positive-for-covid-19', 'https://newsinfo.inquirer.net/1411643/thailand-urges-calm-after-death-of-covid-19-vaccine-recipient', 'https://newsinfo.inquirer.net/1411677/covid-19-icu-occupancy-rate-nearing-moderate-risk-trea

### Scraping the news pages

In [63]:
# SANITY CHECK: another article
curr_url = news_urls[0]
curr_page = requests.get(curr_url, headers=headers)
curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

# attempt to find format
headline = curr_soup.find('h1', {'class':'entry-title'}).text.strip()
author = curr_soup.find(id='art_author')['data-byline-strips']
date = curr_soup.find(id='art_plat').contents[2]

body = ''
body_soup = curr_soup.find(id='article_content').contents[1].find_all('p')

# remove all children from <p>s found in article content (e.g. advertisments, extenral links)
for x in body_soup:
    for y in x.find_all():
        if len(list(y.parents)) >= 1:
            y.extract()
    # to remove the caption
    if x.has_key('class'):
        if x['class'][0] == 'wp-caption-text':
            x.extract()

for x in body_soup:
    #print("Index: ", x)
    #print("\n", body_soup[x].text.strip())
    if x.text.strip() == 'RELATED STORIES': # cut off of article
        break
    if x.has_key('class'):
        if x['class'][0] == 'corona_article_tracker': # to account for the common paragraphs abt covid-19 in covid related news
            break
        pass
    else:
        body += '\n' + x.text.strip()

print('URL: ', curr_url)
print("\nHeadline: ", headline)
print("\nAuthor: ", author)
print("\nDate: ", date[4:])
print("\nArticle body: \n", body)

URL:  https://newsinfo.inquirer.net/1411623/sc-confirms-another-justice-to-retire-for-health-reasons

Headline:  SC confirms another justice to retire for health reasons

Author:  Tetch Torres-Tupas

Date:  4:34 PM March 26, 2021

Article body: 
 
MANILA, Philippines—Supreme Court Associate Justice Edgardo Delos Santos is considering retiring early due to health reasons, Public Information Chief and Spokesperson Atty. Brian Keith Hosaka said Friday.
“According to Justice Delos Santos, due to health reasons, he is considering the possibility of retiring ahead of his 70th birthday on 12 June 2022,” Hosaka told reporters.
Hosaka said the magistrate had sent a letter to his staff advising them as early as March 19, 2021, to look for other employment, “knowing the difficulty of finding a job during this pandemic.”
“Justice Delos Santos further added that he remains an incumbent member of the Supreme Court until after a specific date of retirement, as may be indicated in a formal letter from

In [66]:
# actual scraping

inquirer_news = []

for x in news_urls:
    print("[SCRAPING] URL: ", x)
    curr_url = x
    curr_page = requests.get(curr_url, headers=headers)
    curr_soup = BeautifulSoup(curr_page.content, 'html.parser')

    # attempt to find format
    headline = curr_soup.find('h1', {'class':'entry-title'}).text.strip()
    if curr_soup.find(id='art_author') is not None:
        author = curr_soup.find(id='art_author')['data-byline-strips']
    else:
        author = ''
    date = curr_soup.find(id='art_plat').contents[2]

    body = ''
    body_soup = curr_soup.find(id='article_content').contents[1].find_all('p')

    # remove all children from <p>s found in article content (e.g. advertisments, extenral links)
    for x in body_soup:
        for y in x.find_all():
            if len(list(y.parents)) >= 1:
                y.extract()
        # to remove the caption
        if x.has_key('class'):
            if x['class'][0] == 'wp-caption-text':
                x.extract()

    for x in body_soup:
        if x.text.strip() == 'RELATED STORIES': # cut off of article
            break
        if x.has_key('class'):
            if x['class'][0] == 'corona_article_tracker': # to account for the common paragraphs abt covid-19 in covid related news
                break
            pass
        else:
            body += '\n' + x.text.strip()
    
    inquirer_news.append({
        'source': curr_url,
        'date': date[4:],
        'title': headline,
        'article_body': body,
        'author': author
    })
    
    time.sleep(5)
    print("DONE, NEXT:")

[SCRAPING] URL:  https://newsinfo.inquirer.net/1411623/sc-confirms-another-justice-to-retire-for-health-reasons
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411718/phs-covid-19-cases-top-700000-as-doh-reports-about-10000-new-infections
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411705/thailand-faces-meth-trafficking-surge-after-myanmar-coup
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411707/covid-19-hits-over-15000-health-workers-doh
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411686/bureau-of-immigration-warns-public-against-online-scammers
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411690/ched-six-heis-partner-with-lgus-to-serve-as-vaccination-centers
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411680/solon-seeks-instant-sanctions-for-govt-execs-who-skipped-vaccine-priority-list
DONE, NEXT:
[SCRAPING] URL:  https://newsinfo.inquirer.net/1411696/zambales-mayor-tests-positive-for-covid-19
D

In [74]:
# sanity check

article_no = random.randint(0,len(inquirer_news)-1)

print('Index: ', article_no)
print('\nSource: ', inquirer_news[article_no]['source'])
print("\nTitle: ", inquirer_news[article_no]['title'])
print("\nAuthor: ", inquirer_news[article_no]['author'])
print("\nDate: ", inquirer_news[article_no]['date'])
print("\nArticle body: \n", inquirer_news[article_no]['article_body'])

Index:  28

Source:  https://newsinfo.inquirer.net/1411597/healthcare-workers-in-ncr-bubble-cebu-davao-to-get-400000-sinovac-vaccines

Title:  Healthcare workers in NCR bubble, Cebu, Davao to get 400,000 Sinovac vaccines

Author:  Daphne Galvez

Date:  2:06 PM March 26, 2021

Article body: 
 

MANILA, Philippines — The government will be allocating most of the recently delivered  of Sinovac BioTech for healthcare workers in areas most affected by new coronavirus variants, Malacañang said Friday.
This includes healthcare workers in the National Capital Region “bubble” (NCR, Batangas, Rizal, Laguna, Cavite) Cebu and Davao, Presidential spokesman Harry Roque said.
“Nagkaroon na ng desisyon ang ating NITAG (National Immunization Technical Advisory Group) na ‘yung mga kakarating na pinakahuling donasyon ng China na 400,000 na Sinovac ay ibibigay ang karamihan nito doon sa pinaka-apektado ng new variants kasama na ang NCR plus, Cebu, at Davao,” Presidential spokesman Harry Roque announced in

In [75]:
# write to file
with open('inquirer_news.json', 'w') as file:
    json.dump(inquirer_news, file, indent=4)