# Getting data directly from a website
This notebook scrapes data from the [ABS-CBN news website](https://news.abs-cbn.com/news), gathering the contents of news articles published on March 12, 2021 and saves the details in a json file.

### Import `requests` library
This package allows you to get any website's HTML code so that you can extract from it. Let's save the website's URL in the `URL` variable.

In [1]:
import requests

URL1="https://news.abs-cbn.com/news?page=6"
URL2="https://news.abs-cbn.com/news?page=7"
URL3="https://news.abs-cbn.com/news?page=8"

### Load the page

In [2]:
page1=requests.get(URL1)
page2=requests.get(URL2)
page3=requests.get(URL3)

In [3]:
print(page1.content)



### Parse HTML data

In [4]:
from bs4 import BeautifulSoup

# Parse page1 HTML and get articles
soup = BeautifulSoup(page1.content, 'html.parser')
articles1=soup.find_all('article', class_='clearfix')

# Parse page2 HTML and get articles
soup = BeautifulSoup(page2.content, 'html.parser')
articles2=soup.find_all('article', class_='clearfix')

# Parse page3 HTML and get articles
soup = BeautifulSoup(page3.content, 'html.parser')
articles3=soup.find_all('article', class_='clearfix')

In [5]:
articles = []

for i in range(0,len(articles1)):
    articles.append(articles1[i])
    articles.append(articles2[i])
    articles.append(articles3[i])

### Find details of the first article

In [6]:
# Find title
articles[0].find(class_='title').text

'THE DAY IN PHOTOS: March 12, 2021'

In [7]:
# Find link of article
link = 'https://news.abs-cbn.com'+articles[0].find('a')['href']
print(link)

https://news.abs-cbn.com/news/multimedia/slideshow/03/12/21/the-day-in-photos-march-12-2021


In [8]:
# Finding the author of the article
articles[0].find(class_='author').text.strip()

'ABS-CBN News'

In [9]:
# Finding the date the article was published
articles[0].find('span', class_='datetime').text.strip()

'Mar 12 11:37 PM'

In [10]:
# Finding the content of the article
import numpy as np

article = requests.get(link)
article_content = article.content
    
soup = BeautifulSoup(article_content, 'html.parser')
body = soup.find_all('div', class_='block-content')
x = body[0].find_all('p')
x.pop(2) # Remove "Share"

list_paragraphs = []
for p in np.arange(2, len(x)-1):
    paragraph = x[p].get_text()
    list_paragraphs.append(paragraph)

    final_article = " ".join(list_paragraphs)
        
final_article

'Share Here are the big stories today in photos. Hundreds of minimum health protocol violators from different communities in Quezon City are apprehended and processed at the Quezon City Memorial Circle on Friday in an effort to curb the surge of COVID-19 infections in Metro Manila. The police together with the local government Task Force Disiplina are enforcing the restrictions, but only gave tickets and face masks this time. Jire Carreon, ABS-CBN News Lucha libre wrestler Ciclon Ramirez sprays water at a man as he and others encourage mask-less people to wear masks as a measure of prevention against the COVID-19 disease at the Central Abastos market in Mexico City, Mexico in this picture taken March 10, 2021. Lucha libre is a popular form of professional wrestling in Mexico and they are helping in the fight aginst the coronavirus pandemic which has claimed almost 200,000 lives in their country. Carlos Jasso, Reuters Health workers care for patients infected with COVID-19 at the full e

### Find all articles and their details and save as json

In [11]:
import numpy as np

mar11_json = []
mar12_json = []

for n in np.arange(0, len(articles)):
    
    link = 'https://news.abs-cbn.com'+articles[n].find('a')['href']
    title = articles[n].find(class_='title').text.strip()
    date = articles[n].find('span', class_='datetime').text.strip()
    author = articles[n].find(class_='author').text.strip()
    
    # Getting the content
    article = requests.get(link)
    article_content = article.content
    
    # Parse HTML data of specific article
    soup = BeautifulSoup(article_content, 'html.parser')
    body = soup.find_all('div', class_='block-content')
    x = body[0].find_all('p')
    x.pop(2) # Remove "Share"
    
    # Unifying the paragraphs
    list_paragraphs = []
    for p in np.arange(2, len(x)-1):
        paragraph = x[p].get_text()
        list_paragraphs.append(paragraph)
        
        final_article = " ".join(list_paragraphs)
        
    # Saving as json
    if date.find('Mar 11') == 0:
        mar11_json.append({
            "date": date,
            "title": title,
            "author": author,
            "content": final_article
        })
    else:
        mar12_json.append({
            "date": date,
            "title": title,
            "author": author,
            "content": final_article
        })

In [12]:
mar11_json

[{'date': 'Mar 11 11:57 PM',
  'title': "Duque says PH's vaccination rollout rate 'not as quick as we wanted'",
  'author': 'Vivienne Gulla, ABS-CBN News',
  'content': 'MANILA — The pace of the country’s COVID-19 vaccination program is “not as quick” as what the government had wanted, Health Secretary Francisco Duque III said on Thursday, more than a week since the country began rolling out the vaccines’ first doses.\xa0 Some 83,000 health personnel have been inoculated so far, according to Duque. Malacañang, meanwhile, said some 114,615 Filipinos have been vaccinated as of Wednesday, out of the government\'s 70 million target this year.\xa0 “The first week, I will admit, the vaccination rate was not as quick as we wanted it, but for obvious reasons. Siyempre nag-uumpisa pa lang (we have just started)" he explained. "Pangalawa, mayroon pong option na makapili ang babakunahang healthcare workers. Kung ayaw sa Sinovac, binigyan po natin sila ng right of first refusal, at ‘yung AstraZene

In [13]:
mar12_json

[{'date': 'Mar 12 11:37 PM',
  'title': 'THE DAY IN PHOTOS: March 12, 2021',
  'author': 'ABS-CBN News',
  'content': 'Share Here are the big stories today in photos. Hundreds of minimum health protocol violators from different communities in Quezon City are apprehended and processed at the Quezon City Memorial Circle on Friday in an effort to curb the surge of COVID-19 infections in Metro Manila. The police together with the local government Task Force Disiplina are enforcing the restrictions, but only gave tickets and face masks this time. Jire Carreon, ABS-CBN News Lucha libre wrestler Ciclon Ramirez sprays water at a man as he and others encourage mask-less people to wear masks as a measure of prevention against the COVID-19 disease at the Central Abastos market in Mexico City, Mexico in this picture taken March 10, 2021. Lucha libre is a popular form of professional wrestling in Mexico and they are helping in the fight aginst the coronavirus pandemic which has claimed almost 200,0

### Save articles in a json file

In [14]:
import json

# Save March 11 json file as mar11.json
mar11_json = json.dumps(mar11_json, indent = 4)
with open("mar11.json", "w") as outfile:
    outfile.write(mar11_json)
    
# Save March 12 json file as mar12.json
mar12_json = json.dumps(mar12_json, indent = 4)
with open("mar12.json", "w") as outfile:
    outfile.write(mar12_json)