# Getting data directly from a website
This notebook scrapes data from the [ABS-CBN news website](https://news.abs-cbn.com/news), gathering the contents of news articles published on March 12, 2021 and saves the details in a json file.

### Import `requests` library
This package allows you to get any website's HTML code so that you can extract from it. Let's save the website's URL in the `URL` variable.

In [1]:
import requests

URL="https://news.abs-cbn.com/news?page=6"

### Load the page

In [2]:
page=requests.get(URL)

In [3]:
print(page.content)



### Parse HTML data

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
# Get articles
articles=soup.find_all('article', class_='clearfix')

### Find details of the first article

In [6]:
# Find title
articles[0].find(class_='title').text

'Pagsipa ng COVID-19 cases di dapat isisi sa taumbayan: labor groups'

In [7]:
# Find link of article
link = 'https://news.abs-cbn.com'+articles[0].find('a')['href']
print(link)

https://news.abs-cbn.com/video/news/03/12/21/pagsipa-ng-covid-19-cases-di-dapat-isisi-sa-taumbayan-labor-groups


In [8]:
# Finding the author of the article
articles[0].find(class_='author').text.strip()

'ABS-CBN News'

In [9]:
# Finding the date the article was published
articles[0].find('span', class_='datetime').text.strip()

'Mar 12 07:41 PM'

In [10]:
# Finding the content of the article
import numpy as np

article = requests.get(link)
article_content = article.content
    
soup = BeautifulSoup(article_content, 'html.parser')
body = soup.find_all('div', class_='block-content')
x = body[0].find_all('p')
x.pop(2) # Remove "Share"

list_paragraphs = []
for p in np.arange(2, len(x)-1):
    paragraph = x[p].get_text()
    list_paragraphs.append(paragraph)

    final_article = " ".join(list_paragraphs)
        
final_article

'Nananawagan ang mga labor group at employers na huwag isisi sa taumbayan ang muling pagdami ng mga kaso ng COVID-19 kundi sa mga maling istratehiya ng pamahalaan. Nagpa-Patrol, Zen Hernandez. TV Patrol, Biyernes, 12 Marso 2021'

### Find all articles and their details and save as json

In [11]:
import numpy as np

articles_json = []

for n in np.arange(0, len(articles)):
    
    # Getting the link of the article
    link = 'https://news.abs-cbn.com'+articles[n].find('a')['href']
    
    # Getting the title
    title = articles[n].find(class_='title').text.strip()
    
    # Getting the date
    date = articles[n].find('span', class_='datetime').text.strip()
    
    # Getting the author
    author = articles[n].find(class_='author').text.strip()
    
    # Getting the content
    article = requests.get(link)
    article_content = article.content
    
    # Parse HTML data of specific article
    soup = BeautifulSoup(article_content, 'html.parser')
    body = soup.find_all('div', class_='block-content')
    x = body[0].find_all('p')
    x.pop(2) # Remove "Share"
    
    # Unifying the paragraphs
    list_paragraphs = []
    for p in np.arange(2, len(x)-1):
        paragraph = x[p].get_text()
        list_paragraphs.append(paragraph)
        
        final_article = " ".join(list_paragraphs)
        
    # Saving as json
    articles_json.append({
        "date": date,
        "title": title,
        "author": author,
        "content": final_article
    })

In [12]:
articles_json

[{'date': 'Mar 12 07:41 PM',
  'title': 'Pagsipa ng COVID-19 cases di dapat isisi sa taumbayan: labor groups',
  'author': 'ABS-CBN News',
  'content': 'Nananawagan ang mga labor group at employers na huwag isisi sa taumbayan ang muling pagdami ng mga kaso ng COVID-19 kundi sa mga maling istratehiya ng pamahalaan. Nagpa-Patrol, Zen Hernandez. TV Patrol, Biyernes, 12 Marso 2021'},
 {'date': 'Mar 12 07:31 PM',
  'title': 'Pagpapatupad ng curfew sa Metro Manila kasado na pero may mga di natuwa',
  'author': 'ABS-CBN News',
  'content': 'MAYNILA — Sa darating na Lunes ay ipatutupad na sa Metro Manila ang unified curfew na aarangkada mula alas-10 ng gabi hanggang alas-5 ng umaga, na inaasahang makatulong sa pagkontrol ng pagkalat ng COVID-19. Para sa ilang mga tao, makatutulong ang pagkakasundo ng Metro Manila mayors sa unified curfew para mabawasan ang kalituhan, lalo\'t iba-iba noon ang patakaran ng bawat lokal na pamahalaan.\xa0 Pero ang ilan, hindi masaya, katulad ni Aling Charito na ma

### Save articles in a json file

In [13]:
# Save json file as articles.json
import json

articles_json = json.dumps(articles_json, indent = 4)
with open("articles.json", "w") as outfile:
    outfile.write(articles_json) 