# Web Scraping Use Case: News Articles

The code below was used as part of a project that required scraping news articles from different Filipino news websites. This script in particular was used to scrape opinion pieces from The Philippine Star.

In [1]:
from selenium import webdriver
from time import sleep
import re
import numpy as np
import pandas as pd
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup as soup
from selenium.webdriver.chrome.options import Options
import random
import os
from IPython.display import clear_output
import matplotlib.pyplot as plt

## Setting up a Proxy

You might want to set up a proxy as sampled below to prevent your actual IP from being blocked by the website you're scraping. The IPs below are just placeholders.

In [2]:
os.environ['HTTP_PROXY'] = 'http://54.238.250.91:8080/'
os.environ['HTTPS_PROXY'] = 'https://54.238.250.91:8080/'

### Initialize driver object

You'll be initializing a driver object to use `Selenium`. Just google `<Your Brower Name Here> Driver` and you should probably get a bunch of download options. Just check your browser settings for the version number and then download the equivalent driver version. Place the driver file in the same folder as your notebook.

In [3]:
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 \
        Safari/537.36")


# This one for headlines
driver = webdriver.Chrome('chromedriver', options=opts)
driver.get('https://www.philstar.com/opinion')

# Using Selenium

Selenium is helpful for websites with dynamic HTML structures, such as continuously scrolling ones, or ones with multiple pop-up and hover-over types of things. You can interact with objects through different means in the web driver. For example, `find_element_by_css_selector` allows us to select a headline by using its css selector.

In [4]:
# driver.find_element_by_css_selector('div.profileCard__header__info')
driver.find_element_by_css_selector('div.news_title').text

'Seven deadly sins of leadership'

# Using BeautifulSoup

For this use case, we only use Selenium to manually scroll through and load all available archived news articles in the opinion section of the website. It's more practical to use BeautifulSoup to select the items we need to get from the web page. We will be extracting the headline, link, date, and text of each opinion piece.

In [11]:
page_html = soup(driver.page_source, "lxml")

Below are fairly straightforward examples of using `BeautifulSoup` to select objects we are interested in. `.select` generates a list object, so we index the first element to show a sample of the 

In [12]:
page_html.select('div.news_title > a')[0].get('href')

'https://www.philstar.com/opinion/2019/08/18/1944271/seven-deadly-sins-leadership'

In [13]:
page_html.select('div.news_title > a')[0].text

'Seven deadly sins of leadership'

Once we're sure about the selected objects, store them into lists.

In [14]:
k = page_html.select('div.news_title > a')

links = []
heads = []

for elem in k:
    links.append(elem.get('href'))
    heads.append(elem.text)

Use regular expressions to extract the dates from the links.

In [15]:
dates = re.findall(r"/(\d{4}/\d{2}/\d{2})", str(links))

In [16]:
# Just making sure all lengths are equal
len(links), len(heads), len(dates)

(214, 214, 214)

In [18]:
op_data = pd.DataFrame({"headlines": heads, "links": links, "dates": dates})

In [19]:
op_data.head()

Unnamed: 0,headlines,links,dates
0,Seven deadly sins of leadership,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18
1,Aftershock of the colonial earthquake,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18
2,Clark airport placed under private group,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18
3,Tightening US-Philippines relations,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18
4,Diabetes – high rate of med discontinuation re...,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18


# Getting the text

Note that at this point, we still don't have the bodies of text associated with each headline. We extracted the links per headline precisely for this reason: we will access each link in the list we generated earlier and extract the text. The output list will then be appended as a column in a new dataframe.

In [28]:
def get_html(url):
    """ Return soup object of given webpage. """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 \
        Safari/537.36'}
    p = requests.get(url, headers=headers)
    sp = soup(p.text, 'html.parser')
    p.close
    return sp

note: `get_text` appends to a specific list, `arts`, which I initialize after defining the function.

In [29]:
def get_text(links, lim=5):
    """
    Accepts a list of links and extracts the main body of text from
    each accessed link. 
    
    Parameters
    ----------
    links : list
        list of links to access
    """
    lim = lim
#     stopper = 0
    
    for url in links:
#         if stopper == lim:
#             break
        try:
            sp = get_html(url)
        except:
            print('cannot find url')
        
        try:
            # yes, they are tagged sports articles for some reason
            texts = sp.select('#sports_article_writeup')[0].text
            arts.append(texts)
        except:
            arts.append('something went wrong')
        
#         stopper += 1
        sleeptime = random.randint(5,9)
        sleep(sleeptime)

In [30]:
arts = []

In [31]:
get_text(links)

In [33]:
op_data = pd.DataFrame({"headlines": heads, "links": links, "dates": dates, "text": arts})

In [34]:
op_data.to_csv("Philstar_Opinion.csv")

# Sample Output

In [35]:
op_data

Unnamed: 0,headlines,links,dates,text
0,Seven deadly sins of leadership,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,"\nPeter Drucker, the greatest management guru,..."
1,Aftershock of the colonial earthquake,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,\nIf you watch the violent demonstration in Ho...
2,Clark airport placed under private group,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,\nCLARK FREEPORT – Management of Clark Interna...
3,Tightening US-Philippines relations,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,\nThe recent signing of a memorandum of unders...
4,Diabetes – high rate of med discontinuation re...,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,\nMost patients with type 2 diabetes mellitus ...
5,Petition denied based on petitioner’s prior fi...,https://www.philstar.com/opinion/2019/08/18/19...,2019/08/18,\nMany people believe that once they obtain a ...
6,Choppy,https://www.philstar.com/opinion/2019/08/17/19...,2019/08/17,\nThe New York Stock Exchange (NYSE) lost 800 ...
7,"AFP, PNP resurrect creepy memories",https://www.philstar.com/opinion/2019/08/17/19...,2019/08/17,"\nOn Dec. 4, 2018, President Duterte signed Ex..."
8,The young and the restless,https://www.philstar.com/opinion/2019/08/17/19...,2019/08/17,\nThe world is spellbound by the Hong Kong rio...
9,The sun of the imagination,https://www.philstar.com/opinion/2019/08/17/19...,2019/08/17,"\nIn the popular mind, to be a poet means to b..."
