# News Article Analysis 1.0

## Web Scraping

### Keyword

Specify a 'keyword' to search for in Google News. Tool created for this example only design to scrap articles from www.thestar.com.my

Search in Google News will be looks like this: "Petronas" site:www.thestar.com.my

### Pages

Specify pages of result from Google News to scrap. Generally one page of Google News result contains 10 articles.

In [None]:
# KEYWORD to search in Google News
keyword = 'Harimau Malaya'

# PAGES to download from Google News
pages = 30

### Store of results

Result will be save in a all_articles List, that contain List.

all_articles = [[date, title, content, link],[date, ..., ..., ...],....,....]

### Adjust keyword casing

Adjust keyword to lower case, standardize casing for text matching and analysis.

In [None]:
# All articles with date, title, content, link will be save in list of list
all_articles = []

# turn keyword into lower case
keyword = keyword.lower()

## Import module needed for Web Scraping¶ 

requests - use for downloading html code.

BeautifulSoup - use for parsing html code.

Regular Expressiong (re) - use for search and matching.

time - using its time.sleep() function to slower the scraping (hopefully less burder to target website server).

random - to make sleep time in random seconds.

datetime - to re-format date downloaded from articles.

unicodedata - to clean up some unicode in article's content.

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import time
import random
import datetime
import unicodedata  # to clean unicode eg. \xa0

## Scrap Individual Article

Define scrapTheStar(link) - define function to scrap single page of The Star Online

Date, Title, Content, Link of the article will be added to all_articles

If intended to scrap news article from other site, this function will need to be re-write. As different website will have different html structure, storing data in different place.

In [None]:
# define function to scrap The Star Online (date, title, content, link)
def scrapTheStar(link):
    page_response = requests.get(link, timeout=5)
    page_content = BeautifulSoup(page_response.content, "html.parser")
    
    # scrap date
    date = page_content.select('p.date')
    date = date[0].text.strip()     # Tuesday, 9 Oct 2018
    date = re.findall(r'\w+', date) 
    date = ' '.join(date[1:4])      # 9 Oct 2018
    date = datetime.datetime.strptime(date,'%d %b %Y').strftime('%Y-%m-%d')  # 20181009
    
    # scrap title
    titles = page_content.select('h1')
    title = titles[0].text.strip()
    
    # scrap content
    nodes = page_content.select('div.story p')
    content = ''
    for node in nodes:
        content = content + node.text
    content = unicodedata.normalize("NFKD",content)
    
    # scrap content (alternative method, if above method failed)
    if len(content) < 10:
        try:
            content = page_content.select('div.story')
            content = content[1].text
        except IndexError:
            print('unable to download content')
            
    # gathering information into a list
    date_title_content = [date, title, content, link]
    
    # Note: Google result may have article which keyword not exist in content (only exist in related news title ).
    # only append those articles with keyword in content
    if keyword in date_title_content[2].lower():  
        all_articles.append(date_title_content)
    else: 
        print('keyword not in content')
    

## Scrap URL from Google News

Define scrapit(googleNewsUrl) - define function to scrap news articles URL from Google News results

When running this function, scrapTheStar(link) function will be call.

In [None]:
# define function to scrap Google News, loop through all pages to get The Star Online url.
def scrapit(googleNewsUrl):
    res = requests.get(googleNewsUrl, timeout=5)
    soup = BeautifulSoup(res.content, "html.parser")
    
    links = soup.select('h3 a')

    for link in links:
        link = link.get('href')
        urlRegex = re.compile('https://www.thestar.com.my.*/&sa')  # define match https..../&sa
        link = urlRegex.findall(link)  # find match in link
        link = link[0][:-4]   # regex return a list so call index [0], [:-4]strip /&sa which only need for match
        #print(link)
        scrapTheStar(link)
        
        # make random sleep to slow down the scraping
        r = random.randint(1, 5)
        #print('sleep', r, 'seconds')
        time.sleep(r)
        

## Web Scraping in Action

Inserting keyword to googleNewsUrl, looping through number of page specified, and calling the scrapit(googleNewsUrl) function, which will also call scrapTheStar(link) inside of it.

Optional: print out URL of Google News and individual articles for checking. Optional: print out sleep in seconds.

In [None]:
for page in range(pages):
    keyword_in_link = '+'.join(keyword.split())  # add + between keyword
    googleNewsUrl = 'https://www.google.com/search?q=%22' + keyword_in_link +'%22+site:www.thestar.com.my&hl=en&tbm=nws&ei=3u2wW7rtLMmKvQTTo77QCw&start=' + str(page) + '0&sa=N'
    #print(googleNewsUrl)
    scrapit(googleNewsUrl)

## Save all_articles as JSON

After saving a copy in hard drive, we can use it for Text Analysis later.

In [None]:
import json
stringOfJsonData = json.dumps(all_articles)
jsonFile = open('news.json', 'w')
jsonFile.write(stringOfJsonData)
jsonFile.close()

## Result of Web Scraping tool.

Open up news.json to check Web Scraping result.

Example: print out all title from result.

Results store in : all_articles = [[date, title, content, link],[date, ..., ..., ...],....,....]

In [19]:
# Open JSON file
with open('news.json') as f:
    all_articles = json.load(f)

# Print out all title, with index in front
index = 1
for i in range(len(all_articles)):
    print(str(index) + ') ' + all_articles[i][1])
    index += 1