# Project Title: Analysis of 11 Indonesian News Websites and Outreach of Singapore Government Press Statements in Indonesian Media

## Problem

The purpose of this project is to find out whether and to what extent Singapore Government's Press Statement have been reported in the Indonesian media.  Since I will be scraping data from 11 Indonesian News Website, the data collected can also be used for some NLP analysis - for example what are the kind of news on Singapore that have been reported in the Indonesian media.

## Data Sources 

For a list of Singapore Government news, I will have to scrape the website of www.gov.sg .  It contains press statements from all the Government Agencies except for the Ministry of Foreign Affairs of Singapore. 

For a list of MFA's press statement, I will have to scrape the website of www.mfa.gov.sg

For the above two websites, I will only be scraping the search results of press statements that contain the word 'Indonesia' in it, since I am assuming that the Indonesian media will only be interested to report on our press statements that concerns Indonesia.

To obtain a list of Indonesian media articles, I will be scraping news that contains the word 'Singapore' or 'Singapura' from the following Indonesian news websites:

1. www.thejakartapost.com (uses Google Custom Search Engine (GCSE) to search its site - the only credible English news site)
2. news.kompas.com (uses GCSE)
3. www.tribunnews.com/ (uses GCSE)
4. www.merdeka.com (uses GCSE)
5. www.inilah.com (uses GCSE)
6. www.metrotvnews.com (uses GCSE)
7. news.detik.com (uses its own search engine)
8. http://www.antaranews.com/ (uses its own search engine)
9. http://www.liputan6.com/ (uses its own search engine)
10. https://www.sindonews.com/ (uses its own search engine)
11. www.suara.com (uses its own search engine)

There are actually more Indonesian news websites out there.  I am, however, limiting my analysis to the above 13 sites since they are known to have reported on news concerning Singapore before.


<b>Limits imposed to scraping:</b> 

I will be scraping only the top 100 search results of each news site, using 'Singapore' or 'Singapura' as my keyword.  The reasons for limiting my search as such are two: First, GCSE only gives the top 100 results, since all of the news website above use the 'free' version of GCE.  Second, scraping the whole news website will take a long time.  If the amount of data is too big, Jupyter Notebook will also crash.  

Since a quick scan of the search results reveal that the top 100 search results are dated between July 2017 to Jan 2016, I have also limited the scraping of the Singapore Government Press Statements that are dated in the abovementioned time period.

## Hypothesis 

Null Hypothesis: None of the Singapore Government Press Statements have been reported in the Indonesian Media

Alternate Hypothesis: Singapore Government Press Statements have been reported in the Indonesian Media before

Note: Even if we were to achieve the Null Hypothesis, the outcome of this project will not go to waste.  The data collected from the 13 Indonesian news website can also be used to do an NLP analysis

## Steps

Step 1: Scraped the data from Indonesian News Websites

Step 2: Clean and parse the data.  Only pick those information that I need. 

Step 3: Scrape data from the Singapore Government website and Ministry of Foreign Affairs website

Step 4: Use Google Translate API to translate the text of the Indonesian articles

Step 5: Use the similarities method in gensim library to determine whether a Singapore Government press statement matches an Indonesian article

# Scraping Data 

## A) Scraping the Indonesian News Websites - those that use Google Custom Search Engines (GCSE)

GCSE IDs of the different Sites:

Jakarta Post - '007685728690098461931:2lpamdk7yne' 

Kompas - '018212539862037696382:-xa61bkyvao'

Tribun News - 'partner-pub-7486139053367666:4965051114'

Merdeka - '001561947424278099921:7qnaw_9r2rq'

Inilah - '009693617939879592379:wxvqc64bh3s'

MetroTVnews - '008086164163598071346:djhxpzyiqw4'

In [1]:
from googleapiclient.discovery import build
import pandas as pd
import json
from bs4 import BeautifulSoup
import urllib2

for a description of how the GCSE API works, please refer to: https://google-api-client-libraries.appspot.com/documentation/customsearch/v1/python/latest/customsearch_v1.cse.html

In [3]:
AppKeyID = 'AIzaSyB_cCz7n__XGOLiq3p3wxMk-JwHqQqKdEA'

In [4]:
#Scraping data from news website that uses Google Search Engines (GCSE) and Bahasa Indonesia.
#The search is sorted by Relevance
service = build("customsearch", 'v1', developerKey=AppKeyID)

result_pandas = pd.DataFrame()
file_names = []
file_names = ['kompas.csv', 'tribunnews.csv', 'merdeka.csv', 'inilah.csv', 'metrotvnews.csv']
indo_news_sites = [
'018212539862037696382:-xa61bkyvao',
'partner-pub-7486139053367666:4965051114',
'001561947424278099921:7qnaw_9r2rq',
'009693617939879592379:wxvqc64bh3s',
'008086164163598071346:djhxpzyiqw4']

for m, indo_news_site in enumerate(indo_news_sites):
    print 'News Site: ' + indo_news_site
    
    #GCSE API can only be called 10 times at a time.  
    #Therefore, I will need to loop 10 times to get 100 search results
    for count in range(10):        
            
        print 'Count: ' + str(count)
        
        if (count == 0):
            search_result = service.cse().list(q='Singapura', cx=indo_news_site, start=None).execute()
            column_names = list(pd.read_json(json.dumps(search_result['items'])).columns.values)
            #column_names = list(pd.read_json(json.dumps(search_result['items'][0]['pagemap']['metatags'])).columns.values)        
            result_pandas = pd.DataFrame(columns=(column_names))
            result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))
    
        else:
            search_result = service.cse().list(q='Singapura', cx=indo_news_site, start=count*10).execute()
            if 'items' in search_result:
                result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))

                    
    print ('Results: ', len(result_pandas))
    
    file_name = file_names[m]
    result_pandas.to_csv(file_name, encoding='utf-8')

News Site: 018212539862037696382:-xa61bkyvao
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
News Site: partner-pub-7486139053367666:4965051114
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
News Site: 001561947424278099921:7qnaw_9r2rq
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
News Site: 009693617939879592379:wxvqc64bh3s
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
News Site: 008086164163598071346:djhxpzyiqw4
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)


In [5]:
#removing variables that I do not need anymore from the system
del file_names
del indo_news_sites
del search_result
del result_pandas

In [6]:
#Scraping data from news website that uses Google Search Engines (GCSE) and Bahasa Indonesia.
#The search is sorted by Date
#I have to insert the attribute sort='Date' in my list arguments
service = build("customsearch", 'v1', developerKey=AppKeyID)

result_pandas = pd.DataFrame()
file_names = []
file_names = [
#'kompas2.csv', 
'tribunnews2.csv', 'merdeka2.csv', 'inilah2.csv', 'metrotvnews2.csv']
indo_news_sites = [
#'018212539862037696382:-xa61bkyvao',
'partner-pub-7486139053367666:4965051114',
'001561947424278099921:7qnaw_9r2rq',
'009693617939879592379:wxvqc64bh3s',
'008086164163598071346:djhxpzyiqw4']

for m, indo_news_site in enumerate(indo_news_sites):
    print 'News Site: ' + indo_news_site
    
    #GCSE API can only be called 10 times at a time.  
    #Therefore, I will need to loop 10 times to get 100 search results
    for count in range(10):        
            
        print 'Count: ' + str(count)
        
        if (count == 0):
            search_result = service.cse().list(q='Singapura', cx=indo_news_site, sort='Date', start=None).execute()
            column_names = list(pd.read_json(json.dumps(search_result['items'])).columns.values)
            #column_names = list(pd.read_json(json.dumps(search_result['items'][0]['pagemap']['metatags'])).columns.values)        
            result_pandas = pd.DataFrame(columns=(column_names))
            result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))
    
        else:
            search_result = service.cse().list(q='Singapura', cx=indo_news_site, sort='Date', start=count*10).execute()
            if 'items' in search_result:
                result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))

                    
    print ('Results: ', len(result_pandas))
    
    file_name = file_names[m]
    
    print file_name
    
    result_pandas.to_csv(file_name, encoding='utf-8')

News Site: partner-pub-7486139053367666:4965051114
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
tribunnews2.csv
News Site: 001561947424278099921:7qnaw_9r2rq
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
merdeka2.csv
News Site: 009693617939879592379:wxvqc64bh3s
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
inilah2.csv
News Site: 008086164163598071346:djhxpzyiqw4
Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)
metrotvnews2.csv


In [7]:
#Scraping data from The Jakarta Post
#Results are obtained by relevance
result_pandas = pd.DataFrame()
service = build("customsearch", 'v1', developerKey=AppKeyID)

    #GCSE API can only be called 10 times at a time.  
    #Therefore, I will need to loop 10 times to get 100 search results
for count in range(10):        
            
    print 'Count: ' + str(count)
        
    if (count == 0):
        search_result = service.cse().list(q='Singapore', cx='007685728690098461931:2lpamdk7yne', start=None).execute()
        column_names = list(pd.read_json(json.dumps(search_result['items'])).columns.values)
        #column_names = list(pd.read_json(json.dumps(search_result['items'][0]['pagemap']['metatags'])).columns.values)        
        result_pandas = pd.DataFrame(columns=(column_names))
        result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))
    
    else:
        search_result = service.cse().list(q='Singapore', cx='007685728690098461931:2lpamdk7yne', start=count*10).execute()
        if 'items' in search_result:
            result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))

    
print ('Results: ', len(result_pandas))
result_pandas.to_csv('jakartapost.csv', encoding='utf-8')

Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)


In [8]:
#Scraping data from The Jakarta Post
#Results are obtained by relevance
result_pandas = pd.DataFrame()
service = build("customsearch", 'v1', developerKey=AppKeyID)

    #GCSE API can only be called 10 times at a time.  
    #Therefore, I will need to loop 10 times to get 100 search results
for count in range(10):        
            
    print 'Count: ' + str(count)
        
    if (count == 0):
        search_result = service.cse().list(q='Singapore', cx='007685728690098461931:2lpamdk7yne', sort='Date', start=None).execute()
        column_names = list(pd.read_json(json.dumps(search_result['items'])).columns.values)
        #column_names = list(pd.read_json(json.dumps(search_result['items'][0]['pagemap']['metatags'])).columns.values)        
        result_pandas = pd.DataFrame(columns=(column_names))
        result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))
    
    else:
        search_result = service.cse().list(q='Singapore', cx='007685728690098461931:2lpamdk7yne', sort='Date', start=count*10).execute()
        if 'items' in search_result:
            result_pandas = result_pandas.append(pd.read_json(json.dumps(search_result['items'])))

    
print ('Results: ', len(result_pandas))
result_pandas.to_csv('jakartapost2.csv', encoding='utf-8')

Count: 0
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5
Count: 6
Count: 7
Count: 8
Count: 9
('Results: ', 100)


In [9]:
result_pandas

Unnamed: 0,cacheId,displayLink,formattedUrl,htmlFormattedUrl,htmlSnippet,htmlTitle,kind,link,pagemap,snippet,title
0,LSV0VDWOnekJ,www.thejakartapost.com,www.thejakartapost.com/news/national,www.thejakartapost.com/news/national,14 hours ago <b>...</b> National 5 days ago &m...,News - national - The Jakarta Post,customsearch#result,http://www.thejakartapost.com/news/national,{u'metatags': [{u'twitter:creator': u'@jakpost...,14 hours ago ... National 5 days ago · Inmate ...,News - national - The Jakarta Post
1,biXRsvwbqdYJ,www.thejakartapost.com,www.thejakartapost.com/.../jomblang-cave-a-ver...,www.thejakartapost.com/.../jomblang-cave-a-ver...,4 hours ago <b>...</b> “The going up and down ...,"Jomblang Cave, a vertical cave with heaven-sen...",customsearch#result,http://www.thejakartapost.com/travel/2017/07/2...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,4 hours ago ... “The going up and down the cav...,"Jomblang Cave, a vertical cave with heaven-sen..."
2,ZyaFsoE7Vc8J,www.thejakartapost.com,www.thejakartapost.com/.../art-stage-jakarta-r...,www.thejakartapost.com/.../art-stage-jakarta-r...,10 hours ago <b>...</b> Art Stage Jakarta and ...,Art Stage Jakarta returns with more participan...,customsearch#result,http://www.thejakartapost.com/life/2017/07/24/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,10 hours ago ... Art Stage Jakarta and Singapo...,Art Stage Jakarta returns with more participan...
3,7Mv1m_UBT6EJ,www.thejakartapost.com,www.thejakartapost.com/,www.thejakartapost.com/,14 hours ago <b>...</b> The Jakarta Post - Alw...,The Jakarta Post - Always Bold. Always Indepen...,customsearch#result,http://www.thejakartapost.com/,"{u'metatags': [{u'srv': u'web2-web', u'viewpor...",14 hours ago ... The Jakarta Post - Always Bol...,The Jakarta Post - Always Bold. Always Indepen...
4,fnU784HguEEJ,www.thejakartapost.com,www.thejakartapost.com/.../lily-yulianti-farid...,www.thejakartapost.com/.../lily-yulianti-farid...,12 hours ago <b>...</b> All of the experience ...,Lily Yulianti Farid: For the love of writing -...,customsearch#result,http://www.thejakartapost.com/life/2017/07/24/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,12 hours ago ... All of the experience of writ...,Lily Yulianti Farid: For the love of writing -...
5,bmGAXif99NwJ,www.thejakartapost.com,www.thejakartapost.com/.../jokowi-orders-polic...,www.thejakartapost.com/.../jokowi-orders-polic...,2 days ago <b>...</b> ... that Indonesia was a...,Jokowi orders police to gun down foreign drug ...,customsearch#result,http://www.thejakartapost.com/news/2017/07/22/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,2 days ago ... ... that Indonesia was a potent...,Jokowi orders police to gun down foreign drug ...
6,N3AV94fBZNsJ,www.thejakartapost.com,www.thejakartapost.com/.../bumi-subsidiary-to-...,www.thejakartapost.com/.../bumi-subsidiary-to-...,"2 days ago <b>...</b> Then, BRMS&#39; debts wo...",Bumi subsidiary to gain big from NFC deal: Ana...,customsearch#result,http://www.thejakartapost.com/news/2017/07/22/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,"2 days ago ... Then, BRMS' debts worth $191 mi...",Bumi subsidiary to gain big from NFC deal: Ana...
7,XQ4xPq0GFWUJ,www.thejakartapost.com,www.thejakartapost.com/.../justin-biebers-indo...,www.thejakartapost.com/.../justin-biebers-indo...,"2 days ago <b>...</b> The Canadian singer, how...",Justin Bieber&#39;s Indonesia concert rumors h...,customsearch#result,http://www.thejakartapost.com/life/2017/07/22/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,"2 days ago ... The Canadian singer, however, i...",Justin Bieber's Indonesia concert rumors heat ...
8,63h4nU0N5xkJ,www.thejakartapost.com,www.thejakartapost.com/seasia,www.thejakartapost.com/seasia,"2 days ago <b>...</b> Vital to keep Malacca, S...",Southeast Asia - The Jakarta Post,customsearch#result,http://www.thejakartapost.com/seasia,{u'metatags': [{u'twitter:creator': u'@jakpost...,"2 days ago ... Vital to keep Malacca, S'pore s...",Southeast Asia - The Jakarta Post
9,Hu1DnE8Q5s8J,www.thejakartapost.com,www.thejakartapost.com/.../graft-buster-novel-...,www.thejakartapost.com/.../graft-buster-novel-...,3 days ago <b>...</b> He has since been underg...,Graft buster Novel Baswedan may have major sur...,customsearch#result,http://www.thejakartapost.com/news/2017/07/20/...,{u'metatags': [{u'cdb:id': u'4065a5a8898c7cc66...,3 days ago ... He has since been undergoing tr...,Graft buster Novel Baswedan may have major sur...


I have saved all the data files in a data folder.  

The search results which are arranged by Date have an index 2 at the end of the file.  E.g inilah.csv saves the search results arranged by relevance, while inilah2.csv contains the search result arranged by Date

In [13]:
#Combine the search results for those found by relevance and those found by Date
#Thereafter, remove duplicates
saved_files = ['inilah', 'jakartapost', 'kompas', 'merdeka', 'metrotvnews', 'tribunnews']

In [32]:
for saved_file in saved_files:
    print saved_file
    #open files which contains results arranged by relevance
    temp_saved_file = pd.read_csv('data/' + saved_file + '.csv', index_col=0)
    #append results that were arranged by date
    temp_saved_file = temp_saved_file.append(pd.read_csv('data/' + saved_file + '2.csv', index_col=0))
    #drop duplicates
    temp_saved_file = temp_saved_file.drop_duplicates(subset='formattedUrl')
    #keep relevant columns only
    if ('title' in temp_saved_file.columns) & ('link' in temp_saved_file.columns) & ('snippet' in temp_saved_file.columns):
        print 'yes'
        temp_saved_file = temp_saved_file[['title','link','snippet']]
    #save file
    temp_saved_file.to_csv('data/' + saved_file + '_final.csv', encoding='utf-8')

inilah
yes
jakartapost
yes
kompas
yes
merdeka
yes
metrotvnews
yes
tribunnews
yes


## A2) Getting the body from the links

In [8]:
def getSoupLink(url):
    import ssl
    # This restores the same behavior as before.
    context = ssl._create_unverified_context()
    req = urllib2.Request(url, headers={'User-Agent' : "Chrome"}) 
    #con = urllib2.urlopen(req)
    try:
        con = urllib2.urlopen(req, context=context)
        page = con.read()
        soup = BeautifulSoup(page,"lxml")
        con.close()
        return soup
    
    except urllib2.HTTPError as err:
        if err.code == 404:
            return err.code
        else:
            raise

In [223]:
#Scraping the body of kompas sites

temp_saved_file = pd.read_csv('data/' + 'kompas' + '_final.csv', index_col=0).reset_index(drop=True)

temp_saved_file.loc[:,'body'] = '' #initialising body column
temp_saved_file.loc[:,'date'] = '' #initialising date column

In [224]:
for m, url in enumerate(temp_saved_file.link):
    
    soup = getSoupLink(temp_saved_file.link[m])
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"read__content"}) is None):
            paras = soup.find("div", attrs={"class":"read__content"}).find_all("p")
            content = " ".join([para.getText() for para in paras])
            
            temp_saved_file.loc[m, 'body'] = content
            
            date = soup.find('div', attrs={'class':'read__time'}).getText()
            date = date.split()[2]
    
            date = date.replace(',','')
            print date
            temp_saved_file.loc[m, 'date'] = date
        
    else:
        print 'URL cannot be opened'

        
temp_saved_file.to_csv('data/' + 'kompas' + '_final2.csv', encoding='utf-8')

del m
del url
del date
del content


Index: 0
24/07/2017
Index: 1
24/07/2017
Index: 2
24/07/2017
Index: 3
01/11/2016
Index: 4
18/07/2017
Index: 5
18/05/2017
Index: 6
20/06/2017
Index: 7
31/08/2016
Index: 8
14/06/2017
Index: 9
08/12/2016
Index: 10
15/07/2017
Index: 11
19/07/2017
Index: 12
17/09/2016
Index: 13
15/09/2016
Index: 14
14/07/2017
Index: 15
27/07/2016
Index: 16
19/11/2016
Index: 17
13/07/2017
Index: 18
02/09/2016
Index: 19
12/07/2017
Index: 20
15/09/2016
Index: 21
13/06/2017
Index: 22
12/04/2017
Index: 23
02/09/2016
Index: 24
08/06/2017
Index: 25
05/12/2016
Index: 26
15/01/2017
Index: 27
29/06/2017
Index: 28
12/06/2017
Index: 29
25/05/2016
Index: 30
06/08/2016
Index: 31
29/05/2017
Index: 32
06/06/2017
Index: 33
14/06/2017
Index: 34
22/07/2016
Index: 35
06/03/2017
Index: 36
07/06/2017
Index: 37
21/07/2016
Index: 38
27/04/2017
Index: 39
16/09/2016
Index: 40
05/04/2017
Index: 41
13/06/2017
Index: 42
16/05/2017
Index: 43
05/05/2017
Index: 44
04/07/2017
Index: 45
06/04/2017
Index: 46
04/05/2017
Index: 47
28/04/2017
In

In [228]:
#jakartapost

temp_saved_file = pd.read_csv('data/' + 'jakartapost' + '_final.csv', index_col=0).reset_index(drop=True)
temp_saved_file.loc[:,'body'] = '' #initialising body column
temp_saved_file.loc[:,'date'] = '' #initialising date column

for m, url in enumerate(temp_saved_file.link):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"show-define-text"}) is None):
            paras = soup.find("div", attrs={"class":"show-define-text"}).find_all("p")
            content = " ".join([para.getText() for para in paras])
            temp_saved_file.loc[m, 'body'] = content
        
            dates = soup.find_all("span", attrs={"class":"day"})
        
            for date1 in dates:
                if (date1.getText().count(",") == 2):
                    day,date,year = date1.getText().split(",")
                    month,date = date.split()
                    date = date + " " + month + " " + year
                    print date
                    temp_saved_file.loc[m, 'date'] = date
    
    else:
        print 'URL cannot be opened'
        
temp_saved_file.to_csv('data/' + 'jakartapost' + '_final2.csv', encoding='utf-8')

Index: 0
11 July  2017
Index: 1
20 April  2017
Index: 2
12 July  2017
Index: 3
16 July  2017
Index: 4
11 July  2017
Index: 5
20 March  2017
Index: 6
12 April  2017
Index: 7
17 March  2017
Index: 8
8 October  2015
Index: 9
3 January  2017
Index: 10
15 March  2017
Index: 11
16 June  2016
Index: 12
31 January  2017
Index: 13
6 June  2017
Index: 14
30 June  2017
Index: 15
8 May  2017
Index: 16
13 July  2017
Index: 17
15 December  2016
Index: 18
10 May  2017
Index: 19
16 May  2017
Index: 20
14 November  2016
Index: 21
25 April  2017
Index: 22
19 May  2017
Index: 23
27 April  2017
Index: 24
16 March  2017
Index: 25
16 July  2017
Index: 26
8 July  2017
Index: 27
25 May  2017
Index: 28
9 February  2017
Index: 29
6 March  2017
Index: 30
7 May  2017
Index: 31
15 June  2017
Index: 32
27 August  2016
Index: 33
20 May  2017
Index: 34
1 March  2017
Index: 35
18 May  2017
Index: 36
30 May  2016
Index: 37
17 March  2017
Index: 38
4 May  2017
Index: 39
4 July  2017
Index: 40
14 March  2017
Index: 41
25

In [229]:
#removing variables I do not need anymore

del m
del url
del date
del content

In [299]:
#inilah

temp_saved_file = pd.read_csv('data/' + 'inilah' + '_final.csv', index_col=0).reset_index(drop=True)
        
temp_saved_file.loc[:,'body'] = '' #initialising body column
temp_saved_file.loc[:,'date'] = '' #initialising date column

In [300]:
for m, url in enumerate(temp_saved_file.link):
    
    soup = getSoupLink(url)
    print 'Index: ' + str(m)

    if (soup != 404):
        
        if not (soup.find("div", attrs={"style":"width:400px;float:left"}) is None):
            paras = soup.find("div", attrs={"style":"width:400px;float:left"}).find_all("p")
            content = " ".join([para.getText() for para in paras])
         
            temp_saved_file.loc[m, 'body'] = content
        
            date = soup.find("div", attrs={"class":"w-cd"}).find('h6').getText()
        
            date = date.split('|')[1]
            date = date.split(',')[1]
        
            print date

            temp_saved_file.loc[m, 'date'] = date
            
        elif not (soup.find("div", attrs={"id":"isi"}) is None):
            paras_temps = soup.find("div", attrs={"id":"isi"})
            
            while (paras_temps.div is not None):
                paras_temps.div.replace_with('')
    
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '')
    
            #print content
            temp_saved_file.loc[m, 'body'] = content
            
            temp_saved_file.loc[m, 'body'] = content
        
            date = soup.find("div", attrs={"id":"date"}).getText()
            #Selasa, 28 Mei 2013 | 09:15 WIB
            date = date.split('|')[0]
            date = date.split(',')[1]
        
            print date

            temp_saved_file.loc[m, 'date'] = date

    else:
        print 'URL cannot be opened'
        
temp_saved_file.to_csv('data/' + 'inilah' + '_final2.csv', encoding='utf-8')


Index: 0
 8 Juni 2017 
Index: 1
 9 April 2017 
Index: 2
 25 November 2016 
Index: 3
 30 Maret 2017 
Index: 4
 8 Juni 2017 
Index: 5
 14 Juni 2017 
Index: 6
 5 Juli 2017 
Index: 7
 5 Mei 2017 
Index: 8
 25 November 2016 
Index: 9
 30 Mei 2017 
Index: 10
 24 November 2016 
Index: 11
 14 Juli 2017 
Index: 12
 19 Mei 2017 
Index: 13
 3 Juni 2017 
Index: 14
 16 April 2017 
Index: 15
 31 Maret 2017 
Index: 16
 15 Mei 2017 
Index: 17
 30 Mei 2017 
Index: 18
 23 Mei 2017 
Index: 19
 21 Desember 2016 
Index: 20
 12 April 2017 
Index: 21
 31 Maret 2017 
Index: 22
 29 Mei 2017 
Index: 23
 31 Maret 2017 
Index: 24
 24 Maret 2017 
Index: 25
 8 Mei 2017 
Index: 26
 29 Maret 2017 
Index: 27
 8 Maret 2017 
Index: 28
 26 April 2017 
Index: 29
 21 April 2017 
Index: 30
 11 April 2017 
Index: 31
 29 Mei 2017 
Index: 32
 15 Mei 2017 
Index: 33
 15 Juni 2017 
Index: 34
 17 Mei 2017 
Index: 35
 7 Maret 2017 
Index: 36
 18 Mei 2017 
Index: 37
 18 Juli 2017 
Index: 38
 14 April 2017 
Index: 39
 28 Maret 2017 

In [5]:
# merdeka

temp_saved_file = pd.read_csv('prelim_data/' + 'merdeka' + '_final.csv', index_col=0).reset_index(drop=True)

#temp_saved_file.loc[:,'body'] = '' #initialising body column
#temp_saved_file.loc[:,'date'] = '' #initialising date column

In [282]:
for m, url in enumerate(temp_saved_file.link):
    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"mdk-body-paragpraph"}) is None):
            paras = soup.find("div", attrs={"class":"mdk-body-paragpraph"}).find_all("p")
            content = " ".join([para.getText() for para in paras])
            temp_saved_file.loc[m, 'body'] = content
        
            date = soup.find('span', attrs={'class':'date-post'}).getText()
            #Rabu, 30 November 2016 12:56
            date = date.split(',')[1]
            date = date.split()
            date1 = ''
            date1 = date[0] + ' ' + date[1] + ' ' + date[2]  
            #del date
        
            print date1
            temp_saved_file.loc[m, 'date'] = date1

    else:
        print 'URL cannot be opened'

temp_saved_file.to_csv('data/' + 'merdeka' + '_final2.csv', encoding='utf-8')


Index: 0
30 November 2016
Index: 1
11 Agustus 2016
Index: 2
25 Maret 2017
Index: 3
9 Agustus 2015
Index: 4
24 Maret 2015
Index: 5
16 Januari 2017
Index: 6
26 Juli 2016
Index: 7
3 November 2016
Index: 8
Index: 9
6 Agustus 2016
Index: 10
16 Juli 2017
Index: 11
6 Oktober 2016
Index: 12
21 Maret 2017
Index: 13
31 Maret 2017
Index: 14
13 April 2017
Index: 15
Index: 16
19 Agustus 2016
Index: 17
15 September 2016
Index: 18
10 Agustus 2016
Index: 19
3 April 2017
Index: 20
26 April 2016
Index: 21
13 Februari 2014
Index: 22
18 Mei 2017
Index: 23
2 November 2016
Index: 24
30 Mei 2012
Index: 25
29 Juni 2017
Index: 26
18 Mei 2016
Index: 27
7 Oktober 2016
Index: 28
19 Agustus 2016
Index: 29
30 Maret 2017
Index: 30
20 Agustus 2016
Index: 31
27 Maret 2017
Index: 32
18 Januari 2017
Index: 33
19 Januari 2016
Index: 34
13 Mei 2017
Index: 35
31 Juli 2015
Index: 36
12 November 2015
Index: 37
11 Februari 2014
Index: 38
5 Juli 2017
Index: 39
16 September 2016
Index: 40
22 Agustus 2016
Index: 41
4 Juli 2017
I

In [264]:
#removing variables I do not need anymore

del m
del url
del date
del content

In [5]:
def isint(x):
    try:
        a = int(x)
    
    except ValueError:
        return False
    
    else:
        return True

In [271]:
# metrotvnews

temp_saved_file = pd.read_csv('data/' + 'metrotvnews' + '_final.csv', index_col=0).reset_index(drop=True)

temp_saved_file.loc[:,'body'] = '' #initialising body column
temp_saved_file.loc[:,'date'] = '' #initialising date column

for m, url in enumerate(temp_saved_file.link):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"tru"}) is None):
            
            paras_temps = soup.find("div", attrs={"class":"tru"})
    
            while (paras_temps.div is not None):
                paras_temps.div.replace_with('')
    
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '')
    
            #print content
            temp_saved_file.loc[m, 'body'] = content
    
            date = soup.find('div', attrs={'class':'reg'}).getText()
            #Krisna Octavianus &nbsp;&nbsp; &bull; &nbsp;&nbsp; 24 Juli 2017 14:48 WIB
            date = date.split()
            
            date1 = ''
            
            for g, dat1 in enumerate(date):
                
                if (isint(date[g]) == True):
                    date1 = str(date[g]) + ' ' + date[g+1] + ' ' + date[g+2]
                    break
        
            print date1
    
            temp_saved_file.loc[m, 'date'] = date1

    else:
        print 'URL cannot be opened'
        
temp_saved_file.to_csv('data/' + 'metrotvnews' + '_final2.csv', encoding='utf-8')

Index: 0
24 Juli 2017
Index: 1
15 April 2017
Index: 2
11 April 2017
Index: 3
12 April 2017
Index: 4
13 Juni 2017
Index: 5
15 April 2017
Index: 6
08 Juni 2017
Index: 7
09 Juni 2017
Index: 8
08 Juni 2017
Index: 9
24 Juli 2017
Index: 10
24 Juli 2017
Index: 11
Index: 12
Index: 13
23 Juli 2017
Index: 14
Index: 15
Index: 16
12 Mei 2017
Index: 17
18 Apr 2017
Index: 18
09 Juni 2017
Index: 19
21 Juli 2017
Index: 20
25 Apr 2017
Index: 21
16 Apr 2017
Index: 22
31 Mar 2017
Index: 23
30 Mar 2017
Index: 24
17 April 2016
Index: 25
Index: 26
29 Jun 2017
Index: 27
25 Apr 2017
Index: 28
19 May 2017
Index: 29
29 Mar 2017
Index: 30
17 Juli 2017
Index: 31
Index: 32
01 Jun 2017
Index: 33
17 Juli 2017
Index: 34
03 Jul 2016
Index: 35
13 April 2017
Index: 36
28 Nov 2016
Index: 37
Index: 38
14 Apr 2017
Index: 39
06 May 2017
Index: 40
Index: 41
21 Dec 2016
Index: 42
19 Februari 2017
Index: 43
09 Juni 2017
Index: 44
09 Juni 2017
Index: 45
13 Jul 2017
Index: 46
07 Mar 2017
Index: 47
11 Jan 2017
Index: 48
05 Apr 20

In [6]:
temp_saved_file = pd.read_csv('data/' + 'metrotvnews' + '_final2.csv', index_col=0).reset_index(drop=True)

m_links = [64, 79, 88, 95, 96, 112, 118, 178]

for m in m_links:

    soup = getSoupLink(temp_saved_file.link[m])
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"part article"}) is None):
            
            paras_temps = soup.find("div", attrs={"class":"part article"})
    
            while (paras_temps.div is not None):
                paras_temps.div.replace_with('')
    
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '')
    
            #print content
            temp_saved_file.loc[m, 'body'] = content
    
            date = soup.find('span', attrs={'class':'red'}).getText()
            #- 26 Maret 2017 10:40 wib
            date = date.split()
            
            date1 = ''
            
            for g, dat1 in enumerate(date):
                
                if (isint(date[g]) == True):
                    date1 = str(date[g]) + ' ' + date[g+1] + ' ' + date[g+2]
                    break
        
            print date1
    
            temp_saved_file.loc[m, 'date'] = date1

    else:
        print 'URL cannot be opened'
        
temp_saved_file.to_csv('data/' + 'metrotvnews' + '_final2.csv', encoding='utf-8')

Index: 64
14 Juni 2017
Index: 79
21 Juni 2017
Index: 88
26 Maret 2017
Index: 95
09 Juni 2017
Index: 96
06 Juli 2017
Index: 112
Index: 118
Index: 178
14 Juli 2017


In [285]:
# tribunnews

temp_saved_file = pd.read_csv('data/' + 'tribunnews' + '_final.csv', index_col=0).reset_index(drop=True)

temp_saved_file.loc[:,'body'] = '' #initialising body column
temp_saved_file.loc[:,'date'] = '' #initialising date column

for m, url in enumerate(temp_saved_file.link):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"side-article txt-article"}) is None):
            paras = soup.find("div", attrs={"class":"side-article txt-article"}).find_all('p')
            content = " ".join([para.getText() for para in paras])
    
            #print content
            temp_saved_file.loc[m, 'body'] = content
    
            date = soup.find('time').getText()
            #Senin, 14 November 2016 18:50
            date = date.split(',')[1]
            #14 November 2016 18:50
            date = date.split()
            date1 = ''
            date1 = date[0] + ' ' + date[1] + ' ' + date[2]  
        
            print date1
    
            temp_saved_file.loc[m, 'date'] = date1

    else:
        print 'URL cannot be opened'
        
temp_saved_file.to_csv('data/' + 'tribunnews' + '_final2.csv', encoding='utf-8')

Index: 0
20 September 2015
Index: 1
3 Februari 2017
Index: 2
14 November 2016
Index: 3
13 Mei 2017
Index: 4
21 April 2017
Index: 5
14 November 2016
Index: 6
10 Februari 2017
Index: 7
14 November 2016
Index: 8
9 Mei 2016
Index: 9
14 November 2016
Index: 10
14 November 2016
Index: 11
14 November 2016
Index: 12
11 Februari 2014
Index: 13
18 Desember 2016
Index: 14
9 Juli 2017
Index: 15
10 Maret 2016
Index: 16
4 April 2017
Index: 17
11 Oktober 2016
Index: 18
16 November 2016
Index: 19
9 Desember 2016
Index: 20
14 Februari 2017
Index: 21
9 Desember 2016
Index: 22
5 Mei 2017
Index: 23
3 November 2016
Index: 24
14 Februari 2017
Index: 25
20 April 2017
Index: 26
20 Januari 2017
Index: 27
14 November 2016
Index: 28
3 April 2016
Index: 29
23 November 2014
Index: 30
13 November 2016
Index: 31
20 Januari 2017
Index: 32
14 November 2016
Index: 33
27 Maret 2016
Index: 34
8 Desember 2016
Index: 35
8 Desember 2016
Index: 36
25 April 2017
Index: 37
11 Mei 2017
Index: 38
24 Agustus 2016
Index: 39
5 Janu

## B) Scraping news website that has its own search engines

In [7]:
#Detik News

temp_saved_file = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file.loc[:,'body'] = ''


In [8]:
num_pages = 397
count = 0

for i in range(num_pages):
    print ('Page: ', i+1)
    url = 'https://www.detik.com/search/searchnews?query=singapura&siteid=3&sortby=date&fromdatex=01/01/2015&todatex=25/07/2017&page={}'.format(i+1)
    soup = getSoupLink(url)
    
    for link in soup.find_all("article"):
        link2 = link.find("a").get("href")
        title_link = link.find("h2", attrs={"class":"title"}).getText()
        snippet = link.find("p").getText()
        date = link.find("span", attrs={"class":"date"}).getText()
        #Selasa, 25 Jul 2017 11:32 WIB   -  
        date = date.split(',')[1]
        #25 Jul 2017 11:32 WIB 
        date = date.split()
        date1 = ''
        date1 = date[0] + ' ' + date[1] + ' ' + date[2]
        temp_saved_file.loc[count] = [title_link, link2, snippet, '', date1]
        count = count + 1
    
temp_saved_file.to_csv('data/' + 'detiknews' + '_final.csv', encoding='utf-8')

('Page: ', 1)
('Page: ', 2)
('Page: ', 3)
('Page: ', 4)
('Page: ', 5)
('Page: ', 6)
('Page: ', 7)
('Page: ', 8)
('Page: ', 9)
('Page: ', 10)
('Page: ', 11)
('Page: ', 12)
('Page: ', 13)
('Page: ', 14)
('Page: ', 15)
('Page: ', 16)
('Page: ', 17)
('Page: ', 18)
('Page: ', 19)
('Page: ', 20)
('Page: ', 21)
('Page: ', 22)
('Page: ', 23)
('Page: ', 24)
('Page: ', 25)
('Page: ', 26)
('Page: ', 27)
('Page: ', 28)
('Page: ', 29)
('Page: ', 30)
('Page: ', 31)
('Page: ', 32)
('Page: ', 33)
('Page: ', 34)
('Page: ', 35)
('Page: ', 36)
('Page: ', 37)
('Page: ', 38)
('Page: ', 39)
('Page: ', 40)
('Page: ', 41)
('Page: ', 42)
('Page: ', 43)
('Page: ', 44)
('Page: ', 45)
('Page: ', 46)
('Page: ', 47)
('Page: ', 48)
('Page: ', 49)
('Page: ', 50)
('Page: ', 51)
('Page: ', 52)
('Page: ', 53)
('Page: ', 54)
('Page: ', 55)
('Page: ', 56)
('Page: ', 57)
('Page: ', 58)
('Page: ', 59)
('Page: ', 60)
('Page: ', 61)
('Page: ', 62)
('Page: ', 63)
('Page: ', 64)
('Page: ', 65)
('Page: ', 66)
('Page: ', 67)
('Pa

In [49]:
#Antara News

del temp_saved_file
temp_saved_file = pd.DataFrame()
temp_saved_file = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file.loc[:,'body'] = ''

In [55]:
num_pages = 81 - 75
#count = 0

for i in range(num_pages):
    i = i + 1 + 75
    print ('Page: ', i)
    url = 'http://www.antaranews.com/search?q=Singapura&page={}'.format(i)
    soup = getSoupLink(url)
    
    for link in soup.find_all("div", attrs={"class":"paging_ekonomi"}):
        link2 = 'http://www.antaranews.com' + link.find("a").get("href")
        title_link = link.find("h3").getText()
        snippet = link.find("div", attrs={"class":"pt5"}).getText()
        date = link.find("div", attrs={"class":"date"}).getText()
        #25 Jul 2017 11:32 WIB   -  
        date = date.split()
        date1 = ''
        date1 = date[0] + ' ' + date[1] + ' ' + date[2]
        temp_saved_file.loc[count] = [title_link, link2, snippet, '', date1]
        count = count + 1
    
temp_saved_file.to_csv('data/' + 'antaranews' + '_final.csv', encoding='utf-8')

('Page: ', 76)
('Page: ', 77)
('Page: ', 78)
('Page: ', 79)
('Page: ', 80)
('Page: ', 81)


In [38]:
#Liputan6 - Bisnis

del temp_saved_file
temp_saved_file = pd.DataFrame()
temp_saved_file = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file.loc[:,'body'] = ''
temp_saved_file.loc[:,'date'] = ''

In [43]:
#url = 'http://www.liputan6.com/search?order=relevance&channel_id=5&from_date=01%2F01%2F2015&to_date=25%2F07%2F2017&type=all&q=singapura'
url = open('prelim_data/liputanbisnis.html', 'r')
#soup = getSoupLink(url)
soup = BeautifulSoup(url, 'lxml')

for i, link in enumerate(soup.find_all("a", attrs={"class":"ui--a articles--iridescent-list--text-item__title-link"})):
    temp_saved_file.loc[i, 'link'] = link.get("href")
    
for j, title_link in enumerate(soup.find_all("span", attrs={"class":"articles--iridescent-list--text-item__title-link-text"})):
    temp_saved_file.loc[j, 'title'] = title_link.getText()

for k, snippet in enumerate(soup.find_all("div", attrs={"class":"articles--iridescent-list--text-item__summary"})):
    temp_saved_file.loc[k, 'snippet'] = snippet.getText()

url.close()

#temp_saved_file.to_csv('data/' + 'liputan6' + '_final.csv', encoding='utf-8')''

In [44]:
#Liputan6 - Bisnis

temp_saved_file2 = pd.DataFrame()
temp_saved_file2 = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file2.loc[:,'body'] = ''
temp_saved_file2.loc[:,'date'] = ''

url = open('prelim_data/liputannews.html', 'r')
#soup = getSoupLink(url)
soup = BeautifulSoup(url, 'lxml')
    
for i, link in enumerate(soup.find_all("a", attrs={"class":"ui--a articles--iridescent-list--text-item__title-link"})):
    temp_saved_file2.loc[i, 'link'] = link.get("href")
    
for j, title_link in enumerate(soup.find_all("span", attrs={"class":"articles--iridescent-list--text-item__title-link-text"})):
    temp_saved_file2.loc[j, 'title'] = title_link.getText()

for k, snippet in enumerate(soup.find_all("div", attrs={"class":"articles--iridescent-list--text-item__summary"})):
    temp_saved_file2.loc[k, 'snippet'] = snippet.getText()

temp_saved_file3 = temp_saved_file.append(temp_saved_file2)
temp_saved_file3.to_csv('data/' + 'liputan6' + '_final.csv', encoding='utf-8')

url.close()

In [None]:
#Sindo News

del temp_saved_file
temp_saved_file = pd.DataFrame()
temp_saved_file = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file.loc[:,'body'] = ''
temp_saved_file.loc[:,'date'] = ''

In [48]:
num_pages = 86
count = 0

for i in range(num_pages):
    i = i*12
    print ('Page: ', i)
    url = 'https://search.sindonews.com/search/{}?type=artikel&q=Singapura'.format(i)
    soup = getSoupLink(url)
    
    for link in soup.find_all("div", attrs={"class":"news-content"}):
        link2 = link.find("a").get("href")
        title_link = link.find("a").getText()
        snippet = link.find("div", attrs={"class":"news-summary"}).getText()
        date = link.find("div", attrs={"class":"news-date"}).getText()
        #Kamis, 20 Juli 2017 - 05:01 WIB
        date = date.split(',')[1]
        #20 Juli 2017 - 05:01 WIB
        date = date.split('-')[0]
        #date1 = date[0] + ' ' + date[1] + ' ' + date[2]
        temp_saved_file.loc[count] = [title_link, link2, snippet, '', date]
        count = count + 1
    
temp_saved_file.to_csv('data/' + 'sindonews' + '_final.csv', encoding='utf-8')

('Page: ', 0)
('Page: ', 12)
('Page: ', 24)
('Page: ', 36)
('Page: ', 48)
('Page: ', 60)
('Page: ', 72)
('Page: ', 84)
('Page: ', 96)
('Page: ', 108)
('Page: ', 120)
('Page: ', 132)
('Page: ', 144)
('Page: ', 156)
('Page: ', 168)
('Page: ', 180)
('Page: ', 192)
('Page: ', 204)
('Page: ', 216)
('Page: ', 228)
('Page: ', 240)
('Page: ', 252)
('Page: ', 264)
('Page: ', 276)
('Page: ', 288)
('Page: ', 300)
('Page: ', 312)
('Page: ', 324)
('Page: ', 336)
('Page: ', 348)
('Page: ', 360)
('Page: ', 372)
('Page: ', 384)
('Page: ', 396)
('Page: ', 408)
('Page: ', 420)
('Page: ', 432)
('Page: ', 444)
('Page: ', 456)
('Page: ', 468)
('Page: ', 480)
('Page: ', 492)
('Page: ', 504)
('Page: ', 516)
('Page: ', 528)
('Page: ', 540)
('Page: ', 552)
('Page: ', 564)
('Page: ', 576)
('Page: ', 588)
('Page: ', 600)
('Page: ', 612)
('Page: ', 624)
('Page: ', 636)
('Page: ', 648)
('Page: ', 660)
('Page: ', 672)
('Page: ', 684)
('Page: ', 696)
('Page: ', 708)
('Page: ', 720)
('Page: ', 732)
('Page: ', 744)
('

In [100]:
del temp_saved_file2
del temp_saved_file3

In [None]:
# Suara.com

del temp_saved_file
temp_saved_file = pd.DataFrame()
temp_saved_file = pd.DataFrame(columns=('title','link','snippet','body', 'date'))
temp_saved_file.loc[:,'body'] = ''
temp_saved_file.loc[:,'date'] = ''

num_pages = 32
count = 0

for i in range(num_pages):
    i = i+1
    print ('Page: ', i)
    url = 'http://www.suara.com/search/singapura/page--{}'.format(i)
    soup = getSoupLink(url)
    
    for m, link in enumerate(soup.find_all("a", attrs={"class":"ellipsis2"})):
        link2 = link.get("href")
        title_link = link.getText()
        #http://www.suara.com/news/2015/01/05/165809/angkut-1400-mobil-mewah-kapal-kargo-singapura-terbalik
        date = link.get("href").replace('http://www.suara.com/', '')
        #news/2015/01/05/165809/angkut-1400-mobil-mewah-kapal-kargo-singapura-terbalik
        date = date.split('/')
        date1 = ''
        if (len(date) > 3):
            date1 = date[3] + '/' + date[2] + '/' + date[1]
        
        snippet = soup.find_all("p", attrs={"class":"ellipsis2"})[m].getText()
        temp_saved_file.loc[count] = [title_link, link2, snippet, '', date1]
        count = count + 1
    
temp_saved_file.to_csv('data/' + 'suara' + '_final.csv', encoding='utf-8')

## B2) Inserting body of text

In [119]:
#Detik News

temp_saved_file = pd.read_csv('data/' + 'detiknews' + '_final2.csv', index_col=0).reset_index(drop=True)

#temp_saved_test = [temp_saved_file.link[0], temp_saved_file.link[11], temp_saved_file.link[27]]

In [121]:
temp_saved_file.count()

title      3568
link       3568
snippet    3520
body          0
date       3568
dtype: int64

In [124]:
#for m, url in enumerate(temp_saved_file.link):
#last_stop=649
for m in range(3568-last_stop):

    m = m + last_stop
    soup = getSoupLink(temp_saved_file.link[m])
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"detail_text"}) is None):
            paras_temps = soup.find("div", attrs={"class":"detail_text"})
            #content = " ".join([para for para in paras])
            
            while (paras_temps.script is not None):
                paras_temps.script.replace_with('')
            
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '')
            
            #print content
            temp_saved_file.loc[m, 'body'] = content          

    else:
        print 'URL cannot be opened'
        
#temp_saved_file.to_csv('data/' + 'detiknews' + '_final2.csv', encoding='utf-8')

Index: 545
Index: 546
Index: 547
Index: 548
Index: 549
Index: 550
Index: 551
Index: 552
Index: 553
Index: 554
Index: 555
Index: 556
Index: 557
Index: 558
Index: 559
Index: 560
Index: 561
Index: 562
Index: 563
Index: 564
Index: 565
Index: 566
Index: 567
Index: 568
Index: 569
Index: 570
Index: 571
Index: 572
Index: 573
Index: 574
Index: 575
Index: 576
Index: 577
Index: 578
Index: 579
Index: 580
Index: 581
Index: 582
Index: 583
Index: 584
Index: 585
Index: 586
Index: 587
Index: 588
Index: 589
Index: 590
Index: 591
Index: 592
Index: 593
Index: 594
Index: 595
Index: 596
Index: 597
Index: 598
Index: 599
Index: 600
Index: 601
Index: 602
Index: 603
Index: 604
Index: 605
Index: 606
Index: 607
Index: 608
Index: 609
Index: 610
Index: 611
Index: 612
Index: 613
Index: 614
Index: 615
Index: 616
Index: 617
Index: 618
Index: 619
Index: 620
Index: 621
Index: 622
Index: 623
Index: 624
Index: 625
Index: 626
Index: 627
Index: 628
Index: 629
Index: 630
Index: 631
Index: 632
Index: 633
Index: 634
Index: 635

KeyboardInterrupt: 

In [127]:
temp_saved_file.to_csv('data/' + 'detiknews' + '_final2.csv', encoding='utf-8')

In [3]:
#AntaraNews

temp_saved_file = pd.read_csv('data/' + 'antaranews' + '_final2.csv', index_col=0).reset_index(drop=True)


In [4]:
temp_saved_file.count()
#temp_saved_test = [temp_saved_file.link[0], temp_saved_file.link[400], temp_saved_file.link[800]]

title      810
link       810
snippet    810
body         0
date       810
dtype: int64

In [15]:
#total_number of cells = 810
#for m, url in enumerate(temp_saved_file.link):
last_stop=802
for m in range(810-last_stop):
    
    m = m + last_stop
    soup = getSoupLink(temp_saved_file.link[m])
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"id":"content_news"}) is None):
            paras_temps = soup.find("div", attrs={"id":"content_news"})
            #content = " ".join([para for para in paras])
            
            while (paras_temps.script is not None):
                paras_temps.script.replace_with('')
            
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '').replace('\t', '')
            
            #print 'yes'
            temp_saved_file.loc[m, 'body'] = content
            
        #else:
        #    print 'no body'

    else:
        print 'URL cannot be opened'

Index: 802
Index: 803
Index: 804
Index: 805
Index: 806
Index: 807
Index: 808
Index: 809


In [16]:
temp_saved_file.to_csv('data/' + 'antaranews' + '_final2.csv', encoding='utf-8')

In [3]:
#Liputan6

temp_saved_file = pd.read_csv('data/' + 'liputan6' + '_final.csv', index_col=0).reset_index(drop=True)
#temp_saved_test = [temp_saved_file.link[0], temp_saved_file.link[20], temp_saved_file.link[80]]
temp_saved_file.count()

title      191
link       191
snippet    191
body         0
date         0
dtype: int64

In [4]:
for m, url in enumerate(temp_saved_file.link):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"class":"article-content-body__item-content"}) is None):
            paras_temps = soup.find("div", attrs={"class":"article-content-body__item-content"})
            #content = " ".join([para for para in paras])
            
            while (paras_temps.find("p", attrs={"class":"baca-juga__header"}) is not None):
                paras_temps.find("p", attrs={"class":"baca-juga__header"}).replace_with('')
            
            while (paras_temps.find("ul", attrs={"class":"baca-juga__list"}) is not None):
                paras_temps.find("ul", attrs={"class":"baca-juga__list"}).replace_with('')
                
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '').replace('\t', '')
            
            #print content
            temp_saved_file.loc[m, 'body'] = content
        
        if not (soup.find("time", attrs={"class": "read-page--header--author__datetime"}) is None):
            date = soup.find("time", attrs={"class": "read-page--header--author__datetime"}).get("datetime")
            date = date.split()[0]
            #print date
            temp_saved_file.loc[m, 'date'] = date
        

    else:
        print 'URL cannot be opened'

temp_saved_file.to_csv('data/' + 'liputan6' + '_final2.csv', encoding='utf-8')

Index: 0
Index: 1
Index: 2
Index: 3
Index: 4
Index: 5
Index: 6
Index: 7
Index: 8
Index: 9
Index: 10
Index: 11
Index: 12
Index: 13
Index: 14
Index: 15
Index: 16
Index: 17
Index: 18
Index: 19
Index: 20
Index: 21
Index: 22
Index: 23
Index: 24
Index: 25
Index: 26
Index: 27
Index: 28
Index: 29
Index: 30
Index: 31
Index: 32
Index: 33
Index: 34
Index: 35
Index: 36
Index: 37
Index: 38
Index: 39
Index: 40
Index: 41
Index: 42
Index: 43
Index: 44
Index: 45
Index: 46
Index: 47
Index: 48
Index: 49
Index: 50
Index: 51
Index: 52
Index: 53
Index: 54
Index: 55
Index: 56
Index: 57
Index: 58
Index: 59
Index: 60
Index: 61
Index: 62
Index: 63
Index: 64
Index: 65
Index: 66
Index: 67
Index: 68
Index: 69
Index: 70
Index: 71
Index: 72
Index: 73
Index: 74
Index: 75
Index: 76
Index: 77
Index: 78
Index: 79
Index: 80
Index: 81
Index: 82
Index: 83
Index: 84
Index: 85
Index: 86
Index: 87
Index: 88
Index: 89
Index: 90
Index: 91
Index: 92
Index: 93
Index: 94
Index: 95
Index: 96
Index: 97
Index: 98
Index: 99
Index: 100

In [5]:
del soup
del temp_saved_file
del content
del date

In [10]:
#Sindonews

temp_saved_file = pd.read_csv('data/' + 'sindonews' + '_final.csv', index_col=0).reset_index(drop=True)
temp_saved_file.count()
#temp_saved_test = [temp_saved_file.link[0], temp_saved_file.link[20], temp_saved_file.link[80]]

title      1032
link       1032
snippet    1032
body          0
date       1032
dtype: int64

In [11]:
for m, url in enumerate(temp_saved_file.link):
#last_stop=649
#for m in range(3568-last_stop):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("div", attrs={"id":"content"}) is None):
            paras_temps = soup.find("div", attrs={"id":"content"})
            #content = " ".join([para for para in paras])
            
            #while (paras_temps.find("p", attrs={"baca-juga__header"}) is not None):
            #    paras_temps.find("p", attrs={"baca-juga__header"}).replace_with('')
            
            #while (paras_temps.find("ul", attrs={"baca-juga__list"}) is not None):
            #    paras_temps.find("ul", attrs={"baca-juga__list"}).replace_with('')
                
            content = paras_temps.getText()
            content = content.replace('\r', '').replace('\n', '').replace('\t', '')
            
            #print content
            temp_saved_file.loc[m, 'body'] = content
        

    else:
        print 'URL cannot be opened'

temp_saved_file.to_csv('data/' + 'sindonews' + '_final2.csv', encoding='utf-8')

Index: 0
Index: 1
Index: 2
Index: 3
Index: 4
Index: 5
Index: 6
Index: 7
Index: 8
Index: 9
Index: 10
Index: 11
Index: 12
Index: 13
Index: 14
Index: 15
Index: 16
Index: 17
Index: 18
Index: 19
Index: 20
Index: 21
Index: 22
Index: 23
Index: 24
Index: 25
Index: 26
Index: 27
Index: 28
Index: 29
Index: 30
Index: 31
Index: 32
Index: 33
Index: 34
Index: 35
Index: 36
Index: 37
Index: 38
Index: 39
Index: 40
Index: 41
Index: 42
Index: 43
Index: 44
Index: 45
Index: 46
Index: 47
Index: 48
Index: 49
Index: 50
Index: 51
Index: 52
Index: 53
Index: 54
Index: 55
Index: 56
Index: 57
Index: 58
Index: 59
Index: 60
Index: 61
Index: 62
Index: 63
Index: 64
Index: 65
Index: 66
Index: 67
Index: 68
Index: 69
Index: 70
Index: 71
Index: 72
Index: 73
Index: 74
Index: 75
Index: 76
Index: 77
Index: 78
Index: 79
Index: 80
Index: 81
Index: 82
Index: 83
Index: 84
Index: 85
Index: 86
Index: 87
Index: 88
Index: 89
Index: 90
Index: 91
Index: 92
Index: 93
Index: 94
Index: 95
Index: 96
Index: 97
Index: 98
Index: 99
Index: 100

In [9]:
#Suara

temp_saved_file = pd.read_csv('data/' + 'suara' + '_final.csv', index_col=0).reset_index(drop=True)
temp_saved_file.count()

title      640
link       640
snippet    640
body         0
date       639
dtype: int64

In [None]:
# for m, url in enumerate(temp_saved_file.link):
#last_stop=649
#for m in range(3568-last_stop):

    soup = getSoupLink(url)
    print 'Index: ' + str(m)
        
    if (soup != 404):
        
        if not (soup.find("article") is None):
            paras_temps = soup.find("article")
            #content = " ".join([para.getText() for para in paras])
            
            while (paras_temps.find("div", attrs={"class":"baca-juga"}) is not None):
                paras_temps.find("div", attrs={"class":"baca-juga"}).replace_with('')
            
            #while (paras_temps.find("ul") is not None):
            #    paras_temps.find("ul").replace_with('')
                
            #content = paras_temps.getText()
            paras = paras_temps.find_all('p')
            content = " ".join([para.getText() for para in paras])
            #content = content.replace('\r', '').replace('\n', '').replace('\t', '')
            
            #print content
            temp_saved_file.loc[m, 'body'] = content
        

    else:
        print 'URL cannot be opened'

temp_saved_file.to_csv('data/' + 'suara' + '_final2.csv', encoding='utf-8')

## C) Scraping the Government News Site

In [21]:
#scraping gov's site
base_url = 'http://www.gov.sg'
articles = []
num_pages = 6
df_statement = pd.DataFrame(columns=('date','title','content','link'))
count = 0
for i in range(num_pages):
    i = i+1
    print ('Page: ', i)
    url = 'https://www.gov.sg/resources/sgpc/page-{}?start=01-Jan-2016&end=22-Jul-2017&keyword=Indonesia&type=press-release'.format(i)
    soup = getSoupLink(url)
    for link in soup.find_all("div", attrs={"class": "press-title"}):
        link2 = link.find("a")
        title_link = link2.getText()
        #print count
        #print link2.get("href")
        #count = count + 1
        soup = getSoupLink(base_url + link2.get("href"))
        if (soup != 404):
            paras = soup.find("div", attrs={"class":"wrap"})
        
            while (paras.div is not None):
                paras.div.replace_with('')
                
            while (paras.h2 is not None):
                paras.h2.replace_with('')
            
            while (paras.h3 is not None):
                paras.h3.replace_with('')
            
            while (paras.find("ul", attrs={"class":"contact-list"}) is not None):
                paras.find("ul", attrs={"class":"contact-list"}).replace_with('')        
            
            if not (paras.getText() is None): 
                bodycontent = paras.getText()
                
            if (bodycontent.find('\n') != -1):
                bodycontent = bodycontent.replace('\n','')
            if (bodycontent.find(u'\xa0') != -1):
                bodycontent = bodycontent.replace(u'\xa0',u'')
            if (bodycontent.find('\r') != -1):
                bodycontent = bodycontent.replace("\r",' ')
            if (bodycontent.find(u'\u2019') != -1):
                bodycontent = bodycontent.replace(u'\u2019',"'")
        
            print base_url + link2.get("href")
            date = soup.find("div", attrs={"class":"lastupdate"}).getText()
            
            date = date.split()
            
            for g, dat1 in enumerate(date):
                
                if (isint(date[g]) == True):
                    date1 = str(date[g]) + ' ' + date[g+1] + ' ' + date[g+2]
                    break
            
            #year = soup.find("span", attrs={"class":"day"}).getText()
            #print link2.get("href")
            #date,month,year = date1.getText().split()
            #date = date + ' ' + month + ' ' + year
            #content = " ".join([para.getText() for para in paras])
            df_statement.loc[count] = [date1,title_link,bodycontent,base_url + link2.get("href")]
            count = count + 1
            
print count

('Page: ', 1)
http://www.gov.sg/resources/sgpc/media_releases/mewr/press_release/P-20170712-1
http://www.gov.sg/resources/sgpc/media_releases/mewr/press_release/P-20170712-2
http://www.gov.sg/resources/sgpc/media_releases/mof/press_release/P-20170712-1
http://www.gov.sg/resources/sgpc/media_releases/mof/press_release/P-20170711-2
http://www.gov.sg/resources/sgpc/media_releases/mewr/press_release/P-20170709-2
http://www.gov.sg/resources/sgpc/media_releases/mindef/press_release/P-20170617-1
http://www.gov.sg/resources/sgpc/media_releases/mindef/press_release/P-20170603-3
http://www.gov.sg/resources/sgpc/media_releases/mindef/press_release/P-20170518-2
http://www.gov.sg/resources/sgpc/media_releases/mewr/press_release/P-20170517-1
('Page: ', 2)
http://www.gov.sg/resources/sgpc/media_releases/mewr/press_release/P-20170517-2
http://www.gov.sg/resources/sgpc/media_releases/mindef/press_release/P-20170516-3
http://www.gov.sg/resources/sgpc/media_releases/mindef/press_release/P-20170513-1
http

In [23]:
df_statement.to_csv('data/statement.csv', encoding='utf-8')

In [27]:
df_statement.content[10]

u"1.Minister for Defence Dr Ng Eng Hen opened the 11thInternational Maritime Defence Exhibition and Conference (IMDEX) Asia at the Changi Exhibition Centre this morning.2.During the opening ceremony, Dr Ng highlighted the importance of maritime trade to Asia, and the need to maintain open sea lines of communication. He said, \u201cThe seas around us are strategic and will remain so far into the future.\u201d He added that \u201cthese sea lines of communication, or SLOCs, must remain open and stable for all to use \u2013 SLOCs are the global commons which we and all other stakeholders must collectively protect and preserve. Integral to these collective efforts is dialogue and cooperation, supported by the institutionalisation and acceptance of a rules-based order by all countries.\u201d3.Dr Ng also highlighted that the Republic of Singapore Navy (RSN) plays a major role in strengthening efforts to tackle regional maritime security threats, through the Malacca Straits Patrol with Indones

In [9]:
#scraping mfa.gov.sg's site
base_url = 'http://www.mfa.gov.sg'
date = []
num_pages = 3
df_mfa = pd.DataFrame(columns=('date','title','content','link'))
count = 0
#date_count = 0

for i in range(num_pages):
    i = i+1
    #print ('Page: ', i)
    url = 'http://www.mfa.gov.sg/content/mfa/media_centre/press_room.page.Indonesia.on.pressrelease.......2016.January.2017.July.{}.html'.format(i)
    soup = getSoupLink(url)
    print ('Page: ', i)
    
    for date_pt in soup.find("table", attrs={"class": "table_pressroom"}).find_all("td", attrs={"class": "col-xs-4 col-sm-2"}):
        #print date_pt.getText()
        date.append(date_pt.getText())
        #date_count = date_count + 1
    
    for link in soup.find("table", attrs={"class": "table_pressroom"}).find_all("td", attrs={"width":"82%"}):
        link2 = link.find("a")
        title_link = link2.getText()
        #print count
        #print link2.get("href")
        #count = count + 1
        soup = getSoupLink(base_url + link2.get("href"))
        if (soup != 404):
            parbase = soup.find_all("div", attrs={"class":"parsys par_article"})
            
            for parbase_sect in parbase:
                paras = parbase_sect.find_all('p')
                content = " ".join([para.getText() for para in paras])
            
            #bodycontent = ""
            #for content in paras:
            #    if not (content.getText() is None): 
            #        #bodycontent = bodycontent + repr(content.string)
            #        bodycontent = bodycontent + ' ' + content.getText()
                
            #if (bodycontent.find('\n') != -1):
            #    bodycontent = bodycontent.replace('\n','')
            #if (bodycontent.find(u'\xa0') != -1):
            #    bodycontent = bodycontent.replace(u'\xa0',u'')
            #if (bodycontent.find("Press Release") != -1):
            #    bodycontent = bodycontent.replace("Press Release",'')
            #if (bodycontent.find('\r') != -1):
            #    bodycontent = bodycontent.replace("\r",' ')
            #if (bodycontent.find(u'\u2019') != -1):
            #    bodycontent = bodycontent.replace(u'\u2019',"'")
        
            df_mfa.loc[count] = [date[count],title_link,content,base_url + link2.get("href")]
            count = count + 1
            
print count

('Page: ', 1)
('Page: ', 2)
('Page: ', 3)
30


In [10]:
df_mfa

Unnamed: 0,date,title,content,link
0,24-Jul-2017,MFA Press Statement: Visit by Senior Minister ...,\n Senior Minister of State (SM...,http://www.mfa.gov.sg/content/mfa/media_centre...
1,17-Jul-2017,MFA Spokesman's Comments in response to media ...,The MFA Spokesman said: ...,http://www.mfa.gov.sg/content/mfa/media_centre...
2,25-May-2017,MFA Press Statement: Condolence letters from S...,\r\n President Tony Tan Keng Ya...,http://www.mfa.gov.sg/content/mfa/media_centre...
3,25-May-2017,MFA Spokesman’s Comments on the bomb blast in ...,\r\n Singapore strongly condemns th...,http://www.mfa.gov.sg/content/mfa/media_centre...
4,25-Apr-2017,MFA Press Statement: Visit by Minister for For...,Minister for Foreign Affairs Dr Vivian...,http://www.mfa.gov.sg/content/mfa/media_centre...
5,07-Mar-2017,MFA Press Statement: Visit by Deputy Prime Min...,DPM Teo Chee Hean represented Singapore as Spe...,http://www.mfa.gov.sg/content/mfa/media_centre...
6,06-Mar-2017,MFA Press Statement: Visit by Deputy Prime Min...,President of the Republic of Indonesia Joko Wi...,http://www.mfa.gov.sg/content/mfa/media_centre...
7,05-Mar-2017,MFA Press Statement: Visit by Deputy Prime Min...,Deputy Prime Minister (DPM) and Coordina...,http://www.mfa.gov.sg/content/mfa/media_centre...
8,11-Feb-2017,Edited transcript of the Joint Press Conferenc...,"Emcee: Thank you. Excellencies, distinguished ...",http://www.mfa.gov.sg/content/mfa/media_centre...
9,10-Feb-2017,MFA Press Statement: Official Visit of Her Exc...,Minister for Foreign Affairs of the Re...,http://www.mfa.gov.sg/content/mfa/media_centre...


In [12]:
df_mfa.to_csv('data/mfa.csv', encoding='utf-8')