# Guardian article extractor

About 15 years ago I wrote a script to scrape the Guardian newspaper site
and write them to a .html so that I could print it off for reading in the
toilet at work.

This took a couple of weeks to get working correctly. Then they had a
couple of modifications and I had to rewrite but it wasn't so hard.

---

### Steps in the procedure

1. Scrape the front page and get the links to stories
1. Access the links and scrape each of the pages
1. Extract meta data and story text from the pages
1. Sort and create table of contents (toc)
1. Format and dump to a file

### Much easier

Things are easier today using **Goose**.





# Let's get started

Pull the front page and get the links.
Some use BeautifulSoup to do this but regexp
is more efficient

In [25]:
import datetime, calendar
mydate = datetime.datetime.now()
 
yy, mm, dd = str(mydate).split(' ')[0].split('-')
today = calendar.month_abbr[int(mm)].lower() + '/' + dd

In [26]:
import requests
import re
pp = re.compile('<a href="(.*?)".*?data-link-name="(.*?)"', re.DOTALL)
r = requests.get('https://www.theguardian.com/')


lks =  set([ lk for lk, name in pp.findall(r.text) 
                       if name == 'article' and today in lk ])  



In [27]:
list(lks)[:3]

['https://www.theguardian.com/us-news/2019/aug/12/pennsylvania-fire-five-children-die-in-fire-at-daycare-centre',
 'https://www.theguardian.com/world/2019/aug/12/uk-and-ireland-commemorate-1979-fastnet-race-disaster',
 'https://www.theguardian.com/music/2019/aug/12/love-yourself-more-k-pop-band-bts-take-extended-break-to-enjoy-life']

# Sorting the links

I always write a class to do this.
The object contains :

- meta data
- a payload
- a unique identifier generated using a hash fucntion


In [28]:
import hashlib

class GuardianArt():
    #this wraps a Goose object
    #maybe I should inherit but ???
    
    def __init__(self, lk=""):
        
        #this is the payload
        self.goose_art = None
        
        self.raw = lk
        self.id = str(hash(self))
        
        #the rest is meta data
        self.label = lk.split('/')[3]
        self.tag = 'misc'
            
        #change tag from misc if appropriate to do so 
        if ( lk.split('/')[4] in ['audio', 'video', 'gallery']):
            self.tag = self.label = 'xmedia'    
        elif (self.label == 'world' or 'news' in self.label):
            self.tag = 'news'
        elif self.label == 'commentisfree':
            self.tag = 'op_ed'

    def __hash__(self):
        return int( hashlib.md5(self.raw.encode()).hexdigest(), 16)
        
    def __repr__(self):
        return self.raw
    
    def __str__(self):
        return self.raw


arts = [GuardianArt(lk) for lk in lks]

shit_list = ['xmedia', 'football', 'business', 'stage', 'sport'] 
arts = [art for art in arts 
             if art.label not in shit_list]

#tags = set([art.tag for art in arts])
misc = set([art.label for art in arts if art.tag == 'misc'])

In [20]:
arts

[https://www.theguardian.com/commentisfree/2019/aug/11/boris-johnson-has-a-problem-with-women-its-not-the-one-you-think,
 https://www.theguardian.com/world/2019/aug/11/british-governments-hong-kong-intervention-riles-china,
 https://www.theguardian.com/society/2019/aug/11/medicinal-cannabis-the-hype-is-strong-but-the-evidence-is-weak,
 https://www.theguardian.com/commentisfree/2019/aug/11/good-old-days-look-deeper-and-myths-of-ideal-communities-fades,
 https://www.theguardian.com/us-news/2019/aug/11/iranian-woman-arrested-in-australia-pleads-guilty-to-conspiracy-in-us,
 https://www.theguardian.com/politics/2019/aug/11/sajid-javids-plan-to-flood-tills-with-brexit-50p-coins,
 https://www.theguardian.com/environment/2019/aug/11/extinction-rebellion-hitting-a-nerve-at-australias-climate-flashpoint,
 https://www.theguardian.com/world/2019/aug/11/i-will-eventually-get-killed-meet-bryan-kramer-papua-new-guineas-anti-corruption-tsar,
 https://www.theguardian.com/media/2019/aug/11/lights-dim-bu

# Loading the articles

This used to be hard but I use Goose to do this now

In [29]:
from goose3 import Goose
#from  bs4 import BeautifulSoup

gg = Goose()

for art in arts:
    print(  str(art).split('/')[-1],)
    try:
        art.goose_art = gg.extract(url=str(art) ) 
    except:
        pass
print('Done!')


pennsylvania-fire-five-children-die-in-fire-at-daycare-centre
uk-and-ireland-commemorate-1979-fastnet-race-disaster
love-yourself-more-k-pop-band-bts-take-extended-break-to-enjoy-life
joseph-fiennes-ive-done-my-bit-for-society-ive-illustrated-the-patheticness-of-misogyny
new-zealand-gun-buyback-10000-firearms-returned-after-christchurch-attack
vladimir-putin-west-russian-president-20-years
rock-sea-meaing-tel-aviv-beach-refugee
guatemala-elects-president-alejandro-giammattei-who-called-trump-immigration-deal-bad-news
myanmar-scrambles-to-rescue-flood-victims-as-landslide-kills-dozens
drug-maker-will-make-21bn-from-treating-cystic-fibrosis
behrouz-boochani-wins-25000-national-biography-award-and-accepts-via-whatsapp-from-manus
nigel-farage-prince-harry-meghan-markle-overweight-queen-mother-cpac-brexit
troubles-tourism-should-derry-be-celebrating-its-political-murals
russia-honours-national-heroes-killed-in-mysterious-nuclear-rocket-blast
australia-coal-use-is-existential-threat-to-pacif

# Output to HTML

There are libraries for doing HTML output from python 
but as often as not string formatting is easier.

In [30]:
header = '''<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/international">
<head>
<title>News, sport and opinion from the Guardian's global edition | The Guardian</title>
<meta charset="utf-8">
</head>
'''  

headline_wrapper = '<h2 id="%s">%s</h2>'
keyword_wrapper = '<h4>%s</h4>'
para_wrapper = "<p>%s</p>\n\n"
toc_item_wrapper = '<a href="#%s">%s</a><br>\n'

sections = ['news', 'op_ed', 'misc']
text  = {x:[] for x in sections}
tocs  = {x:[] for x in sections}


for yy in arts:
    #the goose object is stored in the article
    art = yy.goose_art
    
    html_id = (yy.id, art.title)
    tocs[yy.tag].append(html_id) 
    
    html_text = ''.join( [para_wrapper%x for x in art.cleaned_text.split('\n'*2)])
    text[yy.tag].append('\n\n'.join([ headline_wrapper%html_id, 
                                      keyword_wrapper%art.meta_keywords, 
                                      html_text]))

#write to a file remember to use a context cos it's cleaner   
with  open('text.html','w') as fp:
    fp.write(header + '<body>\n')
    
    for ss in sections:
        fp.write('<h1>%s</h1>'%ss.upper())
        toc_html = [toc_item_wrapper%x for x in tocs[ss] ]
        fp.write(''.join(toc_html) + '<br>\n')
        
    for ss in sections:
        fp.write('\n'.join(text[ss]) )

    fp.write('</body>\n')