# Easy Scraping

Demo: scrapely, python-readability, pyQuery, httpie

Prerequisites:

* Python3
* `pip install -r reuiqrements.txt`

Useful trick in IPython notebook

In [87]:
import pprint
from IPython.core.display import HTML

In [2]:
HTML('Logo of Initium Lab: <img src="%s">' % 'http://initiumlab.com/favicon-32x32.png')

A small hack to allow longer output area

In [3]:
%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

<IPython.core.display.Javascript object>

## Readability

We use a version ported to Python3:
<https://github.com/hyperlinkapp/python-readability>
(already included in the `reuqirements.txt` file)

In [4]:
from readability.readability import Document
import requests
html = requests.get('http://initiumlab.com/').content
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

In [5]:
print(readable_article)

<html><body><div><div class="post-body">

      
      

      
        
          <video controls="" poster="./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"/><br/>  <source src="./blog/20150922-jackathon3-review/jackathon3-timelapse.webm" type="video/webm"/><br/>  Sorry, you browser does not support HTML5 video.<br/></video>

<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2">Youku</a>.</p>
<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.</p>
<p>This week, the goal for each participant is to read one the t

In [6]:
HTML(readable_article)

## PyQuery

Let's fix the above URL problems

In [7]:
import pyquery
r = pyquery.PyQuery(readable_article)
r('p')

[<p>, <p>, <p>]

In [8]:
r('video').attr('poster')

'./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'

In [9]:
r('video source').attr('src')

'./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'

In [10]:
r('video').attr('poster', 'http://initiumlab.com/%s' % r('video').attr('poster'))

[<video>]

In [11]:
r('video').attr('poster')

'http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'

In [12]:
r('video source').attr('src', 'http://initiumlab.com/%s' % r('video source').attr('src'))

[<source>, <source>]

In [13]:
r('video source').attr('src')

'http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'

In [14]:
r.html()

'<body><div><div class="post-body">\n\n      \n      \n\n      \n        \n          <video controls="" poster="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"/><br/>  <source src="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/webm"/><br/>  Sorry, you browser does not support HTML5 video.<br/></video>\n\n<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2">Youku</a>.</p>\n<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting

In [15]:
%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

<IPython.core.display.Javascript object>

In [16]:
HTML(r.html())

## Scrapely

In [17]:
from scrapely import Scraper
s = Scraper()

In [26]:
help(s.train)

Help on method train in module scrapely:

train(url, data, encoding=None) method of scrapely.Scraper instance



In [102]:
from urllib import parse
def get_localhost_url(url):
    filename = parse.quote_plus(url)
    fullpath = 'tmp/%s' % filename
    html = requests.get(url).content
    open(fullpath, 'wb').write(html)
    return 'http://localhost:8888/files/%s?download=1' % parse.quote_plus(fullpath)

In [103]:
training_url = 'http://initiumlab.com/blog/20150916-legco-eng/'
training_data = {'title': 'Legco Matrix Brief (English)', 
                 'author': 'Initium Lab', 
                 'date': '2015-09-16'}
s.train(get_localhost_url(training_url), training_data)

In [104]:
testing_url = 'http://initiumlab.com/blog/20150901-data-journalism-for-the-blind/'
s.scrape(get_localhost_url(testing_url))

[{'author': ['Andy Shu'],
  'date': ['\n            2015-09-01\n          '],
  'title': [' 可視化火了 盲人怎麼辦 | Initium Lab ']}]

In [105]:
testing_url = 'http://initiumlab.com/blog/20150922-jackathon3-review/'
s.scrape(get_localhost_url(testing_url))

[{'author': ['Initium Lab'],
  'date': ['\n            2015-09-22\n          '],
  'title': [' Jackathon #3 -- Read a data science book in 8 hours | Initium Lab ']}]