## Scrapely notes


https://blog.scrapinghub.com/2016/07/07/scrapely-the-brains-behind-portia-spiders/

> This approach is handy as you don’t need a well-defined HTML page. It instead relies on the order of tags on a page. Another useful feature of this approach is that Scrapely doesn’t need to find a 100% match, and instead looks for the best match. Even if the page is updated and tags are changed, Scrapely can still extract the data.

* Support partial match. Strict with false positive; prone to false negative.

Other observations:

* Infinite scroll creates another barrier to download and parse article-by-article.


## Meta data

In [3]:
from scrapely import Scraper

In [17]:
s = Scraper()
training_url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
training_data = {'title': '港特首梁振英就住宅违建事件道歉', 
                 'date': '2012年12月10日'}
s.train(training_url, training_data)

In [18]:
testing_url = 'http://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121206_hongkong_illegal_structure.shtml'
s.scrape(testing_url)

[{'date': ['更新时间'], 'title': ['梁振英定下出席违建事件质询日期']}]

## Body

In [19]:
s = Scraper()
training_url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
training_data = {'title': '港特首梁振英就住宅违建事件道歉', 
                 'body': '梁振英说，其位于太平山山顶的住宅内的违建部分大都不是由他所建，此前没有马上公开交待和处理，是因为律师意见认为司法程序仍在进行，他不应评论'}
s.train(training_url, training_data)

testing_url = 'http://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121206_hongkong_illegal_structure.shtml'
s.scrape(testing_url)

[{'body': ['港府落实行政长官梁振英出席立法会答问大会，就其大宅的违章建筑问题（香港称僭建）接受质询的时间。</p>\n                     <p>香港立法会内务委员会主席梁君彦星期四（12月6日）在与政务司司长林郑月娥会面后宣布，梁振英将于下星期一（10日）到立法会答辩。</p>\n                     <p>梁振英此前承认参选前就知道住宅的违建问题，引发政界人士质疑其诚信。</p>\n                     <p>香港媒体分析说，梁振英将能赶及在民主党下星期三（12日）对他提出不信任动议之前回应违建事件。'],
  'title': ['梁振英定下出席违建事件质询日期']}]

In [25]:
s = Scraper()
training_url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
training_data = {'title': '港特首梁振英就住宅违建事件道歉', 
                 'body': '民主党星期四公布的民调报告称，约58%的受访民众认为梁振英在违建事件上隐瞒了实情，23%认为他没有隐瞒；40%受访者支持对梁振英提出不信任动议，33%反对'}
s.train(training_url, training_data)

testing_url = 'http://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121206_hongkong_illegal_structure.shtml'
s.scrape(testing_url)

FragmentNotFound: Fragment not found annotating 'body' using: <function best_match.<locals>.func at 0x1118cdd08>

## Meta data, adjust labelling

Scrapely failed to get a precise match. The resulting element is one level up our intended element. i.e. `<span>更新时间` in this case. If we label both date and time, it results in a conflict.

In [20]:
s = Scraper()
training_url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
training_data = {'title': '港特首梁振英就住宅违建事件道歉', 
                 'date': '2012年12月10日',
                 'time': '格林尼治标准时间10:41'}
s.train(training_url, training_data)

testing_url = 'http://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121206_hongkong_illegal_structure.shtml'
s.scrape(testing_url)

FragmentAlreadyAnnotated: Fragment already annotated: <span class="lastupdated" data-scrapy-annotate="{&quot;annotations&quot;: {&quot;content&quot;: &quot;date&quot;}}">

In [22]:
s = Scraper()
training_url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
training_data = {'title': '港特首梁振英就住宅违建事件道歉', 
                 'date': '更新时间 2012年12月10日, 格林尼治标准时间10:41'}
s.train(training_url, training_data)

testing_url = 'http://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121206_hongkong_illegal_structure.shtml'
s.scrape(testing_url)

FragmentNotFound: Fragment not found annotating 'date' using: <function best_match.<locals>.func at 0x1118cdbf8>

## Meta data - a modern example

Many old web pages do not possess clear HTML structure, hindering the parsing capability of scrapely. Let's try a more modern example.

### Hit HTTP request walls. There exists certain anti-crawling.

In [29]:
s = Scraper()
training_url = 'http://www.scmp.com/week-asia/opinion/article/2152071/see-why-trumps-tariffs-have-hit-chinese-nerve-read-history'
training_data = {'title': 'TO SEE WHY TRUMP’S TARIFFS HAVE HIT A CHINESE NERVE, READ HISTORY', 
                 'date': '24 JUN 2018',
                 'author': 'RANA MITTER',
                 'shares': '4'}
s.train(training_url, training_data)

testing_url = 'http://www.scmp.com/news/china/diplomacy-defence/article/2152195/chinese-leaders-absolutely-confused-trumps-demands'
s.scrape(testing_url)

HTTPError: HTTP Error 405: Method Not Allowed

### There is an HTML interface but the object self-defined, hit error on first trial

In [30]:
help(s.train_from_htmlpage)

Help on method train_from_htmlpage in module scrapely:

train_from_htmlpage(htmlpage, data) method of scrapely.Scraper instance



In [31]:
import requests

You can not just pass `r` or `r.content` or `r.text`... which is normal for other parsing libraries.

In [38]:
training_url = 'http://www.scmp.com/week-asia/opinion/article/2152071/see-why-trumps-tariffs-have-hit-chinese-nerve-read-history'
#r.content

s = Scraper()
training_url = 'http://www.scmp.com/week-asia/opinion/article/2152071/see-why-trumps-tariffs-have-hit-chinese-nerve-read-history'
r = requests.get(training_url)
training_data = {'title': 'TO SEE WHY TRUMP’S TARIFFS HAVE HIT A CHINESE NERVE, READ HISTORY', 
                 'date': '24 JUN 2018',
                 'author': 'RANA MITTER',
                 'shares': '4'}
s.train_from_htmlpage(r, training_data)

AttributeError: 'Response' object has no attribute 'parsed_body'

In [39]:
r.encoding

'utf-8'

In [40]:
testing_url = 'http://www.scmp.com/news/china/diplomacy-defence/article/2152195/chinese-leaders-absolutely-confused-trumps-demands'
#s.scrape(testing_url)

### Continue trying to assemble HTMLPage object

This is not well documented. Clues in issue https://github.com/scrapy/scrapely/issues/17

In [63]:
from scrapely.htmlpage import HtmlPage

In [64]:
s = Scraper()

training_url = 'http://www.scmp.com/week-asia/opinion/article/2152071/see-why-trumps-tariffs-have-hit-chinese-nerve-read-history'
r = requests.get(training_url)
h = HtmlPage(body=r.text)

training_data = {'title': 'To see why Trump’s tariffs have hit a Chinese nerve, read history', 
                 'date': '24 Jun 2018',
                 'author': 'Rana Mitter'}
s.train_from_htmlpage(h, training_data)

In [65]:
open('test.html', 'w').write(r.text)

88469

In [66]:
testing_url = 'http://www.scmp.com/news/china/diplomacy-defence/article/2152195/chinese-leaders-absolutely-confused-trumps-demands'
r = requests.get(testing_url)
h = HtmlPage(body=r.text)
s.scrape_page(h)

[{'author': ['have said']}]

In [67]:
open('test.html', 'w').write(r.text)

195157

Failed to extract, seem page structure is different. The training page downloaded does not have styles.

It may due to the "infinite scroll" feature.

### Try training different page

* title can be parsed
* date can not be parsed


In [70]:
s = Scraper()

training_url = 'http://www.scmp.com/news/china/diplomacy-defence/article/2152195/chinese-leaders-absolutely-confused-trumps-demands'
r = requests.get(training_url)
h = HtmlPage(body=r.text)

training_data = {'title': 'Chinese leaders ‘absolutely confused’ by Trump’s demands on trade', 
                 'date': 'PUBLISHED : Sunday, 24 June, 2018, 11:15am'}
s.train_from_htmlpage(h, training_data)

In [73]:
#help(s.add_template)

In [71]:
testing_url = 'http://www.scmp.com/news/china/policies-politics/article/2151563/chinese-hackers-targeting-satellite-and-defense-firms'
r = requests.get(testing_url)
h = HtmlPage(body=r.text)
s.scrape_page(h)

[{'title': ['Chinese hackers targeting satellite and defense firms, researchers find']}]

In [74]:
from scrapely import TemplateMaker

In [76]:
help(TemplateMaker)

Help on class TemplateMaker in module scrapely.template:

class TemplateMaker(builtins.object)
 |  Methods defined here:
 |  
 |  __init__(self, htmlpage)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  annotate(self, field, score_func, best_match=True)
 |      Annotate a field.
 |      
 |      ``score_func`` is a callable that receives two arguments: (fragment,
 |      htmlpage) and returns a relevancy score (float) indicating how relevant
 |      is the fragment. 0 means the fragment is irrelevant. Higher scores
 |      means the fragment is more relevant. Otherwise, the closest opening tag
 |      (to the left) is annotated with the given attribute.
 |      
 |      If ``best_match`` is ``True``, only the best fragment is annotated.
 |      Otherwise, all fragments (with a positive relevancy) are annotated.
 |  
 |  annotate_fragment(self, index, field)
 |  
 |  annotations(self)
 |      Return all annotations contained in the template as a list of t