Add web scrape sensor #3841

fabaff · 2016-10-12T15:46:15Z

Description:
The web scrape sensor is built on top of the REST sensor, allows one to retrieve a whole website, and to extract a value. Its limited features can't compete with a specialized tool like scrapy but as a last resort it could be helpful. There are samples available in the docs PR with a bit of context (below you find the entries for the configuration.yaml file) and a Jupyter notebook is showing some details.

Related issue (if applicable): fixes Web page parsing or web scraping sensor

Pull request in home-assistant.io with documentation (if applicable): home-assistant/home-assistant.io#1220

Example entry for configuration.yaml (if applicable):

sensor:
  - platform: scrape
    resource: https://home-assistant.io
    name: Release
    select: ".current-version h1"
    value_template: '{{ value.split(":")[1] }}'
  - platform: scrape
    resource: https://home-assistant.io/components/
    name: Home Assistant impl.
    select: 'a[href="#all"]'
    value_template: '{{ value.split("(")[1].split(")")[0] }}'

Checklist:

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated in home-assistant.io

If the code communicates with devices, web services, or third-party tools:

Local tests with tox run successfully. Your PR cannot be merged unless tests pass
New dependencies have been added to the REQUIREMENTS variable (example).
New dependencies are only imported inside functions that use them (example).
New dependencies have been added to requirements_all.txt by running script/gen_requirements_all.py.
New files were added to .coveragerc.

mention-bot · 2016-10-12T15:46:17Z

@fabaff, thanks for your PR! By analyzing the history of the files in this pull request, we identified @balloob, @robbiet480 and @rmkraus to be potential reviewers.

rpitera · 2016-10-12T15:54:13Z

Not to nitpick but typically we used to call these things web "scrapers" and would like to suggest the platform be named "scrape" instead of "scrap" for clarity and recognition. I would love to see this added.

fabaff · 2016-10-12T16:17:05Z

Thanks, renamed.

covrig · 2016-10-12T17:46:31Z

Thanks. Great addition. Is it possible to scrape JavaScript, Ajax generated values (dynamic) with this sensor?

balloob · 2016-10-12T19:53:21Z

Instead of trying to create your own selecting method, let's use CSS selectors using beautifulsoup select method: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

sensor:
  - platform: scrape
    resource: https://home-assistant.io
    select: ".current-version h1"

fabaff · 2016-10-13T15:56:46Z

Now it's beautifulsoup-only. Seems that select() will help to save a bunch of lines as it's doing the same as the my xml2dict detour did.

fabaff · 2016-10-13T15:58:26Z

@covrig, do you have an example?

covrig · 2016-10-13T16:03:15Z

@fabaff Sure.
Try scraping the numbers from here (table):
http://www.transelectrica.ro/widget/web/tel/sen-harta/-/harta_WAR_SENOperareHartaportlet
I already have a python script for this based on PhantomJS if it helps you.

A small question. What kind of output do you get with the scraper (number, string etc.) Can you do any math operations (comparisons) with the result (in case they are numbers)?

Thanks.

fabaff · 2016-10-13T18:46:46Z

As long as you can use a CSS selectors to identify the part you need you should be ok.

The output is a string. This doesn't really matter if processing can be done with a template sensor.

happyleavesaoc · 2016-10-13T19:01:48Z

If you want to scrape dynamic content, you'll either have to track down the AJAX url of the data source, or effectively run a browser emulator like PhantomJS.

arsaboo · 2016-10-13T19:05:04Z

Sounds like a great option for services that do not provide API access. Can this be used to say track prices on Amazon or IFTTT status? I think few common examples will really help.

covrig · 2016-10-13T19:42:17Z

@fabaff I completely forgot about the template sensor. Thanks again for the great work.

fabaff · 2016-10-14T09:31:41Z

I think few common examples will really help.

The docs already contain a couple of examples. I added one for IFTTT status.

balloob · 2016-10-15T04:48:18Z

homeassistant/components/sensor/scrape.py

+        _LOGGER.error(data)
+
+        try:
+            self._state = data[self._element].text[self._before:self._after]


Let's simplify this a bit.

Remove element. People can specify this using the CSS selector :nth-child(). Instead, always pick the first one.

Instead of before and after, add support for a value_template. This will give ultimate freedom and is in line with our other platforms.

and remove 'before', 'after' & 'element'

balloob · 2016-10-16T23:06:05Z

🐬

fabaff changed the title ~~Add web scrap sensor~~ Add web scrape sensor Oct 12, 2016

robbiet480 added the Hacktoberfest label Oct 12, 2016

Add web scrape sensor

4ba79ec

fabaff mentioned this pull request Oct 14, 2016

Add helper notebook for scrape sensor home-assistant/home-assistant-notebooks#5

Closed

balloob reviewed Oct 15, 2016

View reviewed changes

fabaff added 2 commits October 16, 2016 00:41

Add support for 'value_template', set 'verify_ssl' to true,

d1224ef

and remove 'before', 'after' & 'element'

Fix pylint issue

9b13b61

balloob merged commit 71ee847 into home-assistant:dev Oct 16, 2016

fabaff deleted the scrap-sensor branch October 17, 2016 16:25

home-assistant locked and limited conversation to collaborators Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add web scrape sensor #3841

Add web scrape sensor #3841

fabaff commented Oct 12, 2016 •

edited

mention-bot commented Oct 12, 2016

rpitera commented Oct 12, 2016 •

edited

fabaff commented Oct 12, 2016 •

edited

covrig commented Oct 12, 2016

balloob commented Oct 12, 2016 •

edited

fabaff commented Oct 13, 2016

fabaff commented Oct 13, 2016

covrig commented Oct 13, 2016 •

edited

fabaff commented Oct 13, 2016

happyleavesaoc commented Oct 13, 2016

arsaboo commented Oct 13, 2016

covrig commented Oct 13, 2016

fabaff commented Oct 14, 2016

balloob Oct 15, 2016

fabaff Oct 15, 2016

balloob commented Oct 16, 2016

Add web scrape sensor #3841

Add web scrape sensor #3841

Conversation

fabaff commented Oct 12, 2016 • edited

mention-bot commented Oct 12, 2016

rpitera commented Oct 12, 2016 • edited

fabaff commented Oct 12, 2016 • edited

covrig commented Oct 12, 2016

balloob commented Oct 12, 2016 • edited

fabaff commented Oct 13, 2016

fabaff commented Oct 13, 2016

covrig commented Oct 13, 2016 • edited

fabaff commented Oct 13, 2016

happyleavesaoc commented Oct 13, 2016

arsaboo commented Oct 13, 2016

covrig commented Oct 13, 2016

fabaff commented Oct 14, 2016

balloob Oct 15, 2016

Choose a reason for hiding this comment

fabaff Oct 15, 2016

Choose a reason for hiding this comment

balloob commented Oct 16, 2016

fabaff commented Oct 12, 2016 •

edited

rpitera commented Oct 12, 2016 •

edited

fabaff commented Oct 12, 2016 •

edited

balloob commented Oct 12, 2016 •

edited

covrig commented Oct 13, 2016 •

edited