Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add web scrape sensor #3841

Merged
merged 3 commits into from
Oct 16, 2016
Merged

Add web scrape sensor #3841

merged 3 commits into from
Oct 16, 2016

Conversation

fabaff
Copy link
Member

@fabaff fabaff commented Oct 12, 2016

Description:
The web scrape sensor is built on top of the REST sensor, allows one to retrieve a whole website, and to extract a value. Its limited features can't compete with a specialized tool like scrapy but as a last resort it could be helpful. There are samples available in the docs PR with a bit of context (below you find the entries for the configuration.yaml file) and a Jupyter notebook is showing some details.

Related issue (if applicable): fixes Web page parsing or web scraping sensor

Pull request in home-assistant.io with documentation (if applicable): home-assistant/home-assistant.io#1220

Example entry for configuration.yaml (if applicable):

sensor:
  - platform: scrape
    resource: https://home-assistant.io
    name: Release
    select: ".current-version h1"
    value_template: '{{ value.split(":")[1] }}'
  - platform: scrape
    resource: https://home-assistant.io/components/
    name: Home Assistant impl.
    select: 'a[href="#all"]'
    value_template: '{{ value.split("(")[1].split(")")[0] }}'

Checklist:

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • Local tests with tox run successfully. Your PR cannot be merged unless tests pass
  • New dependencies have been added to the REQUIREMENTS variable (example).
  • New dependencies are only imported inside functions that use them (example).
  • New dependencies have been added to requirements_all.txt by running script/gen_requirements_all.py.
  • New files were added to .coveragerc.

@mention-bot
Copy link

@fabaff, thanks for your PR! By analyzing the history of the files in this pull request, we identified @balloob, @robbiet480 and @rmkraus to be potential reviewers.

@rpitera
Copy link

rpitera commented Oct 12, 2016

Not to nitpick but typically we used to call these things web "scrapers" and would like to suggest the platform be named "scrape" instead of "scrap" for clarity and recognition. I would love to see this added.

@fabaff fabaff changed the title Add web scrap sensor Add web scrape sensor Oct 12, 2016
@fabaff
Copy link
Member Author

fabaff commented Oct 12, 2016

Thanks, renamed.

@covrig
Copy link
Contributor

covrig commented Oct 12, 2016

Thanks. Great addition. Is it possible to scrape JavaScript, Ajax generated values (dynamic) with this sensor?

@balloob
Copy link
Member

balloob commented Oct 12, 2016

Instead of trying to create your own selecting method, let's use CSS selectors using beautifulsoup select method: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

sensor:
  - platform: scrape
    resource: https://home-assistant.io
    select: ".current-version h1"

@fabaff
Copy link
Member Author

fabaff commented Oct 13, 2016

Now it's beautifulsoup-only. Seems that select() will help to save a bunch of lines as it's doing the same as the my xml2dict detour did.

@fabaff
Copy link
Member Author

fabaff commented Oct 13, 2016

@covrig, do you have an example?

@covrig
Copy link
Contributor

covrig commented Oct 13, 2016

@fabaff Sure.
Try scraping the numbers from here (table):
http://www.transelectrica.ro/widget/web/tel/sen-harta/-/harta_WAR_SENOperareHartaportlet
I already have a python script for this based on PhantomJS if it helps you.

A small question. What kind of output do you get with the scraper (number, string etc.) Can you do any math operations (comparisons) with the result (in case they are numbers)?

Thanks.

@fabaff
Copy link
Member Author

fabaff commented Oct 13, 2016

As long as you can use a CSS selectors to identify the part you need you should be ok.

The output is a string. This doesn't really matter if processing can be done with a template sensor.

@happyleavesaoc
Copy link
Contributor

If you want to scrape dynamic content, you'll either have to track down the AJAX url of the data source, or effectively run a browser emulator like PhantomJS.

@arsaboo
Copy link
Contributor

arsaboo commented Oct 13, 2016

Sounds like a great option for services that do not provide API access. Can this be used to say track prices on Amazon or IFTTT status? I think few common examples will really help.

@covrig
Copy link
Contributor

covrig commented Oct 13, 2016

@fabaff I completely forgot about the template sensor. Thanks again for the great work.

@fabaff
Copy link
Member Author

fabaff commented Oct 14, 2016

I think few common examples will really help.

The docs already contain a couple of examples. I added one for IFTTT status.

_LOGGER.error(data)

try:
self._state = data[self._element].text[self._before:self._after]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this a bit.

  • Remove element. People can specify this using the CSS selector :nth-child(). Instead, always pick the first one.
  • Instead of before and after, add support for a value_template. This will give ultimate freedom and is in line with our other platforms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@balloob
Copy link
Member

balloob commented Oct 16, 2016

🐬

@balloob balloob merged commit 71ee847 into home-assistant:dev Oct 16, 2016
@fabaff fabaff deleted the scrap-sensor branch October 17, 2016 16:25
@home-assistant home-assistant locked and limited conversation to collaborators Mar 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants