-
-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add web scrape sensor #3841
Add web scrape sensor #3841
Conversation
@fabaff, thanks for your PR! By analyzing the history of the files in this pull request, we identified @balloob, @robbiet480 and @rmkraus to be potential reviewers. |
Not to nitpick but typically we used to call these things web "scrapers" and would like to suggest the platform be named "scrape" instead of "scrap" for clarity and recognition. I would love to see this added. |
Thanks, renamed. |
Thanks. Great addition. Is it possible to scrape JavaScript, Ajax generated values (dynamic) with this sensor? |
Instead of trying to create your own selecting method, let's use CSS selectors using beautifulsoup sensor:
- platform: scrape
resource: https://home-assistant.io
select: ".current-version h1" |
Now it's beautifulsoup-only. Seems that |
@covrig, do you have an example? |
@fabaff Sure. A small question. What kind of output do you get with the scraper (number, string etc.) Can you do any math operations (comparisons) with the result (in case they are numbers)? Thanks. |
As long as you can use a CSS selectors to identify the part you need you should be ok. The output is a string. This doesn't really matter if processing can be done with a template sensor. |
If you want to scrape dynamic content, you'll either have to track down the AJAX url of the data source, or effectively run a browser emulator like PhantomJS. |
Sounds like a great option for services that do not provide API access. Can this be used to say track prices on Amazon or IFTTT status? I think few common examples will really help. |
@fabaff I completely forgot about the template sensor. Thanks again for the great work. |
The docs already contain a couple of examples. I added one for IFTTT status. |
_LOGGER.error(data) | ||
|
||
try: | ||
self._state = data[self._element].text[self._before:self._after] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify this a bit.
- Remove
element
. People can specify this using the CSS selector:nth-child()
. Instead, always pick the first one. - Instead of
before
andafter
, add support for avalue_template
. This will give ultimate freedom and is in line with our other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
and remove 'before', 'after' & 'element'
🐬 |
Description:
The web scrape sensor is built on top of the REST sensor, allows one to retrieve a whole website, and to extract a value. Its limited features can't compete with a specialized tool like scrapy but as a last resort it could be helpful. There are samples available in the docs PR with a bit of context (below you find the entries for the
configuration.yaml
file) and a Jupyter notebook is showing some details.Related issue (if applicable): fixes Web page parsing or web scraping sensor
Pull request in home-assistant.io with documentation (if applicable): home-assistant/home-assistant.io#1220
Example entry for
configuration.yaml
(if applicable):Checklist:
If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
tox
run successfully. Your PR cannot be merged unless tests passREQUIREMENTS
variable (example).requirements_all.txt
by runningscript/gen_requirements_all.py
..coveragerc
.