Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

PROPOSAL: Get Title/Description/icon from site directly #482

Open
scottjenson opened this issue Aug 1, 2015 · 29 comments
Open

PROPOSAL: Get Title/Description/icon from site directly #482

scottjenson opened this issue Aug 1, 2015 · 29 comments

Comments

@scottjenson
Copy link
Contributor

BACKSTORY
One of the issues with have in supporting mDNS (a broadcast protocol over wifi) is that many of the URLs broadcast area local IP address This means the Physical Web Service (PWS) can't contact the site to fetch/cache the meta data (Title/Description/Icon)

Keep in mind that the Physical Web wants to be more than just BLE. One of the main reasons we added mDNS was to show that we can harvest URLs from a range of sources. We expect there to be other transports as well.

As the PWS can't get the Title/Description/Icon for the client, the client now needs to fetch it directly. One of the core principles of the Physical Web is that we try to build on top of the web as much as possible. So instead of having, for example, an mDNS specific path to pass the Title/Description/Icon through, we feel it is much better to come up with a web based approach which can work with any future transport that comes along.

The current way to find the Title/Description/Icon information is to have the PWS just download the entire page and parse through the HTML. This is useful for the simple reason is that no site needs to change anything, the Physical Web works with them as is.

The problem is that we want to keep the effort/processing/data requirements of any client to a minimum. Of course a client can do anything it wants and if someone wants to do the HTML traversal, please, go knock yourselves out. However, the proposal here is to provide something much lighter and easier to support. In fact, this should also work for any page, public or private, it doesn't have to be limited to local IP addresses.

ACTUAL PROPOSAL
Any website that wants to expose the Physical Web meta data specifically would do it through RESTful call so http:/w.x.y.z/info (exact path is TBD) would return a JSON blob of text that would include these attributes. If this exists, it would supersede the HTML tag info. Like the path, the details of the JSON blob are TBD.

The purpose of this issue is to gather comments from the community. Assuming we get agreement on this REST/JSON approach, we'll move onto the details (exact path, JSON structure)

@cqueern
Copy link

cqueern commented Aug 1, 2015

Might it be better to piggyback on the Structured Data work that Google's pushing?

https://developers.google.com/structured-data/site-name

Pros:

  • Don't need an entirely new paradigm for already attention-constrained site owners to consider
  • it's in JSON
  • I believe content management systems and their supporting ecosystems are already building in such support. I base this on Joost's SEO plugin, which is very popular in the WordPress community and includes basic structured data JSON integration
  • Some pretty wacky and / or dangerous stuff (keyword stuffing, crazy character encoding) winds its way into Page Title and Description HTML content
  • Getting site owners across the web to adopt an entirely new declaration in a new file is non-trivial (just think of how many site owners haven't even gotten around to implementing a robots.txt file)

Cons:

  • it requires parsing of the whole HTML page, which is the original problem I think you wanted to avoid

@scottjenson
Copy link
Contributor Author

Good point. I agree we should definitely not reinvent the wheel. The JSON format for site name seems a good start. But you correctly point out that having it buried within a potentially large HTML file, which is far from helpful.

I'm hoping others know about a a well known RESTful pattern we could use to access this data.

@triblondon
Copy link

The TAG has recently discussed a similar problem with loading metadata associated with CSV files, and one proposed solution that proved controversial was to load the metadata from a well known URL. Initially this was relative to the URL of the CSV, but now seems to be site wide:

http://www.w3.org/TR/tabular-data-model/#site-wide-location-configuration

I don't like this, as in principle I'm opposed to adding more things to the already overpopulated and nonsensical collection of 'well known urls' - the web exists to provide cross references between resources exactly so that you don't need to know the path.

If you are to use a well known URL it should at least be defined by RFC5785, but my instinct is to prefer a request for the root path "/" combined with content negotiation to request the metadata that you want. My reasoning is that under the current system, you are executing a GET on "/" already, and parsing the html, and your problem is that html is not a great format for presenting this data, so you're essentially saying you want the same resource in a different format. That's what the accept header is for. It's also backwards compatible without making a second request.

@devunwired
Copy link
Contributor

I might just be piggybacking off what @cqueern said (I'm a bit out of my element here), but why not use one or all of the same methods used by App Indexing to provide deep link metadata? Something like Schema.org markup in the page headers or sitemaps.

I'm not familiar with whether or not any page markup solution presents the same HTML parsing problem mentioned before, but a sitemap would seem to be smaller and better structured if it does.

https://developers.google.com/app-indexing/reference/deeplinks

@scottjenson
Copy link
Contributor Author

@triblondon The issue here is that for any public URL, you are correct, the PWS is doing a GET on the the root URL and parsing the entire page. This has maximum compatibility as the site has to do nothing at, all, the PWS will parse and figure everything out.

The issue is with local URLs, that can't be reached. We're trying to avoid the full download/parsing effort on the client (especially if it has to do 20 of them!) so we're looking for a lightweight way for the client to so some sites when there is no choice.

The proposal is some form of /properties which returns a simple JSON blog so the client can snarf it up quickly. This seems very simple, we're just making sure there aren't any existing systems in place (such as the Web of Things proposal.

@devunwired the answer is much the same, we feel strongly that we don't want the client to be downloading the entire page so something direct and fast seems like a good idea here, we're trying to avoid downloading the entire page.

@cqueern
Copy link

cqueern commented Aug 1, 2015

It seems like having a JSON of the desired structured data point in a flat file would work best. (@triblondon I'd love to check out a link to arguments against 'well known urls' if you have one handy!)

Many content management systems and plugins for those CMS create sitemap.xml files automatically. It wouldn't be too much extra work to engineer them to create, as @scottjenson put it 'http:/w.x.y.z/info (exact path is TBD) would return a JSON blob of text that would include these attributes.'

I believe one of the metrics we care about most is ease of adoption. If the adoption rate of the sitemap.xml effort is an acceptable comparison, we should be able to get close by leveraging familiar schemes.

@danbri
Copy link

danbri commented Aug 2, 2015

FWIW if there's anything missing from schema.org's vocabulary which would make things easier here, just file an issue (nearby in http://github.com/schemaorg/schemaorg).

If JSON inline in HTML is too much to handle, sticking JSON-LD in its own file with a link rel=meta or similar should work.

An aside re Google (although none of this is otherwise Google-specific): Google is currently focussed on JSON-LD within HTML but that is not set in stone, and even then will index the post-Javascript DOM so you could have a script load up the external metadata and inject it into the page to make it available for the various features listed at http://developers.google.com/structured-data/

@dinhvh
Copy link
Contributor

dinhvh commented Aug 3, 2015

@danbri @cqueern which fields from the JSON-LD do you suggest to use for a website?
"name" for the title, "description" for the snippet of the website and "image" for the icon?

@cqueern
Copy link

cqueern commented Aug 3, 2015

@dinhviethoa, would something like the format displayed in Rich Snippets for Articles work? I believe it's the closest thing to what we're looking for.
Here's an example:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline",
  "alternativeHeadline": "The headline of the Article",
  "image": [
    "thumbnail1.jpg",
    "thumbnail2.jpg"
  ],
  "datePublished": "2015-02-05T08:00:00+08:00",
  "description": "A most wonderful article",
  "articleBody": "The full body of the article"
}
</script>

"name" could be derived from Headline
"description" would be derived from description.
"image" would be derived from image.

@cqueern
Copy link

cqueern commented Aug 8, 2015

Was just thinking about the different levels of sophistication of users out there. Consider...

Scenario one: I'm a large retail enterprise with a reasonably sophisticated CMS or dev team and my web platform can support on-demand RESTful JSON for different product categories or promotions that a shopper might find broadcast by beacons in different parts of my store.

Scenario two: I'm a small business owner and am savvy enough to put up a flat text file which contains the appropriate physical web JSON markup (details TBD) in a flat file at a well known URL on my site. I don't have the resources to do anything more complicated.

Going back to @scottjenson's original proposal:

ACTUAL PROPOSAL
Any website that wants to expose the Physical Web meta data specifically would do it through RESTful call so http:/w.x.y.z/info (exact path is TBD) would return a JSON blob of text that would include these attributes. If this exists, it would supersede the HTML tag info. Like the path, the details of the JSON blob are TBD.

Would the following be acceptable logic to accommodate the majority of use cases?

  1. call ... http:/w.x.y.z/info (exact path is TBD) (which) would return a JSON blob of text that would include the Title/Description/Icon attributes. If no response available there...
  2. call http:/w.x.y.z./new-well-known-physical-web-URL.txt that would return more generic JSON blob of text that would include the Title/Description/Icon attributes. If no response available there...
  3. fall back to having the PWS just download the entire page and parse through the HTML to find the Title/Description/Icon information

While the logic above does require several network calls (increasing overhead) it lowers the bar to those who wish to participate. Just a thought...

@scottjenson
Copy link
Contributor Author

call ... http:/w.x.y.z/info (exact path is TBD) (which) would return a JSON blob of text that would include the Title/Description/Icon attributes. If no response available there...

OK, that's the same, the RESTful call is the first call we make

call http:/w.x.y.z./new-well-known-physical-web-URL.txt that would return more generic JSON blob of text that would include the Title/Description/Icon attributes. If no response available there...

This is your addition, a simple .txt file so people don't have to create a full RESTful interface. It's just a simple part of the website, a bit like robots.txt. However, if I understand how RESTful works, the path can resolve to a file can't it? Depending on how you've configured your web server, "http:/w.x.y.z/info" can return "http:/w.x.y.z/info/index.html" If that is possible, then we probably don't need the .txt file at all.

fall back to having the PWS just download the entire page and parse through the HTML to find the Title/Description/Icon information

Well, that's the thing we're trying to avoid ;-) The issue is that the IP is a local address and the PWS doesn't have access so we actually can't do this.

Am I missing something obvious about a RESTful call returning a simple .html file?

For what it's worth, I can see going with a simple .txt file instead of the RESTful call if people feel the RESTful approach is too much of a burden.

@cqueern
Copy link

cqueern commented Aug 9, 2015

Well, that's the thing we're trying to avoid ;-) The issue is that the IP is a local address and the PWS doesn't have access so we actually can't do this.

So are you suggesting that in no circumstance should HTML pages be valid sources of Title/Description/icon? I'm not necessarily against that idea, just clarifying.

Depending on how you've configured your web server, "http:/w.x.y.z/info" can return "http:/w.x.y.z/info/index.html" If that is possible, then we probably don't need the .txt file at all.

True... but that would require some site owners to perform the extra step of that configuration. I'm still thinking of the mom and pop small business owners who have enough on their plate already, and minimizing constraints to adoption of physical web among users like them. Because sitemap.xml and robots.txt files should be familiar to most folks by now, doing something similar like physicalweb.txt (or whatever we call it, and including the file extension) would seem most easily adopted.

@scottjenson
Copy link
Contributor Author

So are you suggesting that in no circumstance should HTML pages be valid sources of Title/Description/icon? I'm not necessarily against that idea, just clarifying.

I struggle with this. It's clearly a standard WEB technique as in nearly all cases, the web page is being downloaded anyway. In our case, we are looking at potentially dozens of pages and to download them all could easily overwhelm the mobile client (especially if most of the time, you won't even be picking the page at all!)

This is why we're trying to come up with a web appropriate way to get as much info about the server as possible, and why we're exploring this RESTful approach.

True... but that would require some site owners to perform the extra step of that configuration. I'm still thinking of the mom and pop small business owners who have enough on their plate already, and minimizing constraints to adoption of physical web among users like them.

This is interesting. I'm assuming this local IP address is almost always going to be a hardware product (like a TV) I really don't see how mDNS could ever be a mom and pop technology. They would most likely use a plugin BLE wall wart or even just have their phone broadcast the URL (to a public website) It seems very unlikely a mom and pop shop would host their own website.

As to the configuration difficulties, my understanding is that what I described is the default behavior of apache so it really should be only as difficult as a) making the json b) naming it index.html and c) placing it in the 'info' directory (or whatever we call it). Of course, I want to make sure I'm not overlooking something here but my understanding is that should be nearly as easy as creating as .txt file.

But let me be clear, I'm not religious on this point. I'm just trying to avoid 2 rules. Just seems like having 1 would be better, especially if it's very easy to do.

@dermike
Copy link
Contributor

dermike commented Aug 9, 2015

I know too little about performance in this case, but depending on if it's the parsing or the http request that's expensive, wouldn't it be possible to only fetch the first couple of kilobytes of the internal url? Might not be fail-safe, but easier... Getting support for this in hardware products could be slow?

@scottjenson
Copy link
Contributor Author

Worth considering! However, a local IP address means that the device in question has a web server built in so it's already on a totally different level that a simplistic BLE beacon with just an ad packet. That's kind of my point. Devices of this complexity really don't need to do much here as they are already quite full functioned. That make sense?

@dermike
Copy link
Contributor

dermike commented Aug 9, 2015

@scottjenson Yes, but I still think it will be a long time before my NAS maker adds this kind of thing to the firmware in addition to the mDNS... so a solution like this could be really slow rolling, or maybe I'm just being a pessimist. ;)

@scottjenson
Copy link
Contributor Author

Let me rephrase. If a device is broadcasting it's local URL (e.g 192.0.0.123) who is serving up that page? Keep in mind that it is perfectly acceptable for it to broadcast a public url (mynas.com/modelXYZ) and not serve up a page at all.

The whole reason I'm proposing this is because we're getting asked by hardware makers that want to have a webserver device broadcast a local URL. In that case, we need to have a solution that mobile clients can find/browse easily.

You CANT have a local IP address and NOT serve up a web page. Does that help clarify?

@dermike
Copy link
Contributor

dermike commented Aug 10, 2015

Not sure we're talking about the same thing.. The point I was trying to make is that my NAS, which has a web server built in, is broadcasting it's ip with mDNS (I guess since it shows up in the PW app). The ip serves up the web based admin page for all settings. To get the metadata other than from the HTML would require a firmware upgrade by the manufacturer since I have no control of that server myself, even though it's my NAS. That was my point of possible slow adoption.

I might be confused about this topic though.

@triblondon
Copy link

Scott, why is the processing overhead on the sensing device a problem? One would assume that the sensing device is a smartphone, so wouldn't it have a decent HTML parser in it already? I suspect I'm missing something. I think I missed the bit when the PWS was added and it feels like a weakness of the project to make a centralised PWS part of the workflow. I'd rather have my device connect directly to the objects that are broadcasting their existence. Is there a doc explaining the PWS role and why it's needed?

@scottjenson
Copy link
Contributor Author

We have a readme on the PWS here on the github. The PWS serves two purposes: speed and protection. Imagine if you can see 50 beacons. That would mean that to get the title/description/favicon you'd have to parse through 50 different web pages. That's a lot of data, especially if you want to pick only one. Besides, you would be contacting all 50 of those websites to download that information. By having the PWS there, IT can contact the devices, cache the results, and not expose the user to finger printing. One last feature, the PWS can filter out SPAM and malitious web sites, providing a layer of protection for the user.

However, let's be clear, this does create a single server through which all lookups flow, which isn't in the spirit of the web. That is why we've open sourced both the client and the PWS so to encourage others to make their own alternative scanners/PWS servers. We very much want there to be user choice in this area.

So if you want to have a simple client that gathers just the URLS, doesn't show any meta data and provides a minimal but serverless UX, you are welcome to build that client. The PWS is meant to be an optional part of that experience.

However, for local IPs, the PWS can't reach those devices (It can only cache publicly reachable websites). Yet the same data issues remain, I don't want my phone to CURL 50 local websites and parse through the HTML. It would be much better to just fetch the data I need from a single file. That is what we're discussing here. How to have the same data returned for a local device that can't be contacted by something like the PWS.

Does that make sense? Anything I missed?

@triblondon
Copy link

Thanks for this - and sorry for being dim. Explaining it to me might help someone else I suppose!

So you want to be able to make a single API call listing all the beacons you can see, and get an aggregated response that includes details on all of them. And this solves two problems - the HTTP/TCP/radio/battery/bandwidth overhead of the multiple connections that would otherwise be required, and the parsing of HTML in the response which is replaced by efficient parsing of JSON.

Although private subnet addresses can't be reached by an aggregation service like that, you still want to get the second half of the solution above by standardising a well known path that can be used in place of the root path, and which is expected to return JSON. So you revert back to your 50 requests, but each is returning structured data.

So how does this relate to BLE beacons, where you might be able to detect 50 beacons all broadcasting the same host with a different path, and those paths when requested return distinct metadata? Would there be a priority to the metadata discovery process, along the lines of:

  1. If (BLE and address is public) use PWS to load the broadcasted path
  2. else if (BLE and address is local) fetch the broadcasted path directly
  3. else if (mDNS and address is public) use PWS which should try well known beacon info path, fall back to root path
  4. else if (mDNS and address is local) fetch well known beacon info path directly, fall back to fetching root path directly

@scottjenson
Copy link
Contributor Author

If (BLE and address is public) use PWS to load the broadcasted path
else if (BLE and address is local) fetch the broadcasted path directly
else if (mDNS and address is public) use PWS which should try well known beacon info path, fall back to root path
else if (mDNS and address is local) fetch well known beacon info path directly, fall back to fetching root path directly

The problem with a BLE private subnet address is that it's the user's responsibility to be on the correct network. It's too easy to imagine that you're on the wrong one and can't get to it. That's why we're only considering this approach for mDNS (which implies you already are on a network) This is fairly conservative of course but we'd rather not assume too much. Here is how we see the flow (up for comment)

  1. if BLE, require public website
  2. if mDNS, test for private subnet addresses (192/168) and use this lookup
  3. else assume public website

However, your flow also suggests that if we have this root lookup method to return JSON, why NOT use it for all websites, even public ones? I agree, that is certainly a reasonable suggestion. We've just had to assume for the last year that websites were oblivious to us that it's probably sunk in too deep ;-) We've just assumed we'd ALWAYS have to scrap the page. This root lookup method was born out the impossibility of mDNS local IPs.

@triblondon
Copy link

Yes, I think I was confused by why the need for a well known path was tightly bound to the issue of private URLs.

Had of course not considered that user is not necessarily on the right network if BLE is broadcasting a private subnet address. Right, so there's basically no use case for that.

The issue for me is arbitrarily switching to a different path - in BLE scenarios, you have a path, so would you ever consider ignoring it and trying your special well known one? Is the well known path purely an answer to the mDNS problem of not having a path component to the discovered address? I don't have a better answer (other than content negotiation, but I concede that has its own problems and a higher barrier to entry), but if you do adopt a well known path it seems important to me to clarify whether it is ONLY used for private hosts discovered via mDNS.

@scottjenson
Copy link
Contributor Author

Agreed. We started down this path as mDNS clearly has a problem with private subnet addresses. The ORIGINAL idea was to stuff more into the mDNS protocol and someone much wiser that me politely remiinded me that we are a web project after all. So we came up with this 'put it in the server' approach which certainly feels like a more mature solution (and allows other transports in the future to work just as easily).

The biggest concern is the additional complexity/burden. However, as we've discussed above, if you have a private subnet addresses and you are already going to the trouble of serving up a web page, adding a single text file into a path really isn't asking that much (we hope).

We could use this same approach for public websites but it doesn't appear to be a priority. If people ask for it, we'll certainly prioritize it.

I'm hoping we're getting close to wrapping this discussion up. Anyone else, if there are any additional concerns/issues, please raise them now.

@scottjenson
Copy link
Contributor Author

For the record, we are working on this issue, it's just on a back burner for the moment. mDNS has turned out to be more complicated than we thought. However, it's still critical to the project that we support technologies other than just BLE. We will be getting back to this.

@danbri
Copy link

danbri commented Sep 30, 2015

I'd suggest some basic cross-domain fields that give you whatever you need for general UI. And then we can look at more specific usecases e.g. what we'd say for a http://schema.org/Restaurant homepage vs a http://schema.org/Museum etc...

@scottjenson
Copy link
Contributor Author

At this point, we're just trying to reflect that the PWS needs so it's very simple: title and description, that's all.

@rektide
Copy link

rektide commented Oct 4, 2015

I really like @cqueern's thought of using Structured Data, which is simply a schema.org/WebSite. There's plenty of staring places in the WebSite entity, such linking to more specific entities via the "about" field. This is exactly the sort of information I'd want if I was making link-local connections to ambient systems.

Using tabular data sounds like a not-so-great idea vs json, but I appreciate that the tabular data spec mentioned by @triblondon is "well known url" compliant (it roots in /.well-known).

Take common best practices for today- Schema.org and .well-known- and I don't think you can go wrong with this work.

@scottjenson
Copy link
Contributor Author

@rektide @danbri it appears that schema.org/Thing has everything we need (URL, Name, Description, Image) so it's looking more and more like we can use that.

The biggest issue we just recently discussed on Twitter is basic security. If you are on a trusted home network, where you've added all of the devices, this is fairly simple. The problem is if you are on an open wifi at a coffee shop. There you could find/interact with any mDNS device directly: there is no PWS proxy to filter/protect the user from malicious devices.

There are a few approaches we can take:

  • Don't worry about it
    Let the consumer beware. This seems counter to the strong protection that web browsers normally try to maintain.
  • Warn Each time
    Every single time you use an mDNS beacon, warn the user. This clearly punishes home users where this is safe and like any overly protective warning system, will quickly just become ignored
  • Warn and remember
    Warn user if NetworkName is unknown so be careful. If user procees, remember network and do not ask agin in the future.
  • Something Else?
    Another approach we haven't thought of yet...

This is always hard as nothing is perfect. However, I hate solutions that just throw up their hands and ask the user every single time. It's a legal CYA type of move that is ultimately, only irritating and will just be ignored by the user anyway

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants