Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thought experiment: using schema.org + JSON-LD instead #49

Closed
darobin opened this issue May 19, 2017 · 42 comments
Closed

Thought experiment: using schema.org + JSON-LD instead #49

darobin opened this issue May 19, 2017 · 42 comments

Comments

@darobin
Copy link

darobin commented May 19, 2017

First: don't shoot me. This is just a thought experiment, but one that I believe may prove useful.

My reasoning is as follows:

Downside:

  • It's a bit more verbose (but not that much).

This is just a translation of the example from the site's front page to give a feel for the result. If there's interest we can dig through more details.

{
  "@type": "Blog",
  "name": "My Example Feed",
  "mainEntity": "https://example.org/",
  "url": "https://example.org/feed.json",
  "blogPost": [
    {
      "@type": "BlogPosting",
      "@id": "https://example.org/second-item",
      "articleBody": "This is a second item.",
      "url": "https://example.org/second-item"
    },
    {
      "@type": "BlogPosting",
      "@id": "https://example.org/initial-post",
      "articleBody": { "@type": "rdf:HTML", "@value": "<p>Hello, world!</p>" },
      "url": "https://example.org/initial-post"
    }
  ]
}
@gruber
Copy link

gruber commented May 19, 2017

"articleBody": "This is a second item.",
"articleBody": { "@type": "rdf:HTML", "@value": "<p>Hello, world!</p>" },

That's more than just "move verbose" -- it's more complex. A feed reader wouldn't be able to make any assumptions about the content of articleBody, they'd have to inspect each one.

@darobin
Copy link
Author

darobin commented May 19, 2017

@gruber I'm not sure what you mean about "inspecting". If the articleBody is a string, then it's text. If it's an object then the @type will tell you what it is — an option we could (and probably should) limit to HTML. That's no more inspection than checking to see if there is a content_text or a content_html.

@hsivonen
Copy link

This isn't the first time RDF is pushed onto feeds. It might be worthwhile to reflect on how well the RSS 1.0 effort did compared to RSS 0.92 and Atom and how often RSS 1.0 was actually consumed using RDF tooling.

Several issues are already covered

JSON-LD answers issue #39 by saying RFC 3986 even though the WHATWG URL spec is the one that's more useful to implement (actually says how to handle errors), so at least that point doesn't suggest that inheriting the choices of JSON-LD would be a positive thing.

@darobin
Copy link
Author

darobin commented May 21, 2017

Whoa there, let's not FUD Henri. JSON-LD != RDF. It's RDF-compatible, but most people who use it completely ignore that because it was designed that way. We have a complete and complex publishing system built atop JSON-LD and there is nothing in it RDFy, even though we do manipulate graphs. Worst-case scenario "RDF" shows up in the names of some things.

@frivoal
Copy link

frivoal commented May 21, 2017

I totally support that. json-ld-feed is generic enough I'd be happy to store the the source of my blog/site in that, generate the HTML from it, and serve it in a feed either as is or lightly filtered.

You could theoretically do the same with json-feed, but you'd constantly run into things that it can't quite express yet, and either not do it, or come up with your own non-standard way of doing it, which means hell for clients trying to figure out what (if anything) it can do with the various bits of non standard data they find in your feed. Not all feed readers need to react to everything you find in a feed, but having well understood ways of including all sorts of information makes life easier for anyone who wants to.

Here are a handful of examples of things json-ld would know how to do, but json feed doesn't, all taken from a poetry site/blog I'm running (https://namino.rivoal.net/poems/):

  • say which language various things are in (author name, content of the post...) (relates to Language & Internationalization #40)
  • Non-english text in titles that need markup to be correctly represented, like ねこ太陽たいよう
  • some items have an author, some items have a translator, some have both, some have several of each (relates to Behavior for multiple authors #6)
  • I store the author's name in multiple languages, also some times their place of birth, date of birth and date of death.
  • I have various metadata about the book poems come from. This is easily expressed using json-ld + schema.org.

And a bunch more like this. Aside from the international aspects, arguably, you do not need all of these in a feed. But given the choice between a format that already has answer to all of these, and one that doesn't, I think it'd be a lot less pain to go with the one that does.

Having not just the json syntax in common with the rest of the world, but also the json-ld semantics makes for much easier interactions with all sorts of tools out there.

@hsivonen
Copy link

Whoa there, let's not FUD Henri. JSON-LD != RDF

"LD" is a rebranding of the "Semantic Web", so I think its fair to ask how things would be better this time round. XML got namespaces due to RDF, so clearly, the SemWeb influence wasn't positive on XML and contributes in significant part to why JSON feels simpler than XML even now that the SemWeb/LD community has moved on from XML. Now JSON-LD adds that complexity to JSON for largely the same reason.

More specifically in relation to feeds, history has RSS 1.0 as a data point. It seems fair to ask why mixing "LD" to JSON-based feeds would be better this time round compared to adding "RDF" or "Semantic Web" to XML-based feeds earlier.

@frivoal
Copy link

frivoal commented May 21, 2017

It seems fair to ask why mixing "LD" to JSON-based feeds would be better this time round

  1. While LD brings some complexity to JSON, the result is still way less complex that XML.

  2. Using JSON-LD may not be the only way to solve all the relevant problems, but relevant problems need solving. As currently proposed, JSONFeed falls short on of a number of things that JSON-LD already solves. (See my earlier comment, or many of the open issues, including those quoted by @darobin ).

  3. There's no evidence that all these problems will be solved for JSONFeed, or that if they are solved, the result will be less complex than JSON-LD.

@darobin
Copy link
Author

darobin commented May 21, 2017

To be candid Henri, it isn't clear to me why you feel compelled to import the nasty parts from the W3C/WHATWG playbook here. Just in that last short comment I count castigation by association (RDF, namespaces…), negative reframing ("rebranding"), and slippery sloping ("ZOMG you're going to bring namespaces"). There's a world outside standards and an awful lot of people are nice there. I see no need to spread that sort of tactic.

If your concern is the risk of grandfathering in complexity, that's a legitimate concern — certainly one that I share at any rate — but in that case why not just say it and discuss solutions instead of reaching for the negative?

The avowed purpose of this proposal is to maintain the simplicity of JSON Feed while reaping the benefits from an existing vocabulary that is already widely deployed, is already consumed by several major implementers, and has already solved many of the problems — including some that haven't been filed yet (@frivoal's ruby text example is a great one: it's a common Western-centric error to assume that things like title can be conveyed solely with a string). Grandfathering in external complexity would defeat the purpose, so I reckon it would be pretty stupid.

Thankfully there's a lot that can be done to retain the full simplicity of JSON Feed while not reinventing the wheel, not rediscovering solved problems, and benefiting from implementations. That's why I mentioned creating a vernacular, the implication being that it's just a subset of the syntax that processes like JSON-LD — enough to inherit the solved problems — but restricts it in a way that makes it usable by anyone who doesn't care what other things you can do there.

There are some technicalities involved obviously; I'm happy to help with those.

@manton
Copy link
Owner

manton commented May 21, 2017

This is an interesting proposal, and it's given me the opportunity to become more familiar with JSON-LD and how it might fit with JSON Feed. Thanks! But it's really a different solution. JSON Feed prioritizes clarity and ease of implementation over some flexibility. JSON-LD is much larger in scope. If JSON-LD is solving problems for people, that's great. I'm glad it's there for developers to use if they need it.

Also, since RDF and RSS 1.0 was brought up... Many of us remember those early debates and we don't want to recreate that kind of "us vs. them" split. The web is bigger now, and there are a bunch of JSON formats. Just something to keep in mind.

I think we can let this thread run its course and then close it. If anyone has related blog posts or other articles, feel free to post a link so there's a record.

@hsivonen
Copy link

To be candid Henri, it isn't clear to me why you feel compelled to import the nasty parts from the W3C/WHATWG playbook here.

I find it interesting that you accuse me of a nasty playbook here, when you suggested resolving issue #39 by the proxy decision of buying into a large bundle of decisions when you must be aware that of the specs mentioned over at issue #39 the WHATWG URL spec best serves practical needs and that a decision to delegate the choice to JSON-LD would effective mean buying into the W3C playbook of pretending WHATWG non-existence by refusing to make normative references to WHATWG specs even when it would make the most sense technically.

I thought that suggesting this kind of roundabout way of resolving issue #39 was particularly uncool, which is what prompted me to comment here in the first place.

Other than that, I spent quite a bit of time participating in the Atom standardization process, so I have some nostalgic interest in what happens in the next generation of feeds. Back then, there was debate of RDF or not to RDF and Atom chose not to RDF. Now that someone is trying to simplify further, it seems backwards to import the complexity back then avoided especially when there's the RSS 1.0 case to study. (It's telling that you seem to take association with RDF as an offense.)

Further, over the years, I've had a lot of bad time with Namespaces both implementation-wise and standards committee-wise, so I was particularly unhappy to discover that the bit of RDF history that should have been thoroughly recognized as a bad design has been imported into JSON-LD. I would hope that the next generation of specs could be spared from the badness of the prefix-based indirection of Namespaces in XML.

@darobin
Copy link
Author

darobin commented May 21, 2017

@manton I think maybe I was not entirely clear about the proposal. The idea is emphatically not to bring in all of JSON-LD. That would seem neither useful nor practical. The proposal is little more than to keep JSON Feed as it is with just 1) different field names, 2) a different mechanism for text/HTML distinction, and 3) @type. Item (3) is the one that annoys me most, but I can show examples of the sort of things that it enables (without touching the complexity of processors one bit).

Simplicity and clarity are unaffected, but we get a format that has already solved the problems faced here, that is already understood by major consumers, and that is already compatible with the content produced by a few million sites. I'm not entirely sure what's not to like…

FWIW I too recall many of the horrors surrounding the earlier RSS discussions, which is why I'm sad to see that universe spill over here, into what I hoped would be a constructive discussion.

@darobin
Copy link
Author

darobin commented May 21, 2017

@hsivonen Again, if it's #39 you have a problem with why not simply say so instead of being aggressive over it? There are very few people who care which URL spec is referenced and I am not one of them — like the vast, vast majority of developers I just use libraries for that crap. If you think that JSON-LD should reference WHATWG URL that is perfectly fine by me. I'm pretty sure that if you open an issue there then @gkellogg or whoever handles it will be happy to consider it.

I'm not sure what you're referring to about "pretending WHATWG non-existence"; I left that world a couple of years ago precisely because I feel life is too short to put up with the kind of toxic comment you are bringing here.

I would like to have a friendly, constructive discussion about this issue — that's why I came here. I have seen the (sadly, unofficial) numbers for the number of pages that have the type of schema.org content we're talking about here from a major processor and they're very impressive. Building on what is actually used out there strikes me as a plus over publishing the same content mostly using different key names. That's all I care about here.

I know you're technically very competent Henri, so if you have constructive input then I very much welcome it. But I don't think technical competence can justify negative behaviour, so if it's just to grind whatever old axes you have with whatever organisation, I would very kindly, and very respectfully, ask that you please take your anger somewhere else.

@manton
Copy link
Owner

manton commented May 21, 2017

@darobin Thanks for clarifying the proposal. Even those 3 changes would turn JSON Feed into something very different, though. I'm happy to learn from the problems that JSON-LD has already solved, but we can't rename the fields, or it becomes a different format.

@frivoal
Copy link

frivoal commented May 22, 2017

but we can't rename the fields, or it becomes a different format.

That's true, but JSONFeed is very new and hasn't been deployed much yet (you can point a some deployments, but not at some tens of thousands of deployments yet), and regardless of whether you adopt the JSON-LD proposal or not, I think it's premature to consider it frozen. I know you've promised it would be and that you'd be compatible forever going forward, but nobody gets something this subtle right on the first attempt, and not all problems can be solved by strict additions.

Even if you end up rejecting the approach proposed here, please reconsider your stance on rejecting all breaking changes this early.

@manton
Copy link
Owner

manton commented May 22, 2017

@frivoal Thanks for the feedback. We've been trying to take the approach of keeping most of the issues open for a while, including this one, so that people have a chance to comment. This proposal is for a completely different format, though. It's not so much that it's a "breaking change" as it throws everything out and starts over. :-)

@hsivonen
Copy link

I'm not sure what you're referring to about "pretending WHATWG non-existence"

Extreme avoidance of normatively referencing WHATWG specs. Trying to minimize credit to the WHATWG when publishing a WHATWG spec under the W3C name. Not mentioning WHATWG as a stakeholder in charters unless pushed to do so. This went on while you were still there.

negative behaviour

It's worth noting that "I see you are doing a spec. Please rebase your spec onto SemWeb/LD stuff." can be seen as negative behavior (in the context of this not being the first time) even if expressed very politely. Then if someone asks "How did it go last time?" that can be portrayed as negative.

@msporny
Copy link

msporny commented May 22, 2017

Hi folks, jumping in here as one of the lead creators and spec editors of JSON-LD 1.0 to try and help...

@hsivonen said:

WHATWG URL spec best serves practical needs and that a decision to delegate the choice to JSON-LD would effective mean buying into the W3C playbook of pretending WHATWG non-existence by refusing to make normative references to WHATWG specs even when it would make the most sense technically

JSON-LD was born out of the desire to reflect reality (Web Developers use JSON) and do practical Linked Data (RDF is far too academic to be useful to most Web Developers). If it makes more sense to refer to the WHATWG URL spec, which I personally think it does, then we should reference it and fight for that in W3C. Especially because I think implementers will just implement according to the most practical spec out there (which is the WHATWG URL spec).

@darobin said:

 "@type": "BlogPosting",
 "@id": "https://example.org/second-item",

Remember that you can alias the "@id" and "@type" keywords to "id" and "type". Vanquish the curly "a" if possible. If this is a new format (which it seems like it is), I suggest you do that to avoid Web developers scratching their head over the strange syntax. The "@id" and "@type" keywords are already aliased in schema.org, so you should be able to do this:

  "id": "https://example.org/second-item",
  "type": "BlogPosting",

@gruber said:

  "articleBody": { "@type": "rdf:HTML", "@value": "<p>Hello, world!</p>" },

That's more than just "move verbose" -- it's more complex.

I agree with @gruber, -1 to doing that. Have you considered using JSON-LD Type Coercion, @darobin? So, you do something like this in the JSON-LD Context:

"articleBodyHtml": {
      "@id": "http://schema.org/articleBody",
      "@type": "rdf:HTML"  // or alternatively - "@type": "whatwg:HTML" :)
    }

That would enable you to do this in the JSONFeed document, which developers should like:

 "articleBodyHtml": "<p>Hello, world!</p>"

@darobin said:

different field names

To be clear, JSON-LD doesn't require you to change any of your field names. It should be possible to create a JSON-LD context that is compatible with JSONFeed's syntax today. It's true that there are some corner cases where this is not possible, but it should be true for 99% of the JSON-based data formats out there.

@darobin
Copy link
Author

darobin commented May 22, 2017

Just a couple short things and then I'd like to jump into the heart of it.

We've taken our discussion offline with @hsivonen (or rather, to a different online) so as not to annoy everyone here. I reckon we can shake away the misunderstandings.

@msporny I am well aware of all the tricks one can do with contexts, but my goal here is emphatically not to add a layer of indirection: it's to reuse the data model and code that millions of domains already use and that many large consumers (Google, Bing, etc.) already accept.

The "@type": "Blog"… example I store with is already what I (and several million other domains) have to put on a home page (as recommended by Google, etc.). Obviously, I can translate it to JSON Feed, but that's one more format to manage that has the exact same content and semantics as one that is already widely used on the Web. Adding a context (and therefore requiring translation) would defeat the purpose.

We already have schema.org as the de facto widespread data model for sharing blog data on the Web. I want to avoid reinvention — ideally this should just be an effort in agreeing to split that data out of the HTML and agree on some fields you must have (schema.org is a very loose duck-typing affair, it can help to require a few things on top).

Now, to address @manton's point, I would like to push back somewhat on the notion that this "throws out everything" or constitutes a major change in any way.

There are essentially two different kinds of change that could apply to a simple format: naming and processing algorithm. Naming is of course required to get both ends to understand something, but changing a name — especially early in the process — is a lot simpler and in a wholly different sphere from changing the processing, even in a simple way. If we decide to change title to zorblux (to please our new Zorbluxian overlords, obviously) it's pretty straightforward and different from, say, deciding that all the items must now be URL references to external things that need to be dereferenced.

Now, that's all a lot of word so I'd like to switch to code if you fine people don't mind. Below is a simple but (I think) representative of the problem space little CLI util that takes the URL of a feed and prints out the content of the first item as text on the console, implemented once in the current JSON Feed proposal and once in the schema-compatible variant. I'm taking a few shortcuts for brevity (essentially in error handling) but they're the same shortcuts in both.

JSON Feed:

const request = require('request');
const stripTags = require('strip-tags');

request({ url: process.argv[2], json: true }, (err, res, body) => {
  if (err) return console.error(err);
  let item = body && body.items && body.items[0];
  if (!item) return console.error('No item');
  if (item.content_html) {
    console.log(stripTags(item.content_text));
  } else if (item.content_text) {
    console.log(item.content_text);
  } else {
    console.log('No content');
  }
});

schema.org:

const request = require('request');
const stripTags = require('strip-tags');

request({ url: process.argv[2], json: true }, (err, res, body) => {
  if (err) return console.error(err);
  let item = body && body.blogPost && body.blogPost[0] && body.blogPost[0].articleBody;
  if (!item) return console.error('No item');
  if (item['@value']) {
    console.log(stripTags(item['@value']));
  } else {
    console.log(item);
  }
});

I would contend that these two implementations are pretty damn similar. Both are, to the best of my knowledge, complete and correct implementations of the task at hand, and in my opinion demonstrate enough of what would be involved in implementing a more elaborate tool (eg. a feed reader) atop either format.

The only difference is that the latter example uses a format that is already understood by all major search engines and a very large number of Web developers who've had to use it for SEO, with AMP, etc., whereas the former essentially reinvents that with different field names and problems that have already been found and solved elsewhere. The last I want is to disparage this effort, I really like the overall idea, but I don't understand why developers should be made to do the work twice. I know you're all smart folks so it can't be about preferring underscores to camel case or whatnot.

So — can we talk about this?

@msporny
Copy link

msporny commented May 22, 2017

@darobin wrote:

@msporny I am well aware of all the tricks one can do with contexts, but my goal here is emphatically not to add a layer of indirection: it's to reuse the data model and code that millions of domains already use and that many large consumers (Google, Bing, etc.) already accept.

Ah, then in that case:

  1. schema.org already supports @type and @id aliasing (it's in the vanilla schema.org context), even though all of their examples have @id and @type.
  2. Try to stay away from { "@type": "rdf:HTML", "@value": "<p>Hello, world!</p>" }, as @gruber mentioned, people will have an allergic reaction to it. Either re-use articleBody as is... or lobby schema.org to add articleBodyHtml (with the proper data typing as I mentioned above).
  3. Avoid using "@id": "https://example.org/initial-post", as it seems to duplicate url. Just use url OR use id, but not both.

In any case, +1 to using vanilla schema.org. If this community does that, I'm sure you can lobby schema.org for the tiny change that will make it easier for Web developers to pick this stuff up and treat it as JSON (the articleBodyHtml comment above).

Best of luck to the JSONFeed community. If you folks go the route of schema.org and JSON-LD, both of those communities are here to support you if you need it.

@alxndr-w
Copy link

https://xkcd.com/927/
#SCNR

@manton
Copy link
Owner

manton commented May 23, 2017

@alexplusde Heh. That never gets old. Even though I've seen it a few times in the last week. :-)

@darobin I appreciate where you're coming from with this. Also, thanks for the code examples! They do a good job of illustrating the similarities between the 2 formats.

But even though you joke that it can't be about camelcase, in a way it is. It's our belief that JSON Feed should be clear and readable, which will encourage more people to support it, because it's so obvious. There's already evidence that this is happening, with support from multiple feed readers shipping this week.

If we renamed items to blogPost and went back to one of these feed reader developers — let's say the developer of Feedbin — and told him to update his parser even though it would break all existing JSON feeds that people have been experimenting with... What would he say? There would have to be an incredibly good reason to force a change like that, and with this issue there are clear disadvantages in readability to renaming the fields regardless of current implementations. I hope you can see our perspective on that.

@msporny Thank you. That the @context can do aliasing and type coercion like that is really cool. Best of luck to the JSON-LD community as well.

@darobin
Copy link
Author

darobin commented May 23, 2017

@manton With respect, I can only push back. This repository was created nine days ago, the public announcement was six days ago. Are you honestly making the case that within a review period of nine days you are confident that you have delivered a sufficiently defect-free format? I don't believe anyone, no matter how smart or experience, would really expect that.

Or maybe if you'd already decided to keep it that way no matter what issues may be found beyond typos, it might have been simpler to not open it up at all? Or at the very least it may have been polite to say so.

If I were implementing a specification that was publicly announced six days ago, I would absolutely expect it to change. Everyone is familiar with what early adoption entails. I have every last bit of confidence in the ability of Feedbin's developer to let items = feed.blogPost || feed.items.

What's more, you are giving precedence to the few over the many. There are a few implementations and a few deployments, all of them early adopters with the wherewithal to handle change. You are, for unclear reasons, giving them precedence over the many more of us who will now have to produce (and test, and maintain, and fix, and document) the same exact same information in two different JSON formats.

Why you consider that the better decision strikes me as particularly unclear.

@manton
Copy link
Owner

manton commented May 23, 2017

Fair enough, but I was trying to make the point that while we can debate whether the spec should be frozen — there are a lot of proposals and I'm reading all of them — at the very least there's friction introduced in any change, so the change has to be inarguably the correct way forward. blogPost does not fit the design of the spec: it's camelcase when nothing else is, it's not plural as all array fields should be, and it unnecessarily mentions "blog". And that's ignoring the complexity of the other changes like articleBody and rdf:HTML.

The repository creation date doesn't tell the full story. Many months of work went into this spec, debating it and rewriting it, before we felt it was ready to share with everyone.

I also think it's worth pointing out again that JSON Feed and JSON-LD can co-exist. They are solving different problems. They do not need to be merged into a single format.

@danbri
Copy link

danbri commented May 23, 2017

hi @manton @brentsimmons, everyone. Quick intro: I run Schema.org, I work for Google, and this discussion is giving me RSS 1.0 flashback :)

JSONFeed is a fine and good thing, and to be honest I stepped back from jumping in and saying "hey look, here's how you could do something similar using Schema.org and JSON-LD" because it felt like it could come off negatively, and there are so many kinds of feeds already out there that another exploration of the design space can't hurt. But since @darobin has opened the conversation and made some positive proposals, maybe I can share some thoughts from a Schema.org, Google and RSS1-ish person's perspective?

Schema.org was launched with an emphasis on Microdata syntax, but when JSON-LD came along we adopted it as another representation (alongside RDFa). All our schemas are defined in terms of a general purpose graph data model (it's basically a collection of named types and properties), so they are not tightly bound to any of these syntax choices. Other syntaxes we've looked at that might be interesting from a feeds perspective are CSV (see W3C CSVW which gives a JSON mapping of tabular data into graphs - spec + my undocumented js hackery), and also Web Components where the idea is to document the meaning of new custom elements by mapping into schema.org-based graphs.

Currently there is nothing on the Schema.org site about using JSON-LD outside of HTML, although this is a very natural and even simpler use of it. For now we only give examples (ab)using the <script> element. It can be quite confusing for developers/publishers to think about issues like escaping when dealing with JSON inside HTML, so we'll probably talk about Schema.org in plain JSON(-LD) files at some point. As @darobin points out we have a lot of the raw materials for addressing feedlike scenarios, but we haven't presented it as such, partly because there are so many options already (RSS, Atom) as well as closely related things like W3C's brand new Activity Streams 2.0 spec (/cc @evanp @jasnell).

At Google, in addition to using Schema.org from the general public Web we have experimented with using it for data feeds of various kinds (mainly in JSON-LD). In many cases it turns out that the "feed" part of the problem is relatively small, and actually you end up focussing more on data structures for the various kinds of thing that the feed items are describing. This may well not be the case for simple RSS-like "new blog post" scenarios, but it fits with my experience from pre-Google efforts too.

Here is an example that is ancient but which illustrates something of the strengths and weaknesses of using a graph-oriented structure for data feeds whose items have non-feed properties in this case the feed is a jobs feed. Forgive the graceless RDF/XML syntax, JSON and JSON-LD didn't exist then. Hopefully the idea comes across at least.

Each item describes a job posting, giving the job title, a salary, the salary currency and the url for the hiring organization. While that original example schema was made up in the pub, today you can find all that vocabulary and a lot more already in http://schema.org/JobPosting, and advocacy for using schema.org for Job postings e.g. from UK govt. Are these kinds of feed considered in-scope for JSONFeed?

This raises a core strength that Schema.org brings to a feeds setting: it covers a lot of scenarios beyond the RSS-like basics, i.e. it often nicely covers the stuff the feed is about. It isn't perfect but it is regularly improved in fairly pragmatic small steps, and without a lot of bureaucracy. If you have a feed about scholarly articles or books, or tv/radio content or videos or events or products or courses or hotels or cars, we have schemas that give at least some basic set of properties for those things.

That's all I meant by _ actually you end up focussing more on data structures for the things the feed items are describing_. To be clear the schema.org schemas are never going to be enough to address everyone's needs, but they're often a good start and go a lot deeper than RSS/Atom into the non-feed-oriented content of data feeds. That doesn't mean that all the complexity/richness needs to be in the face of developers; a simple example as @darobin shows is a perfectly fine start.

If there is interest here in collaborating to minimize the gap between feed-like Schema.org in JSON-LD and the JSONFeed effort, I'd be happy to talk more. It might also be useful to get folk on a call with the Activity Streams team cc:'d above. Even if all these efforts diverge it is worth staying in touch...

@darobin
Copy link
Author

darobin commented May 23, 2017

I understand and respect the fact that work went into this prior to its release. But this is a format for the whole big Web and presumably meant to last, don't you agree that it could do with more review? I'm not saying years, but maybe one month?

The casing and plurality might not be consistent with JSON Feed but they are consistent with the rest of schema.org, which is what we already have to use to describe our content. BlogPosting is used in over a million independent domains on the Web (yes that's domains, not pages). So is Blog, and blogPost is onto the half-million mark. Can you explain why you consider it more important to be compatible with your decisions than with what the Web already does, or why you want developers to learn your conventions instead of the ones they already have to use?

I don't buy the argument that choosing between an object or a string is more complex than picking between two fields. As it happens, it would seem that neither do other web developers the world around since articleBody is also used on over a million domains.

I am not entirely sure what you mean by "JSON Feeds and JSON-LD can coexist". I think that's obvious since one is a proposed feed format while the other is a set of conventions for capturing data in JSON — they are in completely different spaces. This proposal is simply to use the schema.org field names as represented in the JSON-LD syntax (since it works as plain JSON). This makes it a strictly equivalent proposal in complexity, just with conventions that are already common Web technology rather than reinvented.

Honestly, the last thing I want is to stir up trouble and have an endless discussion but it is really hard to understand why you think that we should now learn two sets of conventions for the exact same thing. Basically at the very least I would like someone to tell me why in all the sites I'm going to be developing in the next I don't know how many years I am henceforth going to have to support two different syntaxes of the exact same form, the exact same degree of complexity, and the exact same content. I hate to complain but it just doesn't seem fair or even simply considerate to those of us who have to work with this stuff afterwards.

@manton
Copy link
Owner

manton commented May 23, 2017

Hi @danbri! Thanks for the feedback. Glad you were able to chime in, and I agree that it's good for everyone to keep in touch and be aware of the progress in all these formats.

@darobin Since you keep bringing up the million domains... :-) Can you point to an example of a JSON-LD feed file on the web that is currently being used for a similar purpose as RSS or JSON Feed? I haven't seen one of these in the wild so would like to see what they look like for a real site.

I think we just have a different approach to this and that's okay. I don't want to escalate the debate, and I'm not here to bash JSON-LD. In fact in 1996-ish I was a big fan of MCF (developed by RV Guha while he was at Apple, before RDF) and I built a site database tool to export in that format. I get the appeal, but JSON Feed takes a more narrow approach.

@danbri
Copy link

danbri commented May 23, 2017

MCF :) shouldn't it all be 3D fly-thrus by now?

There will always be more file formats coming along that kind of overlap but address different needs; https://www.w3.org/TR/appmanifest/ is another one...

@frivoal
Copy link

frivoal commented May 23, 2017

Can you point to an example of a JSON-LD feed file on the web that is currently being used for a similar purpose as RSS or JSON Feed? I haven't seen one of these in the wild so would like to see what they look like for a real site.

They are currently not being used as feeds, but to carry the exact same information for search engines and the like to figure out what's on your site.

Doing an actual feed (rather than a thing that'd isn't a feed but carries the same info) in JSON is your idea, and is a good idea. Doing it with a syntax that will cause millions of sites to deploy the same data in two almost identical formats is less of a good idea.

@manton
Copy link
Owner

manton commented May 23, 2017

@danbri Good times. :-)

@frivoal Got it, thank you. I assumed that was probably true but as I've been digging into this more, didn't want to overlook any existing feeds.

@darobin
Copy link
Author

darobin commented May 25, 2017

@manton First and foremost, I want to second what @frivoal has said: having a JSON feed is a great idea and it's your idea, and I think I speak for the folks who've +1ed this here and elsewhere when I say we're not at all here to 386 you. We don't think you're wrong, we just think you could be more right.

The second thing I should say is that judging from the discussion I made a very stupid mistake even mentioning JSON-LD. JSON-LD is great for loads of stuff, especially when you have JSON data that starts to look like a graph (and that can come at you fast), but I really, really would NOT want a JSON Feed processor to implement JSON-LD. That would bring in complexity that is quite simply unjustified. What I am proposing is that JSON Feed be able to duck type as a JSON(that-happens-to-be-LD-but-it-totally-doesn't-matter) encoding of schema.org. A JSON Feed processor should just have to care that something is JSON and has some given fields with predictable names. But it would really save us some bugs (and headaches, and time, and cost) if those names were the ones that are already used for this sort of content.

Right now this schema.org JSON is overwhelmingly embedded in HTML (as special script elements). That's perfectly sensible for crawlers that are reading the HTML anyway, but it's arguably a fair bit less sensible for feed readers. So splitting that exact same content into a separate file makes a whole lot of sense. You're absolutely right: let's do that.

The question I do have is: why translate? Without a doubt we've all done things a hell of a lot more complicated than translating between JSON Feed and any possible variation on the example given in the OP. If I end up having to do that, sure enough I'll do it: just like every other decision in a standard that makes things a bit more complicated than they ought to be, but not by so much that you give up, you suck it up. But (at least for the kind of service I'm working on these days) this is imperfect data from essentially random users that will always find a way to trigger whatever corner case has been made possible. The translation adds complexity that will add bugs and things that break, that leads to less time building cool and making paper planes for your kids. It seems trite and minor, but at the scale of the Web that's a lot fewer paper planes.

The duck typing I mentioned is important: schema.org is a "proper" Web technology in the sense that it's resilient to all manners of crap in the same way that HTML and CSS are (a very important aspect for formats intended for the Web that, by the way, is IMHO insufficiently the case in JSON Feed as currently described).

The example I gave in the OP compounds the mistake I made in mentioning JSON-LD by being a "correct" example. As such, it is more complicated than it need be. The reason for that is that properties in schema.org are essentially global: there is no case of, say, table meaning something different in a room and in a book; any given property always has the same meaning.

This is neat because you don't need to specify the @type for the rest of the object to be meaningful. In fact people typically use @type as they would any other property, just to indicate what kind of thing you're dealing with. It was a mistake to include it like that in the example as it evokes a whole world of typing BDSM when, really, it leaves it up to the consumer to decide just how much BDSM they actually do enjoy.

To give an example, right now I am working on science publishing. The items I publish would be @type: ScholarlyArticle because that's what they are. But under this proposal, from a JSON Feed consumer's perspective, that changes nothing: a url is a url, and articleBody is an articleBody, etc.

If you trawl around the Web, you'll find people typing their content variously as BlogPosting, Article, NewsArticle, ScholarlyArticle, and a whole bunch more (all of which are defined) but the content is equally usable even if you've never seen one and in fact you only need to care if for some reason the type is important to you. A generic feed reader could read them all without ever caring, but one could treat them differently (if only to give them a different icon) and of course understand any number of additional properties (there are quite a few) so long as we agree on a small core that ought to be present in a feed.

You mention numbers: I don't have access to the trove of knowledge that @danbri has, the figures I quote are from the schema.org site (every type and property has a volume ballpark based on real data). If you are looking for examples, any random Medium article ought to have that for instance (maybe @julien51 can comment as to their usage). Or this Make Your Page Discoverable example from a web developer guide.

So, to summarise what I know is a long post in a long thread: I apologise if I naïvely thought ancient hatchet buried and if I stupidly assumed it obvious that I would advocate for something simpler than the status quo rather than for dragging in massively overkill machinery (no matter how neat it may be in other more complex cases). Please allow me to reformulate: might we just use the schema.org field names that we already use so that we may not do the work twice and still have our stuff read by the big crawlers?

We can duck type the hell out of this thing, and @frivoal and I would be happy to contribute processor rules to make this a Web format resilient against content errors.

If you've read this far, please have a 🍪, 🥃, or both.

@msporny
Copy link

msporny commented May 25, 2017

I'm going to duck out of this conversation at this point because @darobin and @danbri are making all of the points I would make... but before I go, I just wanted to underscore something.

There is a general tendency for Web developers working on new formats to invent a new simple Domain Specific Language (DSL) for their small set of use cases and tie it to a particular syntax. The thinking is that "we'll build other use cases out once we're successful with this simple first step". This focus on "simplicity" and "what developers need today", while an admirable goal, is often misinterpreted by folks working on new data formats - you need to design for the future, not just the present. Simple also means that it fits into the current Web ecosystem well, but most developers miss that.

JSONFeed, as it stands right now, is a case in point. Yes, it has a simple syntax (JSON) and a constrained set of properties (JSONFeed) that solve a specific set of use cases. However, it doesn't fit well into the current Web ecosystem because of at least two reasons.

The first is that JSON won't last forever and you've tied yourself to it (schema.org doesn't tie itself to a particular syntax). If you don't think this is a problem, go talk to everyone that tied all their Web Services to XML. :)

The second is that schema.org exists and already provides more than 80% of what JSONFeed does (and the bits that are missing can be easily added to schema.org). This opens you up to a competitive data format that re-uses schema.org instead of effectively re-inventing the data model (which is what JSONFeed does).

Ultimately, the question will be whether or not JSONFeed survives for more than a decade. If you dovetail w/ schema.org, I'd put the chances of it surviving for longer than RSS or Atom very high. If you don't, you'll most likely become RSS 1.0 when the next favored syntax craze hits the software world.

@msporny
Copy link

msporny commented May 25, 2017

Can you point to an example of a JSON-LD feed file on the web that is currently being used for a similar purpose as RSS or JSON Feed

Not exactly a single example, but if you're looking for public deployment numbers for and JSON-LD this may be helpful (look at the adoption curve for JSON-LD vs. other formats):

http://webdatacommons.org/structureddata/#toc9

Again, as @darobin mentioned, you don't have to use JSON-LD... but please do re-use the data model that everyone else is starting to use (schema.org).

@asbjornu
Copy link

asbjornu commented May 28, 2017

Besides echoing and applauding everything @darobin, @danbri and @msporny has written already, I would like to point out that I find it strange to invent a format intended to replace RSS and Atom 12 years later and give it less power than the formats it is intended to replace.

Both RSS and Atom was built on XML, which – as pointed out by @hsivonen – therefore had the dread of XML Namespaces forced upon them. In many ways, this baggage only added weight without giving much value, but in some cases it did. You might say OData and AtomPub could have been invented without XML Namespaces, but the fact that the architecture for this sort of distributed, collision free extension mechanism was already in place made it a lot easier.

Now I'm not saying JSONFeed should implement XML Namespaces (please don't!), I'm just saying that the sort of distributed extensibility that XML Namespaces gave to RSS and Atom is made possible in JSON with JSON-LD.

This extensibility does not impose JSON-LD on to consumers of the core JSONFeed format, but those who want to experiment and extend it, now have a clearly defined and standardised way to do so and the maintainers of the JSONFeed format don't have to lift a finger to enable it.

@hsivonen
Copy link

but I really, really would NOT want a JSON Feed processor to implement JSON-LD. That would bring in complexity that is quite simply unjustified. What I am proposing is that JSON Feed be able to duck type as a JSON(that-happens-to-be-LD-but-it-totally-doesn't-matter) encoding of schema.org. A JSON Feed processor should just have to care that something is JSON and has some given fields with predictable names.

This kind of thing is why I said in my first comment here that it's worthwhile to consider "how often RSS 1.0 was actually consumed using RDF tooling".

I don't believe it works at scale to have a format be two things and people working with the format just ignoring the facet they feel isn't relevant to them.

One of three things happens:

  • The optional processing layer introduces enough syntactic sugar that some producers start relying on the quasi-optional layer (JSON-LD) being there and consumers that didn’t want to buy into the quasi-optional layer are forced to implement the layer that was sold as optional.

  • Producers don’t test with software that uses the quasi-optional layer, so what they output is broken for the purposes of the quasi-optional layer. (Consider how often XHTML served as text/html isn't well-formed XML and can't actually be consumed using a conforming XML processor.) The people who wanted the quasi-optional layer to be there can’t get the benefits anyway and need to write a JSON Feed -specific converter on ingest anyway.

  • A messy mixed state of the two above options without clearly converging on either.

@danbri
Copy link

danbri commented May 31, 2017

@hsivonen - Thanks for sharing your perspective. Just as a datapoint regarding "at scale", at Google we consume JSON-LD, Microdata and RDFa (amongst other things) using a triples/graphs abstraction which is reflected back to users/publishers in terms of typed items with properties/relationships. For consuming diverse data, it works pretty well at scale. For managing/editing/publishing, you'll often want to pick a concrete representation and have additional constraints.

You are correct that it is extra work (but still possible) to tie warnings/errors noticed at the level of the abstraction back to the concrete syntax level (whether it is "{" and "[" or "<"-based. In particular many off-the-shelf parsers that extract triples/graphs do not pass along the kind of information that's helpful for expressing such warnings/errors back to users -e.g. character offsets or whatever. There are certainly challenges that come with having different useful levels of abstraction, but the approach starts to pay off when you're mapping multiple concrete representations into a common model.

Perhaps in the future everyone will use JSON for everything, in which case the abstraction may come to seem too expensive in many contexts. But markup isn't dead yet, even XML, plus there's Web Components, CSV, etc unlikely to go away any time soon.

@hsivonen
Copy link

hsivonen commented Jun 1, 2017

Just as a datapoint regarding "at scale", at Google we consume JSON-LD

Note that I didn't suggest that JSON-LD is a problem "at scale". I suggested that "take your pick from different processing models for the format" is.

@danbri
Copy link

danbri commented Jun 1, 2017

@hsivonen understood. A similar version of this dialogue happened around RDF and XML. If you write XSLTs against a particular serialization of a graph into XML, you'll often be out of luck if someone generates their XML using a generic XML-to-RDF "dump this graph as markup" serialization algorithm. But you might equally be out of luck because the information you wanted simply wasn't in the graph. Without something like shacl to check whether the graph is fit for some purpose, mapping it into a particular syntactic form is only half the problem.

@msporny
Copy link

msporny commented Jun 1, 2017

@hsivonen said:

I suggested that "take your pick from different processing models for the format" is.

Yeah, I think that Henri makes a fair point here; waffling on the processing model leads to problems. For example, if Microsoft were to process the data as hard coded JSON and Google were to allow different JSON-LD Contexts, we'd start seeing problems.

I'd be surprised if ALL of the search companies actually implement/use a fully conformant JSON-LD processor. Instead, I expect that some of them have hard coded the schema.org context and only allow the terms used in schema.org (even though JSON-LD allows you to redefine terms using a different JSON-LD context).

In effect, what the search companies are doing is being restrictive on the sort of input they take in (i.e. you must use the schema.org context and we will not support compaction/expansion). We predicted that implementations of JSON-LD would do this and designed the language appropriately.

schema.org usage in the wild is a restricted subset of full JSON-LD, and that's a perfectly reasonable thing to do because interpretation of the data is the same between a restricted JSON-LD processor (schema.org) and a full blown JSON-LD processor.

I think this is what @darobin is advocating here... use schema.org "as JSON"... Web Developers will like that... and it just so happens to be compatible will full blown JSON-LD processing if you ever want to go there (and many won't care to).

So, yes, @hsivonen is correct... allowing developers to pick from different processing models leads to pain, but allowing them to pick from compatible subsets of the processing model works just fine, and that's what schema.org does.

@asbjornu
Copy link

asbjornu commented Jun 7, 2017

I just want to add that I value the work being put into developing JSONFeed. I even argued for its existence 4 years ago. However, a perspective perhaps worth considering is the fact that the development of Atom took more than 40 people about 2 years to complete. The history of RSS is a bit more chaotic, but it too took time and a lot of hard work.

This is not a plea to start an IETF Working Group and develop JSONFeed the same way as Atom was, it is just a plea to not rush things and to perhaps gather some input and experience from the people that were involved in the development of Atom (and RSS) before you wrap this format up and ship it to production.

@manton
Copy link
Owner

manton commented Jun 16, 2018

Thanks again everyone for the discussion of JSON Feed and JSON-LD. Probably past time to close this now.

@manton manton closed this as completed Jun 16, 2018
@danbri
Copy link

danbri commented Jun 16, 2018 via email

@Globik
Copy link

Globik commented Oct 26, 2018

What for the articlebody.value? Why dublicate the content of an article???? Nonsense! 3000 lines of article in paragraphes tags and doubled in a articlebody.value. Or you wanna braekdown user experiance of a mobile device with no high memory on board??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants