Canonical Tags: Subjects to become 1st class objects in metamodel #2819

tfmorris · 2020-01-02T15:13:28Z

Subjects are currently treated as strings, with light normalization to coalesce similar strings, which limits our flexibility to do things like support aliases, multiple languages, metadata such as descriptions, links to Wikidata, etc.

Proposal & Constraints

Subjects should be first class objects with a set of attributes including:

key
preferred label (one per language)
description (one per language)
aliases (multiple per language)
external identifier(s) - Wikidata to start, perhaps others like FAST

Component Updates

Change importer to look up using subject labels and aliases and return the subject key to be stored.
Change subject display on works, etc pages to use preferred label in the user's preferred language
Change subject page to include description and aliases as well as preferred label. Allow editing of these elements.
Add multilingual label, alias, & description editing (ie for languages other than the current UI language)
Add subject merge (for the inevitable duplicates which will occur)

Additional context

Traditionally library cataloging standards have used pre-coordinated subjects like "U.S. History -- World War II -- 1945" (made up, perhaps invalid, example) which we split apart into constituent elements during import, similar to FAST. The working assumption is that we'll continue to do that, but just making the assumption explicit here.

Stakeholders

⚠️ EDIT by @mekarpeles: Supplanted by #7904

xayhewalo · 2020-01-03T02:38:17Z

@hornc I added your personal label as I thought it was relevant.

LeadSongDog · 2020-01-03T21:28:09Z

@tfmorris
Just spitballing here, but as a transitional step, will we not need to have support for both the existing free-form and whatever structured form is chosen?

tfmorris · 2020-01-15T17:15:52Z

will we not need to have support for both the existing free-form and whatever structured form is chosen?

No. My expectation is that we'll convert everything to structured form at once, but with perhaps imperfect resolution/merging of duplicates which will improve over time. ie we might have two different subjects with labels of "History" and "Histoire" but over time they'll get consolidated together into a single object (with redirects for the former merged subjects).

LeadSongDog · 2020-01-17T16:16:34Z

So then, what happens in the many cases where there's no structured form clearly equivalent to the old free-form? Do we have a catch-all?

tfmorris · 2020-01-17T18:07:39Z

There's no such case. See "everything" in my previous reply.

tfmorris · 2023-02-01T15:21:54Z

I'm aware @tfmorris would prefer us moving directly to a system

Since this is the first sign of progress I've seen and I wasn't aware that the design was happening in the back rooms, I can't really say. I've added the Google doc with the design/plan to my list to review.

I will note, however, that a search for the terms MARC, BIBFRAME, FAST, LOD, Linked Data, Linked Open Data all turned up zero hits, so I'm a little concerned about interoperability with the Real World.

mekarpeles · 2023-03-15T06:24:10Z

Plan looks something like:

Before building anything sophisticated, I think a few things would be helpful:

Create new infogami Tag type OL…T (in prod + local dev environment) (essentially a json doc)
- may require some work from me, @cdrini, and @jimchamp (to make sure this type exists on dev instances)
Creating an experimental Tag document instance (e.g. for a collection, as a prototype) we can test
Seeing if we can synthesize a collection based on this collection Tag -- e.g. if someone goes to a /collections/<foo> and the page doesn't exist, then the controller will render a collections page based on the data in <foo> Tag.
- As per October focus (helping researchers) we may want to pilot an enhanced K-12 collection
- Create mapper to resolve a Work.subject string → Tag document (if one exists)
Trying to build a simple extension on the ILE so we can associate works with this tag by adding its Tag.name to a work's subjects list.

T.B.D.

Formalize schema for Tags

tfmorris · 2023-03-15T16:21:45Z

Subjects, Collections, Awards, and Censorship Warnings (e.g. NSFW) are all VERY different things. Attempting to smoosh them into a single schema creates unnecessary complexity and makes them more difficult to query.

Collections are simple sets (unordered) or lists (ordered) which are manually curated or dynamically created using search criteria.
Awards typically have a sponsoring organization and a date. They may be given to a contributor (author, illustrator, etc) or work.
Censorship categories depend on the geographical & political regime as well as the age, gender, religion, etc of the viewer. They are probably very difficult to model.
Subjects are very well developed and known in the library community. They are explicitly assigned by librarians in both MARC and BIBFRAME cataloging formats. Modern catalog records even include URLs for subjects which act as strong identifiers. Subjects are organized into taxonomic hierarchies giving them structure.

There's a vast trove of professionally curated subject assignments in the MARC library records which is currently greatly underutilized (e.g. no import of FAST URLs). The proposal makes no mention of how the MARC importer will be affected or how this interacts with BIBFRAME data which libraries are already trialing in production.

I also see no mention of how the existing subjects will be deduplicated and matched with the new Thing.Tag.Subject subjects. I'm worried that perhaps this is seen as an entirely manual process, which isn't scalable at all.

I expect this is a fait accompli which isn't open for community input, so I'll stop there.

mekarpeles · 2023-03-21T22:35:03Z

@tfmorris why do you presume fait accompli? We've had no less than 3 community calls on this topic + we're open to discussion here as well.

You're right, there are lots of data sources we can use to get data. One thing blocking importing is having a place to put data.

Open Library currently has works, editions, authors, lists, and several other types. These are all APIs to maintain. Today we have a system that works ~well for subjects in that:

one can edit any work's subjects and it gets indexed in solr
subjects can be any string
subject pages are created dynamically based on subject membership

There are also deficiencies:

a subject is just a string, lots of dupes
subject pages encode very little data other than a string (and works/authors which subscribe to this string). This leads librarians to hand-code /collections pages`.
Limited ability to support multiple types (currently just subjects -- which are misused -- and places, times, people)

I feel you're right that collections, subjects, moderation, and subjects all have different schema. They could all be constructed as independent entities with their own functions, APIs, solr integrations, import pipelines, edit + display UIs. This seems like it could be difficult to maintain when really what's important is that each of these types has mutually consistent schema. I'm imagining they all "inherit" from type Tag and share a schema according to their sub type (e.g. subject, collection, moderation, award, etc). An important aspect to me, from an engineering & implementation perspective, is that they share the same infogami API and we're not creating more types than we need and creating more exposure than we can cover.

You make a point that forcing tags to use the same schema increases complexity and reduces the ability to query. Wouldn't having tags all in one bucket decrease the complexity of the engineering even if it increases the complexity of tags itself? I agree it does transfer complexity from the system to the patron -- and given our resource constraints, to me this is an advantage in this specific situation.

With respect to querying, I agree that the combo of tags with types may make it more difficult to query in infogami, but I imagine the primary use case (i.e. how subjects are used today) is querying via solr and there are any number of optimizations we can make to aggregate a work's tags and usefully bucket them within solr. I intend for infogami to be the storage mechanism and keeping tags as simple, interoperable, and as extensible as possible architecturally I believe is to our benefit.

mekarpeles · 2023-03-22T00:09:51Z

Example generic Tag document could look something like:

{
  key: "/tag/OL1T",
  tag_type: "subject",
  tag_name: "fantasy",
  queries: [{
    "title": {"en": "Recent release"},
    "query": "...&sort=newest",
  }],
  exclusion: "title: ...",
  title: {
    "en": "Fantasy",
    "fr": "Fantaisie"
  },
  description: {
    "en": "..."
  }
  header_img: "https://media.istockphoto.com/id/1070683626/photo/magical-old-book-with-sparkles.jpg",
  children: ["/tag/OL2T", "/tag/OL33T"],
  neighbors: [],
  ... // additional schema related to tag_type
}

mekarpeles · 2023-04-27T18:47:24Z

I'll remain the lead for this issue but am going to mark @JaydenTeoh as the assignee as they've been making great progress (keep up the great work!)

mekarpeles · 2023-05-23T17:43:49Z

Related: #65

mekarpeles · 2023-09-07T15:26:29Z

Supplanted by #7904

tfmorris · 2023-09-07T19:31:09Z

In the case of duplicates most projects keep the oldest issue so that provenance, discussion, and age are preserved. This project seems to continually replace old, perfectly valid, with new issues. Why is that? Does someone's bonus depend on how long tickets have been open?

It looks like my assumption of "fait accompli" was accurate. No response to questions about how this relates to BIBFRAME, FAST, LCSH, Wikidata, MARC, or anything else in the real world. No response to comments in design document that was linked above. No attention being paid to centuries of library cataloging practice or the directions that the library community (the only source of high quality metadata for OL) is going.

For the record, my request for Subjects to become first class objects in OpenLibrary was not "completed" as the issue status seems to indicate, but instead roundly rejected. "Tags" may become useful some day and it may even be able to build Subjects as first class objects on top of them, but there is no plan or path which shows how (or even if) that is going to happen. I'm very disappointed.

jimchamp · 2023-09-07T19:38:25Z

The insults will surely help your case, Tom.

mekarpeles · 2023-09-09T06:41:25Z

@tfmorris, we simply have 2 issues that were similar, one was closed, one remains open, and both are linked together. You're right that there could have been a better way of preserving provenance -- but hey, we could be happy that someone is caring enough to at least go through 700+ issues and trying their best to dedupe at all. Furthermore, a section was explicitly added to the planning doc in response to updates we're planning for October: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#heading=h.o9utr3tyh8k. Furthermore, we have made progress on several elements of this plan through GSoC this year. As well as anyone who has been involved with the project for several years, you know we're a small team doing the best we can and sometimes an older issue gets closed instead of a newer one and it's not done out of malice but rather an attempt to get things organized so we can make forward progress towards efforts that I know you care about.

In response to your questions about BIBFRAME, FAST, LCSH, Wikidata, MARC -- we continue to import classifications for these sources and also intend to links to sources within Tag documents (e.g. "this Tag is a classification from LCCN and here is its number").

We simply haven't gotten to that step of fully defining what is included in a Tag document because there's a lot of opportunities for integrating pieces we have confidence will be required. Much of the schema we imagine may be impacted by #7833 as we are interviewer 15 learners and educators to understand what types of affordances they may want beyond what we have in our current subject pages.

We've had dozens of calls on this topic, spanning staff, design team, engineering, and librarians, and at least 10 different people have been involved in weighing in, as have you (to the best of my ability) over github. This is one of many issues and it would be nice if it were appreciated that we doing the best we're able, and also that it is hard to fully respond to the feedback folks have over github which is why we do weekly community calls which I've tried hard to include you in.

Yes, I'm not perfect and will continue to make mistakes. I understand how frustrating advising and contributing to Open Library must feel under these constraints. All I can do is continue to be open to collaborating to the best of my ability and it would be nice if we could achieve that with good will and the compassion of two contributors who both care deeply about the project and doing right by our patrons.

tfmorris mentioned this issue Jan 2, 2020

Identify UI elements that are not yet covered by i18n #973

Closed

18 tasks

xayhewalo added CH: subjects Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 2 Important, as time permits. [managed] labels Jan 3, 2020

tfmorris mentioned this issue Jan 15, 2020

Subjects directly linking to authors #2870

Closed

hornc removed the CH: subjects label Mar 9, 2020

xayhewalo removed the State: Backlogged label Mar 17, 2020

tfmorris mentioned this issue Mar 21, 2020

Ability for Librarians to remove subjects #3233

Closed

LeadSongDog mentioned this issue May 21, 2020

Add an option to add synonyms to publishers, authors and places #3470

Open

milotype mentioned this issue Jun 18, 2021

Update Croatian translation #5301

Merged

tfmorris mentioned this issue Apr 18, 2022

Rework for subjects #6434

Closed

2 tasks

cdrini mentioned this issue Apr 26, 2022

Translate /subjects into Chinese #6486

Closed

mekarpeles mentioned this issue Mar 15, 2023

Support private lists #3456

Closed

mekarpeles self-assigned this Mar 15, 2023

mekarpeles added the Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] label Mar 15, 2023

cdrini modified the milestones: 2023, Sprint 2023-04 Mar 27, 2023

JaydenTeoh mentioned this issue Apr 5, 2023

Tag Edit UI & Plugin System #7766

Merged

6 tasks

mekarpeles modified the milestones: Sprint 2023-04, Sprint 2023-05 May 1, 2023

tfmorris mentioned this issue May 17, 2023

Fetch subjects in JSON #7882

Closed

JaydenTeoh mentioned this issue May 25, 2023

Canonical Tags: Subjects to become 1st class objects in metamodel #7904

Open

7 tasks

mekarpeles modified the milestones: Sprint 2023-05, Sprint 2023-06 May 30, 2023

mekarpeles modified the milestones: Sprint 2023-06, Sprint 2023-07, 2023 Jul 10, 2023

mekarpeles closed this as completed Sep 7, 2023

mekarpeles modified the milestones: 2023, Sprint 2023-08 Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

tfmorris commented Jan 2, 2020 •

edited by mekarpeles

xayhewalo commented Jan 3, 2020

LeadSongDog commented Jan 3, 2020

tfmorris commented Jan 15, 2020

LeadSongDog commented Jan 17, 2020

tfmorris commented Jan 17, 2020

tfmorris commented Feb 1, 2023

mekarpeles commented Mar 15, 2023 •

edited

tfmorris commented Mar 15, 2023

mekarpeles commented Mar 21, 2023 •

edited

mekarpeles commented Mar 22, 2023

mekarpeles commented Apr 27, 2023

mekarpeles commented May 23, 2023

mekarpeles commented Sep 7, 2023

tfmorris commented Sep 7, 2023

jimchamp commented Sep 7, 2023

mekarpeles commented Sep 9, 2023 •

edited

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

Comments

tfmorris commented Jan 2, 2020 • edited by mekarpeles

Proposal & Constraints

Component Updates

Additional context

Stakeholders

xayhewalo commented Jan 3, 2020

LeadSongDog commented Jan 3, 2020

tfmorris commented Jan 15, 2020

LeadSongDog commented Jan 17, 2020

tfmorris commented Jan 17, 2020

tfmorris commented Feb 1, 2023

mekarpeles commented Mar 15, 2023 • edited

Plan looks something like:

T.B.D.

tfmorris commented Mar 15, 2023

mekarpeles commented Mar 21, 2023 • edited

mekarpeles commented Mar 22, 2023

mekarpeles commented Apr 27, 2023

mekarpeles commented May 23, 2023

mekarpeles commented Sep 7, 2023

tfmorris commented Sep 7, 2023

jimchamp commented Sep 7, 2023

mekarpeles commented Sep 9, 2023 • edited

tfmorris commented Jan 2, 2020 •

edited by mekarpeles

mekarpeles commented Mar 15, 2023 •

edited

mekarpeles commented Mar 21, 2023 •

edited

mekarpeles commented Sep 9, 2023 •

edited