Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

Closed
tfmorris opened this issue Jan 2, 2020 · 17 comments
Closed

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

tfmorris opened this issue Jan 2, 2020 · 17 comments
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Affects: Experience Issues relating directly to service design & patrons experience Affects: Server Issues with the server (olweb) or its plugins. [managed] Affects: UI Issues with the web site's user interface. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Community Discussion This issue is to be brought up in the next community call. [managed] Priority: 2 Important, as time permits. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Jan 2, 2020

Subjects are currently treated as strings, with light normalization to coalesce similar strings, which limits our flexibility to do things like support aliases, multiple languages, metadata such as descriptions, links to Wikidata, etc.

Proposal & Constraints

Subjects should be first class objects with a set of attributes including:

  • key
  • preferred label (one per language)
  • description (one per language)
  • aliases (multiple per language)
  • external identifier(s) - Wikidata to start, perhaps others like FAST

Component Updates

  • Change importer to look up using subject labels and aliases and return the subject key to be stored.
  • Change subject display on works, etc pages to use preferred label in the user's preferred language
  • Change subject page to include description and aliases as well as preferred label. Allow editing of these elements.
  • Add multilingual label, alias, & description editing (ie for languages other than the current UI language)
  • Add subject merge (for the inevitable duplicates which will occur)

Additional context

Traditionally library cataloging standards have used pre-coordinated subjects like "U.S. History -- World War II -- 1945" (made up, perhaps invalid, example) which we split apart into constituent elements during import, similar to FAST. The working assumption is that we'll continue to do that, but just making the assumption explicit here.

Stakeholders

⚠️ EDIT by @mekarpeles: Supplanted by #7904

@tfmorris tfmorris added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Affects: Data Issues that affect book/author metadata or user/account data. [managed] Affects: Experience Issues relating directly to service design & patrons experience Affects: Server Issues with the server (olweb) or its plugins. [managed] Affects: UI Issues with the web site's user interface. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] labels Jan 2, 2020
@xayhewalo xayhewalo added CH: subjects Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 2 Important, as time permits. [managed] labels Jan 3, 2020
@xayhewalo
Copy link
Collaborator

@hornc I added your personal label as I thought it was relevant.

@LeadSongDog
Copy link

@tfmorris
Just spitballing here, but as a transitional step, will we not need to have support for both the existing free-form and whatever structured form is chosen?

@tfmorris
Copy link
Contributor Author

will we not need to have support for both the existing free-form and whatever structured form is chosen?

No. My expectation is that we'll convert everything to structured form at once, but with perhaps imperfect resolution/merging of duplicates which will improve over time. ie we might have two different subjects with labels of "History" and "Histoire" but over time they'll get consolidated together into a single object (with redirects for the former merged subjects).

@LeadSongDog
Copy link

So then, what happens in the many cases where there's no structured form clearly equivalent to the old free-form? Do we have a catch-all?

@tfmorris
Copy link
Contributor Author

There's no such case. See "everything" in my previous reply.

@tfmorris
Copy link
Contributor Author

tfmorris commented Feb 1, 2023

I'm aware @tfmorris would prefer us moving directly to a system

Since this is the first sign of progress I've seen and I wasn't aware that the design was happening in the back rooms, I can't really say. I've added the Google doc with the design/plan to my list to review.

I will note, however, that a search for the terms MARC, BIBFRAME, FAST, LOD, Linked Data, Linked Open Data all turned up zero hits, so I'm a little concerned about interoperability with the Real World.

@mekarpeles
Copy link
Member

mekarpeles commented Mar 15, 2023

Plan looks something like:

Before building anything sophisticated, I think a few things would be helpful:

  • Create new infogami Tag type OL…T (in prod + local dev environment) (essentially a json doc)
    • may require some work from me, @cdrini, and @jimchamp (to make sure this type exists on dev instances)
  • Creating an experimental Tag document instance (e.g. for a collection, as a prototype) we can test
  • Seeing if we can synthesize a collection based on this collection Tag -- e.g. if someone goes to a /collections/<foo> and the page doesn't exist, then the controller will render a collections page based on the data in <foo> Tag.
    • As per October focus (helping researchers) we may want to pilot an enhanced K-12 collection
    • Create mapper to resolve a Work.subject string → Tag document (if one exists)
  • Trying to build a simple extension on the ILE so we can associate works with this tag by adding its Tag.name to a work's subjects list.

T.B.D.

  • Formalize schema for Tags

@mekarpeles mekarpeles added the Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] label Mar 15, 2023
@tfmorris
Copy link
Contributor Author

Subjects, Collections, Awards, and Censorship Warnings (e.g. NSFW) are all VERY different things. Attempting to smoosh them into a single schema creates unnecessary complexity and makes them more difficult to query.

  • Collections are simple sets (unordered) or lists (ordered) which are manually curated or dynamically created using search criteria.
  • Awards typically have a sponsoring organization and a date. They may be given to a contributor (author, illustrator, etc) or work.
  • Censorship categories depend on the geographical & political regime as well as the age, gender, religion, etc of the viewer. They are probably very difficult to model.
  • Subjects are very well developed and known in the library community. They are explicitly assigned by librarians in both MARC and BIBFRAME cataloging formats. Modern catalog records even include URLs for subjects which act as strong identifiers. Subjects are organized into taxonomic hierarchies giving them structure.

There's a vast trove of professionally curated subject assignments in the MARC library records which is currently greatly underutilized (e.g. no import of FAST URLs). The proposal makes no mention of how the MARC importer will be affected or how this interacts with BIBFRAME data which libraries are already trialing in production.

I also see no mention of how the existing subjects will be deduplicated and matched with the new Thing.Tag.Subject subjects. I'm worried that perhaps this is seen as an entirely manual process, which isn't scalable at all.

I expect this is a fait accompli which isn't open for community input, so I'll stop there.

@mekarpeles
Copy link
Member

mekarpeles commented Mar 21, 2023

@tfmorris why do you presume fait accompli? We've had no less than 3 community calls on this topic + we're open to discussion here as well.

You're right, there are lots of data sources we can use to get data. One thing blocking importing is having a place to put data.

Open Library currently has works, editions, authors, lists, and several other types. These are all APIs to maintain. Today we have a system that works ~well for subjects in that:

  • one can edit any work's subjects and it gets indexed in solr
  • subjects can be any string
  • subject pages are created dynamically based on subject membership

There are also deficiencies:

  • a subject is just a string, lots of dupes
  • subject pages encode very little data other than a string (and works/authors which subscribe to this string). This leads librarians to hand-code /collections pages`.
  • Limited ability to support multiple types (currently just subjects -- which are misused -- and places, times, people)

I feel you're right that collections, subjects, moderation, and subjects all have different schema. They could all be constructed as independent entities with their own functions, APIs, solr integrations, import pipelines, edit + display UIs. This seems like it could be difficult to maintain when really what's important is that each of these types has mutually consistent schema. I'm imagining they all "inherit" from type Tag and share a schema according to their sub type (e.g. subject, collection, moderation, award, etc). An important aspect to me, from an engineering & implementation perspective, is that they share the same infogami API and we're not creating more types than we need and creating more exposure than we can cover.

You make a point that forcing tags to use the same schema increases complexity and reduces the ability to query. Wouldn't having tags all in one bucket decrease the complexity of the engineering even if it increases the complexity of tags itself? I agree it does transfer complexity from the system to the patron -- and given our resource constraints, to me this is an advantage in this specific situation.

With respect to querying, I agree that the combo of tags with types may make it more difficult to query in infogami, but I imagine the primary use case (i.e. how subjects are used today) is querying via solr and there are any number of optimizations we can make to aggregate a work's tags and usefully bucket them within solr. I intend for infogami to be the storage mechanism and keeping tags as simple, interoperable, and as extensible as possible architecturally I believe is to our benefit.

@mekarpeles
Copy link
Member

Example generic Tag document could look something like:

{
  key: "/tag/OL1T",
  tag_type: "subject",
  tag_name: "fantasy",
  queries: [{
    "title": {"en": "Recent release"},
    "query": "...&sort=newest",
  }],
  exclusion: "title: ...",
  title: {
    "en": "Fantasy",
    "fr": "Fantaisie"
  },
  description: {
    "en": "..."
  }
  header_img: "https://media.istockphoto.com/id/1070683626/photo/magical-old-book-with-sparkles.jpg",
  children: ["/tag/OL2T", "/tag/OL33T"],
  neighbors: [],
  ... // additional schema related to tag_type
}

@cdrini cdrini modified the milestones: 2023, Sprint 2023-04 Mar 27, 2023
@mekarpeles
Copy link
Member

I'll remain the lead for this issue but am going to mark @JaydenTeoh as the assignee as they've been making great progress (keep up the great work!)

@mekarpeles
Copy link
Member

Related: #65

@mekarpeles
Copy link
Member

Supplanted by #7904

@tfmorris
Copy link
Contributor Author

tfmorris commented Sep 7, 2023

In the case of duplicates most projects keep the oldest issue so that provenance, discussion, and age are preserved. This project seems to continually replace old, perfectly valid, with new issues. Why is that? Does someone's bonus depend on how long tickets have been open?

It looks like my assumption of "fait accompli" was accurate. No response to questions about how this relates to BIBFRAME, FAST, LCSH, Wikidata, MARC, or anything else in the real world. No response to comments in design document that was linked above. No attention being paid to centuries of library cataloging practice or the directions that the library community (the only source of high quality metadata for OL) is going.

For the record, my request for Subjects to become first class objects in OpenLibrary was not "completed" as the issue status seems to indicate, but instead roundly rejected. "Tags" may become useful some day and it may even be able to build Subjects as first class objects on top of them, but there is no plan or path which shows how (or even if) that is going to happen. I'm very disappointed.

@jimchamp
Copy link
Collaborator

jimchamp commented Sep 7, 2023

The insults will surely help your case, Tom.

@mekarpeles mekarpeles modified the milestones: 2023, Sprint 2023-08 Sep 9, 2023
@mekarpeles
Copy link
Member

mekarpeles commented Sep 9, 2023

@tfmorris, we simply have 2 issues that were similar, one was closed, one remains open, and both are linked together. You're right that there could have been a better way of preserving provenance -- but hey, we could be happy that someone is caring enough to at least go through 700+ issues and trying their best to dedupe at all. Furthermore, a section was explicitly added to the planning doc in response to updates we're planning for October: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#heading=h.o9utr3tyh8k. Furthermore, we have made progress on several elements of this plan through GSoC this year. As well as anyone who has been involved with the project for several years, you know we're a small team doing the best we can and sometimes an older issue gets closed instead of a newer one and it's not done out of malice but rather an attempt to get things organized so we can make forward progress towards efforts that I know you care about.

In response to your questions about BIBFRAME, FAST, LCSH, Wikidata, MARC -- we continue to import classifications for these sources and also intend to links to sources within Tag documents (e.g. "this Tag is a classification from LCCN and here is its number").

We simply haven't gotten to that step of fully defining what is included in a Tag document because there's a lot of opportunities for integrating pieces we have confidence will be required. Much of the schema we imagine may be impacted by #7833 as we are interviewer 15 learners and educators to understand what types of affordances they may want beyond what we have in our current subject pages.

We've had dozens of calls on this topic, spanning staff, design team, engineering, and librarians, and at least 10 different people have been involved in weighing in, as have you (to the best of my ability) over github. This is one of many issues and it would be nice if it were appreciated that we doing the best we're able, and also that it is hard to fully respond to the feedback folks have over github which is why we do weekly community calls which I've tried hard to include you in.

Yes, I'm not perfect and will continue to make mistakes. I understand how frustrating advising and contributing to Open Library must feel under these constraints. All I can do is continue to be open to collaborating to the best of my ability and it would be nice if we could achieve that with good will and the compassion of two contributors who both care deeply about the project and doing right by our patrons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Affects: Experience Issues relating directly to service design & patrons experience Affects: Server Issues with the server (olweb) or its plugins. [managed] Affects: UI Issues with the web site's user interface. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Community Discussion This issue is to be brought up in the next community call. [managed] Priority: 2 Important, as time permits. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

7 participants