Design librarian process & UI for merging duplicate Subjects #65

george08 · 2011-09-07T17:59:23Z

Can we please get rid of the duplicate variations on search results pages for subjects?

E.g.
http://openlibrary.org/search/subjects?q=New%20York

⚠️ EDIT: Administrative edit by @mekarpeles -- please jump to #65 (comment) to see recent context + proposal for implementing a solution to this issue.

bencomp · 2014-05-05T23:31:50Z

Duplicate variations are cool. That there are so many of them is a data problem, not a search problem per se. Different spellings may make a huge difference and so do Paris (place) vs. Paris (person).

A serious problem I face here is getting 0 results when I click the New York example. Somehow OL doesn't understand the space, neither kept as %20 nor translated to +. Only when I explicitly write new_york I get results.

bencomp · 2014-06-06T06:48:01Z

It appears my problem was solved by 06b0da2, thanks Anand!

I now realise the problem with duplicates/variations is a bit different than I thought. The search query for New York yields subjects like New York University. On page 1 are three listings for New York University:

New York University (subject) 56 books
New York university (subject) 55 books
New York University (org) 47 books

The URLs for these results are the same, which means the number of books is wrong in at least two of these three results. The 'org' results are probably books in which the original records have the subject in a special (sub)field, but there is no way normal users can edit this in individual books.

I'm pretty sure this is a problem with how Solr gets updated.

hornc · 2017-11-01T00:25:21Z

In https://openlibrary.org/search/subjects?q=University+York the subject and org versions with the same URLs show the same counts...

but https://openlibrary.org/search/subjects?q=University+London
has
London University College 40 books, subject
London University College 39 books, org

and the same URLs, so this is still an issue.

SOLR updating issue?

brad2014 · 2019-05-03T21:59:36Z

@hornc , could you please opine as to the appropriate priority of this, by setting a label?

tfmorris · 2019-05-05T05:14:39Z

George's query returns 0 results (problem solved? :) ) Likely related to #322

The general problem is that subjects need to be objects, not string, and have aliases, associated metadata, etc.

hornc · 2019-05-05T06:03:14Z

Yes, I think this can be closed. #322 is a more specific and current issue that can be dealt with independently. Any other current Subject related issues should be raised separately with current examples.

tfmorris · 2019-05-05T19:26:35Z

I think this is still a valid issue. Although George's original query is broken due to a different bug (#322), the problem she reported isn't fixed. The queries that @hornc posted here in 2017: #65 (comment) still demonstrate the problem.

The first three hits:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

resolve to only 2 uniq URLs which differ only be a single trailing period. This is precisely the same as Ben's example from 2014 (except there the difference was a single letter case difference).

Additionally, the two subject URLs have work counts of 29 and 19, not 40 and 24, respectively, so the counts reported on the search results page is incorrect, but I'm guessing that if we fix the duplication, the counts will take care of themselves, so why don't we focus this issue on that? It's also the problem that George reported in 2011 and was confirmed in 2014 and again in 2017.

There are also additional search hits which should be merged:

Toronto York University 4 books, subject
York University (Toronto) 1 book, org
York University, Toronto 1 book, subject

but if we take care of the simple, common cases to start, we'll have a 90% solution.

On a more general note, I think favoring "current" (ie more modern) reports over historical historical ones is a bad idea in general because it loses the history of research that was done and obscures how long the problem has existed (8 years in this case).

mekarpeles · 2019-10-14T17:04:21Z

related #188

hornc · 2019-11-12T22:39:52Z

I'm trying to figure out what the remaining issue is here -- it seems the latest example is
https://openlibrary.org/search/subjects?q=University+York

and in the first three hits:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

The issue is why there are two separate rows for:

York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

A book that has this subject https://openlibrary.org/works/OL19055710W.json
shows there is only one entry for York University (Toronto, Ont.). under subjects

Why is it showing in subject search results as an org and a subject?

Investigation task:

Investigate what is the significance of org and subject on the subject results page.
and how is this represented in Solr, because it does not appear to be saved in the item metadata

Subject search results page is: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/templates/search/subjects.html

Alternatively, skip investigation, and fix display:

At first glance it looks like if $key or $n have been seen before, they shouldn't be displayed again.
Ideally the most specifc catgeory should be used over subject, so that the results are:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org

The difference in the trailing period is a separate data issue, and subjects should go through some normalisation on import. (I believe they do on the import API path).

openlibrary/openlibrary/templates/search/subjects.html

Line 44 in a53f901

$for doc in response['docs']:

tfmorris · 2019-11-13T13:11:00Z

I believe it was specifically the normalization/clustering of subjects that George's original issue was about.

The org/place/time subjects are used as search facets. If you think there's an issue there, we should create a separate issue to address it and not conflate it with the original problem.

mekarpeles · 2019-12-13T00:49:30Z

I think this issue needs to be renamed and needs a clear scope (how do we know it's complete) if it's going to stay open. @tfmorris would you like to suggest a more useful title and description for this issue? I don't understand the problem well enough other than... There is a discrepancy between subject search results for subject v. org? Is there any proposed solution? Do we know where to look / what code is doing something wrong?

mekarpeles · 2023-02-01T05:29:49Z

We should be able to make significant progress on this through #7486 of #2819.

Extending the ILE admin blue bar should allow librarians to search for works by subject and update them in bulk.

It's possible we'll also want to use scripts for duplicate subjects applied to thousands+ of books.

mekarpeles · 2023-09-08T18:25:39Z

Updates

The good news is, we now have a subject "object" which we're calling a Tag. See: #7928

We decided to make a new object called a Tag (and leave subjects as is) because there are many things we want a Tag object for that are not limited to "subject".

The current strategy is to use subject strings as a mechanism to pull a corresponding Tag object from the db. This process and the corresponding features and definitions are described in the breakdown section here.

Challenges

Even with a way to promote subject strings into Tag objects, there are still several challenges around duplication and naming. There are two things I'd like to briefly discuss:

Some human or librarian-editable mechanism (e.g. a UI) should exist that allows us to merge subjects or tags (and have the changes apply to all relevant Works, etc.)
Leaning into the usage of: prefixes as a mechanism of "typing" subjects is another thing we've talked about, and designing Tags so they have a type (like subject, place, content-warning, etc).

Proposed Solution

As a result of the progress we've made this year, I'm going to rename this issue to make it something more actionable:

"Design process & UI for merging duplicate Subjects"

Ideally, the solution would also leverage and extend our existing Librarian Merge Queue: https://openlibrary.org/merges to include merge requests for a list of subjects.

Extend the blue librarian toolbar (ILE) to allow librarians to select multiple subjects on the subject search page (e.g. https://openlibrary.org/search/subjects?q=new+york) and then click Merge
The ILE will redirect to new Merge page for subjects https://openlibrary.org/subjects/merge?records= (similar to the Works merge page, e.g.: https://openlibrary.org/works/merge?records=OL929211W) that will allow the librarian to confirm, leave a commit message, and select which subject name should "win" (i.e. be kept for) the merge
The merge dashboard https://openlibrary.org/merges should be updated to include a subject type and should also have a type filter and dropdown with checkboxes that allows librarians to select and see Work and/or Author and/or subject merge requests. Similar to how the URL may specify e.g. ?reviewer=librarian123, we should have types=authors,works,subjects (default) -- very similar change to Add "Status" filter to merge request table #8272

ghost assigned EdwardBetts Sep 7, 2011

george08 unassigned EdwardBetts May 5, 2014

mekarpeles added Theme: Search Issues related to search UI and backend. [managed] subjects labels Mar 23, 2017

mekarpeles mentioned this issue Mar 23, 2017

Subject search breaks when keyword "new" present #322

Closed

hornc added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Nov 1, 2017

brad2014 assigned hornc May 3, 2019

hornc closed this as completed May 5, 2019

tfmorris reopened this May 5, 2019

tfmorris mentioned this issue Jul 30, 2019

WIP - Solr enhancements #2246

Closed

xayhewalo added this to Un-Triaged in Triage Oct 20, 2019

xayhewalo added Priority: 2 Important, as time permits. [managed] State: Backlogged labels Nov 12, 2019

xayhewalo moved this from Un-Triaged to Triaged in Triage Nov 12, 2019

hornc removed their assignment Jan 14, 2020

hornc removed the CH: subjects label Mar 9, 2020

xayhewalo removed the State: Backlogged label Mar 17, 2020

mekarpeles added Needs: Lead Theme: Subjects labels Apr 20, 2020

mekarpeles mentioned this issue Jan 22, 2021

Improving Search (No Dead Ends) #2728

Closed

13 tasks

mekarpeles added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Jan 5, 2022

mekarpeles mentioned this issue Sep 26, 2022

Search: Editions in Solr #6377

Closed

63 tasks

mekarpeles changed the title ~~Fix subject search results~~ Fix duplicate subject search results Dec 19, 2022

mekarpeles modified the milestones: September 2011, Next (proposed) Dec 19, 2022

mekarpeles modified the milestones: Next (proposed), 2023 Jan 26, 2023

mekarpeles mentioned this issue Feb 1, 2023

Canonical Tags: Subjects to become 1st class objects in metamodel #2819

Closed

mekarpeles changed the title ~~Fix duplicate subject search results~~ De-duplicate subject Feb 1, 2023

mekarpeles changed the title ~~De-duplicate subject~~ De-duplicate subjects Feb 1, 2023

JaydenTeoh mentioned this issue May 25, 2023

Canonical Tags: Subjects to become 1st class objects in metamodel #7904

Open

7 tasks

mekarpeles changed the title ~~De-duplicate subjects~~ Design librarian process & UI for merging duplicate Subjects Sep 8, 2023

mekarpeles added Affects: Librarians Issues related to features that librarians particularly need. [managed] Theme: Book Tags Issues related to community book tags Type: Epic A feature or refactor that is big enough to require subissues. [managed] labels Sep 8, 2023

mekarpeles mentioned this issue Sep 9, 2023

Add "Status" filter to merge request table #8272

Merged

mekarpeles modified the milestones: 2023, 2024 (provisional, requires discussion) Nov 6, 2023

cdrini added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] and removed Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design librarian process & UI for merging duplicate Subjects #65

Design librarian process & UI for merging duplicate Subjects #65

george08 commented Sep 7, 2011 •

edited by mekarpeles

Loading

bencomp commented May 5, 2014

bencomp commented Jun 6, 2014

hornc commented Nov 1, 2017

brad2014 commented May 3, 2019

tfmorris commented May 5, 2019 •

edited

Loading

hornc commented May 5, 2019

tfmorris commented May 5, 2019

mekarpeles commented Oct 14, 2019

hornc commented Nov 12, 2019 •

edited

Loading

tfmorris commented Nov 13, 2019

mekarpeles commented Dec 13, 2019

mekarpeles commented Feb 1, 2023

mekarpeles commented Sep 8, 2023 •

edited

Loading

Design librarian process & UI for merging duplicate Subjects #65

Design librarian process & UI for merging duplicate Subjects #65

Comments

george08 commented Sep 7, 2011 • edited by mekarpeles Loading

bencomp commented May 5, 2014

bencomp commented Jun 6, 2014

hornc commented Nov 1, 2017

brad2014 commented May 3, 2019

tfmorris commented May 5, 2019 • edited Loading

hornc commented May 5, 2019

tfmorris commented May 5, 2019

mekarpeles commented Oct 14, 2019

hornc commented Nov 12, 2019 • edited Loading

tfmorris commented Nov 13, 2019

mekarpeles commented Dec 13, 2019

mekarpeles commented Feb 1, 2023

mekarpeles commented Sep 8, 2023 • edited Loading

Updates

Challenges

Proposed Solution

george08 commented Sep 7, 2011 •

edited by mekarpeles

Loading

tfmorris commented May 5, 2019 •

edited

Loading

hornc commented Nov 12, 2019 •

edited

Loading

mekarpeles commented Sep 8, 2023 •

edited

Loading