Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design librarian process & UI for merging duplicate Subjects #65

Open
Tracked by #7904
george08 opened this issue Sep 7, 2011 · 13 comments
Open
Tracked by #7904

Design librarian process & UI for merging duplicate Subjects #65

george08 opened this issue Sep 7, 2011 · 13 comments
Labels
Affects: Librarians Issues related to features that librarians particularly need. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Book Tags Issues related to community book tags Theme: Search Issues related to search UI and backend. [managed] Theme: Subjects Type: Bug Something isn't working. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]
Projects

Comments

@george08
Copy link

george08 commented Sep 7, 2011

Can we please get rid of the duplicate variations on search results pages for subjects?

E.g.
http://openlibrary.org/search/subjects?q=New%20York

⚠️ EDIT: Administrative edit by @mekarpeles -- please jump to #65 (comment) to see recent context + proposal for implementing a solution to this issue.

@ghost ghost assigned EdwardBetts Sep 7, 2011
@bencomp
Copy link
Contributor

bencomp commented May 5, 2014

Duplicate variations are cool. That there are so many of them is a data problem, not a search problem per se. Different spellings may make a huge difference and so do Paris (place) vs. Paris (person).

A serious problem I face here is getting 0 results when I click the New York example. Somehow OL doesn't understand the space, neither kept as %20 nor translated to +. Only when I explicitly write new_york I get results.

@bencomp
Copy link
Contributor

bencomp commented Jun 6, 2014

It appears my problem was solved by 06b0da2, thanks Anand!

I now realise the problem with duplicates/variations is a bit different than I thought. The search query for New York yields subjects like New York University. On page 1 are three listings for New York University:

  • New York University (subject) 56 books
  • New York university (subject) 55 books
  • New York University (org) 47 books

The URLs for these results are the same, which means the number of books is wrong in at least two of these three results. The 'org' results are probably books in which the original records have the subject in a special (sub)field, but there is no way normal users can edit this in individual books.

I'm pretty sure this is a problem with how Solr gets updated.

@mekarpeles mekarpeles added Theme: Search Issues related to search UI and backend. [managed] subjects labels Mar 23, 2017
@hornc hornc added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Nov 1, 2017
@hornc
Copy link
Collaborator

hornc commented Nov 1, 2017

In https://openlibrary.org/search/subjects?q=University+York the subject and org versions with the same URLs show the same counts...

but https://openlibrary.org/search/subjects?q=University+London
has
London University College 40 books, subject
London University College 39 books, org

and the same URLs, so this is still an issue.

SOLR updating issue?

@brad2014
Copy link
Collaborator

brad2014 commented May 3, 2019

@hornc , could you please opine as to the appropriate priority of this, by setting a label?

@tfmorris
Copy link
Contributor

tfmorris commented May 5, 2019

George's query returns 0 results (problem solved? :) ) Likely related to #322

The general problem is that subjects need to be objects, not string, and have aliases, associated metadata, etc.

@hornc
Copy link
Collaborator

hornc commented May 5, 2019

Yes, I think this can be closed. #322 is a more specific and current issue that can be dealt with independently. Any other current Subject related issues should be raised separately with current examples.

@hornc hornc closed this as completed May 5, 2019
@tfmorris
Copy link
Contributor

tfmorris commented May 5, 2019

I think this is still a valid issue. Although George's original query is broken due to a different bug (#322), the problem she reported isn't fixed. The queries that @hornc posted here in 2017: #65 (comment) still demonstrate the problem.

The first three hits:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

resolve to only 2 uniq URLs which differ only be a single trailing period. This is precisely the same as Ben's example from 2014 (except there the difference was a single letter case difference).

Additionally, the two subject URLs have work counts of 29 and 19, not 40 and 24, respectively, so the counts reported on the search results page is incorrect, but I'm guessing that if we fix the duplication, the counts will take care of themselves, so why don't we focus this issue on that? It's also the problem that George reported in 2011 and was confirmed in 2014 and again in 2017.

There are also additional search hits which should be merged:

Toronto York University 4 books, subject
York University (Toronto) 1 book, org
York University, Toronto 1 book, subject

but if we take care of the simple, common cases to start, we'll have a 90% solution.

On a more general note, I think favoring "current" (ie more modern) reports over historical historical ones is a bad idea in general because it loses the history of research that was done and obscures how long the problem has existed (8 years in this case).

@mekarpeles
Copy link
Member

related #188

@xayhewalo xayhewalo added this to Un-Triaged in Triage Oct 20, 2019
@xayhewalo xayhewalo added Priority: 2 Important, as time permits. [managed] State: Backlogged labels Nov 12, 2019
@xayhewalo xayhewalo moved this from Un-Triaged to Triaged in Triage Nov 12, 2019
@hornc
Copy link
Collaborator

hornc commented Nov 12, 2019

I'm trying to figure out what the remaining issue is here -- it seems the latest example is
https://openlibrary.org/search/subjects?q=University+York

and in the first three hits:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

The issue is why there are two separate rows for:

York University (Toronto, Ont.). 24 books, org
York University (Toronto, Ont.). 24 books, subject

A book that has this subject https://openlibrary.org/works/OL19055710W.json
shows there is only one entry for York University (Toronto, Ont.). under subjects

Why is it showing in subject search results as an org and a subject?

Investigation task:

  • Investigate what is the significance of org and subject on the subject results page.
  • and how is this represented in Solr, because it does not appear to be saved in the item metadata

Subject search results page is: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/templates/search/subjects.html

Alternatively, skip investigation, and fix display:

At first glance it looks like if $key or $n have been seen before, they shouldn't be displayed again.
Ideally the most specifc catgeory should be used over subject, so that the results are:

York University (Toronto, Ont.) 40 books, subject
York University (Toronto, Ont.). 24 books, org

The difference in the trailing period is a separate data issue, and subjects should go through some normalisation on import. (I believe they do on the import API path).

$for doc in response['docs']:

@tfmorris
Copy link
Contributor

I believe it was specifically the normalization/clustering of subjects that George's original issue was about.

The org/place/time subjects are used as search facets. If you think there's an issue there, we should create a separate issue to address it and not conflate it with the original problem.

@mekarpeles
Copy link
Member

I think this issue needs to be renamed and needs a clear scope (how do we know it's complete) if it's going to stay open. @tfmorris would you like to suggest a more useful title and description for this issue? I don't understand the problem well enough other than... There is a discrepancy between subject search results for subject v. org? Is there any proposed solution? Do we know where to look / what code is doing something wrong?

@mekarpeles mekarpeles added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Jan 5, 2022
@mekarpeles mekarpeles changed the title Fix subject search results Fix duplicate subject search results Dec 19, 2022
@mekarpeles mekarpeles modified the milestones: Next (proposed), 2023 Jan 26, 2023
@mekarpeles mekarpeles changed the title Fix duplicate subject search results De-duplicate subject Feb 1, 2023
@mekarpeles mekarpeles changed the title De-duplicate subject De-duplicate subjects Feb 1, 2023
@mekarpeles
Copy link
Member

We should be able to make significant progress on this through #7486 of #2819.

Extending the ILE admin blue bar should allow librarians to search for works by subject and update them in bulk.

It's possible we'll also want to use scripts for duplicate subjects applied to thousands+ of books.

@mekarpeles
Copy link
Member

mekarpeles commented Sep 8, 2023

Updates

The good news is, we now have a subject "object" which we're calling a Tag. See: #7928

We decided to make a new object called a Tag (and leave subjects as is) because there are many things we want a Tag object for that are not limited to "subject".

The current strategy is to use subject strings as a mechanism to pull a corresponding Tag object from the db. This process and the corresponding features and definitions are described in the breakdown section here.

Challenges

Even with a way to promote subject strings into Tag objects, there are still several challenges around duplication and naming. There are two things I'd like to briefly discuss:

  1. Some human or librarian-editable mechanism (e.g. a UI) should exist that allows us to merge subjects or tags (and have the changes apply to all relevant Works, etc.)
  2. Leaning into the usage of: prefixes as a mechanism of "typing" subjects is another thing we've talked about, and designing Tags so they have a type (like subject, place, content-warning, etc).

Proposed Solution

As a result of the progress we've made this year, I'm going to rename this issue to make it something more actionable:

"Design process & UI for merging duplicate Subjects"

Ideally, the solution would also leverage and extend our existing Librarian Merge Queue: https://openlibrary.org/merges to include merge requests for a list of subjects.

@mekarpeles mekarpeles changed the title De-duplicate subjects Design librarian process & UI for merging duplicate Subjects Sep 8, 2023
@mekarpeles mekarpeles added Affects: Librarians Issues related to features that librarians particularly need. [managed] Theme: Book Tags Issues related to community book tags Type: Epic A feature or refactor that is big enough to require subissues. [managed] labels Sep 8, 2023
@cdrini cdrini added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] and removed Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Librarians Issues related to features that librarians particularly need. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Book Tags Issues related to community book tags Theme: Search Issues related to search UI and backend. [managed] Theme: Subjects Type: Bug Something isn't working. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]
Projects
No open projects
Triage
  
Triaged
Development

No branches or pull requests

9 participants