-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design librarian process & UI for merging duplicate Subjects #65
Comments
Duplicate variations are cool. That there are so many of them is a data problem, not a search problem per se. Different spellings may make a huge difference and so do Paris (place) vs. Paris (person). A serious problem I face here is getting 0 results when I click the New York example. Somehow OL doesn't understand the space, neither kept as |
It appears my problem was solved by 06b0da2, thanks Anand! I now realise the problem with duplicates/variations is a bit different than I thought. The search query for New York yields subjects like New York University. On page 1 are three listings for New York University:
The URLs for these results are the same, which means the number of books is wrong in at least two of these three results. The 'org' results are probably books in which the original records have the subject in a special (sub)field, but there is no way normal users can edit this in individual books. I'm pretty sure this is a problem with how Solr gets updated. |
In https://openlibrary.org/search/subjects?q=University+York the subject and org versions with the same URLs show the same counts... but https://openlibrary.org/search/subjects?q=University+London and the same URLs, so this is still an issue. SOLR updating issue? |
@hornc , could you please opine as to the appropriate priority of this, by setting a label? |
George's query returns 0 results (problem solved? :) ) Likely related to #322 The general problem is that subjects need to be objects, not string, and have aliases, associated metadata, etc. |
Yes, I think this can be closed. #322 is a more specific and current issue that can be dealt with independently. Any other current Subject related issues should be raised separately with current examples. |
I think this is still a valid issue. Although George's original query is broken due to a different bug (#322), the problem she reported isn't fixed. The queries that @hornc posted here in 2017: #65 (comment) still demonstrate the problem. The first three hits:
resolve to only 2 uniq URLs which differ only be a single trailing period. This is precisely the same as Ben's example from 2014 (except there the difference was a single letter case difference). Additionally, the two subject URLs have work counts of 29 and 19, not 40 and 24, respectively, so the counts reported on the search results page is incorrect, but I'm guessing that if we fix the duplication, the counts will take care of themselves, so why don't we focus this issue on that? It's also the problem that George reported in 2011 and was confirmed in 2014 and again in 2017. There are also additional search hits which should be merged:
but if we take care of the simple, common cases to start, we'll have a 90% solution. On a more general note, I think favoring "current" (ie more modern) reports over historical historical ones is a bad idea in general because it loses the history of research that was done and obscures how long the problem has existed (8 years in this case). |
related #188 |
I'm trying to figure out what the remaining issue is here -- it seems the latest example is and in the first three hits:
The issue is why there are two separate rows for:
A book that has this subject https://openlibrary.org/works/OL19055710W.json Why is it showing in subject search results as an Investigation task:
Subject search results page is: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/templates/search/subjects.html Alternatively, skip investigation, and fix display: At first glance it looks like if
The difference in the trailing period is a separate data issue, and subjects should go through some normalisation on import. (I believe they do on the import API path).
|
I believe it was specifically the normalization/clustering of subjects that George's original issue was about. The org/place/time subjects are used as search facets. If you think there's an issue there, we should create a separate issue to address it and not conflate it with the original problem. |
I think this issue needs to be renamed and needs a clear scope (how do we know it's complete) if it's going to stay open. @tfmorris would you like to suggest a more useful title and description for this issue? I don't understand the problem well enough other than... There is a discrepancy between subject search results for subject v. org? Is there any proposed solution? Do we know where to look / what code is doing something wrong? |
UpdatesThe good news is, we now have a subject "object" which we're calling a Tag. See: #7928 We decided to make a new The current strategy is to use subject strings as a mechanism to pull a corresponding Tag object from the db. This process and the corresponding features and definitions are described in the breakdown section here. ChallengesEven with a way to promote subject strings into Tag objects, there are still several challenges around duplication and naming. There are two things I'd like to briefly discuss:
Proposed SolutionAs a result of the progress we've made this year, I'm going to rename this issue to make it something more actionable:
Ideally, the solution would also leverage and extend our existing Librarian Merge Queue: https://openlibrary.org/merges to include merge requests for a list of subjects.
|
Can we please get rid of the duplicate variations on search results pages for subjects?
E.g.
http://openlibrary.org/search/subjects?q=New%20York
The text was updated successfully, but these errors were encountered: