Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fake subjects from Works #2107

Closed
tfmorris opened this issue May 6, 2019 · 6 comments
Closed

Remove fake subjects from Works #2107

tfmorris opened this issue May 6, 2019 · 6 comments
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] State: Work In Progress This issue is being actively worked on. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]
Projects

Comments

@tfmorris
Copy link
Contributor

tfmorris commented May 6, 2019

Description

Three of the top five "subjects" are not subjects at all:

  • Accessible book - 2.5 million
  • Protected DAISY - 1.2 million
  • In library - 0.5 million

The are also some lower frequency noise terms like Lending library and Internet Archive Wishlist but the three above represent the bulk of the noise.

Expectation

The subject list should contain things which are actually subjects of the work.

Proposal & Constraints

Remove the three subjects above. If they're needed to provide functionality move them to a hidden portion of the Solr index where they don't pollute the UI.

@mekarpeles mekarpeles added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Priority: 2 Important, as time permits. [managed] subjects labels May 6, 2019
@LeadSongDog
Copy link

@tfmoorris Perhaps these tags should be supplanted by a new entry under The Physical Object / Format prior to deletion, or is there a better way?

@hornc hornc removed the Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] label Jul 21, 2019
@hornc
Copy link
Collaborator

hornc commented Jul 22, 2019

We should also remove this code which looks like it is trying to protect these subjects:

SYSTEM_SUBJECTS = ["Accessible Book", "Lending Library", "In Library", "Protected DAISY"]

self._prevent_system_subjects_deletion(work)

def _prevent_system_subjects_deletion(self, work):

These subjects should no longer be used as there are other ways to get the information they were trying to convey.

@hornc hornc added the State: Work In Progress This issue is being actively worked on. [managed] label Jul 22, 2019
@cdrini cdrini added the Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed] label Jul 22, 2019
@LeadSongDog
Copy link

@hornc
Are those "other ways to get the information" all working and documented to users?

@xayhewalo xayhewalo added this to Triaged in Triage Oct 20, 2019
@hornc hornc added Priority: 3 Issues that we can consider at our leisure. [managed] and removed Priority: 2 Important, as time permits. [managed] labels Nov 7, 2019
@mekarpeles mekarpeles added the Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] label Dec 18, 2019
@hornc hornc added this to the Next Sprint (Proposed) milestone Feb 3, 2020
@hornc
Copy link
Collaborator

hornc commented Feb 13, 2020

Checking ol_dump_editions_2019-12-31.txt

  • Lending library appears to be clear
  • Internet Archive Wishlist 0, also clear

ol_dump_works_2019-12-31.txt

  • Lending library 0
  • Internet Archive Wishlist - 382004 remaining
  • In Library - 436321 remaining

@hornc
Copy link
Collaborator

hornc commented Feb 14, 2020

Subjects search for quicker testing of current numbers:

NB the solr indexing of these subjects seems way off -- there are many more items showing up with these subjects at the URLs than are in the data dumps.

@cdrini cdrini removed this from the Sprint 2020-02 milestone Mar 2, 2020
@hornc hornc removed the CH: subjects label Mar 9, 2020
@hornc hornc changed the title Remove fake subjects Remove fake subjects from Works Aug 11, 2020
@hornc
Copy link
Collaborator

hornc commented Aug 19, 2020

The last accessible book subjects found in the ol_dump_works_2020-07-31.txt.gz have been processed and had these subjects removed.

@hornc hornc closed this as completed Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] State: Work In Progress This issue is being actively worked on. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]
Projects
No open projects
Triage
  
Triaged
Development

No branches or pull requests

5 participants