Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with legacy IDs in Solr statistics on DSpace 6.x #12

Closed
nwoodward opened this issue Jan 4, 2021 · 3 comments
Closed

Issue with legacy IDs in Solr statistics on DSpace 6.x #12

nwoodward opened this issue Jan 4, 2021 · 3 comments

Comments

@nwoodward
Copy link

First off, this is a great tool for reviewing DSpace statistics. Thanks for releasing it to the community. I wanted to ask if you have run into issues with DSpace 6.x instances that have been migrated from prior major versions and thus potentially contain non-UUID IDs in the Solr statistics?

After running solr-upgrade-statistics-6x on my instance I was left with some IDs in Solr that couldn't be migrated and thus were labeled "XXXXX-unmigrated". When I run the indexer while on the v6_x branch I see it fails when it comes across an unmigrated ID. So I'm wondering if some sort of UUID validation step would be useful before the calls to update views/downloads statistics in PostgreSQL?

@alanorth
Copy link
Member

alanorth commented Jan 4, 2021

@nwoodward Unmigrated statistics are the bane of my existence! Yes I've had this issue many times, and I always just went and purged the unmigrated stats to fix it. But you're absolutely right that I could similarly just modify the indexer's Solr query to:

  1. Only work on statistics records with IDs that are UUIDs, ie: id:/.{36}/
  2. Only work on statistics records with IDs that are not unmigrated, ie: NOT id:/.+-unmigrated/

What do you think?

@nwoodward
Copy link
Author

Yeah, I haven't run into examples of unmigrated IDs matching anything in the current DSpace database, so I think number 1 makes the most sense. Plus a note that the Solr statistics must be migrated to 6x, especially in older shards, since that's where this problem is likely to occur. And for some strange reason, I couldn't get Python's UUID module to work at validating UUIDs in the Solr stats. It was finding a lot of false negatives. There are several regex patterns out there to match UUIDs, and I just chose one of them.

@alanorth
Copy link
Member

alanorth commented Jan 5, 2021

@nwoodward yeah I guess it's better to explicitly match UUIDs than to try to not match unmigrated, as the two are not necessarily the same. For what it's worth, in our recent DSpace 6 migration this year I had all kinds of non-UUID values like -1, 0, 9391-unmigrated, etc. I had to purge millions of records from our ten years of stats. I assume those come from deleted items, deleted bitstreams (think: regenerated ImageMagick thumbnails), homepage hits (as the top-level homepage doesn't have a UUID). Shame there is no discussion of this on the DSpace wiki.

Regarding the regex to match UUIDs, I think matching 36 characters is good enough for us. BTW, the issue in this project is Solr's regex support, not Python's!

alanorth added a commit that referenced this issue Jan 5, 2021
We need to make sure that the indexer only tries to index UUIDs, as
opposed to legacy IDs that may have been left over from a migration
from earlier DSpace versions. For example, "98110-unmigrated", "-1"
etc.

For matching the UUIDs in Solr I decided that it is sufficient for
our use case to simply match thirty-six characters, where a UUID is
composed of thirty-two hexadecimal characters and four dashes. We
don't need to do any verification of "real" UUIDs because it would
be needlessly complex in our case.

See: #12
@alanorth alanorth closed this as completed Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants