New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with legacy IDs in Solr statistics on DSpace 6.x #12
Comments
@nwoodward Unmigrated statistics are the bane of my existence! Yes I've had this issue many times, and I always just went and purged the unmigrated stats to fix it. But you're absolutely right that I could similarly just modify the indexer's Solr query to:
What do you think? |
Yeah, I haven't run into examples of unmigrated IDs matching anything in the current DSpace database, so I think number 1 makes the most sense. Plus a note that the Solr statistics must be migrated to 6x, especially in older shards, since that's where this problem is likely to occur. And for some strange reason, I couldn't get Python's UUID module to work at validating UUIDs in the Solr stats. It was finding a lot of false negatives. There are several regex patterns out there to match UUIDs, and I just chose one of them. |
@nwoodward yeah I guess it's better to explicitly match UUIDs than to try to not match unmigrated, as the two are not necessarily the same. For what it's worth, in our recent DSpace 6 migration this year I had all kinds of non-UUID values like Regarding the regex to match UUIDs, I think matching 36 characters is good enough for us. BTW, the issue in this project is Solr's regex support, not Python's! |
We need to make sure that the indexer only tries to index UUIDs, as opposed to legacy IDs that may have been left over from a migration from earlier DSpace versions. For example, "98110-unmigrated", "-1" etc. For matching the UUIDs in Solr I decided that it is sufficient for our use case to simply match thirty-six characters, where a UUID is composed of thirty-two hexadecimal characters and four dashes. We don't need to do any verification of "real" UUIDs because it would be needlessly complex in our case. See: #12
First off, this is a great tool for reviewing DSpace statistics. Thanks for releasing it to the community. I wanted to ask if you have run into issues with DSpace 6.x instances that have been migrated from prior major versions and thus potentially contain non-UUID IDs in the Solr statistics?
After running
solr-upgrade-statistics-6x
on my instance I was left with some IDs in Solr that couldn't be migrated and thus were labeled "XXXXX-unmigrated". When I run the indexer while on the v6_x branch I see it fails when it comes across an unmigrated ID. So I'm wondering if some sort of UUID validation step would be useful before the calls to update views/downloads statistics in PostgreSQL?The text was updated successfully, but these errors were encountered: