-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check and new way of harvest data #96
Conversation
There are two "MEL downloads and views" plugins: $ grep -rsI "MEL downloads and views"
backend/src/plugins/mel_downloads_and_views/info.json: "display_name": "MEL downloads and views",
backend/src/plugins/dspace_add_missing_items/info.json: "display_name": "MEL downloads and views", |
The new harvesting method is much faster. I did a complete harvest of DSpace Test in about three hours, but there was an error at the end of harvesting. Here is the log from the API container: ...
[Nest] 88 - 06/16/2021, 2:01:12 PM [HarvesterService] Starting Harvest
[Nest] 88 - 06/16/2021, 4:53:43 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 88 - 06/16/2021, 4:53:46 PM [DSpaceHealthCheck] Health Check Started
[Nest] 88 - 06/16/2021, 4:55:12 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 88 - 06/16/2021, 4:55:12 PM [DSpaceHealthCheck] ResponseError: search_phase_execution_exception +395ms
TypeError: Cannot read property 'body' of undefined
at DSpaceHealthCheck.deleteDuplicates (/backend/dist/plugins/dspace_health_check/index.js:90:18)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at async DSpaceHealthCheck.transcode (/backend/dist/plugins/dspace_health_check/index.js:45:13)
[Nest] 88 - 06/16/2021, 4:56:15 PM [DSpaceHealthCheck] Health Check Started
[Nest] 88 - 06/16/2021, 4:57:46 PM [DSpaceHealthCheck] ResponseError: search_phase_execution_exception +91033ms
TypeError: Cannot read property 'body' of undefined
at DSpaceHealthCheck.deleteDuplicates (/backend/dist/plugins/dspace_health_check/index.js:90:18)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at async DSpaceHealthCheck.transcode (/backend/dist/plugins/dspace_health_check/index.js:45:13) Also, now the DSpace Health Check has over 600,000 jobs... Looking in the DSpace server logs I see that it seems to be stuck on one item because I see 34,000 of these so far (and increasing fast): 15.21.125.64 - - [16/Jun/2021:19:22:14 +0200] "GET /rest/handle/10568/3703?expand=metadata,parentCommunityList,parentCollectionList,bitstreams HTTP/1.1" 200 443 "-" "ARES harvesting bot; https://github.com/ilri/OpenRXV" From the nginx logs on the server: # grep -c '/rest/handle/10568/3703?expand=metadata,parentCommunityList,parentCollectionList,bitstreams' /var/log/nginx/rest.log
37553 |
This implements a new method for harvesting DSpace repositories as well as a "health check" plugin that runs after harvesting to find duplicate and missing items. The new harvesting method harvests 10 items at a time and is much faster (at least 3x) than the previous method. We still need to do some cleanup work on this: - Commit button should only be active if `openrxv-items-temp` has data, otherwise if you click it twice it will overwrite the live `openrxv-items-final` index. - Some documentation tips in the admin dashboard to help users with the values that are expected in various new fields - Making the "DSpace add missing items" plugin hidden, as it is used internally by the health check plugin and should not be enabled manually See: #96
This is working well now. Squashed into one commit and merged to master manually. We still need to:
Should solve #67 and #62. I will evaluate that in the coming days after we have more experience with this in production. |
@moayadnajd there's something strange with the health check. It seems to create way too many plugin jobs: This happens every time I start the plugins after harvesting. I investigated the jobs and most of them are from $ redis-cli KEYS "bull:plugins:*" \
| sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
| ncat -w 3 localhost 6379 \
| grep -v -E '^\$' | sort | uniq -c | sort -h
3 dspace_health_check
4 -ERR wrong number of arguments for 'hget' command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items |
@moayadnajd I added debugging to the DSpace Health Check and I think it's confused when there is more than one repository. Notice how many "missing" items it wants to add for WorldFish: [Nest] 88 - 06/26/2021, 5:46:15 PM [NestApplication] Nest application successfully started +12ms
[Nest] 88 - 06/26/2021, 5:47:06 PM [DSpaceHealthCheck] Started DSpace health check
[Nest] 88 - 06/26/2021, 5:47:06 PM [DSpaceHealthCheck] Started DSpace health check
[Nest] 88 - 06/26/2021, 5:47:10 PM [DSpaceHealthCheck] getHandles
[Nest] 88 - 06/26/2021, 5:47:10 PM [DSpaceHealthCheck] 0 handles found
[Nest] 88 - 06/26/2021, 5:47:10 PM [DSpaceHealthCheck] getHandles
[Nest] 88 - 06/26/2021, 5:47:11 PM [DSpaceHealthCheck] 0 handles found
[Nest] 88 - 06/26/2021, 5:47:11 PM [DSpaceHealthCheck] 4431 handles found
[Nest] 88 - 06/26/2021, 5:47:11 PM [DSpaceHealthCheck] Searching for duplicate handles
[Nest] 88 - 06/26/2021, 5:47:12 PM [DSpaceHealthCheck] Adding 92409 missing items (WorldFish)
[Nest] 88 - 06/26/2021, 5:47:13 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 88 - 06/26/2021, 5:47:13 PM [DSpaceHealthCheck] 9999 handles found
[Nest] 88 - 06/26/2021, 5:47:13 PM [DSpaceHealthCheck] Finished DSpace health check (WorldFish)
[Nest] 88 - 06/26/2021, 5:47:13 PM [DSpaceHealthCheck] 19998 handles found
[Nest] 88 - 06/26/2021, 5:47:14 PM [DSpaceHealthCheck] 29997 handles found
[Nest] 88 - 06/26/2021, 5:47:15 PM [DSpaceHealthCheck] 39996 handles found
[Nest] 88 - 06/26/2021, 5:47:15 PM [DSpaceHealthCheck] 49995 handles found
[Nest] 88 - 06/26/2021, 5:47:16 PM [DSpaceHealthCheck] 59994 handles found
[Nest] 88 - 06/26/2021, 5:47:17 PM [DSpaceHealthCheck] 69993 handles found
[Nest] 88 - 06/26/2021, 5:47:18 PM [DSpaceHealthCheck] 79992 handles found
[Nest] 88 - 06/26/2021, 5:47:18 PM [DSpaceHealthCheck] 89991 handles found
[Nest] 88 - 06/26/2021, 5:47:18 PM [DSpaceHealthCheck] 90990 handles found
[Nest] 88 - 06/26/2021, 5:47:18 PM [DSpaceHealthCheck] Searching for duplicate handles
[Nest] 88 - 06/26/2021, 5:47:53 PM [DSpaceHealthCheck] Adding 1419 missing items (DSpace Test) The WorldFish repository only has 4,400 items total, so it's impossible that 92,000 are missing. Similarly, there are actually about 92,000 items in CGSpace so this seems to be all the items in the repository. |
Now the harvested looks to the sitemap and estimate the number of pages the harvester should harvest
pages is now 10 items since it will take less memory and time to hit
you have to specify the link of the sitemap when you setup your repository
Now there is a plugin for health check to fix duplicates and add missing items
you have to specify the repo name and link of the rest api item by handle
after you finish indexing you have to click on commit indexing to show results in the explorer
This may fix #62 #67