Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check and new way of harvest data #96

Closed
wants to merge 11 commits into from

Conversation

moayadnajd
Copy link

@moayadnajd moayadnajd commented Jun 14, 2021

Now the harvested looks to the sitemap and estimate the number of pages the harvester should harvest
pages is now 10 items since it will take less memory and time to hit

you have to specify the link of the sitemap when you setup your repository
Screen Shot 2021-06-14 at 5 23 30 PM

Now there is a plugin for health check to fix duplicates and add missing items

you have to specify the repo name and link of the rest api item by handle

Screen Shot 2021-06-14 at 5 26 21 PM

after you finish indexing you have to click on commit indexing to show results in the explorer

Screen Shot 2021-06-14 at 5 38 50 PM

This may fix #62 #67

@moayadnajd moayadnajd added the enhancement New feature or request label Jun 14, 2021
@moayadnajd moayadnajd requested a review from alanorth June 14, 2021 14:29
@alanorth
Copy link
Member

There are two "MEL downloads and views" plugins:

$ grep -rsI "MEL downloads and views"
backend/src/plugins/mel_downloads_and_views/info.json:    "display_name": "MEL downloads and views",
backend/src/plugins/dspace_add_missing_items/info.json:    "display_name": "MEL downloads and views",

@alanorth
Copy link
Member

The new harvesting method is much faster. I did a complete harvest of DSpace Test in about three hours, but there was an error at the end of harvesting. Here is the log from the API container:

...
[Nest] 88   - 06/16/2021, 2:01:12 PM   [HarvesterService] Starting Harvest
[Nest] 88   - 06/16/2021, 4:53:43 PM   [FetchConsumer] OnGlobalQueueDrained
[Nest] 88   - 06/16/2021, 4:53:46 PM   [DSpaceHealthCheck] Health Check Started
[Nest] 88   - 06/16/2021, 4:55:12 PM   [PluginsConsumer] OnGlobalQueueDrained
[Nest] 88   - 06/16/2021, 4:55:12 PM   [DSpaceHealthCheck] ResponseError: search_phase_execution_exception +395ms
TypeError: Cannot read property 'body' of undefined
    at DSpaceHealthCheck.deleteDuplicates (/backend/dist/plugins/dspace_health_check/index.js:90:18)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async DSpaceHealthCheck.transcode (/backend/dist/plugins/dspace_health_check/index.js:45:13)
[Nest] 88   - 06/16/2021, 4:56:15 PM   [DSpaceHealthCheck] Health Check Started
[Nest] 88   - 06/16/2021, 4:57:46 PM   [DSpaceHealthCheck] ResponseError: search_phase_execution_exception +91033ms
TypeError: Cannot read property 'body' of undefined
    at DSpaceHealthCheck.deleteDuplicates (/backend/dist/plugins/dspace_health_check/index.js:90:18)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async DSpaceHealthCheck.transcode (/backend/dist/plugins/dspace_health_check/index.js:45:13)

Also, now the DSpace Health Check has over 600,000 jobs...

Screenshot 2021-06-16 at 20-18-31 OpenRXV

Looking in the DSpace server logs I see that it seems to be stuck on one item because I see 34,000 of these so far (and increasing fast):

15.21.125.64 - - [16/Jun/2021:19:22:14 +0200] "GET /rest/handle/10568/3703?expand=metadata,parentCommunityList,parentCollectionList,bitstreams HTTP/1.1" 200 443 "-" "ARES harvesting bot; https://github.com/ilri/OpenRXV"

From the nginx logs on the server:

# grep -c '/rest/handle/10568/3703?expand=metadata,parentCommunityList,parentCollectionList,bitstreams' /var/log/nginx/rest.log
37553

alanorth pushed a commit that referenced this pull request Jun 23, 2021
This implements a new method for harvesting DSpace repositories as
well as a "health check" plugin that runs after harvesting to find
duplicate and missing items. The new harvesting method harvests 10
items at a time and is much faster (at least 3x) than the previous
method.

We still need to do some cleanup work on this:

- Commit button should only be active if `openrxv-items-temp` has
data, otherwise if you click it twice it will overwrite the live
`openrxv-items-final` index.
- Some documentation tips in the admin dashboard to help users with
the values that are expected in various new fields
- Making the "DSpace add missing items" plugin hidden, as it is used
internally by the health check plugin and should not be enabled
manually

See: #96
@alanorth
Copy link
Member

This is working well now. Squashed into one commit and merged to master manually. We still need to:

  • Commit button should only be active if openrxv-items-temp has data, otherwise if you click it twice it will overwrite the live openrxv-items-final index.
  • Some documentation tips in the admin dashboard to help users with the values that are expected in various new fields
  • Making the "DSpace add missing items" plugin hidden, as it is used internally by the health check plugin and should not be enabled manually

Should solve #67 and #62. I will evaluate that in the coming days after we have more experience with this in production.

@alanorth alanorth closed this Jun 23, 2021
@alanorth
Copy link
Member

@moayadnajd there's something strange with the health check. It seems to create way too many plugin jobs:

Screenshot 2021-06-25 at 21-29-25 AReS-fs8

This happens every time I start the plugins after harvesting. I investigated the jobs and most of them are from dspace_add_missing_items:

$ redis-cli KEYS "bull:plugins:*" \
  | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
  | ncat -w 3 localhost 6379 \
  | grep -v -E '^\$' | sort | uniq -c | sort -h
      3 dspace_health_check
      4 -ERR wrong number of arguments for 'hget' command
     12 mel_downloads_and_views
    129 dspace_altmetrics
    932 dspace_downloads_and_views
 186428 dspace_add_missing_items

@alanorth
Copy link
Member

@moayadnajd I added debugging to the DSpace Health Check and I think it's confused when there is more than one repository. Notice how many "missing" items it wants to add for WorldFish:

[Nest] 88   - 06/26/2021, 5:46:15 PM   [NestApplication] Nest application successfully started +12ms
[Nest] 88   - 06/26/2021, 5:47:06 PM   [DSpaceHealthCheck] Started DSpace health check
[Nest] 88   - 06/26/2021, 5:47:06 PM   [DSpaceHealthCheck] Started DSpace health check
[Nest] 88   - 06/26/2021, 5:47:10 PM   [DSpaceHealthCheck] getHandles
[Nest] 88   - 06/26/2021, 5:47:10 PM   [DSpaceHealthCheck] 0 handles found
[Nest] 88   - 06/26/2021, 5:47:10 PM   [DSpaceHealthCheck] getHandles
[Nest] 88   - 06/26/2021, 5:47:11 PM   [DSpaceHealthCheck] 0 handles found
[Nest] 88   - 06/26/2021, 5:47:11 PM   [DSpaceHealthCheck] 4431 handles found
[Nest] 88   - 06/26/2021, 5:47:11 PM   [DSpaceHealthCheck] Searching for duplicate handles
[Nest] 88   - 06/26/2021, 5:47:12 PM   [DSpaceHealthCheck] Adding 92409 missing items (WorldFish)
[Nest] 88   - 06/26/2021, 5:47:13 PM   [PluginsConsumer] OnGlobalQueueDrained
[Nest] 88   - 06/26/2021, 5:47:13 PM   [DSpaceHealthCheck] 9999 handles found
[Nest] 88   - 06/26/2021, 5:47:13 PM   [DSpaceHealthCheck] Finished DSpace health check (WorldFish)
[Nest] 88   - 06/26/2021, 5:47:13 PM   [DSpaceHealthCheck] 19998 handles found
[Nest] 88   - 06/26/2021, 5:47:14 PM   [DSpaceHealthCheck] 29997 handles found
[Nest] 88   - 06/26/2021, 5:47:15 PM   [DSpaceHealthCheck] 39996 handles found
[Nest] 88   - 06/26/2021, 5:47:15 PM   [DSpaceHealthCheck] 49995 handles found
[Nest] 88   - 06/26/2021, 5:47:16 PM   [DSpaceHealthCheck] 59994 handles found
[Nest] 88   - 06/26/2021, 5:47:17 PM   [DSpaceHealthCheck] 69993 handles found
[Nest] 88   - 06/26/2021, 5:47:18 PM   [DSpaceHealthCheck] 79992 handles found
[Nest] 88   - 06/26/2021, 5:47:18 PM   [DSpaceHealthCheck] 89991 handles found
[Nest] 88   - 06/26/2021, 5:47:18 PM   [DSpaceHealthCheck] 90990 handles found
[Nest] 88   - 06/26/2021, 5:47:18 PM   [DSpaceHealthCheck] Searching for duplicate handles
[Nest] 88   - 06/26/2021, 5:47:53 PM   [DSpaceHealthCheck] Adding 1419 missing items (DSpace Test)

The WorldFish repository only has 4,400 items total, so it's impossible that 92,000 are missing. Similarly, there are actually about 92,000 items in CGSpace so this seems to be all the items in the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing communities and collections
2 participants