Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies / missing data in automatic updates #372

Closed
fsteeg opened this issue Jan 8, 2024 · 16 comments
Closed

Inconsistencies / missing data in automatic updates #372

fsteeg opened this issue Jan 8, 2024 · 16 comments
Assignees
Labels

Comments

@fsteeg
Copy link
Member

fsteeg commented Jan 8, 2024

Via email feedback, original message on 12/22/23 12:38 by M.H.

New entry was missing in lobid-gnd:

https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312101741

Latest update is now on 2023-12-27T11:12:51.000, which is 2023-12-27T10:12:51Z in OAI-PMH, as clarified by DNB via email on 12/22/23, 17:13 by J.R.

Fetching updates manually worked, the missing resource is now in lobid-gnd:

https://lobid.org/gnd/1312101741

However, the automatic update for that time span on the server is way too small:

sol@quaoar1:~/git/lobid-gnd$ ls -alh data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.*
1.2K Dec 27 10:40 data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.jsonl
3.8K Dec 27 10:40 data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.rdf

Compared to the manual run for the same time span (sol@quaoar3:~/git/lobid-gnd$ sbt "runMain apps.ConvertUpdates 2023-12-27T09:40:26Z 2023-12-27T10:40:25Z"):

sol@quaoar3:~/git/lobid-gnd$ ls -alh GND-updates.*
143K Jan  8 12:34 GND-updates.jsonl
369K Jan  8 12:34 GND-updates.rdf

Might have been temporary network issues, but at least we need better monitoring.

@dr0i
Copy link
Member

dr0i commented Jan 18, 2024

Might be related to #363.

@witzigs
Copy link

witzigs commented Jan 18, 2024

Hi,
We link to lobid-gnd on a search interface (swisscollections.ch) and I got a question why the links for two GND records don't return results on lobid-gnd. Both have been added to the GND on 14.12.2023, as far as I know:
https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312495189
https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312496002

Reading this issue, I assume that the records weren't added to lobid-gnd due to these update issues. So I just add this feedback here in case more examples help you fix the issue.

Best regards,
Silvia

@fsteeg
Copy link
Member Author

fsteeg commented Feb 14, 2024

I got a question why the links for two GND records don't return results on lobid-gnd

I reindexed those two (underlying issue is still unresolved):

https://lobid.org/gnd/1312495189
https://lobid.org/gnd/1312496002

@dr0i
Copy link
Member

dr0i commented Feb 19, 2024

Couldn't find the underlying problem. Especially strange is that the automatic updates somtimes are smaller than the later manually invoked one. As @fsteeg mentioned this could be a temporarily network issue. There might also be a problem on the side of the provider. Because of this hardly debugable problem and also to cope with possible problem at provider side I suggest to do also a daily update in addendum to the hourly updates. This way we have should be more safe to get all the data. If agreed I will configure to also have a daily update. Or better ideas?

@dr0i dr0i added the bug label Feb 19, 2024
@acka47
Copy link
Contributor

acka47 commented Feb 19, 2024

Because of this hardly debugable problem and also to cope with possible problem at provider side I suggest to do also a daily update in addendum to the hourly updates. This way we have should be more safe to get all the data. If agreed I will configure to also have a daily update.

+1 This sound like a good approach to me. Isn't it so that the number of reports has risen since we switched to hourly updates in November (#350)? The question is whether it is a good idea in the first place to have hourly updates if people can not rely on them being carried out reliably.

@acka47 acka47 assigned dr0i and unassigned fsteeg and acka47 Feb 19, 2024
dr0i added a commit that referenced this issue Feb 19, 2024
This redundancy in updating the data should bring more stability
in being in sync with upstream data.
@dr0i
Copy link
Member

dr0i commented Feb 19, 2024

#378 works as a safety rope.
Why we have sometimes (as in "seldom") trouble to get the whole data hourly remains to be a puzzle. It could be interesting to ask dnb if they notice issues on their side re oaipmh service and data syncing.

@fsteeg
Copy link
Member Author

fsteeg commented Feb 20, 2024

(+1 for additional daily updates, I've approved #378)

Why we have sometimes (as in "seldom") trouble to get the whole data hourly remains to be a puzzle. It could be interesting to ask dnb if they notice issues on their side re oaipmh service and data syncing.

Could this in some way be related to the fact that the OAI-PMH interface expects UTC times, while the modification times in the data and the server use local time (see mail from J.R. on 2023-12-22)?

@dr0i
Copy link
Member

dr0i commented Feb 20, 2024

From German Wikipedia:

Addiert man eine Stunde zur UTC, erhält man die Mitteleuropäische Zeit (MEZ), die zeitweise in Deutschland, Österreich, der Schweiz und anderen mitteleuropäischen Staaten gilt. Für die im Sommer geltende Mitteleuropäische Sommerzeit (MESZ) sind zwei Stunden zu addieren.

So indeed: if we query what we think starts last hour to now (MEZ) we query in fact just now to next hour (UTC). Wondering why there was data at all. Going to fix it.

dr0i added a commit that referenced this issue Feb 20, 2024
OAI-PMH expects UTC times.We use CET timed server and so these times
must be chenged to UTC. When files are produced these will have CET
based timestamps, though.
@dr0i
Copy link
Member

dr0i commented Feb 22, 2024

Should be fixed with #379 "from now on". I.e. I assume a complete reindexing is needed to catch up with all the possibly missing data @fsteeg ?

@dr0i dr0i assigned fsteeg and unassigned dr0i Feb 22, 2024
@acka47
Copy link
Contributor

acka47 commented Feb 22, 2024

A new dump should be provided soon:

image

Source: https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/Gesamtabzuege/gesamtabzuege_node.html#doc58272bodyText2

@acka47
Copy link
Contributor

acka47 commented Feb 23, 2024

We received a mail yesterday about missing records that were created last week. Example: https://lobid.org/gnd/1319507522

Creation date (see MARC) is: 2024-02-15

@acka47
Copy link
Contributor

acka47 commented Feb 23, 2024

We received a mail yesterday about missing records that were created last week.

The example (https://lobid.org/gnd/1319507522) now works and I sent out a mail response.

@acka47
Copy link
Contributor

acka47 commented Feb 26, 2024

E.V. who sent the mail mentioned in #372 (comment) followed up on it by providing more entries that are still missing. I went through them to see on which day they were created and found entries from the following days:

  • 2024-01-23
  • 2024-01-24
  • 2024-01-25
  • 2024-01-29
  • 2024-01-30
  • 2024-01-31
  • 2024-02-21

He closes the email with the notion that the list is not exhaustive and more entries are missing. As the impact of the missing updates is significantly downgrading the service we should not wait for a new full dump but reindex titles – probably best starting at 2023-11-10 as this is the date we have rescheduled the updates (see #350).

@fsteeg
Copy link
Member Author

fsteeg commented Feb 27, 2024

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Feb 27, 2024
@acka47
Copy link
Contributor

acka47 commented Feb 27, 2024

+1 It's ok for me to close this issue now but we should monitor closely whether updates reliably come in .

@acka47 acka47 assigned fsteeg and unassigned acka47 Feb 27, 2024
@acka47
Copy link
Contributor

acka47 commented Jun 4, 2024

Closing. Updates have been fine during the last weeks/months.

@acka47 acka47 closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants