[Epic] Import additional data from zbMATH Open #3

aot29 · 2022-01-14T11:15:22Z

aot29 · 2022-01-17T16:55:47Z

@physikerwelt according to the documentation and the examples, the OAI API should return JSON, but it only returns XML. That's OK for me, but is it the intended behavior?

physikerwelt · 2022-01-17T18:43:27Z

Yes, the response content type is incorrect. I'll make a pull request.

physikerwelt · 2022-01-17T18:49:52Z

The content type itself is correct. Only in the swagger API, the wrong content type is expected.

< HTTP/1.1 200 OK
< Date: Mon, 17 Jan 2022 18:48:33 GMT
< Server: Apache/2.4.38 (Debian)
< Content-Length: 5810
< Vary: Accept-Encoding
< Content-Type: text/xml; charset=utf-8

aot29 · 2022-01-20T11:39:52Z

@Hyper-Node suggested that we wouldn't necessarily need to import the data, if the data is already in a graph database in zbMath Open (I didn't find any documentation on how the backend database is implemented). If this is so, that would be the most elegant solution.

Otherwise, would we import preview data (title, author, doi, keywords etc., but no abstract etc.) for all publications in zbMath Open? Taking licensing limitations into account, that would be +- 3 million entries. If that is so, I would first prototype it like described in this issue, then build an import container and put it on the server to run by night. I estimate importing 3 million entries (probably in batches of 100) using quickstatements would take a couple of weeks.

@Hyper-Node @physikerwelt what do you think?

physikerwelt · 2022-01-20T11:52:22Z

I would say stay focussed. There is no graph database for zbMath Open. I would be extremely happy if we could develop a tool that is capable of importing individual zbMATH Open entries (or a batch of entries) on demand, without creating duplicates (but rather updates existing entries). If I interpret the ticket description correctly, this is what this ticket about. I am not sure if we need to import all zbMATH at this point in time. I would like to create a different ticket for that and keep the focus here on building the first version of the zbMATH Open -> MaRDI portal ingestion pipeline.

physikerwelt · 2022-01-20T13:54:11Z

@physikerwelt according to the documentation and the examples, the OAI API should return JSON, but it only returns XML. That's OK for me, but is it the intended behavior?

Fixed now, cf. https://oai.zbmath.org/

aot29 · 2022-01-20T14:47:16Z

thanks.
~~Next question: the ListSets endpoint always crashes with error 500. I can work around this using helper/filter so that's OK for me so far.~~

aot29 · 2022-01-20T16:06:43Z

The ListSets endpoint crashes with certain parameters, e.g.
curl -X 'GET' 'https://oai.zbmath.org/v1/helper/filter?filter=software:FORTRAN&metadataPrefix=oai_zb_preview' -H 'accept: text/xml' returns INTERNAL SERVER ERROR

while

curl -X 'GET' 'https://oai.zbmath.org/v1/helper/filter?filter=software:Gfan&metadataPrefix=oai_zb_preview' -H 'accept: text/xml' works fine

Since FORTRAN is probably much more popular than Gfan, is there some error related to the size of the result set?

(also mailing OAI support)

sedimentation-fault · 2022-04-08T15:24:37Z

Don't go into the trouble of downloading 4.2+ million records from zbmath - at least not yet. Two-thirds of all XML records are records whose title, authors and many other elements contain just the string

zbMATH Open Web Interface contents unavailable due to conflicting licenses.

Example:

Let's take the item with "DE number" 3224368 and form the OAI-PMH URL for the "GetRecord" endpoint:

https://oai.zbmath.org/v1/?verb=GetRecord&identifier=oai%3Azbmath.org%3A3224368&metadataPrefix=oai_zb_preview

Download this with your favorite web client to, say, 3224368.xml and inspect it - you'll see what I mean. This happens to 2 out of any 3 items that I try randomly.

Interestingly enough, trying the bibtex for the same item from

https://zbmath.org/bibtex/03224368.bib

will get you full information on exactly those fields where OAI-PMH encounters "conflicting licenses" - go figure.

I would understand if this would appear in XML elements that might contain some copyrightable information, but I fail to see how titles or author names fall into a category of items where any licensing restrictions might apply whatsoever...

physikerwelt · 2022-04-09T17:15:11Z

Indeed title and authors can not be exposed via the API due to license restrictions. The terms and conditions of zbMATH don't allow scaping the bibtex information. Thanks to @rank-zero for the background information.

I think independent of the restricted fields we should design the ingestion process in a way that it downloads the initial dataset once and fetches updates in fixed intervals. Here the oaipmh format comes in handy.

sedimentation-fault · 2022-04-11T02:20:38Z

"The terms and conditions of zbMATH don't allow scaping the bibtex information."

I thought zbMATH decided to become "open access" some time ago. Besides, if I present a title, does this give rise to a legal suspicion that I scraped a bibtex from somewhere? zbMATH does not have to reveal to anyone how it arrived at any given title or author name. Plus, the way you say it implies that it is zbMATH itself that imposes restrictions to...itself? - I don't understand all this.

Anyway, to stay on topic: you plan to download 4.x million records? Even with a sleep interval of 1-2 seconds between downloads, which is short (I think), it may take a whole year, since the download itself will also consume some seconds: assuming ~6 seconds per record, you will get 10 records per minute, or 600 per hour, i.e ~12000/day - you need a year to get them all once. How often do you plan to hit the oai.zbmath.org server per second with requests? Are you OK with such a "long running" process? Just curious...

physikerwelt · 2022-04-11T09:42:08Z

I am not a lawyer and I agree with you that the situation is not intuitiv. Especially since one can use the DOI field to join data from semantic scholar or crossref. However one still is not allowed to redistribute the merged data. I have double checked that zbMATH Open can not expose titles and authors via APIs without to break German law.

Tha API is capable of providing the dataset in a few hours. I tested that last week so one can get all data quite quick. The import to wikibase is the long running task. The issue is about developing the software to import records from zbMATH Open importing everything is subject to another discussion. It is not entirely clear to me if importing everything is a good idea or if a lazy approach is preferable.

sedimentation-fault · 2022-04-11T11:51:10Z

O.K., I don't want to get into lengthy debates about this here, especially since you are not zbMATH. :-) But I do have some remarks and I urge you to consider them seriously in your project:

Common sense says: no matter how capable the API is, imported records that lack title and/or author information will be useless. To see this, just pause for a moment and ask yourself: "Why are we doing all this?" You do this for researchers - and no researcher will tell you that such a crippled record is of any use.
"zbMATH Open can not expose titles and authors via APIs without to break German law." Well, there is some possibility that exposing titles and authors would break a law that forbids the dissemination of databases. Maybe there is a conflict there. No matter what the conflict is, common sense again suggests that it should be allowed to either expose, or add/combine title/author information - and disseminate the combined records. I thus strongly suggest that your consortium fights for this right in the german courts. There are some fights that you cannot win with code - and this is one of them. You must stand up and fight it in the courts.

The TODO list above should be updated with a new, high priority item:

Fight in the courts for the right to use titles/authors, no matter how we got them!

In a sane society, this right would be self-evident - but we don't live in one, so this is the way to go.

physikerwelt · 2022-04-11T12:08:07Z

Thank you for your opinion. As said, I am not a lawyer and therefore this is out of scope. I think you ambition to improve the legal situation is nobel, however this is not our expertise. There are other initiatives with the required legal expertise to go into these issues. In this project we need to respect rules and regulations.

physikerwelt · 2023-01-31T09:36:48Z

This is now in the making for over a year. @LizzAlice can you estimate how long it will take to complete this task?

LizzAlice · 2023-02-01T09:38:59Z

I would think that this would take 1-2 months. However, if I should do it, I would like to push it until after my link prediction is in beta stage.

physikerwelt · 2023-09-04T19:55:09Z

@LizzAlice I feel we have different tickets for the same task. Can you close duplicates?

aot29 added the enhancement New feature or request label Jan 14, 2022

aot29 changed the title ~~Import additional data from zwMath~~ Import additional data from zbMath Jan 14, 2022

aot29 self-assigned this Jan 14, 2022

This was referenced Apr 21, 2022

[Epic] Import properties from Wikibase that are relevant #4

Closed

Mass delete (Nuke) broken MaRDI4NFDI/portal-compose#82

Closed

aot29 removed their assignment Mar 14, 2022

LizzAlice transferred this issue from MaRDI4NFDI/portal-compose Apr 21, 2022

LizzAlice assigned LizzAlice and unassigned LizzAlice Apr 21, 2022

physikerwelt assigned LizzAlice May 5, 2022

Hyper-Node added the epic label May 9, 2022

Hyper-Node changed the title ~~Import additional data from zbMath~~ [Epic] Import additional data from zbMath Jun 27, 2022

physikerwelt changed the title ~~[Epic] Import additional data from zbMath~~ [Epic] Import additional data from zbMATH Open Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Import additional data from zbMATH Open #3

[Epic] Import additional data from zbMATH Open #3

aot29 commented Jan 14, 2022 •

edited by Hyper-Node

Loading

aot29 commented Jan 17, 2022

physikerwelt commented Jan 17, 2022

physikerwelt commented Jan 17, 2022 •

edited

Loading

aot29 commented Jan 20, 2022 •

edited

Loading

physikerwelt commented Jan 20, 2022

physikerwelt commented Jan 20, 2022

aot29 commented Jan 20, 2022 •

edited

Loading

aot29 commented Jan 20, 2022

sedimentation-fault commented Apr 8, 2022

physikerwelt commented Apr 9, 2022

sedimentation-fault commented Apr 11, 2022

physikerwelt commented Apr 11, 2022

sedimentation-fault commented Apr 11, 2022

physikerwelt commented Apr 11, 2022

physikerwelt commented Jan 31, 2023

LizzAlice commented Feb 1, 2023

physikerwelt commented Sep 4, 2023

[Epic] Import additional data from zbMATH Open #3

[Epic] Import additional data from zbMATH Open #3

Comments

aot29 commented Jan 14, 2022 • edited by Hyper-Node Loading

aot29 commented Jan 17, 2022

physikerwelt commented Jan 17, 2022

physikerwelt commented Jan 17, 2022 • edited Loading

aot29 commented Jan 20, 2022 • edited Loading

physikerwelt commented Jan 20, 2022

physikerwelt commented Jan 20, 2022

aot29 commented Jan 20, 2022 • edited Loading

aot29 commented Jan 20, 2022

sedimentation-fault commented Apr 8, 2022

physikerwelt commented Apr 9, 2022

sedimentation-fault commented Apr 11, 2022

physikerwelt commented Apr 11, 2022

sedimentation-fault commented Apr 11, 2022

physikerwelt commented Apr 11, 2022

physikerwelt commented Jan 31, 2023

LizzAlice commented Feb 1, 2023

physikerwelt commented Sep 4, 2023

aot29 commented Jan 14, 2022 •

edited by Hyper-Node

Loading

physikerwelt commented Jan 17, 2022 •

edited

Loading

aot29 commented Jan 20, 2022 •

edited

Loading

aot29 commented Jan 20, 2022 •

edited

Loading