Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer content type JSON Lines and format gzip #91

Closed
acka47 opened this Issue Apr 23, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@acka47
Copy link
Contributor

acka47 commented Apr 23, 2018

As in lobid-resources, see http://lobid.org/resources/api#content_types

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Apr 24, 2018

And also support format gzip via content header.

@acka47 acka47 changed the title Offer content type JSON Lines Offer content type JSON Lines and format gzip Apr 24, 2018

@fsteeg fsteeg added the ready label May 14, 2018

@fsteeg fsteeg self-assigned this May 14, 2018

@fsteeg fsteeg added working ready and removed ready working labels Jun 13, 2018

fsteeg added a commit that referenced this issue Jun 19, 2018

fsteeg added a commit that referenced this issue Jun 19, 2018

fsteeg added a commit that referenced this issue Jun 19, 2018

fsteeg added a commit that referenced this issue Jun 19, 2018

fsteeg added a commit that referenced this issue Jun 20, 2018

fsteeg added a commit that referenced this issue Jun 20, 2018

fsteeg added a commit that referenced this issue Jun 20, 2018

fsteeg added a commit that referenced this issue Jun 20, 2018

@fsteeg fsteeg added review and removed working labels Jun 20, 2018

@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Jun 20, 2018

Deployed to stage, see:

http://stage.lobid.org/gnd/search?q=ehrenfeld&format=bulk

Tested uncompressed request for all corporate bodies:

curl "http://stage.lobid.org/gnd/search?q=type:CorporateBody&format=bulk" > bulk.jsonl

This yields a 1.7 GB file. Took about:

  • 1:30 minutes on the same machine
  • 2:30 minutes on our local network
  • 6:45 minutes on our Eduroam WLAN

Tested same request, but compressed (handled by the Apache proxy):

curl --header "Accept-Encoding: gzip" "http://stage.lobid.org/gnd/search?q=type:CorporateBody&format=bulk" > bulk.gz

This yields a 174 MB file. Took about:

  • 1:30 minutes on the same machine
  • 1:30 minutes on our local network
  • 1:30 minutes on our Eduroam WLAN

See also documentation: http://stage.lobid.org/gnd/api#content_types

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Jun 20, 2018

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jun 21, 2018

You should also be able to get jsonlines and gzip for a filter query. I tried curl --header "Accept: application/x-jsonlines" "http://stage.lobid.org/gnd/search?filter=%2B%28type%3APlaceOrGeographicName%29" > geographika.jsonl and am currently pulling the whole GND.

Furthermore, as we discussed online we will use jsonl instead of bulk (and adjust this in lobid-resources at a later point as well).

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jun 21, 2018

Downloading the whole GND as gzip (1,5 GB, unzipped 14 GB) took just 13 minutes. So, this definitely works like a charm.

@acka47 acka47 assigned fsteeg and unassigned acka47 Jun 21, 2018

fsteeg added a commit that referenced this issue Jun 22, 2018

Use `jsonl` as format value for bulk requests
For consistency with `html`, `json`, etc.

See #91

fsteeg added a commit that referenced this issue Jun 22, 2018

@fsteeg

This comment has been minimized.

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Jun 22, 2018

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jun 22, 2018

Looks good. One minor problem is left, though, which actually was there before this ticket. http://lobid.org/gnd/4074335-4.jsonl also gives back JSON and not JSON lines as doest http://lobid.org/gnd/4074335-4.jsonfoo. We should only allow a colon : with more to follow after .json.

fsteeg added a commit that referenced this issue Jun 22, 2018

Serve HTTP 415 responses for unsupported media types
Don't fall back to JSON if unsupported format was requested

See #91
@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jun 22, 2018

+1

@acka47 acka47 removed their assignment Jun 22, 2018

@dr0i dr0i added deploy and removed review labels Jun 26, 2018

@fsteeg fsteeg closed this in #126 Jun 27, 2018

fsteeg added a commit that referenced this issue Jun 27, 2018

@dr0i dr0i removed the deploy label Jun 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.