New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too much media_urls returned in search #45
Comments
With "duplicate media_urls", do you mean that there are items that contain identical |
It seems I was mistaken. I didn't expect that a result object can have multiple different images (not duplicates), each with their own set of resolutions. For example: Is this correct? Because that's a lot of data for 1 object and with a default size of 100 objects in search you hit the memory limit pretty soon. |
You can lookup the item in it's original form (as returned by the source) by requesting http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea/source to get a better idea of what's going on. It looks like transformation of the item went fine, since the source also shows that there are lots of images associated with the item. I suspect that there is (or was) something wrong at Natinaal Archief Leiden, since the item no longer exists there (http://hdl.handle.net/10648/37b5754f-5494-4337-a317-3ec5b5ef12cf returns a 404). Regarding your memory limit problem: to me the response seems to be not that big (± 2 Mb). Maybe you should bump up PHP's memory limit a bit 😉. |
If you think that a json-response for a single object (± 3 Mb filesize) is acceptable, then I wish all users of the RESTful API (not just PHP, think jQuery) happy coding and you can close this issue. EDIT: I can't imagine any use of an object with 15900 media_urls. Maybe there should be a feauture to limit the total returned media_urls? |
|
This is not an isssue about PHP... |
I think the problem here is that we currently can't provide any guarantees about the size of a single item. For example, there may exist items with many URLs, huge descriptions and long lists of items. In situations where the user's application directly calls the REST API on query time (like @frankstrater's search interface), a potentially large response of multiple megabytes is not desirable. My suggestion is to add an optional filter to the REST API which allows the user to specify the maximal size in bytes of objects that should included in the result set. This prevents us from having to filter out large items on index time, and gives the API user control over the maximal size of the returned response. Additionally I would like to add the option to specify which fields should be returned. Frank, would this help you? |
|
In this case we are talking about a singel item that has a size of ± 2 Mb. When you happen to issue a query where multiple of these items are returned in a single response, you're easily talking about a response size of multiple megabytes. Currently all responses are served uncompressed. I will enable GZIP compression. |
The problem with specifying a maximum size for objects is that you would like to have at least one media url in your response ... |
@justinvw A way to specify which fields should be returned should be added (think statistical dashboard apps). I noticed the json-response is nicely padded. which makes it human-readable, but amounts to a lot of overhead (if I'm not mistaken). @ajslaghu http://search.opencultuurdata.nl/ uses 42 as size, but should work if you change that to 18. With 128Mb you exceed the memory limit somewhere between 20 and 30 if you search for 'notulen' for example. As of now, 50 is too much to handle. |
@breyten But on what basis do you want to select this single URL? Details such as size aren't alway present. Also, this problem isn't specific to the @frankstrater by default the response content is pretty-printed, but if you send I just enabled GZIP compression, which results in some nice reductions of the response size:
|
I sort of expected something along the line of:
to get the first 10 image media-urls, but I'm not sure if this is feasible. |
How about this for a compromise?
Other fields may be large as well, but I think by limiting the media urls we solve 90-95% of the issue. |
@breyten, did you check how many items there are in the index that have more than 10 |
Since scripting is disabled, no ;)
|
Isn't it simpler to decouple possibly long lists into a separate endpoint? It is very natural for the specific case here and perhaps others as well (the classic author/books). |
Some search queries return a lot of duplicate media_urls, leading to an allowed memory size overload when parsing the json response. Might be a caching problem on the ocd_backend. Test scripts to reproduce the bug:
http://strateradvies.nl/ocdsearch/test.php
http://strateradvies.nl/ocdsearch/src_test.php
The text was updated successfully, but these errors were encountered: