Too much media_urls returned in search #45

frankstrater · 2014-06-17T08:51:35Z

Some search queries return a lot of duplicate media_urls, leading to an allowed memory size overload when parsing the json response. Might be a caching problem on the ocd_backend. Test scripts to reproduce the bug:

http://strateradvies.nl/ocdsearch/test.php
http://strateradvies.nl/ocdsearch/src_test.php

justinvw · 2014-06-17T09:20:36Z

With "duplicate media_urls", do you mean that there are items that contain identical media_urls within the same item? Can you provide the id's of some items where this problem occurs?

frankstrater · 2014-06-17T09:46:21Z

It seems I was mistaken. I didn't expect that a result object can have multiple different images (not duplicates), each with their own set of resolutions. For example:

http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea

Is this correct? Because that's a lot of data for 1 object and with a default size of 100 objects in search you hit the memory limit pretty soon.

justinvw · 2014-06-17T10:06:34Z

You can lookup the item in it's original form (as returned by the source) by requesting http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea/source to get a better idea of what's going on. It looks like transformation of the item went fine, since the source also shows that there are lots of images associated with the item.

I suspect that there is (or was) something wrong at Natinaal Archief Leiden, since the item no longer exists there (http://hdl.handle.net/10648/37b5754f-5494-4337-a317-3ec5b5ef12cf returns a 404).

Regarding your memory limit problem: to me the response seems to be not that big (± 2 Mb). Maybe you should bump up PHP's memory limit a bit 😉.

frankstrater · 2014-06-17T10:22:34Z

If you think that a json-response for a single object (± 3 Mb filesize) is acceptable, then I wish all users of the RESTful API (not just PHP, think jQuery) happy coding and you can close this issue.

EDIT: I can't imagine any use of an object with 15900 media_urls. Maybe there should be a feauture to limit the total returned media_urls?

breyten · 2014-06-17T10:51:39Z

Regarding the memeory limitation in php, I would either up it, or use a lower limit (return 50 objects, say)
As for limiting the media url, we might consider other options -- eg. specify preferred dimensions of the image (if applicable) (I dunno really ;))

frankstrater · 2014-06-17T11:17:39Z

This is not an isssue about PHP...

justinvw · 2014-06-17T11:39:26Z

I think the problem here is that we currently can't provide any guarantees about the size of a single item. For example, there may exist items with many URLs, huge descriptions and long lists of items. In situations where the user's application directly calls the REST API on query time (like @frankstrater's search interface), a potentially large response of multiple megabytes is not desirable.

My suggestion is to add an optional filter to the REST API which allows the user to specify the maximal size in bytes of objects that should included in the result set. This prevents us from having to filter out large items on index time, and gives the API user control over the maximal size of the returned response. Additionally I would like to add the option to specify which fields should be returned.

Frank, would this help you?

ajslaghu · 2014-06-17T12:06:57Z

I'd would like to think that retrieving say 50 items would be feasable on either channel (mobile / web). But also I take into account possible compressen. Is the 2 / 3,5 MB response compressed or uncompressed?
Our memory limit is 128 Meg. For me that seems pretty much and enough for queries up to 50 items.

justinvw · 2014-06-17T12:22:52Z

In this case we are talking about a singel item that has a size of ± 2 Mb. When you happen to issue a query where multiple of these items are returned in a single response, you're easily talking about a response size of multiple megabytes.

Currently all responses are served uncompressed. I will enable GZIP compression.

breyten · 2014-06-17T12:33:59Z

The problem with specifying a maximum size for objects is that you would like to have at least one media url in your response ...

frankstrater · 2014-06-17T12:35:49Z

@justinvw A way to specify which fields should be returned should be added (think statistical dashboard apps). I noticed the json-response is nicely padded. which makes it human-readable, but amounts to a lot of overhead (if I'm not mistaken).

@ajslaghu http://search.opencultuurdata.nl/ uses 42 as size, but should work if you change that to 18. With 128Mb you exceed the memory limit somewhere between 20 and 30 if you search for 'notulen' for example. As of now, 50 is too much to handle.

justinvw · 2014-06-17T12:54:08Z

@breyten But on what basis do you want to select this single URL? Details such as size aren't alway present. Also, this problem isn't specific to the media_urls field. Other fields can theoretically also contain huge amounts of data.

@frankstrater by default the response content is pretty-printed, but if you send X-Requested-With: XMLHttpRequest along with the headers, you will receive an unpadded response.

I just enabled GZIP compression, which results in some nice reductions of the response size:

$ curl -H "Accept-Encoding: gzip" http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json.gzip
$ curl http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json
$ ls -la test.json*
-rw-r--r--  1 justin  staff  3034699 Jun 17 14:43 test.json
-rw-r--r--  1 justin  staff   494143 Jun 17 14:42 test.json.gzip

frankstrater · 2014-06-17T13:26:00Z

I sort of expected something along the line of:

"filters": {
    "media_content_type": {
        "terms": ["image/jpeg", "image/png"],
        "count": 10
    }
}

to get the first 10 image media-urls, but I'm not sure if this is feasible.

breyten · 2014-06-19T14:10:58Z

How about this for a compromise?

When doing a search only the first 10 image urls are returned (as gotten from Elasticsearch)
When querying an individual object, all media urls are returned

Other fields may be large as well, but I think by limiting the media urls we solve 90-95% of the issue.

justinvw · 2014-06-19T14:20:43Z

@breyten, did you check how many items there are in the index that have more than 10 media_urls?

breyten · 2014-06-19T14:28:06Z

Since scripting is disabled, no ;)

breyten@ks206687:~$ curl -s -XPOST 'http://localhost:9201/ocd_combined_index/_search' -d '{
"query": {"match_all": {}},
"filter" : {
    "script" : {
        "script" : "doc['media_urls'].values.length > param1",
        "params" : {
            "param1" : 10
        }
    }
},
"size": 0
}'

melvyn-sopacua · 2014-07-03T17:22:43Z

Isn't it simpler to decouple possibly long lists into a separate endpoint? It is very natural for the specific case here and perhaps others as well (the classic author/books).
Instead of specifying limits for entities that the application had to assert, one can use paging with a sane default page size and references.
For search results the media list of a result object would return a reference to its canonical endpoint.
This makes the API forwards compatible, as the application only has to be taught how to follow references and how to request previous or next pages.

justinvw added the bug label Jun 17, 2014

frankstrater changed the title ~~Duplicate media_urls in search~~ Too much media_urls returned in search Jun 17, 2014

frankstrater closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much media_urls returned in search #45

Too much media_urls returned in search #45

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

breyten commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

ajslaghu commented Jun 17, 2014

justinvw commented Jun 17, 2014

breyten commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

breyten commented Jun 19, 2014

justinvw commented Jun 19, 2014

breyten commented Jun 19, 2014

melvyn-sopacua commented Jul 3, 2014

Too much media_urls returned in search #45

Too much media_urls returned in search #45

Comments

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

breyten commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

ajslaghu commented Jun 17, 2014

justinvw commented Jun 17, 2014

breyten commented Jun 17, 2014

frankstrater commented Jun 17, 2014

justinvw commented Jun 17, 2014

frankstrater commented Jun 17, 2014

breyten commented Jun 19, 2014

justinvw commented Jun 19, 2014

breyten commented Jun 19, 2014

melvyn-sopacua commented Jul 3, 2014