Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much media_urls returned in search #45

Closed
frankstrater opened this issue Jun 17, 2014 · 17 comments
Closed

Too much media_urls returned in search #45

frankstrater opened this issue Jun 17, 2014 · 17 comments
Labels

Comments

@frankstrater
Copy link
Contributor

Some search queries return a lot of duplicate media_urls, leading to an allowed memory size overload when parsing the json response. Might be a caching problem on the ocd_backend. Test scripts to reproduce the bug:

http://strateradvies.nl/ocdsearch/test.php
http://strateradvies.nl/ocdsearch/src_test.php

@justinvw justinvw added the bug label Jun 17, 2014
@justinvw
Copy link
Contributor

With "duplicate media_urls", do you mean that there are items that contain identical media_urls within the same item? Can you provide the id's of some items where this problem occurs?

@frankstrater
Copy link
Contributor Author

It seems I was mistaken. I didn't expect that a result object can have multiple different images (not duplicates), each with their own set of resolutions. For example:

http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea

Is this correct? Because that's a lot of data for 1 object and with a default size of 100 objects in search you hit the memory limit pretty soon.

@justinvw
Copy link
Contributor

You can lookup the item in it's original form (as returned by the source) by requesting http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea/source to get a better idea of what's going on. It looks like transformation of the item went fine, since the source also shows that there are lots of images associated with the item.

I suspect that there is (or was) something wrong at Natinaal Archief Leiden, since the item no longer exists there (http://hdl.handle.net/10648/37b5754f-5494-4337-a317-3ec5b5ef12cf returns a 404).

Regarding your memory limit problem: to me the response seems to be not that big (± 2 Mb). Maybe you should bump up PHP's memory limit a bit 😉.

@frankstrater
Copy link
Contributor Author

If you think that a json-response for a single object (± 3 Mb filesize) is acceptable, then I wish all users of the RESTful API (not just PHP, think jQuery) happy coding and you can close this issue.

EDIT: I can't imagine any use of an object with 15900 media_urls. Maybe there should be a feauture to limit the total returned media_urls?

@breyten
Copy link
Member

breyten commented Jun 17, 2014

  • Regarding the memeory limitation in php, I would either up it, or use a lower limit (return 50 objects, say)
  • As for limiting the media url, we might consider other options -- eg. specify preferred dimensions of the image (if applicable) (I dunno really ;))

@frankstrater
Copy link
Contributor Author

This is not an isssue about PHP...

@justinvw
Copy link
Contributor

I think the problem here is that we currently can't provide any guarantees about the size of a single item. For example, there may exist items with many URLs, huge descriptions and long lists of items. In situations where the user's application directly calls the REST API on query time (like @frankstrater's search interface), a potentially large response of multiple megabytes is not desirable.

My suggestion is to add an optional filter to the REST API which allows the user to specify the maximal size in bytes of objects that should included in the result set. This prevents us from having to filter out large items on index time, and gives the API user control over the maximal size of the returned response. Additionally I would like to add the option to specify which fields should be returned.

Frank, would this help you?

@ajslaghu
Copy link
Contributor

  1. I'd would like to think that retrieving say 50 items would be feasable on either channel (mobile / web). But also I take into account possible compressen. Is the 2 / 3,5 MB response compressed or uncompressed?
  2. Our memory limit is 128 Meg. For me that seems pretty much and enough for queries up to 50 items.

@justinvw
Copy link
Contributor

In this case we are talking about a singel item that has a size of ± 2 Mb. When you happen to issue a query where multiple of these items are returned in a single response, you're easily talking about a response size of multiple megabytes.

Currently all responses are served uncompressed. I will enable GZIP compression.

@breyten
Copy link
Member

breyten commented Jun 17, 2014

The problem with specifying a maximum size for objects is that you would like to have at least one media url in your response ...

@frankstrater
Copy link
Contributor Author

@justinvw A way to specify which fields should be returned should be added (think statistical dashboard apps). I noticed the json-response is nicely padded. which makes it human-readable, but amounts to a lot of overhead (if I'm not mistaken).

@ajslaghu http://search.opencultuurdata.nl/ uses 42 as size, but should work if you change that to 18. With 128Mb you exceed the memory limit somewhere between 20 and 30 if you search for 'notulen' for example. As of now, 50 is too much to handle.

@frankstrater frankstrater changed the title Duplicate media_urls in search Too much media_urls returned in search Jun 17, 2014
@justinvw
Copy link
Contributor

@breyten But on what basis do you want to select this single URL? Details such as size aren't alway present. Also, this problem isn't specific to the media_urls field. Other fields can theoretically also contain huge amounts of data.

@frankstrater by default the response content is pretty-printed, but if you send X-Requested-With: XMLHttpRequest along with the headers, you will receive an unpadded response.

I just enabled GZIP compression, which results in some nice reductions of the response size:

$ curl -H "Accept-Encoding: gzip" http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json.gzip
$ curl http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json
$ ls -la test.json*
-rw-r--r--  1 justin  staff  3034699 Jun 17 14:43 test.json
-rw-r--r--  1 justin  staff   494143 Jun 17 14:42 test.json.gzip

@frankstrater
Copy link
Contributor Author

I sort of expected something along the line of:

"filters": {
    "media_content_type": {
        "terms": ["image/jpeg", "image/png"],
        "count": 10
    }
}

to get the first 10 image media-urls, but I'm not sure if this is feasible.

@breyten
Copy link
Member

breyten commented Jun 19, 2014

How about this for a compromise?

  • When doing a search only the first 10 image urls are returned (as gotten from Elasticsearch)
  • When querying an individual object, all media urls are returned

Other fields may be large as well, but I think by limiting the media urls we solve 90-95% of the issue.

@justinvw
Copy link
Contributor

@breyten, did you check how many items there are in the index that have more than 10 media_urls?

@breyten
Copy link
Member

breyten commented Jun 19, 2014

Since scripting is disabled, no ;)

breyten@ks206687:~$ curl -s -XPOST 'http://localhost:9201/ocd_combined_index/_search' -d '{
"query": {"match_all": {}},
"filter" : {
    "script" : {
        "script" : "doc['media_urls'].values.length > param1",
        "params" : {
            "param1" : 10
        }
    }
},
"size": 0
}'

@melvyn-sopacua
Copy link

Isn't it simpler to decouple possibly long lists into a separate endpoint? It is very natural for the specific case here and perhaps others as well (the classic author/books).
Instead of specifying limits for entities that the application had to assert, one can use paging with a sane default page size and references.
For search results the media list of a result object would return a reference to its canonical endpoint.
This makes the API forwards compatible, as the application only has to be taught how to follow references and how to request previous or next pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants