Skip to content
This repository has been archived by the owner on Jun 30, 2018. It is now read-only.

[Feature] Handling Selecting Inner Fields on Inner Objects #31

Closed
wants to merge 2 commits into from
Closed

[Feature] Handling Selecting Inner Fields on Inner Objects #31

wants to merge 2 commits into from

Conversation

bcoe
Copy link
Contributor

@bcoe bcoe commented Jun 9, 2011

[Feature] When selecting specific fields on an inner object, Elasticsearch returns the inner fields in this format '_source.inner_object', '_source.inner_object.inner_key'. I need this behaviour to be properly handled in my app, as I have some large keys I do not want to serialize on inner object, so I like cherry-picking them, I've added handling for this syntax.

Here's an example of the underlying Elastic search request:

curl -X POST "http://localhost:9200/foo-attachments-index/_search?pretty=true" -d '{"size":5,"from":0,"fields":["uuid","hash","file_extension","file_size","filename","type","external_id","downloadable","date","sender","visible","thumbnail_created","shared","large_thumbnail_created","share_url","meta","_source.meta.image_width"],"sort":["_score"],"query":{"bool":{"must":[{"query_string":{"query":"*"}},{"term":{"user_uuid":"foo"}}]}}}'

…earch returns the inner fields in this format '_source.inner_object', '_source.inner_object.inner_key'. I need this behaviour to be properly handled in my app, as I have some large keys I do not want to serialize on inner object, so I like cherry-picking them, I've added handling for this syntax.
@karmi
Copy link
Owner

karmi commented Jun 9, 2011

I'm not sure I understand the problem, and even less the code. What is the real issue here? What is the actual response from ES?

(I suspect the issue is that ES returns pseudo-JSON like _source.inner_object and the Item instantiation fails on that, but I am not sure.)

@bcoe
Copy link
Contributor Author

bcoe commented Jun 9, 2011

Query:

  curl -X POST "http://localhost:9200/5336bfb50322c2285eff3afb07576452-attachments-index/_search?pretty=true" -d '{
      "size":5,
      "from":0,
      "fields":
      [
          "uuid",
          "hash",
          "file_extension",
          "file_size",
          "filename",
          "type",
          "external_id",
          "downloadable",
          "date",
          "sender",
          "visible",
          "thumbnail_created",
          "shared",
          "large_thumbnail_created",
          "share_url",
          "meta",
          "_source.meta"
      ],
      "sort":
      [
          "_score"
      ],
      "query":
      {
          "bool":
          {
              "must":
              [
                  {
                      "query_string":
                      {
                          "query":"*"
                      }
                  },
                  {
                      "term":
                      {
                          "user_uuid":"5336bfb50322c2285eff3afb07576452"
                      }
                  }
              ]
          }
      }
  }'

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 83,
    "max_score" : 1.3976741,
    "hits" : [ {
      "_index" : "5336bfb50322c2285eff3afb07576452-attachments-index",
      "_type" : "attachment",
      "_id" : "f7bcf93ae94365d97f974a3b6d7a1ab6",
      "_score" : 1.3976741,
      "fields" : {
        "sender" : "bencoe@gmail.com",
        "_source.meta" : {
          "image_width" : 200,
          "image_height" : 200
        },
        "file_size" : 12090,
        "thumbnail_created" : "true",
        "hash" : "2ff70f3bc51d1af4b47439edc3f53b77",
        "file_extension" : "png",
        "filename" : "attachmentsme-200x200.png",
        "uuid" : "f7bcf93ae94365d97f974a3b6d7a1ab6",
        "type" : "image",
        "date" : 1299097315,
        "downloadable" : "true"
      }
    }

The code:

  1. Examines the final document for keys in the form '_source.foo' (these keys currently cause exceptions when they are mapped onto an object)
  2. Recursively coerces these fields into the proper form, e.g., '_source.foo' would end up being the key 'foo' when mapped onto the final object. '_source.foo.bar', would end up being the key 'foo.bar' on the final object.

In the example code I've posted note the key '_source.meta' this is also provided in the initial query.

@karmi
Copy link
Owner

karmi commented Jun 10, 2011

OK, thanks for clarification. I think this unexpected behaviour of ElasticSearch. The documentation at http://www.elasticsearch.org/guide/reference/api/search/fields.html states:

The fields will automatically load stored fields (store mapping set to yes),
or, if not stored, will load the _source and extract it from it (allowing to return nested document object).

And this is probably not the case. Another user, @vhyza has already hit this issue as well.

In your case, you should have a fields=meta,... and the returned field should be simply meta, not _source.meta.

Now, you are working around this by setting fields to fields=_source.meta. I think ES should return you response like this:

"_source" : { "meta" : "..." }

and not like this:

"_source.meta" : "..."

That response would be properly parsed by the Item.new logic.

@kimchy, do you have any ideas or suggestions for this?

@vhyza
Copy link
Collaborator

vhyza commented Jun 10, 2011

@kimchy here is gist with example: https://gist.github.com/1018584. I tried it on ES 0.16.1 and on master with last commit elastic/elasticsearch@6382ddf43cea9a6a88a2

@bcoe
Copy link
Contributor Author

bcoe commented Jun 10, 2011

I agree it seems like weird behaviour with how ES (or maybe Lucene?) is handling object fields behind the scenes. The only way I could figure out to cherry-pick the 'meta' field was using '_source.meta', which required this patch.

I'm going to dig into the ES code a little bit, maybe this will give me an excuse to submit my first patch to that project -- I'll keep you posted.

@kimchy
Copy link

kimchy commented Jun 10, 2011

The way ES works in its decision to try and load fields is from _source or not is only for non compound fields (like person.name), for things that point to compound fields, like person, it will always need to go through _source, and its delicate as to see if it really needs to be loaded or not, so for those, you have to prefix it with _source.

@karmi
Copy link
Owner

karmi commented Jun 10, 2011

@kimchy, so it's intentional, that ES returns JSON like "_source.meta" : "..." and not "_source" : { "meta" : "..." }.

If so, then client libraries have to take this into account and strip the _source.* prefix....

@kimchy
Copy link

kimchy commented Jun 10, 2011

Yes, since it started from the fact that you had to explicitly specify _source if you wanted to load it from source and parsing it. And its an indication that it was loaded from it.

@bcoe
Copy link
Contributor Author

bcoe commented Jun 10, 2011

Cool, thanks for the clarification, I'm glad it wasn't just weird behaviour that I was noticing.

@karmi
Copy link
Owner

karmi commented Jun 10, 2011

OK, what a pity :) But OK, I'd probably then:

  • strip _source from any key
  • split the key by . and recursively build the Hash out of the resulting Array.

@bcoe
Copy link
Contributor Author

bcoe commented Jun 10, 2011

Yeah, that's what my patch is attempting to do ;)

With unit tests even.

@karmi
Copy link
Owner

karmi commented Jun 10, 2011

@bcoe, yes, I'll have a look if it could be simplified somehow...

karmi added a commit that referenced this pull request Jul 12, 2011
…rom ES [#31]

The underlying issue is this, reported by @vhyza: <#31 (comment)>

For a query with limited fields, such as:

<http://localhost:9200/fields_test/_search?q=shay&fields=message,_source.person>

ES returns JSON like this:

    fields: {
      message: "This is a tweet!"
      _source.person: {
        sid: "12345"
        name: {
          last_name: "Banon"
          first_name: "Shay"
        }
      }
    }

Notice the `_source` prefix for the `person`.
It gets even more tricky with nested hashes such as `_source.track.info.duration`.
This should be converted into a regular Hash and accessible as `track.info.duration.minutes` property of the result.

See the test suite for example data. This commit closes issue #31.
@karmi
Copy link
Owner

karmi commented Jul 12, 2011

I have cheated and used eval to do the trick for building the nested hashes. Seems more intelligible to the casual source code reader and more expressive with regard to the code intent; see the Collection#__parse_fields method.

@karmi karmi closed this Jul 12, 2011
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants