Strip empty arrays and maps from the es document before sending it downstream #37

Merged
merged 3 commits into from Jul 12, 2016

Conversation

Projects
None yet
3 participants
@missinglink
Member

missinglink commented Jul 12, 2016

strip empty arrays and maps from the es document before sending it downstream

prior to this PR when we send quite large documents to ES, a big chunk of the request body can be empty arrays for the parent.* fields or an empty {} for address_parts.

the downside of having these empty fields is 1. network overhead, 2. they actually get stored in the index despite contain no true 'values' and get returned to API from the cluster.

this has actually been bugging me for some time but I didn't know how best to deal with it, in the end I used JSON.parse( JSON.stringify( _, replacer ) ) to do the job, its a little more CPU but seemed better than the other approaches I considered (key checks on each insert).

I'm hoping we get a little performance bump out of ES due to the reduced size of documents (by up to 80%) being sent around the cluster and when responding to requests from the REST API.

Also! this will make the sense output much easier to read :)

the only worry here is that somewhere in the API codebase we are not checking the existence of these fields (eg. we can now return a doc with no doc.category = [] array when that array is empty); so we might need to add some key checks there.

an example of the fields produced before and after the PR (for a bare-minimal document):

{
  _id: "myid",
  _index: "pelias",
  _type: "mylayer",
  data: {
    address_parts: {},
    category: [],
    center_point: {},
    layer: "mylayer",
    name: {},
    parent: {
      borough: [],
      borough_a: [],
      borough_id: [],
      country: [],
      country_a: [],
      country_id: [],
      county: [],
      county_a: [],
      county_id: [],
      localadmin: [],
      localadmin_a: [],
      localadmin_id: [],
      locality: [],
      locality_a: [],
      locality_id: [],
      macrocounty: [],
      macrocounty_a: [],
      macrocounty_id: [],
      macroregion: [],
      macroregion_a: [],
      macroregion_id: [],
      neighbourhood: [],
      neighbourhood_a: [],
      neighbourhood_id: [],
      region: [],
      region_a: [],
      region_id: []
    },
    phrase: {},
    source: "mysource",
    source_id: "myid"
  }
}
{
  _id: "myid",
  _index: "pelias",
  _type: "mylayer",
  data: {
    layer: "mylayer",
    parent: {},
    source: "mysource",
    source_id: "myid"
  }
}

note: the parent: {} remains due to the nature of how the replacer function works; when that key is checked it still contains a bunch of arrays, so it's not considered empty. it looks like getting rid of that property would require some nasty code and more CPU so I'm happy to live with it as-is.

@missinglink missinglink self-assigned this Jul 12, 2016

missinglink added some commits Jul 12, 2016

@orangejulius

This comment has been minimized.

Show comment
Hide comment
@orangejulius

orangejulius Jul 12, 2016

Member

LGTM! This has been bugging me too. Some profiling from a while back suggests that JSON parsing/serializing is one of the biggest costs in our importers, and it can't really be sped up, only avoided, so this should really help.

If it turns out the CPU cost of doing the replacing is too high, I had a branch at one point that made sure they were never created. It's possible to make work.

Member

orangejulius commented Jul 12, 2016

LGTM! This has been bugging me too. Some profiling from a while back suggests that JSON parsing/serializing is one of the biggest costs in our importers, and it can't really be sped up, only avoided, so this should really help.

If it turns out the CPU cost of doing the replacing is too high, I had a branch at one point that made sure they were never created. It's possible to make work.

@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Jul 12, 2016

Contributor

:shipit:

Contributor

trescube commented Jul 12, 2016

:shipit:

@orangejulius

This comment has been minimized.

Show comment
Hide comment
@orangejulius

orangejulius Jul 12, 2016

Member

Oh yes, let's also make sure this is well tested (i suppose it will be with any build) on dev/dev2 and prod_build

Member

orangejulius commented Jul 12, 2016

Oh yes, let's also make sure this is well tested (i suppose it will be with any build) on dev/dev2 and prod_build

@missinglink

This comment has been minimized.

Show comment
Hide comment
@missinglink

missinglink Jul 12, 2016

Member

I tested this on a local build, imported OSM and ran the API via compare app, no errors (after my sneaky second commit) :)

Member

missinglink commented Jul 12, 2016

I tested this on a local build, imported OSM and ran the API via compare app, no errors (after my sneaky second commit) :)

@missinglink missinglink merged commit c05758b into master Jul 12, 2016

3 checks passed

approvals/lgtm this commit looks good
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@missinglink missinglink removed the in review label Jul 12, 2016

@orangejulius orangejulius referenced this pull request in pelias/api Jul 25, 2016

Merged

Handle empty parent properties #599

@orangejulius orangejulius deleted the strip_empty_arrays_objects branch Jul 25, 2016

@missinglink missinglink referenced this pull request in pelias/api Sep 6, 2016

Merged

Refactor deduper and write additional tests #649

je-l pushed a commit to nlsfi/pelias-model that referenced this pull request Aug 31, 2017

Merge pull request #37 from pelias/strip_empty_arrays_objects
Strip empty arrays and maps from the es document before sending it downstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment