New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Elasticsearch 5 #461

Open
dianashk opened this Issue Nov 16, 2016 · 20 comments

Comments

Projects
None yet
10 participants
@dianashk
Contributor

dianashk commented Nov 16, 2016

ES 5 has been released. 🎊

Let's see what migration might look like and what the benefits/drawbacks might be.

Looking for input from the community here. ✨

@jacobbaskin

This comment has been minimized.

jacobbaskin commented Nov 16, 2016

I'm going to be hacking stuff in to try and do a performance comparison. But here's what I've found so far:

  • Everything would need to depend on the elasticsearch 12.0.1 client library (this should be a noop -- afaict most of the master branches already do but they'd need to get released)
  • Schema would have to change slightly. 5.0 doesn't support the "index_concurrency" parameter and geo indexing changed so a lot of the options on "geo_point" need to be removed
  • Some deprecated query clauses need to be rewritten

Maybe that's it? I'll let you know if I find anything else.

Given this, I think there could be a flag in pelias.json that turns on "ES 5 mode" for querying/indexing. Would be relatively painless; I'd be happy to take a crack at it.

@missinglink

This comment has been minimized.

Member

missinglink commented Nov 17, 2016

happy to assist, I'm not the best at checking my github notifications but you have our email if you need anything :)

each repo contains at least one testing framework; some include two or three. The pelias/schema repo includes integration tests along with unit tests, so you should be able to confirm that the refactor works as expected by simply running those tests and ensuring they pass.

having a feature flag for 'ES 5 mode' has the potential to get pretty messy; particularly if the pelias/query code also needs to be varied accordingly, so let's try and keep that as tidy as we can, an ideal scenario would be using APIs that are backwards compatible with ES2, although that's probably wishful thinking ;)

@jacobbaskin

This comment has been minimized.

jacobbaskin commented Nov 17, 2016

@missinglink the query code should be 100% backward-compatible as far as I can tell. The schema code could be but I'd worry about hurting performance in ES 2. Is there an easy way to remove some fields and check indexing + querying performance? Will the integration tests work for that?

@missinglink

This comment has been minimized.

Member

missinglink commented Nov 23, 2016

hey @jacobbaskin, you can remove fields from the schema but it might cause code depending on those fields to break.

what exactly are you proposing to change?

@missinglink

This comment has been minimized.

Member

missinglink commented Dec 2, 2016

@cag

This comment has been minimized.

cag commented Dec 2, 2016

Okay, in order to do this, I have 3 pull requests that must be simultaneously resolved, and afterwards, version numbers for these repos must be propagated throughout all the repos.

See:
pelias/dbclient#35
pelias/config#45
pelias/schema#195

It would probably be best to resolve them in that order.

@orangejulius

This comment has been minimized.

Member

orangejulius commented Feb 23, 2017

Hi everyone,

It looks like there are only a few things to do to support ES5:

orangejulius added a commit to pelias/api that referenced this issue Apr 13, 2017

Use bool instead of filtered queries in fixtures
This is needed for Elasticsearch 5 support as part of pelias/pelias#461

orangejulius added a commit to pelias/query that referenced this issue Jul 5, 2017

Fix: explicitly set minimum_should_match to 1
A while back #61 changed all our
queries to use Elasticsearch 5 compatible [bool queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html)
instead of the `filtered` queries that were deprecated in Elasticsearch
2, and no longer supported in Elasticsearch 5.

However, while the queries were supposed to behave identically, it
appears that a `filtered` query effectively had a `minimum_should_match`
setting of 1 implied, while the new queries have a
`minimum_should_match` of 0 implied.

This was causing the new queries to return a bunch of irrelevant
results, since any results that match the parent hierarchy filters would
be returned, even if it did not match any of the should conditions.

Our query logic depended on `minimum_should_match` being set to 1, so it's
now explicitly set.

Connects pelias/pelias#461
Connects pelias/api#762
Connects pelias/api#855

@wafflebot wafflebot bot added in progress and removed ideas labels Jul 5, 2017

@orangejulius

This comment has been minimized.

Member

orangejulius commented Jul 5, 2017

Hi everyone,
It's been a while since we've worked on anything here. When looking at another task @trescube and I figured out what was blocking our PR to use Elasticsearch 5 compatible queries, and with pelias/query#65 all our acceptance-tests are now passing, so we should soon be able to have full functionality with ES5!

orangejulius added a commit to pelias/api that referenced this issue Jul 6, 2017

Update fixtures for Elasticsearch 5 queries
These queries use [bool](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html)
queries instead of the deprecated `filtered` queries, and also set
`minimum_should_match` to 1, which was apparently implied previously.

Connects #762
Connects #844
Connects #855
Connects pelias/pelias#461

orangejulius added a commit to pelias/api that referenced this issue Jul 6, 2017

Update fixtures for Elasticsearch 5 queries
These queries use [bool](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html)
queries instead of the deprecated `filtered` queries, and also set
`minimum_should_match` to 1, which was apparently implied previously.

Connects #762
Connects #844
Connects #855
Connects pelias/pelias#461

@dianashk dianashk added processed and removed in progress labels Jul 27, 2017

@orangejulius orangejulius removed their assignment Jul 27, 2017

@orangejulius orangejulius referenced this issue Dec 8, 2017

Closed

ES 5+ #698

@orangejulius orangejulius changed the title from Upgrade to Elastcisearch 5 to Upgrade to Elasticsearch 5 Feb 6, 2018

@orangejulius

This comment has been minimized.

Member

orangejulius commented Mar 24, 2018

Hello again everyone,
Time for another ES5 update. This is something we want to take care of soon. We need to add support for ES5 with backwards compatibility for ES2, at least for long enough to transition over.

I recently rebased @cag's old es5-compat branch in pelias/schema, and it still works.

@orangejulius orangejulius referenced this issue Mar 24, 2018

Open

Elasticsearch 6 Support #719

1 of 7 tasks complete
@mdorda

This comment has been minimized.

mdorda commented Jul 17, 2018

Hello. Thank you for all the great job. It looks es5-compat branch has not been updated for a while. Will you plan any update soon?

@LEDfan

This comment has been minimized.

LEDfan commented Aug 17, 2018

We are looking into using Pelias in our production environment, unfortunately the lack of ES5/6 support is holding us back. Is there any update on this?

From the TODO list above it seems that it can already be tested, if this is the case I can definitely test this is on our setup and report back if it works.

Thanks!

@andrewkcarter

This comment has been minimized.

andrewkcarter commented Aug 31, 2018

We are also considering using Pelias, but the lack of ES 5 or 6 is a deal breaker for us as well.

@ashleyholt

This comment has been minimized.

ashleyholt commented Sep 14, 2018

Re: the to-do list above: for testing the upgrade with a full planet build, are the current acceptance tests sufficient for checking that the upgrade doesn't introduce changes to the query results, or should we include a subtask for updating the tests as well?

@orangejulius

This comment has been minimized.

Member

orangejulius commented Sep 14, 2018

@ashleyholt the acceptance-tests are perfect for testing any full planet build. we'll definitely use them to verify any Elasticsearch upgrades.

@bboure

This comment has been minimized.

bboure commented Sep 22, 2018

Hi all,
Any news on this? I just discovered Pelias and it looks just great and exactly what I need. I am considering using it for a new project. However if I could use ES 5 or 6 it would be great.
I am not familiar yet neither with Pelias nor with Elastic search (still learning) so I cannot really offer my help here but if I can help in any way (even if just by testing), just let me know.

@missinglink

This comment has been minimized.

Member

missinglink commented Sep 24, 2018

This is happening imminently (firstly an ES5 upgrade and then later an ES6 upgrade).
Hard to give an estimate but I would hope we could have something working in ES5 within the next 30-60 days.

@bboure is there a deal-breaker reason why you can't run the existing ES@2.4 setup, possibly something ops related?
If you haven't seen https://github.com/pelias/docker then check that out, it makes getting everything setup much easier :)

@bboure

This comment has been minimized.

bboure commented Sep 24, 2018

@missinglink Thank you for your feedback.
No deal breaker. I am still at the very early stage of my project and I am still playing with Pelias to try to see if it would fits my requirements; and everything indicates that it does! πŸ™‚
Right now I am using ES 2.3 (because AWS does not offer 2.4) but of course I would prefer to use a much more recent release whenever possible.
I can wait 30-60 more days.

@orangejulius

This comment has been minimized.

Member

orangejulius commented Sep 25, 2018

Hi everyone,
Just to give a bit more context on this, we are beginning work on the upgrade process in earnest.

We plan to begin some initial work to allow development with Elasticsearch 5, and then determine how feasible it is to support both ES 2 and ES 5 for a little while to make transitioning easier. Worst case, we will publish a "final" Pelias release that supports ES2 and require ES5 for all subsequent versions.

Milestones for initial development work:

After that we'll iron out any issues bringing ES5 support to Pelias mainline. This will involve a progression of testing on smaller to larger builds. Ultimately, we will require a full planet Pelias build run under known conditions to show good results before we consider Elasticsearch 5 production ready.

We'll likely test on a progression of builds something like this:

While continent sized and full planet builds require serious resources to run, all the prior builds can be run on most personal laptops with around 8GB or more of RAM and a good amount of free disk space. We'll post many updates here, but definitely when we are ready to test at those stages, as we'd love feedback from as wide an audience as possible.

At this time, it appears the easiest way to move forward is to focus on Elasticsearch 5 support only for now. Once we've settled in, it appears supporting E5 and ES6 is much easier than ES2, ES5, and ES6.

@missinglink

This comment has been minimized.

Member

missinglink commented Sep 28, 2018

FYI, the schema changes are ready for testing/review in pelias/schema#323

@missinglink

This comment has been minimized.

Member

missinglink commented Oct 25, 2018

Today I merged pelias/docker#25 which means we support 5.6, please see the notes in pelias/schema#323 regarding configuration for non-docker setups.

orangejulius added a commit to pelias/schema that referenced this issue Oct 25, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99

JWileczek added a commit to JWileczek/schema that referenced this issue Oct 26, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in pelias#99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes pelias#99

orangejulius added a commit to pelias/schema that referenced this issue Nov 2, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99

orangejulius added a commit to pelias/schema that referenced this issue Nov 2, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99

orangejulius added a commit to pelias/schema that referenced this issue Nov 3, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99

orangejulius added a commit to pelias/schema that referenced this issue Nov 3, 2018

feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment