Skip to content

Commit

Permalink
feat(mapping): use "index": "not_analyzed" for literal fields
Browse files Browse the repository at this point in the history
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
  • Loading branch information
orangejulius committed Nov 2, 2018
1 parent 162af3a commit e81f17a
Show file tree
Hide file tree
Showing 6 changed files with 493 additions and 262 deletions.
5 changes: 3 additions & 2 deletions mappings/document.js
Expand Up @@ -3,14 +3,15 @@ const postalcode = require('./partial/postalcode');
const hash = require('./partial/hash');
const multiplier = require('./partial/multiplier');
const literal = require('./partial/literal');
const literal_with_doc_values = require('./partial/literal_with_doc_values');
const config = require('pelias-config').generate();

var schema = {
properties: {

// data partitioning
source: literal,
layer: literal,
source: literal_with_doc_values,
layer: literal_with_doc_values,
alpha3: admin,

// place name (ngram analysis)
Expand Down
3 changes: 2 additions & 1 deletion mappings/partial/literal.json
@@ -1,4 +1,5 @@
{
"type": "string",
"analyzer": "keyword"
"index": "not_analyzed",
"doc_values": false
}
4 changes: 4 additions & 0 deletions mappings/partial/literal_with_doc_values.json
@@ -0,0 +1,4 @@
{
"type": "string",
"index": "not_analyzed"
}
4 changes: 2 additions & 2 deletions test/document.js
Expand Up @@ -117,7 +117,7 @@ module.exports.tests.parent_analysis = function(test, common) {
t.equal(prop[field+'_a'].type, 'string');
t.equal(prop[field+'_a'].analyzer, 'peliasAdmin');
t.equal(prop[field+'_id'].type, 'string');
t.equal(prop[field+'_id'].analyzer, 'keyword');
t.equal(prop[field+'_id'].index, 'not_analyzed');

t.end();
});
Expand All @@ -129,7 +129,7 @@ module.exports.tests.parent_analysis = function(test, common) {
t.equal(prop['postalcode'+'_a'].type, 'string');
t.equal(prop['postalcode'+'_a'].analyzer, 'peliasZip');
t.equal(prop['postalcode'+'_id'].type, 'string');
t.equal(prop['postalcode'+'_id'].analyzer, 'keyword');
t.equal(prop['postalcode'+'_id'].index, 'not_analyzed');

t.end();
});
Expand Down

0 comments on commit e81f17a

Please sign in to comment.