added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

toni-moreno · 2020-08-21T16:26:08Z

This PR solves #7891
ID has been computed as hash computed with sha256(concat(timestamp,measurement,series-hash)),enables resend or update data avoiding ES duplicated documents.
Tested with several loads with the same data and no duplicated documents have been generated.

Hope this PR could help everybody.

…oiding duplicated ES documents, fix influxdata#7891

ssoroka

I think I'd rather name force_document_id as avoid_duplicates or something, but is there any reason not to turn this on for everyone, and remove the config option? Under some circumstances, Telegraf assumes it can resend the same metrics with no ill effects downstream. I can't imagine a case where it'd be desirable to see all the duplicates.

toni-moreno · 2020-08-24T06:54:04Z

Hello @ssoroka , thank you for the fast review.

About the chosen name "force_document_id". I've chosen because "document_id" is something known for people used to play with ES and Logstash, and could help people to identify the property in both tools ( telegraf / logstash ).

About turn on the property by default. I put false by default , to make this change backwards compatible, I'm not in the head of the people working with telegraf sending to elastic, and set the property to true by default would be a breaking change in some rare cases , so I'd prefer maintain backwards compatibility right now and perhaps change it in the future.

Anyway I'm open to consensus and change both things if other people could give us their opinion.

Thank you very much.

ssoroka · 2020-08-24T16:15:09Z

@lpic10 do you want to weigh in here before I merge?

lpic10 · 2020-08-24T17:27:18Z

Concerning the default option, I don't know if there is a valid scenario in that the telegraf users would want the same data stored twice (maybe when there are really duplicated log lines or metric points?)

I understand that InfluxDB does that deduplication by timestamp and tags/fields automatically, so maybe that is not really something expected by most people. It could make sense to have this enabled by default on ES output even if there is a potential performance impact on telegraf side.

About the config name there seems to be no consistency on the other tools sending data to ES. (eg. fluentd calls it "hash_id" and beats/logstash calls it "fingerprint"). But in all cases this option is configurable.

For me both force_document_id and avoid_duplicates make sense but I'm bad at naming things.

ssoroka · 2020-08-26T18:17:10Z

Ok. I think I like the idea of changing this to avoid duplicates by default, since this is more likely the expected behavior. In that case, I'd make a new config option called allow_duplicates or disable_document_id and default it to false. We should really strive to do the right thing most of the time, even if that's a breaking change. In this case, this change is more of a fix than a break, anyway, as it probably should have been the default behavior in the first place, and likely anyone surprised by the change will be pleasantly surprised. Duplicate data is a pain.

melodous · 2020-08-26T20:04:15Z

IMHO in the elasticsearch world it's not natural to perform this kind of deduplication, probably elasticsearch users does not expect this behavior by default.

Also, this behavior has a penalty in write throughput, from the official Elastic documentation:

'Use auto-generated ids
When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster'

ssoroka · 2020-08-27T20:04:56Z

That's good feedback @melodous. I'll change my position to say we should choose the faster option by default and allow users interested in this to turn it on.

ssoroka · 2020-08-27T20:07:35Z

@toni-moreno I'd say it's good to merge whenever you want. let me know if you plan to rename or want to merge as is.

toni-moreno · 2020-08-31T09:59:25Z

Hello @ssoroka , INHO and regarding @melodous comment, if set document_id from origin could be a costly operation in the ES backend , the name "force_document_id" ,gives people idea of something with possible penalty , and looks good for me if everybody agree.

…oiding duplicated ES documents, fix influxdata#7891 (influxdata#8019)

added force_document_id option to ES output enable resend data and av…

49d21c9

…oiding duplicated ES documents, fix influxdata#7891

ssoroka reviewed Aug 21, 2020

View reviewed changes

ssoroka approved these changes Aug 24, 2020

View reviewed changes

ssoroka added area/elasticsearch feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Aug 24, 2020

ssoroka merged commit 9a06ac1 into influxdata:master Sep 8, 2020

idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020

added force_document_id option to ES output enable resend data and av…

8591369

…oiding duplicated ES documents, fix influxdata#7891 (influxdata#8019)

arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023

added force_document_id option to ES output enable resend data and av…

683f613

…oiding duplicated ES documents, fix influxdata#7891 (influxdata#8019)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

toni-moreno commented Aug 21, 2020

ssoroka left a comment •

edited

Loading

toni-moreno commented Aug 24, 2020

ssoroka commented Aug 24, 2020

lpic10 commented Aug 24, 2020

ssoroka commented Aug 26, 2020

melodous commented Aug 26, 2020 •

edited

Loading

ssoroka commented Aug 27, 2020

ssoroka commented Aug 27, 2020

toni-moreno commented Aug 31, 2020

added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

Conversation

toni-moreno commented Aug 21, 2020

ssoroka left a comment • edited Loading

Choose a reason for hiding this comment

toni-moreno commented Aug 24, 2020

ssoroka commented Aug 24, 2020

lpic10 commented Aug 24, 2020

ssoroka commented Aug 26, 2020

melodous commented Aug 26, 2020 • edited Loading

ssoroka commented Aug 27, 2020

ssoroka commented Aug 27, 2020

toni-moreno commented Aug 31, 2020

ssoroka left a comment •

edited

Loading

melodous commented Aug 26, 2020 •

edited

Loading