Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added force_document_id option to ES output enable resend data avoiding duplicated ES documents, fix #7891 #8019

Conversation

toni-moreno
Copy link
Contributor

This PR solves #7891
ID has been computed as hash computed with sha256(concat(timestamp,measurement,series-hash)),enables resend or update data avoiding ES duplicated documents.
Tested with several loads with the same data and no duplicated documents have been generated.

Hope this PR could help everybody.

Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather name force_document_id as avoid_duplicates or something, but is there any reason not to turn this on for everyone, and remove the config option? Under some circumstances, Telegraf assumes it can resend the same metrics with no ill effects downstream. I can't imagine a case where it'd be desirable to see all the duplicates.

@toni-moreno
Copy link
Contributor Author

Hello @ssoroka , thank you for the fast review.

About the chosen name "force_document_id". I've chosen because "document_id" is something known for people used to play with ES and Logstash, and could help people to identify the property in both tools ( telegraf / logstash ).

About turn on the property by default. I put false by default , to make this change backwards compatible, I'm not in the head of the people working with telegraf sending to elastic, and set the property to true by default would be a breaking change in some rare cases , so I'd prefer maintain backwards compatibility right now and perhaps change it in the future.

Anyway I'm open to consensus and change both things if other people could give us their opinion.

Thank you very much.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 24, 2020

@lpic10 do you want to weigh in here before I merge?

@ssoroka ssoroka added area/elasticsearch feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Aug 24, 2020
@lpic10
Copy link
Contributor

lpic10 commented Aug 24, 2020

Concerning the default option, I don't know if there is a valid scenario in that the telegraf users would want the same data stored twice (maybe when there are really duplicated log lines or metric points?)

I understand that InfluxDB does that deduplication by timestamp and tags/fields automatically, so maybe that is not really something expected by most people. It could make sense to have this enabled by default on ES output even if there is a potential performance impact on telegraf side.

About the config name there seems to be no consistency on the other tools sending data to ES. (eg. fluentd calls it "hash_id" and beats/logstash calls it "fingerprint"). But in all cases this option is configurable.

For me both force_document_id and avoid_duplicates make sense but I'm bad at naming things.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 26, 2020

Ok. I think I like the idea of changing this to avoid duplicates by default, since this is more likely the expected behavior. In that case, I'd make a new config option called allow_duplicates or disable_document_id and default it to false. We should really strive to do the right thing most of the time, even if that's a breaking change. In this case, this change is more of a fix than a break, anyway, as it probably should have been the default behavior in the first place, and likely anyone surprised by the change will be pleasantly surprised. Duplicate data is a pain.

@melodous
Copy link

melodous commented Aug 26, 2020

IMHO in the elasticsearch world it's not natural to perform this kind of deduplication, probably elasticsearch users does not expect this behavior by default.

Also, this behavior has a penalty in write throughput, from the official Elastic documentation:

'Use auto-generated ids
When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster'

@ssoroka
Copy link
Contributor

ssoroka commented Aug 27, 2020

That's good feedback @melodous. I'll change my position to say we should choose the faster option by default and allow users interested in this to turn it on.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 27, 2020

@toni-moreno I'd say it's good to merge whenever you want. let me know if you plan to rename or want to merge as is.

@toni-moreno
Copy link
Contributor Author

Hello @ssoroka , INHO and regarding @melodous comment, if set document_id from origin could be a costly operation in the ES backend , the name "force_document_id" ,gives people idea of something with possible penalty , and looks good for me if everybody agree.

@ssoroka ssoroka merged commit 9a06ac1 into influxdata:master Sep 8, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elasticsearch feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants