Add support for _ttl at event level ? #43

wiibaa · 2015-01-26T06:06:06Z

Moved from https://logstash.jira.com/browse/LOGSTASH-470
ElasticSearch has the ability to automatically prune older messages if a TTL value is provided by the message or set as default on the index itself.
So for example you could create a single index and tell it store events for 30 days, after 30 days ElasticSearch would start removing the old entries from the index. This could also be used with the daily style indexes that LogStash automatically creates. Although this would not delete the index itself it would empty it out.
Also it may not be a bad idea to allow for filters to also be able to tweak this value. That way I could save my access log entries for a week but my error logs for a month.

Interesting comment from MixMuffins
The thing about using the TTL is that it creates a lot of excess overhead in elasticsearch if you're dealing with a lot of indexes. Have you tried considering an automated script to remove entries after a certain period of time, and having that run as a cronjob? It'd be more efficient and wouldn't sacrifice speed on your elasticsearch indexing.

untergeek · 2015-01-26T16:33:41Z

@wiibaa, thanks for migrating these. I know you're aware of these details, but I'm going to comment here for the benefit of other readers.

TTL is a bad idea for time-series data as indices can grow to billions of documents per day. If a document-level TTL is set on any document in the index, the entire index will be scanned every 60 seconds (a configurable default) to look for documents which have a TTL set, and to check if it has expired. This is a tremendous amount of overhead just for reading, not even deleting.

Even using a delete_by_query via cron is problematic because of how it affects segment sizing and allocation. It makes for very uneven segment merges which puts strain on your indexing and search operations. In addition, deleting documents in Elasticsearch does not result in immediate deletion. From Elasticsearch: The Definitive Guide book:

Internally, Elasticsearch has marked the old document as deleted…The old version of the document doesn’t disappear immediately, although you won’t be able to access it. Elasticsearch cleans up deleted documents in the background as you continue to index more data.

The need to prune documents by TTL or by query is necessary in certain environments, but time-series data (like logs) should almost never be handled in this way, in my opinion. Your Elasticsearch environment will be far better served by splitting into separate indices and dropping them with DELETE calls (or using curator).

With this said, if you insist on doing event-level TTL, there's nothing to prevent users from adding a TTL to individual events in Logstash by adding a field called _ttl with a string value that Elasticsearch will recognize such as "1d" for one day, or "1w" for one week. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-ttl

  mutate {
    add_field => { "_ttl" => "1d" }
  }

Because this is fully supported now—because of the Logstash 1.2 schema change—I'm going to close this issue with all caveats mentioned. Feel free to re-open if you believe it necessary.

wiibaa · 2015-01-27T05:24:59Z

@untergeek I had a rough idea, but it is always good to hear the details from the experts

wols · 2015-05-09T17:13:37Z

A valuable clue. Thank you!

untergeek closed this as completed Jan 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for _ttl at event level ? #43

Add support for _ttl at event level ? #43

wiibaa commented Jan 26, 2015

untergeek commented Jan 26, 2015

wiibaa commented Jan 27, 2015

wols commented May 9, 2015

Add support for _ttl at event level ? #43

Add support for _ttl at event level ? #43

Comments

wiibaa commented Jan 26, 2015

untergeek commented Jan 26, 2015

wiibaa commented Jan 27, 2015

wols commented May 9, 2015