This repository has been archived by the owner. It is now read-only.

ElasticSearch sink - Improve mappings, timestamps, index separations, etc #1701

Closed
rikatz opened this Issue Jun 26, 2017 · 21 comments

Comments

Projects
None yet
7 participants
@rikatz
Copy link
Contributor

rikatz commented Jun 26, 2017

As mentioned here, maybe using / creating a Generic 'MetricsTimestamp' field usable by all the metrics regardless of which type it uses is better than creating a per-type Timestamp.

Using a per-type Timestamp forces us to create a Datasource per metric in Grafana (like one per CPU, one per memory and so) or even a new index mapping per metric type, when we could filter the desired metric per '_type' field (as already inserted).

The idea here is to insert a general point["MetricsTimestamp"] = date.UTC() in sinks/elasticsearch/driver.go instead of creating a esCommon.MetricFamilyTimestamp(family)

Need your opinion in this, as I'm thinking about start working in a PR for this. I've taken a look in #1313 but couldn't realize why Timestamp is separated, as there's a _type field that can be used to separate everything in Kibana :)

TODO (as suggested by @outcoldman)

  • Replace the specific timestamps to a more generic timestamp (like MetricsTimestamp or even just timestamp)
  • Create a new cmd line argument, that allows us to specify the creation of a per-metric index (cpu-index, memory-index, etc)
  • Improve the type mapping, removing the unecessary text analyzed fields.
  • remove all {"type": "text", "fields": {"raw": {"type": "keyword"} } } types and replace them with just {"type": "keyword"}, nobody will use full text search over metrics fields, it just does not make any sense. That will save the storage.
  • split labels with , and send it as an array of ["app=foo","label2=my_label"] and keep them as {"type": "keyword"} as well, so you will be able to search with MetricsTags.labels=app=foo
  • Rename the fields memory/cache to more like JSON friendly memory_cache and maybe remove unnecessary nesting with Metrics. memory_cache, just keep memory_cache keep only one field with type text, which is Message on events.

Thanks in advice!

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jun 26, 2017

@AlmogBaku need your opinion in this :)

@AlmogBaku

This comment has been minimized.

Copy link
Contributor

AlmogBaku commented Jun 26, 2017

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jun 26, 2017

OK, maybe I'm missing something or I don't understand well how Kibana works, but when you add a new 'index' pattern, you have to specify if the index contains 'time based events', and what's the Time-field base.

If we do add a common 'index' (like testevent-*) pointing only to 'CpuMetricsTimestamp', the different events (memory, network) are 'discarded' from Kibana search.

If you add a non timestamped index in Kibana, we can plot graphs but the behaviour is pretty strange. You have to add graphs with 'X axis' using a Histogram, but the plotting doesn't work well (and also selecting the Time range).

Do you have any Kibana Dashboard example?

Thank you!

@AlmogBaku

This comment has been minimized.

Copy link
Contributor

AlmogBaku commented Jun 27, 2017

I don't have an example.. but I guess adding an extra field couldn't hurt..

However, it's interesting to try and see if having a unified timestamp outside of the types will also work for Kibana.. (maybe I was wrong.. I don't really recall how it works right now)

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jun 29, 2017

@AlmogBaku No problems at all. I'll make some tests here, including Kibana Dashboards and check if everything works fine.

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 5, 2017

@AlmogBaku I am confused, how is having multiple timestamp fields helps kibana?

I actually see several issues with current sink implementation of ElasticSearch:

  1. I would suggest to change the mapping to keep events with different set of fields in separate indexes. See for details https://www.elastic.co/blog/index-vs-type. Having multiple types in the same index, which do not have any common fields, just hurts lucene. Certainly does not save the space.

  2. Current mapping has a lot of fields analyzed by default. Why would you analyze MetricsTags.namespace_name? Certainly you don't want to use full text search on these fields. The only analyzed fields should be probably Message from events (maybe something else).

  3. Every index should have only one timestamp field. That will work best with kibana.

  4. Events currently aren't searchable as Time-based events, considering that we probably don't have a timebased field for it... And some alias.

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jul 5, 2017

@outcoldman about the issues:

Separating the indexes per metrics could be applicable, but I think would make harder to create Kibana and/or Grafana Dashboards.

Let's take a look at how Elastic does this, as they have a Metricbeat Plugin for kubernetes/kubelet:

Regardless of using kube-state-metrics (this is another subject) or kubelet (and the collected metrics), Metricbeat would everytime export the metrics to the same Elasticsearch cluster (or any other output) as defined here.

The most important thing here is that the metrics use the same Timestamp field, but different type mapping per metrics.

Waiting for your considerations :)

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 5, 2017

@rikatz yes, you can send all the metrics to just one index, but as Do your documents have similar mappings? If no, use different indices.

With metricsbeat you can:

a) you can specify dedicated index for every type with elasticsearch output.
b) use logstash output and forward events based on type to specific index

I agree that having multiple indexes per every type can be hard to manage if you are just playing with that setup. But in case of real production use - that will pay off. Considering that sinks are configured only on command arguments level - maybe having a dedicated_indexes=[true|false] can target both scenarios?

For consideration - we can implement another sink - generic HTTP/JSON one, so it will be possible to use for example logstash http input and so folks will be able to do anything they want on pipeline.

Btw, considering that ES5 supports ingest pipelines - I guess I can use that now as a workaround for separating types and putting stuff in right indexes.

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 5, 2017

but I think would make harder to create Kibana and/or Grafana Dashboards.

btw, regarding this one. If you will have indexes with names like heapster-metrics-cpu-YYYY.MM.dd, heapster-metrics-network-YYYY.MM.dd, ... and heapster-events-YYYY.MM.dd you can easily build dashboards in Kibana and Grafana by just using names like heapster-metrics-*

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jul 5, 2017

Agreed, you can use wildcards in that. Maybe a new argument would help anyway.

@AlmogBaku what do you think about this rebasing:

  • Using only one timestamp instead of per-metric timestamp. This could be made in a new PR without any impacts (I think), as I don't know, with ES5 if there was someone using previously the Elasticsearch sink. Maybe need to release a 'breaking change' with this.
  • Create a new dedicated_indexes argument to create a index per metric type.
@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 5, 2017

@rikatz also I would suggest to update the mappings:

  • remove all {"type": "text", "fields": {"raw": {"type": "keyword"} } } types and replace them with just {"type": "keyword"}, nobody will use full text search over metrics fields, it just does not make any sense. That will save the storage.
  • split labels with , and send it as an array of ["app=foo","label2=my_label"] and keep them as {"type": "keyword"} as well, so you will be able to search with MetricsTags.labels=app=foo
  • if that is possible - would suggest to rename the fields memory/cache to more like JSON friendly memory_cache and maybe remove unnecessary nesting with Metrics. memory_cache, just keep memory_cache
  • keep only one field with type text, which is Message on events.
@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jul 5, 2017

@outcoldman let's take this by parts :D first of all, changing the timestamp schema to something more generic would help? For me would a lot.

Need more opinions before 'breaking' stuffs!

Then let's see how to refactor the fields / mapping structures (in another PR) and break other stuffs :)

Let's make a 'list' of things that we think need to be changed (fields to be removed, arguments to be inserted, things to be changed) and achieve this in a lot of PRs to improve the ES5 sink :)

@rikatz rikatz changed the title ElasticSearch sink - Change Timestamp schema ElasticSearch sink - Improve mappings, timestamps, index separations, etc Jul 5, 2017

@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jul 5, 2017

@outcoldman Have changed the Issue description, take a look if I've missed something.

Anyway, need some other people from community (instead or just 3 of us), as I think some changes would break existing production environments :)

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 5, 2017

@rikatz sure, just one common timestamp field will help a lot, that will be good start.

As for breaking changes - I would suggest to break just once. If with every release this sink will have breaking changes - folks will be very unhappy. I would suggest to actually make all of them at once and standartize the schema with best practices from elasticsearch.

Btw as for mappings - you can take a look on which mappings ElasticSearch is using https://www.elastic.co/guide/en/beats/metricbeat/master/exported-fields-kubernetes.html#_kubernetes_event_message - they do not have any text fields at all, Message also is a keyword. Not sure if that is desirable, but again - possible nobody will use full text search on this field as well. Plus if you will really need to use analyzers and full text search on some fields - you can use ingest pipeline.

Btw on the list 3rd and 4th is the same.

Curious if maybe metricbeats/kubernetes authors can help with advise as well. cc @exekias @monicasarbu - the community will really apprieate if you will help heapster be much nicer with ElasticSearch, we understand that you want to make your beats experience as simple as possible. But heapster is the native metrics collector from kubernetes, at least some advice and guidence will be appreciated. Your choise of pulling metrics from kubernetes directly is certainly ok, but my suggestion - API in kubernetes are getting breaking changes times to time - for example prometheus/kubernetes plugin feels like is getting broken all the time, and they actually using the same path as you do by pulling metrics directly. I would assume that heapster will have less of the breaking changes, considering closer sitting with kubernetes. This is why I am asking you to help with that, as this is what most people will rely on. The best case can be that heapster and metricsbeats/kubernetes will produce the same mappings (with the same names) so all the dashboards built for one can be used with another.

@exekias

This comment has been minimized.

Copy link

exekias commented Jul 7, 2017

Hi there!

I would say keyword is right in most cases, normally you do metric lookups by their full value. If you expect to have a random text probably text is still preferable.

As for the labels, we use nesting to store them, so they would be something like "labels.app": "foo".

@outcoldman pointed out to our exported fields list, it's a good reference if you want to match our names.

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 7, 2017

About the types and indices, https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch - in ES 6 - there will be no types, only one type per index

@AlmogBaku

This comment has been minimized.

Copy link
Contributor

AlmogBaku commented Jul 7, 2017

  1. using keyword make sense
  2. what are the benefits of having an index per metric type?
  3. I'm still not sure how and if you can configure multiple types with only one timestamp on kibana (and also for older kibanas then version 5)
  4. I agree that we need to minimize the breaking changes and to try to create one stable mapping. We might send multiple PRs, but all of them should be released at the same time.
@rikatz

This comment has been minimized.

Copy link
Contributor Author

rikatz commented Jul 7, 2017

@AlmogBaku I think that, by the end of 'types' in ES6, it's a good idea to have a per-metric Index, probably the impact migrating from 5 to 6 would be less than expected.

About the type, I think you can use it (the field _type) to filter which metrics you want. If this is not the case to Kibana < 5, than probably this needs to be a breaking change. Maybe, after all we need to decide wheter to support or not anymore ES 2x versions.

Let's close those questions, and we can start working on that :D

@outcoldman

This comment has been minimized.

Copy link

outcoldman commented Jul 8, 2017

@AlmogBaku
2. that is the main confusion about the elasticsearch indexes and types. There are not similar to databases and tables. Both articles has really good explanation why having one type per index makes sense https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch and https://www.elastic.co/blog/index-vs-type
3. Not sure what you are referencing to. If you will have type per index - you will be able to do that. Even if not - you will have just one kibana type for this index - which will have fields from all the different types in the index. How different timestamps help? At minimum you can always search for type:cpu at search

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 31, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@AlmogBaku

This comment has been minimized.

Copy link
Contributor

AlmogBaku commented Jan 12, 2018

let's continue the discussion in #1909 and decide what's the new schema will be.
/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.