New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

future of raintank-metric, use something else? #11

Closed
Dieterbe opened this Issue Jul 27, 2015 · 44 comments

Comments

Projects
None yet
4 participants
@Dieterbe
Contributor

Dieterbe commented Jul 27, 2015

please help me fill this in. we need to agree on what our requirements/desires are before talking about using other tools

current requirements?

  • safely relay metrics from our queue into storage and ES without losing data in case we can't safely deliver
  • decode messages from our custom format used in rabbitmq (but i suppose we could also store them differently in rabbit?)
  • encode messages into our custom format, to be stored in ES

possible future requirements

  • real time aggregation
  • real time processing/alerting (I personally don't think we need to be too concerned about this just yet. once we have high performance/scalability requirements we'll probably use a dedicated real time processing framework like spark/storm/heron/...)

questions

  • can we write our own decode, encode, processor plugins in Go, in heka?
  • can somebody describe what we do with ES from the raintank-metric/rabbitmq perspective and how dependent this is on the main storage backend? like if kairosdb is down, can we or must we still update ES? if ES is down, can or must we still write to kairos?
  • does rabbitmq support multiple readers of the same data, and does it maintain what has been acked by which reader?
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jul 27, 2015

@woodsaj

This comment has been minimized.

Member

woodsaj commented Jul 28, 2015

I personally think that Heka is the best path forward.

The the required flow of data is as follows:

  • Consume from rabbitmq. The messages recieved from rabbit will be json encoded arrays of individual metrics.
  • decode rabbitmq messages. first unmarshal the json into a list of go data structs eg. []map[string]interface{}. Then split the list for processing of each individual metric.
  • create and or update ES with metric definition. These definitions are currently only used by graphite-api when searching metrics. Now would be a good time to revisit the data and format of what we are storing.
  • write the metric to our TSDB (currently kairosdb)

We are currently using the consistent-hash exchange with rabbitmq (https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange). The reason for selecting this exchange type was because we were doing aggregation and so needed to ensure that metrics were always routed to the same raintank-metric process. this is no longer the case and we can re-evaluate our options.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jul 28, 2015

@woodsaj

This comment has been minimized.

Member

woodsaj commented Jul 29, 2015

so we should be able to leverage the existing Heka rabbitmq input plugin, which will then pass the messages to a new Decoder plugin. We will then need 2 output plugins, 1 for elasticsearch and 1 for tsdb.

@woodsaj

This comment has been minimized.

Member

woodsaj commented Jul 29, 2015

raintank/grafana#275 Highlights a current issue we have with the data we store in Elasticsearch. On all of the RT dashboards we have a template variable that uses the query "litmus.*" to populate. We already have clients that have 70K unique metrics and having to parse all of these to get the list of ~70 endpoints they have is extremely expensive. At present we limit the response payload to 10k documents which leads to not all endpoints showing up in the list.

So as we re-work the raintank-metric code i think we also need to overhaul the metricDefinition that we store in Elastic. To allow efficient querying from graphite i think we need to tokenize the metric name.
ie. litmus.endpoint.collector.proto.measurement would become

{
  series: litmus.endpoint.collector.proto.measurement,
  pos0: litmus,
  pos1: endpoint,
  pos2: collector,
  pos3: proto,
  pos4: measurement
 ....
}

When search for a query like "litmus.*" we could query elastic just for all the branch nodes that match

{
  "size": 0,
  "aggs": {
    "names": {
      "terms": {"field": "pos1"}
    }
  },
  "query": {
    "filtered": {
      "filter": {
        "and": [
          {
            "or": [
              {
                "term": {
                  "org_id": 1
                }
              },
              {
                "term": {
                  "public": true
                }
              }
            ]
          },
          {
            "term": {
              "pos0": "litmus"
            }
          }
        ]
      },
      "query": {
        "regexp": {
          "pos1": "*"
        }
      }
    }
  }
}

There is further thought and investigation required for this to determine how to identify any metrics that are leaf nodes.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jul 29, 2015

notes from meeting:

  • rabbitmq performs poorly if acks take long. not really good at being a persisted queue, better take out of rabbit asap. queue could be made durable but not worth the hassle.
  • might change rabbit for kafka again later.
  • nsq could be useful since it does buffering, work distribution and is distributed. but it's more of a generic messaging system and probably has no good support for plugins for custom formats and so on. heka does have a nice plugin system and some more features for logs/metrics that seem useful for us down the road.

look into using heka

  • future scale-out will be fine. can be shared nothing round robin or whatever.
    will need consistent hashing once we do something like aggregation/alerting again.
  • for our use cases we can probably even just bypass rabbitmq and have collector-ctrl write directly to heka. (but i think it's simplest if we just leave rabbitmq in for now?)
  • can implement metrics 2.0

heka TODO:

  • validate performance
  • can ES and kairosdb outputs be indendepent? (or should we not add definitions to ES if kairosdb is down)
  • verify whether we can prioritize realtime data and lower prio for older data. (also: any cap on bandwith it tries to use for old data? to not affect realtime data. also any caps realtime adjustable?)
  • look into behavior when disk grows full, mem goes full / cap ?
  • any other potential gotcha's to consider?

@Dieterbe Dieterbe self-assigned this Jul 30, 2015

@nopzor1200

This comment has been minimized.

nopzor1200 commented Jul 31, 2015

Sounds like you guys are on top of this. I don't have much to add but 3 points:

  1. I think the ES and kairosdb outputs should be independent ideally. I can't think of a reason we wouldn't want to send to ES if Cassandra is down and not to Cassandra if ES is down?

(obviously the less buffering we do of data IN GENERAL the better we'll be. practically i think we need to have low MTTR for issues in prod; the more data we allow to buffer the more challenging it will be to reload, and the longer that prod may have to be in a degraded state due to higher load).

  1. I think it'd be good to prioritize real time data and lower prio for older data, but I wouldn't prioritize it for an initial replacement if it's challenging.

  2. I think it'd be nice to cap bandwidth for old buffering. I guess this very much relates to the concern I mention below 1). I'm not sure where the best place to do this is -- eg. I've heard of cassandra shops changing some of their tunables (about repair bandwidth, write and read threads etc) during specific times like cluster repairs or large reloads etc.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 3, 2015

so when the metadata for a metric is in ES but the data is not in kairosdb, will it give issues? how does graphite-kairosdb behave if you request series that exist in ES but not in kairosdb? i generally agree with you here but @woodsaj can you confirm there is no gotcha here. i wouldn't want graphite to return 500's in this scenario for example.

other things i have to look into:

  • max messages in flight that can be dropped? there's a bunch of chan's with a buf of 50 and probably a periodic sync?
@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 3, 2015

As we are querying C* directly, we only get back rows that have data. The query used is:

SELECT * FROM data_points WHERE key=? AND column1 > ? AND column1 <= ?

where key is essentially the series we want, and column1 is the timestamp of datapoints.

So querying for a series that is not in C* yields the same result as querying for a time range for which there is no data, C* returns nothing.

As of late last week when this happends, we now return null for every datapoint in the requested time range. (we were previously returning an empty set "[]")

@ctdk

This comment has been minimized.

Contributor

ctdk commented Aug 4, 2015

Redoing the ES definitions seems like a good idea. One question though: do we actually need to prefix these things with "litmus.*"? If everything we're querying there has litmus in front of it, maybe we could just leave it off.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 4, 2015

the long term goal is to add all kinds of non-litmus applications/metrics/integrations to raintank, so i think we should prefix metrics with a litmus namespace.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 5, 2015

i'm not too excited about heka for the raintank-metric use case because :

  • if heka crashes/is killed, messages can be dropped at various parts in the pipeline. disk buffer is only used in the output plugin phase, long after amqp ack, and with various buffers in between. there doesn't seem to be an fsync but need to get that confirmed.
  • heka might be much slower than we want it to be (see https://github.com/trink/hindsight , bench conducted by one of the heka authors). there seems to be overhead caused by protobuf encoding, internal routing system, etc.
  • no QoS . all data is FIFO. we could work around it by spinning up new heka's as described in my email

solutions:

  • proceed with heka, knowing it can lose messages, comes with overhead because it's so general purpose, requires intervention to provide QOS.
  • look into hindsight but only if we're serious about C (-1) and someone can write the needed plugins.
  • reconsider kafka
  • redesign raintank-metric around the diskqueue code which can be ripped from nsqd, like i did with carbon-relay-ng. ack to amqp when fsync returns. add some QOS to it, which shouldn't be too hard. prototyping this should take about the same time as prototyping a heka based solution I think.
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 6, 2015

@woodsaj thoughts on the above?
also, you mentioned in last call that rabbit doesn't deal to well with certain approaches, i'ld like to get a better understanding of that:

  1. can you query for, say 1k messages at once, or is it 1 by 1.
  2. can you concurrently query for messages? let's say have 20 workers each getting messages 1-by-1, or 1k-by-1k
  3. how long is reasonable to delay acks? subsecond good enough? is it ok to have rabbit wait for acks in context of 1k-by-1k retrieval, in context of concurrent fetches, and both (say 20 workers each pulling in 1k metrics and only acking them a second after the messages arrive)
  4. what are reasonable message sizes? (like can we work around some of the above by packing many metrics into single messages)
@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 7, 2015

Rabbitmq is not something you query. it is a AMQP broker. Messages get pushed to the client.
https://www.rabbitmq.com/tutorials/amqp-concepts.html

Read this for detailed info on Rabbitmq performance.
https://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/

If the consumer is configured with automatic Acks then the broker will just send messages as they come in. But if the consumer is configured to for explicit Acks, then only prefetch-count (a configuration of the channel) messages will be delivered to a consumer until the messages are acked, the default is 1.

So generally, when dealing with high message counts, Ack'ing a message should be used to signal that the message has been successfully received. Delaying the Ack until the message has been processed will significantly reduce the number of messages that can be consumed.

We already pack many metrics into each message. The grafana-server buffers metrics for 100ms grouped into 512 (this can be tuned) buckets based on a hash of the series name. So at most, each grafana-sever can emit 5120 messages per second.

As per the rabbitmq performance blog article, bandwidth throughput of rabbitmq (ie total volume of metrics we can send) is maximized with large message sizes, ie by packing more metrics into a message

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 9, 2015

@Dieterbe.
So to more specifically respond the proposed options:

  1. proceed with heka? The performance numbers listed for Heka are well and truly below what we need. Though we are pushing metrics from rabbit in batches, after decoding each metric would still be represented as a single message within Heka and so we would likely expereince lower then required performance from the filter, encode and output plugins.
  2. Hindsight? We dont currently have the resources to deal with C. The lead time to bring on new developers with strong C background is too long for it to be a vial option right now.
  3. Kafka? kafka just replaces rabbitmq in the stack. we still need to handle consuming from Kafka and writing to the TSDB in a reliable way. The github.com/Shopify/saram package looks ok, but the documentation doesnt even have an example of how to create a consumer. Additionally, using Kafka means that we nee to add a Kafka cluster and a Zookeeper cluster to the stack. Kafka does certainly have a number of advantages though.
  4. Resign raintank-metric? This option is looking more promising to me. Some things to consider.
    • we can move the raintank-metric logic directly into grafana-server. It is clear from the server performance that handling metrics is CPU intensive. This is likely from the encoding/decoding to/from json. Currently we are decoding the message in grafana-server then encoding again to send to rabbitmq, then decodiing again in raintank-metric before finally encoding into a different schema and sending to TSDB. Moving the logic to grafana removes two of the steps. Additionally this also removes the need for rabbitmq.
    • grafana-server is where we first receive the metric and so a disk based commit log (diskqueue) can be used there.
@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 9, 2015

i should also note, that instead of moving the raintank-metric into grafana-server we can also do the reverse and move the collector-ctrl logic out of grafana-server and into raintank-metric.

@ctdk

This comment has been minimized.

Contributor

ctdk commented Aug 9, 2015

Also I worked on raintank-metric while flying home this morning, and have
some changes I want to try out and see if they help.

On Sunday, August 9, 2015, Anthony Woods notifications@github.com wrote:

@Dieterbe https://github.com/Dieterbe.
So to more specifically respond the proposed options:

proceed with heka? The performance numbers listed for Heka are well
and truly below what we need. Though we are pushing metrics from rabbit in
batches, after decoding each metric would still be represented as a single
message within Heka and so we would likely expereince lower then required
performance from the filter, encode and output plugins.
2.

Hindsight? We dont currently have the resources to deal with C. The
lead time to bring on new developers with strong C background is too long
for it to be a vial option right now.
3.

Kafka? kafka just replaces rabbitmq in the stack. we still need to
handle consuming from Kafka and writing to the TSDB in a reliable way. The
github.com/Shopify/saram package looks ok, but the documentation
doesnt even have an example of how to create a consumer. Additionally,
using Kafka means that we nee to add a Kafka cluster and a Zookeeper
cluster to the stack. Kafka does certainly have a number of advantages
though.
4.

Resign raintank-metric? This option is looking more promising to me.
Some things to consider.

  • we can move the raintank-metric logic directly into grafana-server.
    It is clear from the server performance that handling metrics is CPU
    intensive. This is likely from the encoding/decoding to/from json.
    Currently we are decoding the message in grafana-server then encoding again
    to send to rabbitmq, then decodiing again in raintank-metric before finally
    encoding into a different schema and sending to TSDB. Moving the logic to
    grafana removes two of the steps. Additionally this also removes the need
    for rabbitmq.
    • grafana-server is where we first receive the metric and so a disk
      based commit log (diskqueue) can be used there.


Reply to this email directly or view it on GitHub
#11 (comment)
.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 10, 2015

the fair thing to do would be benchmark heka ourselves with our use case, but frankly I don't think we should spend too much time persuing this direction because heka doesn't look like such a good fit anymore.
the main argument we had on heka over nsq was that it was simpler (could run input, output and processing plugins all in 1 process) but heka still requires having a queue in front, unless we run collector-ctrl as a heka input plugin, but then the entire metrics pipeline is only 1 thing running on a frontend which doesn't seem right. so i'ld like to push for nsq a bit again. (not just its library, but as a queue, see below)

first of all, I agree on removing rabbit, and we can also switch to a more optimized encoding like protobuf
(see https://github.com/alecthomas/go_serialization_benchmarks)

i've been doing some experimenting with nsq

I'm awaiting some feedback from them but it looks like we might be able to achieve a quicker result by just using nsqd as-is instead of building the queue into raintank-metric.
i'm starting to think this would be a better design anyway, decoupling producer from consumers (i'm not a fan of doing collector controller + queue + to-kairos + to-ES in 1 process), allows interesting topologies with extra consumers for future use cases, etc. they also have some interesting stuff in the pipeline like WAL's for each topic, so it'll be a bit more kafka like, but still simpler.

@woodsaj is moving out collector-ctrl into raintank-metric something you can take on?

I almost have embedded nsqd working (gives us free diskqueue and more), i'll finish that, and i would like to move the to-kairos and to-ES to simple standalone workers that consume from nsqd.
this part deviates from your POV so i would be happy to discuss if you disagree.

@ctdk

This comment has been minimized.

Contributor

ctdk commented Aug 10, 2015

@Dieterbe I still need to push my changes I made on the plane up, but I made some small but potentially significant changes, mostly to the event processing, that might speed that up a bunch.

I also got the groupcache caching caught up with the current raintank-metric master. That was working before fine, except that we need to not use consistent hashing with the queue. I think that's a grafana change though.

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 10, 2015

firstly, yes i can migrate the collector-ctrl logic to raintank-metric.

My biggest motivation for keeping everything in one process is to reduce the number of encoding/decoding steps needed to process messages as it is consumes a lot of CPU. By removing the queue from the topology I would expect to see at least a 30% reduction in CPU usage (this usage is currently split between grafana-server and raintank-metric).

The main benefit of a producer/consumer topology is to allow horizontally scaling work across many worker nodes. However as we see from the current production resource usage, the resource usage of grafana-server is about the same as for raintank-metric, with grafana-server using slightly more resources due to the alertingJobs. If we split ingestion of metrics from writing to TSDB, as metric volume increases we would need to scale both parts at the same time anyway. This of course changes if we add new processing steps or output destinations.

So the way i see things working is:

  1. receive json encoded metrics payload from collector.
  2. commit the raw payload ([]bytes) to disk using an append only commit log
  3. unmarshal the payload
  4. validate the metrics (we force the orgId of metrics for payloads sent from private collectors)
  5. send to ES and TSDB
  6. update commit log position so if we restart, we know where to restart from.
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 10, 2015

@woodsaj internally we can just use gogoprotobuf (see https://github.com/alecthomas/go_serialization_benchmarks for comparison) instead of json. for probe to ingest we can do the same thing.

  1. update commit log position so if we restart, we know where to restart from.

note if kairos is down we still want to update ES, and vice versa. this requires 2 read cursors in the commit log. originally i figured we could modify the nsqd diskqueue to support multiple read cursors, but since then the main author has started working on https://github.com/mreiferson/wal which will be the basis of the new WAL for nsqd and is probably more approriate. i'll start reading the code.

but I think we can save at least 1 week, probably 2, of dev time by just using nsq

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 10, 2015

Your argument is very compelling. Lets go with nsq and gogoprotobuf then.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 12, 2015

currently building out nsq based prototype.

  • brief tests with protobuf were unsuccessfull, couldn't get it to work. not worrying about this yet. json will do for now
  • grafana has nsq branch for nsqd based publishing, dev-stack has nsq branch which sets up all components, currently building the nsq_to_kairos and nsq_to_elasticsearch pieces in an nsq branch of raintank-metric.

then will need to do some testing/validations:

  • performs well, even with "ack delayed until written into kairos"
  • performs well and clean recovery when kairos goes down and comes back up
  • when the nsq_to_* tools connect to a new nsqd instances/topics, no information is lost

(anything else to test?)

once that looks good, will polish up the code, add instrumentation, docs, config flags, cleaner dev stack, data prioritization, handle nsqd down, etc.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 12, 2015

btw, another design goal in this is to cap mem usage of collector-controller, which will become simple here (collector-controller is just a thin layer on top of nsqd, if nsqd fails, grafana can just kill itself), to fix OOM issues as per raintank/grafana#248

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 14, 2015

update: i've been doing some stress testing using this script , and thought nsq was dropping messages. but it turns out grafana was dropping some log entries before shutdown (see grafana/grafana#2516) and also while kairos returns 204 for all metrics, it seems some of them are not coming through. did some docker logs -f on the kairosdb container but didn't get more info out of it, so may need to do some extra debugging in/around kairosdb. ( @woodsaj any idea? so instead of backfilling all the way what i see is that for a range of time (say an hour) there are gaps here and there of a minute to several minutes, in the middle of otherwise completely backfilled data. does it have a hard time dealing with backfilling in the past or out of order metrics? because nsq doesn't have strict ordering guarantees)

so while the nsq stack is in pretty good shape, i do want to rule out these kinks.
i've also seen cases where there's a timeout between grafana and nsqd, perhaps a docker thing, or perhaps because the max-in-flight is too high and starves the network link, so i want to play with that value.

i've also extended our messages for the first byte to be a format specifier which will make it easier if change formats later ('\x00' means our current json format), and the next 8 bytes are an int64 timestamp, so i can trace messages through the pipe using that as id. (for reference, i use these scripts to parse the logs)

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 17, 2015

kairosdb buffers metrics before writing to cassandra. So the dropped metrics could be from kairos not being able to write to cassandra, though kairos logs this in /opt/kairosdb/log/kairosdb.log (or similar)

Kairos should definitely be able to backfill data, however this is considerable more resource intensive for cassandra.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 17, 2015

good news and bad news.

good:

  • in all my testing, nsq deals fine with late acks / no acks, it's built for this and its auto-requeue mechanism seems to work fine. when you kill nsq consumer it reconnects fine and everything just works. no data loss.
  • when i kill -9 nsqd and restart it, no data is lost in my few tests. consumers just resume where they left off. i think we should tolerate up to 1 lost message here. we could test this much more in-depth but i decided not to go too far in this testing. after all many companies run nsq in prod, many for years.

the bad:

  • sometimes io timeout between consumers and nsqd, and sometimes between nsqadmin and nsqd.
    the consumers are often hitting 100% cpu while nsqd surprisingly stays under 10% cpu most of the time. so cpu saturation might explain the first case, but not the second. when timeout happens, it just waits 60s and reconnects and then everything works fine.
  • cassandra/kairos blows up on my laptop whenever i go above 1k metrics/s. (see raintank/raintank-docker#29)

from what i see in datadog (c* and "metrics received" dashboards), we do about 8k metrics/s right now. @woodsaj suggested aiming for 10k metrics/s per core which I think is definitely a criteria we should meet. the kairos stack on my laptop definitely can't handle that but i can test without it, and when i do, i hit the dreaded timeouts to/from nsqd (though i forgot how many metrics/s that was), so while i prepare for production readyness (add instrumentation etc) i'll look into this more. but if the stack on my laptop can do 15k/s in total (with disabled persistence to kairosdb) than that should be good enough to proceed IMHO.

in the meantime,

  • i'ld like it if @woodsaj and @ctdk can have a close look at nsq_to_kairos and nsq_to_elasticsearch (see nsq branch). they currently are a pretty naive port of the logic from raintank-metric into nsq consumers. i want one of you to look at what it does and what it actually should do, i know there's been talk about refactoring/rethinking the schema of what goes into ES and such.
    once we iron those things out, we can start look at optimisations and performance tuning.
    that said, i also don't think we should hold up deploying the first version of this to prod for too long. after all the initial scope was "minimize data loss when the metrics pipeline fails" and that's still something that should be solved sooner rather than later.
  • likewise re protobuf/faster serialization still seems useful, though we'll probably be fine without it for a while. i guess it depends more on if someone can do the work without holding up anything.

@woodsaj , @ctdk thoughts?

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 18, 2015

further tests on my laptop with the new nsq_to_kairos --dry=true (don't store in kairos) option:
(this is continous flow-through, no recovery)

  • 10k/s all fine, low cpu usage of all involved daemons. recovers fine after downtime.
  • 20k/s all fine. <60% for all daemons. still recovers fine after downtime.
  • 40k/s nsq_to_elasticsearch hits 100% cpu. see below
  • 50k/s grafana-server, nsq_to_kairos all nsq_to_elasticsearch hit 100% cpu. queues start to grow. increasing GOMAXPROCS for nsq_to_* helps to achieve >100% cpu usage, but does not catch up the metrics backlog.
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 18, 2015

recovery tests (kill nsq_to_kairos for 10min, then restart it, so it reprioritizes and catches up):

  • at 20k/s, can't keep up, potentially due to 60s read timeouts and cpu saturation. (80% time spent in json decode), cpu >100%
  • at 10k/s, it can recover, but still suffers from 60s read timeouts after which it reconnects, which makes the processes slower and more annoying than it should be.
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 18, 2015

when I test the nsq stack outside of docker, with the standard nsq utilities (but still mimicking same message size, disk queue behavior, etc) I get much better results. no timeouts, and faster too.
https://gist.github.com/Dieterbe/2fd593c988d0fc8f796a

so this tells me that the bottleneck/timeouts is either related to/caused by our code (likely), or potentially docker networking, and something we can figure out as we go. i.e. i feel pretty comfortable getting the ball rolling to get this to dev-portal and then prod, and do optimisations as next steps. (thoughts @woodsaj ?)

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 18, 2015

this is looking really good dieter.

I have a branch created from yours that updates the metric schema and also has trimmed out a lot of the legacy code that is not needed. The branch has also done away with the local in-memory cache in favor of just Redis. The localCache was abusing a mutex and would have resulted in pretty poor performance if metrics weren't in the local in-memory cache.
https://github.com/raintank/raintank-metric/tree/nsq-metric-schema

as well as the raintank-metric branch, there are also branches of
grafana: https://github.com/raintank/grafana/tree/nsq-metric-schema
collector: https://github.com/raintank/raintank-collector/tree/metric-schema
graphite: https://github.com/raintank/graphite-kairosdb/tree/metric_schema

assuming you are using raintank-docker
checking out the branches in your raintank_code and running supervisorctl restart all should be enough for grafana and the collector. But for graphite, inside the graphite container run pip install --upgrade git+https://github.com/raintank/graphite-kairosdb@metric_schema && supervisorctl restart all

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 19, 2015

Just pushed commits to raintank-collector, grafana and raintank-metric (branches listed in previous comment) to add processing of events.

The event schema has also been updated to separate tags into an explicit field. The raintankEvents panel and rt-events dashboard have been updated to use the new schema.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 19, 2015

  • i wonder if perhaps tags should be a []string instead of map[string]string. this would allow us to use tags that don't have a key. i try to always come up with a key for a given tag value, but sometimes it's awkward and i've noticed some people appreciate the ability to just tag things with a single word, without a key.
    key-value pairs could be strings like key=val in the array. I believe this is also how datadog does it.
  • is it safe to merge nsq-metric-schema into nsq and start working on that single branch or should we keep these concepts seperate?
  • for collector, i get
module.js:340
    throw err;
          ^
Error: Cannot find module 'json3'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/opt/raintank/raintank-collector/node_modules/socket.io-client/node_modules/socket.io-parser/index.js:7:12)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 19, 2015

outstanding TODO's:

  • add instrumentation for messages/s, metrics/s, message size, num metrics, duration
  • add instrumentation for consumers. messages/s, metrics/s, msg age. duration of consume, kairos, and ES. low/high prio queue size.
  • set alerts on message size
  • set monitoring/alerts on how long of consumer downtime can we sustain with current traffic/hdd space
  • add docs for new config options
  • polish up code / remove stresser
  • prod strategy for handling nsqd down. needs raintank/ops#97
  • prod strategy for introducing new producers. see raintank/ops#98
  • more efficient, faster serialization (time permitting)
  • concurrency fix 0b55eef#commitcomment-12779920
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 19, 2015

i merged your nsq-metric-schema branches into nsq for raintank-metric, grafana, so we can just work toghether on the same branch now.
also updated raintank-docker nsq branch to add in nsq_events and fix collector build.

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 20, 2015

the []string vs map[string]string is a good observation. For endpoints, collectors and dashboards we just user []string so we should definitely standardize. The main driver for map[string]string is that it is easier to build queries for. If we move to a tags based TSDB, then we would need to be able to a) query on the tags and b) also perform groupBy, functions. If have a key:value pair, then i can groupBy('key'), but how would i do that if it was just a string 'key=value', would just parsing all tags and splitting them into key:value pairs if possible be the right approach?

Also, Datadog definitely users key:value pairs.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 20, 2015

Also, Datadog definitely users key:value pairs.

not exclusively. like i said you can use simple tags without keys. see http://docs.datadoghq.com/guides/tagging/ though I agree with you (and DD mentions the same argument on that page) that a key is needed if you want to filter/group by a named dimension.
at vimeo / with metrics 2.0 i've ran into cases where the enforced key:val format felt overkill and awkward, and i've also heard people appreciate the support for simple tags in DD.
i don't feel strongly about making this change right now, as long as k:v feels appropriate for our metrics that's fine by me, but i anticipate this topic will resurface once we're dealing with a broader variety of metrics.

If have a key:value pair, then i can groupBy('key'), but how would i do that if it was just a string 'key=value', would just parsing all tags and splitting them into key:value pairs if possible be the right approach?

my argument only concerns itself with the schema used at creation time and during transport. not TSDB storage. for metrics2.0 / graph-explorer I actually store arrays of "key=val" strings in ES and found that to work surprisingly well, but it may be that for other systems it'll be more efficient to store this this in a more structured way. and i also didn't compare both approaches in ES to see if there's a performance difference. we can store in ES and kairos however it makes most sense for that platform.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 22, 2015

@woodsaj see comment above. I want to let you make the decision since you're most familiar with all the schema stuff.

@woodsaj

This comment has been minimized.

Member

woodsaj commented Aug 24, 2015

I think key:value pairs are much more robust, especially when using Elasticsearch.

I thought the whole purpose of using tags, vs just a long metric name, was to provide context about the data you are sending.
I cant think of a scenario where you would have a value and not have a key that describes what that value represents. @Dieterbe can you give an example.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 24, 2015

let's continue the tag keys discussion in #20

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Aug 28, 2015

@ctdk

This comment has been minimized.

Contributor

ctdk commented Sep 5, 2015

Now that #101 is closed, and tag keys are being discussed elsewhere, is this issue ready to be closed?

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Sep 6, 2015

no there are still some outstanding TODO's (see the list a few comments up)

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Sep 9, 2015

just finished all todo's. which was some code cleanups and configuring some alerts in DD.

@Dieterbe Dieterbe closed this Sep 9, 2015

@Dieterbe Dieterbe removed the in progress label Sep 9, 2015

@Dieterbe Dieterbe removed their assignment Nov 27, 2015

Dieterbe pushed a commit that referenced this issue Jan 22, 2018

Merge pull request #11 from bloomberg/grepExclude
Add exclude and grep functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment