Skip to content
Adrian Cole edited this page Apr 16, 2020 · 3 revisions

Introduction and Scope

  • Medidata is the largest provider of software for clinical trials.
  • We have around 100 services, written in different languages.
  • Medidata is an early adopter of OpenZipkin contributed code to the C# and Ruby client libraries.
  • Currently in maintenance mode, no engineer working full time but we still maintain the Ruby zipkin client.

System Overview

  • Instrumentation. We support Ruby, C#, Java, Scala and Python. We deploy in containers and in AWS lambda.
  • Data ingestion. Traces are sent to the Zipkin server using the HTTP API and AWS SQS.
  • Data storage. We use Elastic search instances managed by AWS for storage. We sample 100% so we check anything that in logs or other systems looks wrong.
  • We add trace ids and span ids to all our log messages.
  • We have different stages, develop stages send traces to different zipkin servers than production. Each stage has its own Zipkin server and ElasticSearch cluster which is dimensioned depending on the traffic of the stage.
  • We visualize traces in the standard opensource Zipkin UI. We deploy its container as it is.
  • We have a service reading the tracing information from ElasticSearch, comparing it with performance objectives in APIs and issuing alerts when performance degrades.
  • We have a service  to visualize important workflows through the system. These workflows are created adding tags to the traces.

Goals

  • Near term: Improve our tooling consuming tracing information. Add more interesting tags to track more complex workflows.
  • Near term: Add User/API key information to traces so they are easily mappable to clients.
  • Mid term: Extend tracing to services still not using it.
  • Mid term: Improve tracing of async and background processes
  • Long term: Move to use 128 bits Trace IDs so we can be compatible with AWS X-Ray.

Current Status (April 2020)

Per Elasticsearch doc, "Instance Store" is recommended for ES instances so we've switched to i3.xlarge.elasticsearch:

  • Zipkin (version 2.21.0)
    • Instance: Running on AWS ECS with 1GB of memory, 2 services
  • Elasticsearch (version 7.1)
    • Instance: i3.xlarge.elasticsearch (950GB) x 3
    • Number of shards: 3
    • Dedicated master instances: enabled with c4.large.elasticsearch x 3
  • Retention days: 30 days
  • Total cost: $1,864/month

Current Status (March 2019)

We have experienced some performance issues, we were able to ingest 200 spans/second with 0.3% being dropped. With 190spans/second the drop rate was 0.2% so we think we were close to the ingestion limit for the following setup (all in AWS terms):

  • Zipkin (version 2.9.4)
    • Instance: Running on AWS ECS with 4GB of memory, 1 service
  • Elasticsearch (version 5.5)
    • Instance: r4.large.elasticsearch x 4
    • Storage: 200GB (EBS) / instance
    • Number of shards: 1
    • Dedicated master instances: disabled
  • Retention days: 100 days
  • Total cost: $800/month

To resolve the issues, we have upgraded instances and reduced retention days as follows:

  • Zipkin (version 2.9.4)
    • Instance: Running on AWS ECS with 8GB of memory, 2 services
  • Elasticsearch (version 5.5)
    • Instance: m4.xlarge.elasticsearch x 3
    • Storage: 500GB (EBS) / instance
    • Number of shards: 3
    • Dedicated master instances: enabled with c4.large.elasticsearch x 3
  • Retention days: 30 days
  • Total cost: $1,600/month

Current Status (November 2018)

  • Some legacy services still do not use Zipkin but all new infrastructure does.
  • Currently we are ingesting 16 GB/day. We store for 100 days. Ingestion rate is 100%
  • New services use Zipkin so we are growing organically with the organization.
  • The Elastic Search cluster is used by Zipkin and other tools build on top of that data. Logs are stored separately.
Clone this wiki locally