Hotels.com

ZIPKIN : Hotels.com

Created by Daniel on Feb 12, 2019

Introduction and Scope

At Hotels.com we have engineering teams spread across multiple locations that run and operate hundreds of services. Most of our infra is owned and maintained by us but, being part of Expedia Group, there is a lot that we share across the different brands (including some of the business backend services).

That said, most of our monitoring tools are managed by a dedicated Hotels.com team. These includes tools like Graphite and Grafana, Prometheus, AppDynamics and now Zipkin and Haystack (https://github.com/ExpediaDotCom/haystack) as well. There are a few others (Splunk, etc). that are operated by a central team within Expedia Group.

Distributed tracing was introduced in the company by a small team in an attempt to identify performance bottlenecks in one of our services. This immediately added value by helping in pointing out the worst offenders and by making it easier to identify performance improvements such as network calls that could be done in parallel.

From our experience, and despite the fact that distributed tracing is a team sport, in most cases you can add value even when this is done in isolation, ie, when you integrate distributed tracing in a single service.

System Overview

Our services are primarily written in Java but we also have a bit of Scala and Kotlin.

Services are spread across datacenters (owned and operated by Expedia Group, and AWS). We use Kubernetes as our container orchestrator for AWS and a number of custom built tools and scripts to operate and deploy services to our own datacenter.

For instrumentation we use mainly Spring Cloud Sleuth and Brave (one team is using zipkin-finagle), as most services are using Spring Boot or just regular Spring.

For the analysis of distributed tracing data, we use Haystack. If this is the first time you're hearing this name, Haystack is a distributed tracing tool that was open sourced by Expedia Group. It was inspired by Zipkin and as such has a big overlap in terms of features, but it adds extra bits on top, such as adaptive alerting and trend metrics.

It was common practice for service owners to distribute binaries/sdks instead of documenting their apis. This means that integration with Zipkin wasn't always straightforward and would sometimes require coordination between different teams. Depending on the number of dependencies that you had, instrumenting all the http clients of a given service could take anywhere between a few days to several months.

To workaround this problem and to provide a minimum of observability during this period, some teams opted to leverage on Hystrix (https://github.com/Netflix/Hystrix/issues), which is our de-facto circuit breaker. The solution involves registering a command execution hook as a Hystrix plugin that would start and stop spans when a command was executed. This is very similar to what Cloud Sleuth already provides, but we had some issues as we were already using custom Hystrix plugins and Sleuth was wiping them out.

All of our tracing data is sent to a collector (https://github.com/HotelsDotCom/pitchfork) in Zipkin's Json V2 format which we then push to both Zipkin (via http) and translate and push to Haystack (via Kafka).

The reason we added this collector was to allow our development teams to use stable, well known and well documented Zipkin libraries and at the same time to benefit from the features that Haystack adds.

Zipkin is also easier to spin up in a developer's laptop when compared with Haystack, which is useful when we want to run tests locally and capture tracing data.

The entire Haystack and Zipkin stacks are running on a Kubernetes cluster. The Zipkin backend is being decommissioned as are able to get the same information from Haystack and hardly anyone is using it anymore.

On the Haystack backend we use Kafka to feed data into the system which is then read by subcomponents that index the traces, extract and process metrics, create and update a service graph of the dependencies and provide adaptive alerting (we are not yet using this last feature).

Data is stored for 1 day and can be queried by service name, operation name, trace id, or a handful of other fields that we explicitly whitelist (not every tag key is whitelisted).

Goals

The sampling decision is done on our multiple front end applications and is set to 10%. Short term we plan on moving this further to the edge and start our traces in our reverse proxy, Styx (https://github.com/HotelsDotCom/styx/) which is effectively the single entry point into Hotels.com.
Increase retention from 1 day to 2 or 3 days
Expand coverage to more services. There is a good momentum in terms of adoption. Our journey with distributed tracing started fairly recently and in a small period of time we covered about a third of our systems.
There is a plan to ditch our self hosted Haystack and to submit data to one that will be shared one by the multiple brands of Expedia Group. This means less overhead on the team that manages this infra and deeper and more complete traces as some of our systems also make calls to Expedia Group's backend services.
We're trialing a service mesh solution which could potentially help in removing some of the tracing boilerplate code from our apps. Expectation is that a lot of teams will continue using Spring Cloud Sleuth/others because it makes local testing easier and it simplifies the propagation of tracing context and tracing http headers easier

Document generated by Confluence on Jun 18, 2019 18:50

Atlassian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotels.com

ZIPKIN : Hotels.com

Introduction and Scope

System Overview

Goals

Clone this wiki locally