diff --git a/slides/techmeetup/apdex.png b/slides/techmeetup/apdex.png new file mode 100644 index 0000000..3de2259 Binary files /dev/null and b/slides/techmeetup/apdex.png differ diff --git a/slides/techmeetup/breakdown.png b/slides/techmeetup/breakdown.png new file mode 100644 index 0000000..044423c Binary files /dev/null and b/slides/techmeetup/breakdown.png differ diff --git a/slides/techmeetup/dashboard.gif b/slides/techmeetup/dashboard.gif new file mode 100644 index 0000000..38e71cd Binary files /dev/null and b/slides/techmeetup/dashboard.gif differ diff --git a/slides/techmeetup/details.png b/slides/techmeetup/details.png new file mode 100644 index 0000000..90b9bd9 Binary files /dev/null and b/slides/techmeetup/details.png differ diff --git a/slides/techmeetup/errors.png b/slides/techmeetup/errors.png new file mode 100644 index 0000000..fb3a305 Binary files /dev/null and b/slides/techmeetup/errors.png differ diff --git a/slides/techmeetup/flamegraph.png b/slides/techmeetup/flamegraph.png new file mode 100644 index 0000000..200ad8a Binary files /dev/null and b/slides/techmeetup/flamegraph.png differ diff --git a/slides/techmeetup/flamegraph2.png b/slides/techmeetup/flamegraph2.png new file mode 100644 index 0000000..58eb17f Binary files /dev/null and b/slides/techmeetup/flamegraph2.png differ diff --git a/slides/techmeetup/flamegraph3.png b/slides/techmeetup/flamegraph3.png new file mode 100644 index 0000000..a5674ba Binary files /dev/null and b/slides/techmeetup/flamegraph3.png differ diff --git a/slides/techmeetup/latency1.png b/slides/techmeetup/latency1.png new file mode 100644 index 0000000..e20a923 Binary files /dev/null and b/slides/techmeetup/latency1.png differ diff --git a/slides/techmeetup/latency2.png b/slides/techmeetup/latency2.png new file mode 100644 index 0000000..01ec68f Binary files /dev/null and b/slides/techmeetup/latency2.png differ diff --git a/slides/techmeetup/latency3.png b/slides/techmeetup/latency3.png new file mode 100644 index 0000000..d7386eb Binary files /dev/null and b/slides/techmeetup/latency3.png differ diff --git a/slides/techmeetup/logo.svg b/slides/techmeetup/logo.svg new file mode 100644 index 0000000..36d2599 --- /dev/null +++ b/slides/techmeetup/logo.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/slides/techmeetup/minimap.png b/slides/techmeetup/minimap.png new file mode 100644 index 0000000..ce742f8 Binary files /dev/null and b/slides/techmeetup/minimap.png differ diff --git a/slides/techmeetup/no.jpg b/slides/techmeetup/no.jpg new file mode 100644 index 0000000..a85649c Binary files /dev/null and b/slides/techmeetup/no.jpg differ diff --git a/slides/techmeetup/silence.png b/slides/techmeetup/silence.png new file mode 100644 index 0000000..93bc8fd Binary files /dev/null and b/slides/techmeetup/silence.png differ diff --git a/slides/techmeetup/slides.md b/slides/techmeetup/slides.md new file mode 100644 index 0000000..c3f0d1e --- /dev/null +++ b/slides/techmeetup/slides.md @@ -0,0 +1,296 @@ +--- +layout: slides +title: Techmeetup +description: Techmeetup todo +transition: slide +permalink: /slides/techmeetup/ +--- + +
+Application Observability: A Developer’s Perspective + +Odin (Ondřej Popelka) +
+{% comment %} +Application Observability from the developers' point of view +{% endcomment %} + + +
+### What I Do +- Senior Backend Engineer at Keboola +- Architecture, Service Design, API Design, Resources setup via Terraform, CI pipelines, Coding, Monitoring, Operations, 24/7 Support, Vacuuming, Washing the dishes, ... +
+ + +
+## What We do + +- Data Operating System, Data Stack, Data processing platform. +- If you have a (big)data problem, we're likely to have it solved. +- If you have 2+ information systems in your company that do not talk to each other, we make them talk. + +Keboola Logo + +
+{% comment %} + If you have this kind of problem, you know immediately that we can solve it for you. If you don't have + this kind of problem, it is very difficult to explain to you what we do. +{% endcomment %} + + +
+## What does DevOps do? + +- SRE gives us the Kubernetes cluster. +- SRE gives us the networking (private clusters). +- SRE gives us the monitoring tools. +- SRE watches us that we do not do anything *obviously stupid*. +- The UI consumes the API blueprint. + +--- + +**For everything else, there is Devops** +
+ + +
+## Random Numbers + +- 20 domain services (PHP8, NodeJs, GO + lots of Python & PHP7 + bits of Java, PHP5, R) +- 1 monolith service (~7 more domains) +- 1000+ integrations +- 120+ kubernetes nodes, 9 production stacks, 3 clouds (AWS, Azure, GCP) +- 280+ requests per second, 24+ million / day +- 260.000+ asynchronous jobs a day -- ranging from 1 second to 24 hours +- 1.500.000+ LoC code, 13 developers, 4 SRE +
+ +
+## Environment + +- High heterogeneity, +- High load variability, +- High request length variability, +- Uneven distribution of requests, + +--- + +### Must have +- High automation, High reliability, High observability + +
+ +
+## Easy part + +### Latency is the king +- Latency is what the user feels. +- Measure XXth percentile (p90, p75, p50). +- Big difference means that the service is unstable. + +![Good latency](/slides/techmeetup/latency3.png) +
+ +{% comment %} +that the data is still heavily aggregated - the obvious way how to aggregate is to take the average, +which doesn't work very well, the percentile is better, +p90 means 90% of requests are faster than this, 10% is slower. +{% endcomment %} + +
+## Still Easy part +- Graph of obviously bad latency + +![Bad latency](/slides/techmeetup/latency1.png) + +
+{% comment %} +APDEX - measure user satisfaction for a metric for which the target performance has been set +{% endcomment %} + + +
+## Still Easy part ? +- Graph of obviously bad latency: +No +![Bad latency](/slides/techmeetup/latency2.png) +- APDEX Monitoring -- Application Performance Index ![Apdex](/slides/techmeetup/apdex.png) +
+ + +
+## Error rate is the Queen + +- The very first metric is Error rate +- First to look at when something goes wrong +- First to monitor with APDEX +No +![Error-ish service](/slides/techmeetup/errors.png) +
+ + +
+## "Weird" API endpoints + +- If the request fails, it's actually a valid situation. +- Error rate can be very high, but **never 100%**. +- Always monitor individual endpoints **not services**! +- "Negative" metric -- there must be at least some requests succeeding. +- I do appreciate tips on how to monitor these. +
+ +
+## Diagnosing + +Old Lady + +- Latency breakdown; +- Breakdown of time spent in "3rd party" services: + +![Latency Breakdown](/slides/techmeetup/breakdown.png) +
+ + +
+## FlameGraph is God + +- It is absolutely crucial that they are cross-service. +- One request: +![Flamegraph](/slides/techmeetup/flamegraph.png) + +
+ + +
+## FlameGraph cont. + +- Break down of time spent by the business logic: +![Flamegraph](/slides/techmeetup/flamegraph2.png) + +
+ + +
+## FlameGraph cont. + +- Includes time in DB by 3rd party services: +![Flamegraph](/slides/techmeetup/flamegraph3.png) + +
+ + +
+## What are good metrics ? + +Silence +- Incident proven: + - 250+ incidents per month, + - Fail, fail, fail, succeed... + +- After an incident: + - Find what metric/alarm should've triggered; + - Find metrics that shouldn't have triggered; + +--- +Metrics give suspicion +× Flamegraphs and traces **give insight** + +
+{% comment %} +metric != alarm != escalation policy +{% endcomment %} + + +
+## How to get a good metric? + +- Must be representative of the end-user experience. +- At the same time it can be totally Meaningless™. +- ex. "Iteration time": + - When divided by the number of jobs it represents the upper bound of the time between a job is received on internal queue and forwarded to the worker to be switched to the processing state and picked up by the processing engine. + - It should be between 0.1 and 5 + - Why not 7 ? + +- Beware of changes in code that affect the metric! +
+ + +
+## When watch the metrics? + +- When incidents are triggered; +- Ideally every second morning; +- After deploy and During database migrations; +![Versions](/slides/techmeetup/versions.png) +
+ + +
+## What are the best dashboards? + +Eventually all end up like this: + +![Dashboard](/slides/techmeetup/dashboard.gif) + +**The best ones are those that do not eat your battery when you're on 24/7** + +
+ + + +
+## Who's watching the costs? + +--- + +### Budget alerts +- Everything else is wrong. +- Applies to personal pet projects too. +- Budget alerts also apply to the **cost of the monitoring**. + +
+ + +
+## Hard Part + +- Asynchronous jobs + - Containers that run from seconds to up to days + +- Endless loop + - Non-interactive daemons that run for days to months + - Queue workers, stream processors, ... + +--- +Some other time... +
+ + + +
+## Thanks + +Questions & Comments ? + +[linkedin.com/in/odinuv](https://www.linkedin.com/in/odinuv) + +--- + +Keboola Logo +Vacancy + +[keboola.com/about/jobs](https://www.keboola.com/about/jobs) + +
+ diff --git a/slides/techmeetup/vacancy.gif b/slides/techmeetup/vacancy.gif new file mode 100644 index 0000000..8912225 Binary files /dev/null and b/slides/techmeetup/vacancy.gif differ diff --git a/slides/techmeetup/versions.png b/slides/techmeetup/versions.png new file mode 100644 index 0000000..2690b9d Binary files /dev/null and b/slides/techmeetup/versions.png differ