```{=latex}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{textcomp}
\usepackage{fancyvrb}

\newcommand{\passthrough}[1]{\lstset{mathescape=false}#1\lstset{mathescape=true}}
```

```{=latex}
\title{Observable Python Applications}
\author{Moshe Zadka -- https://cobordism.com}
\date{}

\begin{document}
\begin{titlepage}
\maketitle
\end{titlepage}

\frame{\titlepage}
```

```{=latex}
\begin{frame}
\frametitle{Acknowledgement of Country}

Belmont (in San Francisco Bay Area Peninsula)

Ancestral homeland of the Ramaytush Ohlone

\end{frame}
```

## Introduction to Observability

### What is observability?

Our applications execute a lot of code,
in a way that is invisible.
Is this code working?
Is it working well?
Who is using it?
How?

Observability is the ability to look at data
that tells you what your code is doing.
Mostly,
in this context,
the main problem area is server code in distributed systems.

It is not that observability is not important for clients:
just that clients tend not to be written in Python.
It's not that observability does not matter for,
say,
data science,
it is that the tooling for observability there
(mostly Juptyter and quick feedback)
are different.

```{=latex}
\begin{frame}
\frametitle{What is observability}

It's 5pm,
do you know where your application is?

\end{frame}
```

### Why does observability matter?

So why does observability matter?
Observability is a key part of
software development life cycle.

Shipping an application is not the end,
it is the beginning of a new cycle.
The first step is to know the new version is running well.
Otherwise,
a rollback is probably needed.

Then,
you need to know what is going on
to know what to work on next.
Which features are working well?
Which ones have subtle bugs?

Things fail in weird ways.
Whether it is a natural disaster,
a roll-out of underlying infrastructure,
or an application getting into a strange state,
things can fail at any time,
for any reason.

Outside of the normal SDLC,
you need to know that everything is still running.
If it is not running,
it is impportant to be able to know how it is failing.

```{=latex}
\begin{frame}
\frametitle{Why observability}

Ship it and forget it?

\end{frame}
```

### Feedback

The first part of observability is getting
*feedback*.
When code gives information about what it is doing,
this can help in many ways.

In a staging or testing environment,
this helps find problems
and,
more importantly,
triage them in a faster way.
This improves the tooling and communication
around the validation step.

When doing a canary deployment,
or changing a feature flag,
feedback is also important.
This lets you know whether to continue,
wait longer,
or roll it back.

```{=latex}
\begin{frame}
\frametitle{Feedback}

Is my code doing what I think it does?

\end{frame}
```

### Monitor

Sometimes you suspect that something has gone wrong.
Maybe a dependent service is having issues,
or maybe Twitter is,
um,
a-Twitter with questions about your site.

Maybe there is a complicated operations in a related system,
and you want to make sure your system is handling it well.
In those case,
you want to aggregate the data from your observability system
into
*dashboards*.

When writing the application,
these dashboards need to be part of the design criteria.
The only way they will have data to display
is if the application shares it with them.

```{=latex}
\begin{frame}
\frametitle{Monitor}

What is going on right now?

\end{frame}
```

### Alert

Watching dashboards for more than 15 minutes at a time
is like watching paint dry.
No human should be subjected to this.

For this,
we have alerting systems.
Alerting systems compare the observability data
to the expected data,
and send a notification when it is not.

Fully delving into incident management is beyond the scope.
However,
observable applications are alert-friendly in two ways:

* They produce enough data, with enough quality,
  that high quality alerts can be sent.
* The alert either has enough data,
  or the receiver can easily get the data,
  to help triage the source.
  
High quality alerts have three properties:

* Low false alarms: if the alert fires, there is a problem.
* Low missing alarms: the alert fires whenever there is a problem.
* Timeley: The alert is sent quickly to minimize time to recovery.

These three properties are in a three-way conflict.
You can reduce false alarms by raising the threshold of detection,
at the cost of increasing missing alarms.
You can reduce missing alarms by lowering the threshold of detection,
at the cost of increasing false alarms.
You can reduce both false alarms and missing alarms by collecting more data,
at the cost of timeliness.

Improving
all three parameters
is harder to do.
This is where the quality of observability data comes in.
Higher quality data can reduce all three.

```{=latex}
\begin{frame}
\frametitle{Alert}

Is there a problem?

\end{frame}
```

## Logging

### Intro to logging

```{=latex}
\begin{frame}
\frametitle{logging}

A print for the modern world

\end{frame}
```

### Logging levels

```{=latex}
\begin{frame}
\frametitle{logging levels}

What should go where?

Consistent semantics

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{logging level semantics}

\begin{itemize}
\item Error: Alert now
\item Warning: Alert in business hours
\item Info: In Prod
\item Debug: Staging/Explicit
\end{itemize}

\end{frame}
```

### Logging aggregation

```{=latex}
\begin{frame}
\frametitle{logging aggregation}

All instances -> Centralized server\pause

Query & Alert

\end{frame}
```

### Logging queries

```{=latex}
\begin{frame}
\frametitle{logging queries}

Match\pause

Structure

\end{frame}
```

## Metric scraping

```{=latex}
\begin{frame}
\frametitle{Metrics scraping}

Server pull model

\end{frame}
```

### Prometheus as a standard

```{=latex}
\begin{frame}
\frametitle{Prometheus format}

All common metrics aggregation systems support it

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Web endpoint}

Integrate into web framework of choice\pause

Use native library

\end{frame}
```

### Using counters

```{=latex}
\begin{frame}
\frametitle{Counters}

Tick up or die\pause

Hits\pause

Bytes sent


\end{frame}
```

### Using gauges

```{=latex}
\begin{frame}
\frametitle{Gauges}

Point in time measurement\pause

Total allocated memory

\end{frame}
```

### Using enums

```{=latex}
\begin{frame}
\frametitle{Enums}

Different states\pause

0/1 mutually exclusive gauges

\end{frame}
```

## Analytics

```{=latex}
\begin{frame}
\frametitle{Analytics}

Per-transaction measurements

\end{frame}
```

### OpenTelemetry: Strictly in the Future

```{=latex}
\begin{frame}
\frametitle{OpenTelemetry}

Looks good...\pause

but not there yet

\end{frame}
```

### Structured Logging

```{=latex}
\begin{frame}
\frametitle{Structured Logging}

Collect data in per-transaction object\pause

Send it to log

\end{frame}
```

## Error tracking

```{=latex}
\begin{frame}
\frametitle{Error Tracking}

Detailed data about errors\pause

Usually exceptions

\end{frame}
```

### Using Sentry

```{=latex}
\begin{frame}
\frametitle{Sentry}

For most non-trivial cases,
run yourself:\pause

Detailed error data can be sensitive!

\end{frame}
```

## Summary

### Fast, Safe, Repeatable: Choose All Three

```{=latex}
\begin{frame}
\frametitle{Data, Not Speculation}

Observability -> Knowledge

\end{frame}
```

### Upfront Investment Pays Off

```{=latex}
\begin{frame}
\frametitle{Return on Investment}

\begin{itemize}
\item Testing \pause
\item Monitoring \pause
\item On-boarding
\end{itemize}

\end{frame}
```

```{=latex}
\end{document}
```