# Prometheus

> __Prometheus is an open source monitoring and alerting toolkit gathering and processing data locally__

A few facts about it:
- Written in `golang`
- Provides APIs for different languages (including Python), __but__...
- We use our browser to query data using Prometheus specific language called PromQL
- Above __dashboard__ resides on `localhost:9090` by default

![](images/prometheus_architecture.png)

## What's going on above?

- Prometheus scrapes data (like metrics):
    - from short lived jobs via `push gateway`
    - from long running jobs directly
- __All samples (values with timestamps) are stored locally__ (together with necessary metadata)
- Runs predefined rules on collected data:
    - Gather and aggregate new records
    - Process them and send alerts
- Prometheus's API consumers are used to visualize data

## When to use it?

- __Recording numerical time series__, this could be:
    - various training metrics gathered across epochs/batches
    - hardware related data
    - network traffic and other statistics
- __To debug infrastructure during network outages etc.__ (if interested, you can read about Slack's outage [here](https://slack.engineering/slacks-outage-on-january-4th-2021/))

> __Data is stored locally (on each node) AND DOES NOT RELY ON NETWORK STORAGE__

Due to above, if something fails, you always have access to the data __on each node__ (as each Prometheus server is self-contained and independent).

__When shouldn't we use it?__

- __You need very detailed data coming really fast__ (per request metrics with a lot of requests from users, exact billing when every milisecond counts etc.)

That's because:
- Prometheus is desgined to scrape data every few seconds
- Data is kept in local storage which can fill up really fast (in this case think about large remote storage and data rotation)

## Data

> __All prometheus data is stored as timestamped timeseries differentiated by metric and (optionally) label__

- Metric names and labels should be alphanumerical
- __Samples are `float64` (`double`) numerical types__
- __Timestamps are in milliseconds__

```bash
api_http_requests_total{method="POST", handler="/messages"}
```

In the above case:
- `api_http_requests_total` - metric name
- `method="POST"` - label `method` which is equal to `"POST"`
- `handler="/messages"` - label `handler` which is equal to `"/messages"`

## Metrics

> Prometheus provides `4` metrics out of the box

__Note:__ Prometheus server does not differentiate between metrics (it only keeps the data), metrics are used by the client libraries (once again `4` client libraries provided for `golang`, `java`, __`python`__, `ruby`)

### Counter

> __Monotonically increasing counter (can be restarted)__

Useful for:
- number of requests
- tasks completed
- __anything which can only grow OR start a new__

### Gauge

> __Single value which can increase & decrease__

Useful for:
- memory usage monitoring
- temperature
- __anything which can change value arbitrarily__

### Histogram

> __Samples observations and groups them in buckets (which you can configure)__

Let's say our historgram metric is named `our_super_histogram`. In this case the following operations on histogram are available (__notice we add suffixes to metric name!__):
- `our_super_histogram_bucket{le="<upper inclusive bound>"}` - cumulative counters for the observation buckets
- `our_super_histogram_sum` - sum of all values
- `out_super_histogram_count` - count of observations

### Summary

> __Samples observations and provides them as a sliding time window__

- Provides `sum` and `count` like histogram
- `out_super_summary_quantiles{quantile="value"}` - quantile of observations (one can do something similar for histogram but using functions)

### Functions

There are also functions one can run to query for things like `day_of_month()`, we will talk about them in more detail later

## Jobs and instances

> __An instance is an endpoint you can scrape data from__, single process
> __A job is a collections of the same instances__, those are usually replicated for reliability/flexbility

https://prometheus.io/docs/concepts/jobs_instances/ here more if needed

# First contact

## Installation

> Prometheus was written in `golang`, hence __it is contained in a single compiled executable__

Due to above, it's installation & deployment is really simple and can be performed efficiently in many different scenarios:

- Go to [their download page](https://prometheus.io/download/) and download appropriate binary
    - if you're on Mac, it's the download labelled Darwin
- Check your OS's packages, some of those are officially maintained (e.g. `prometheus` for arch linux)
- __Run prometheus inside Docker container__

We will use the last option. Run command below in your command line:

In [None]:
docker run --rm -p 9090:9090 prom/prometheus

What it does:
- bind container's `9090` port to `9090` port on the localhost
- remove container (`--rm` flag) when you kill the process

__Simply go to [`localhost:9090`](http://localhost:9090) in your browser__ and you should see the following (you can mark every checkbox the same way we do):

![](images/prometheus_webui.png)

Start typing something into the `Expression`, you should see autocompletion with possibilities to query:

![](images/prometheus_webui_autocompletion.png)

## Exercise

- Find `20` functions starting with `20` different letters and run them
- Check their graph representation

Understand what those do, consult documentation if needed

# Prometheus configuration

## Configuration file

Until now, we didn't know any details of what just happened, but that's about to change.

> __Prometheus server is configured via [YAML](https://en.wikipedia.org/wiki/YAML) files__

Previously, we ran Prometheus __with default configuration file__ (one can see it in [`localhost:9090/config`](http://localhost:9090/config).

For brevity, let's take a look at this config file to have a feel of what's possible:

In [None]:
# Section with default values
global:
  scrape_interval: 15s # How frequently to scrape targets from jobs
  scrape_timeout: 10s # If there is no response from instance do not try to scrape
  evaluation_interval: 15s # How frequently to evaluate rules (e.g. reload graphs with new data)
# Prometheus alert manager, left for now
alerting:
  alertmanagers:
  - follow_redirects: true
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets: []
# Specific configuration for jobs
scrape_configs:
- job_name: prometheus # Name of the job, can be anything
  honor_timestamps: true # Use timestamps provided by job
  scrape_interval: 15s # As before, but for this job
  scrape_timeout: 10s # ^
  metrics_path: /metrics # Where metrics are located w.r.t. port (localhost:9090/metrics)
  scheme: http # Configures the protocol scheme used for requests (localhost is http)
  follow_redirects: true
  static_configs:
  - targets:
    - localhost:9090

> __Prometheus provides A LOT of configuration options, check all of them [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)__

A few things to note about config file to keep in mind:
- __Prometheus reloads it's configuration based on the config file automatically__ (you can change it live or even reload it by accessing `/-/reload`)
- __If the file is not correctly formed IT WILL NOT BE UPDATED__ 
- `global` is used for everything if not specified (especially inside `scrape_configs`)

### [scrape_configs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)

> Specifies __sets of targets__ (what should be scraped) and parameters describing how to do it for each target

Targets can be defined __statically__ or __dynamically__

- `statically` - specified via `port` etc.
- `dynamically` (service discovery) - using __service discovery mechanisms__ (for example all jobs labeled by `deep-learning`)

A lot of integrations for scraping are provided, some of those are:
- [`consul_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config) - retrieve scraping targets from HashiCorp's [Consul](https://www.consul.io/) used for service discovery and network setup
- [`dockerswarm_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dockerswarm_sd_config) - used with [Docker's Swarm mode](https://docs.docker.com/engine/swarm/) which allows us to connect and orchestrate many containers as one application
- [`ec2_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config)- retrieve scrape targets from EC2 instances
- [`kubernetes_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config) - configure scrape targets from Kubernetes REST API


https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config anything below that should be added?

## Command line

> Instance of Prometheus server itself can be configured via command line flags

You can run the command below to see available options:

In [1]:
docker container run --rm prom/prometheus --help

usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try
                                 --help-long and --help-man).
      --version                  Show application version.
      --config.file="prometheus.yml"  
                                 Prometheus configuration file path.
      --web.listen-address="0.0.0.0:9090"  
                                 Address to listen on for UI, API, and
                                 telemetry.
      --web.config.file=""       [EXPERIMENTAL] Path to configuration file that
                                 can enable TLS or authentication.
      --web.read-timeout=5m      Maximum duration before timing out read of the
                                 request, and closing idle connections.
      --web.max-connections=512  Maximum number of simultaneous connections.
      --web.external-url=<URL>   The URL under which Prometheus is externally
                 

                                 above. One of: [debug, info, warn, error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt,
                                 json]



Most notable are:

- `--web.read-timeout=5m` - Maximum duration before timing out read of the request, and closing idle connections
- `--web.max-connections=512` - Maximum number of simultaneous connections
- `--web.enable-lifecycle` - Enable shutdown and reload via HTTP request (requests to `/-/reload` mentioned previously)
- `--web.page-title="..."` - Change header of the webpage we ran previously
- `--storage.tsdb.retention.time` - How long should we keep the data (default: `15` days)
- `--storage.remote.read-concurrent-limit=10` - How many targets can be read simultaneuosly
- `--log.level=info` - Verbosity of Prometheus server, one of `[debug, info, warn, error]` can be set

__Note: Those flags have their categories first, followed by more categories (optionally) and option as the last one__

Above is due to Prometheus's structure and `golang` as a language of choice.

## Exporters

> Exporters export existing metrics from third-party systems and make them available as Prometheus metrics

There are hundreds of exporters currently:
- official - best practices and verifired by Prometheus; __always pick them if possible__
- unofficial - working, not verified for best practices or may have overlapping functionalities
- in development - to be released as either of the two above

You can see [short list here](https://prometheus.io/docs/instrumenting/exporters/) and much longer one [here](https://github.com/prometheus/prometheus/wiki/Default-port-allocations). Important things to keep in mind:
- __Most of the exporters occupy ports `9100`-`9999`__ and any new exporter should use it if any is available (see longer list above)
- There are a few exporters outside the standard range (once again, see longer list above)

To learn more, we will now use one of the common exporter __[Node exporter](https://github.com/prometheus/node_exporter)__:

> Node Exporter is a Prometheus supported exporter for hardware and OS metrics exposed by *NIX kernels

### Exporting *NIX metrics

> `Node` exporter is a single static binary one can download and run straight from the workstation

- Following command will download the exporter, unpack the `.tar.gz` archive (__we assume you have *NIX system!__).
- You may run those commands anywhere you want, cell below uses temporary directory.
- If you have Windows, you should use [this exporter](https://github.com/prometheus-community/windows_exporter).
- If on MacOS, run `brew install node_exporter`

In [4]:
mkdir tmp
cd $tmpdir
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.1.2.linux-amd64.tar.gz
./node_exporter-1.1.2.linux-amd64/node_exporter --help
cd ..

--2021-04-12 14:50:19--  https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/9524057/715b1a00-7d9f-11eb-8cfa-533c911cfe9a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210412%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210412T125019Z&X-Amz-Expires=300&X-Amz-Signature=afb080ed4433852dd36268cfbf30bf1a961ae75d295d08de7b1ce0128fac6bae&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9524057&response-content-disposition=attachment%3B%20filename%3Dnode_exporter-1.1.2.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-04-12 14:50:19--  https://github-releases.githubusercontent.com/9524057/

                                 test fixtures to use for wifi collector metrics
      --collector.arp            Enable the arp collector (default: enabled).
      --collector.bcache         Enable the bcache collector (default: enabled).
      --collector.bonding        Enable the bonding collector (default:
                                 enabled).
      --collector.btrfs          Enable the btrfs collector (default: enabled).
      --collector.buddyinfo      Enable the buddyinfo collector (default:
                                 disabled).
      --collector.conntrack      Enable the conntrack collector (default:
                                 enabled).
      --collector.cpu            Enable the cpu collector (default: enabled).
      --collector.cpufreq        Enable the cpufreq collector (default:
                                 enabled).
      --collector.diskstats      Enable the diskstats collector (default:
                                 enabled).
      --collector.dr

__Note:__ Add the following alias:
- `unpack` -> `tar xvzf`

> We start three node exporters to show a few more things about Prometheus, __one would be enough to monitor your OS!__

And let's start the `node_exporter`s (run those inside your command line after downloading, check the comments):
- each in separate terminal __OR__
- each in the background (using `&` after the command)

In [6]:
# By default it will bind to port 9100, we change it to 8080-8082
# Run in separate terminals
# if on Mac, and you've installed the package, just remove the ./ from the commands below before running in the same way

# ./node_exporter --web.listen-address 127.0.0.1:8080
# ./node_exporter --web.listen-address 127.0.0.1:8081
# ./node_exporter --web.listen-address 127.0.0.1:8082

## Configure Prometheus to scrape exporters

Now that we have our exporters running we need to setup `Prometheus` server to scrape data from them.

Let's take a look at the config file (and save it as `prometheus-nodes.yml`):

In [None]:
---
global:
  scrape_interval: '15s'  # By default, scrape targets every 15 seconds.
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:
  # Prometheus monitoring itself
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']
  # OS monitoring
  - job_name: 'node'
    scrape_interval: '5s'
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'dev'

Now we need to give this config file to `Prometheus` server (__it is contained in Docker, remember!__).

There are two ways to do that:
- Mount directory in your `localhost` to Docker container during runtime: 
    - When configuration is changing often (and you have autoreload set)
- Create new Docker image and copy the configuration:
    - When configuration is static and changing rarely
    
As this configuration will not change often, we will stick the second route (see [here](https://prometheus.io/docs/prometheus/latest/installation/#volumes-bind-mount) for the first approach):

In [None]:
FROM prom/prometheus

# prometheus.yml is the default path from which Prometheus will take the config
COPY prometheus-nodes.yml /etc/prometheus/prometheus.yml

# document which ports should be mapped
EXPOSE 9090

Now that we have this simple `Dockerfile` (named as `prometheus-nodes.Dockerfile`) we have to:
- `docker image build` - build the image and tag it 
- `docker container run` - run Prometheus server container and expose appropriate ports

In [None]:
docker image build --rm \
  --file prometheus-nodes.Dockerfile \
  --tag aicore/prometheus-nodes:latest .

docker container run --rm --net=host --publish 9090:9090 aicore/prometheus-nodes:latest                                                 

Second command might need a little explanation:
- We only publish port `9090` as __we want to access data inside Docker container__
- Our `node` exporters publish data on `8080`, `8081`, `8082` __BUT on Docker's host machine__
- Our `Docker` container (`prometheus` server) __needs access to ports on localhost__ (in order to scrape data)
- Hence we have to allow it to communicate with `localhost` via `--net=host` command

__Why can't we map ports `8080`, `8081`, `8082`?__

Because those are already taken by our exporters (you can only bind one service to one port!)

__Go to [`localhost:9090/targets`](http://localhost:9090/targets) to verify everything was set up correctly:__

![](images/prometheus_nodes_targets.png)

__Go to [`localhost:9090`](http://localhost:9090) and check whether you have `node` commands available:__

![](images/prometheus_nodes_queries.png)

Run a few of them, you should sometimes (depending on command) see `3` graphs for each node we are monitoring (try `node_cpu_seconds_total` and look for different `localhost:<port>` by hovering over the graph).

__In the next lesson, we will learn how to query our data efficiently__, but before that...

## Exercise

__Your exercise it to set up Prometheus with `3` scrapers about your Operating System:__

- Prometheus itself (we did this one previously)
- `node` (__rember this one has to be run on `localhost` AND ONLY A SINGLE REPLICA__)
- `docker` (we will monitor Docker itself via server... inside `Docker` :D)

In order to do that, do the following steps:
- Change Docker daemon to enable logging it's metric to port `9323` (check [here](https://docs.docker.com/config/daemon/prometheus/#configure-docker), __only this single bulletpoint!__)
- Start a single `node` exporter (by default it exports data to port `9100`)
- Create `prometheus.yml` file describing those three services (__each job with a single instance & remember about the ports!__)
- Write and build `docker` image containing `prometheus.yml` (like we did previously)
- `docker container run` created image (remember about `--net=host` as we have to access data and about mapping containerized `Prometheus` port to `9090`)

__Verify targets are being scrapped correctly by checking `localhost:9090/targets`!__

### Additionally

To make it fully usable for basic system monitoring

## Challenges

### Assessment

- What are the available alternatives to Prometheus? Read about them [here](https://prometheus.io/docs/introduction/comparison/#). When should we use a different tool? 
- What is Prometheus's [Push Gateway](https://prometheus.io/docs/practices/pushing/)? When should we use it?
- Read a little bit about alerting in Prometheus (e.g. when disaster happens it can send you an e-mail). Go through [documentation](https://prometheus.io/docs/alerting/latest/overview/) and get a basic grasp (leave alerting rules until the next lesson though)

### Non-assessment

- Local disk storage is limited. How can one integrate with Prometheus's remote storage? Read about it [here](https://prometheus.io/docs/prometheus/latest/storage/).
- What is Prometheus Federation? Read about it [here](https://prometheus.io/docs/prometheus/latest/federation/)