# Prometheus

> __Prometheus is an open source monitoring and alerting toolkit gathering and processing data locally if was orginally developed at SoundCloud and now is a fully open source community driven project. Prometheus is designed to collect and store metrics as time series data, it stores metric information with a timestamp when it was recorded alongside key-value pairs called labels.__

To get a broader overview of Prometheus and it's uses you can visit the main page [here](https://prometheus.io/docs/introduction/overview/).

A few facts about it:
- Written in `golang` a programming lanaguage developed at google!
- Provides APIs for different languages (including Python), __but__...
- We use our browser to query data using Prometheus specific language called PromQL
- The __Prometheus dashboard__ resides on `localhost:9090` by default
  

## The main components

* **Prometheus server:** The server will scrape and store the time series data.
* **Client libraries:** for instrumenting application code.
* **Push gateway:** for supporting short lived jobs.
* **Alertmanager:** To handle alerts.
* **Exporters:** for suported services like HAProxy, StatsD, Graphite and many more.

![](images/prometheus_architecture.png)

## What's going on above?

- Prometheus scrapes data (like metrics):
    - from short lived jobs via `push gateway`
    - from long running jobs directly
- __All samples (values with timestamps) are stored locally__ (together with necessary metadata)
- Runs predefined rules on collected data:
    - Gather and aggregate new records
    - Process them and send alerts
- Consumers use Prometheus's API to visualise data.

## When to use it?

- __Recording numerical time series__, this could be:
    - various training metrics gathered across epochs/batches
    - hardware related data
    - network traffic and other statistics
- __To debug infrastructure during network outages etc.__ (if interested, you can read about Slack's outage [here](https://slack.engineering/slacks-outage-on-january-4th-2021/))

> __Data is stored locally and does not rely on network storage__

Due to above, if something fails, you always have access to the data (as each Prometheus server is self-contained and independent).

__When shouldn't we use it?__

- __You need very detailed data coming really fast__ (per request metrics with a lot of requests from users, exact billing when every milisecond counts etc.)

That's because:
- Prometheus is desgined to scrape data every few seconds
- Data is kept in local storage which can fill up really fast (in this case think about large remote storage and data rotation)

## Data

> __All prometheus data is stored as timestamped timeseries differentiated by metric and (optionally) label__

- Metric names and labels should be alphanumerical
- __Samples are `float64` (`double`) numerical types__
- __Timestamps are in milliseconds__

For instance `process_cpu_seconds_total` which will calculate the total cpu usage of our Prometheus instance will be displayed as.

`process_cpu_seconds_total{instance="localhost:9090", job="prometheus"}`

The general layout of the query is as follows.
- ``process_cpu_seconds_total` - metric name
- `instance="localhost:9090"` label letting us know the instance which we are checking the cpu time for.
- `job=prometheus` another label letting us know the job we are checking the cpu time for. 

## Metrics

> Prometheus provides `4` metrics out of the box

__Note:__ Prometheus server does not differentiate between metrics (it only keeps the data), metrics are used by the client libraries (once again `4` client libraries provided for `golang`, `java`, __`python`__, `ruby`)

### Counter

> __Monotonically increasing counter (can be restarted)__

Useful for:
- number of requests
- tasks completed
- __anything which can only grow OR start a new__

### Gauge

> __Single value which can increase & decrease__

Useful for:
- memory usage monitoring
- temperature
- __anything which can change value arbitrarily__

### Histogram

> __Samples observations and groups them in buckets (which you can configure)__

Let's say our historgram metric is named `our_super_histogram`. In this case the following operations on histogram are available (__notice we add suffixes to metric name!__):
- `our_super_histogram_bucket{le="<upper inclusive bound>"}` - cumulative counters for the observation buckets
- `our_super_histogram_sum` - sum of all values
- `out_super_histogram_count` - count of observations

### Summary

> __Samples observations and provides them as a sliding time window__

- Provides `sum` and `count` like histogram
- `out_super_summary_quantiles{quantile="value"}` - quantile of observations (one can do something similar for histogram but using functions)

### Functions

There are also functions one can run to query for things like `day_of_month()`, we will talk about them in more detail later

## Jobs and instances

- __An instance is an endpoint you can scrape data from__, for instance an EC2 instance or a Docker container and will be a single process.
- __A job is a collections of the same instances__, like multiple EC2 instances and Docker containers, those are usually replicated for reliability/flexbility.

Read more about jobs and instance from the Prometheus documentation [here](https://prometheus.io/docs/concepts/jobs_instances/).

# First contact

## Installation

> Prometheus was written in `golang` a programming language developed by Google in 2009, hence __it is contained in a single compiled executable__

Due to above, it's installation & deployment is really simple and can be performed efficiently in many different scenarios:

- Either go to [their download page](https://prometheus.io/download/) and download the specific version of Prometheus for your operating system.
    - if you're on Mac, it's the download labelled Darwin
- Or __run prometheus inside Docker container__

We will use the last option, since we are running Prometheus from a docker container we will need bind our Prometheus config file to the container. This will allow Prometheus to update the config file using Docker commands. First pull the Prometheus Docker image from Docker Hub using `docker pull prom/prometheus`. Then create a `prometheus.yml` file using the following code, this is just a simple config for you to get started with Prometheus. 


In [None]:
global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name added as a label `job=<job_name>` to any timeseries scraped
  - job_name: 'prometheus'
    # Override the global default and scrape targets from job every 5 seconds.
    scrape_interval: '5s'
    static_configs:
      - targets: ['localhost:9090']



We can then build the image using the following command, you will just need to change the `/path/to/prometheus.yml` path to the directory where your `prometheus.yml` config file is stored locally. We also need to add the `--web.enable-lifecycle` flag as it allows the enabling of reloading the config from the command line. Run command below in your command line(if using Windows run the command in Powershell and not bash):

In [None]:
docker run --rm -p 9090:9090 --name prometheus -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

What it does:
- bind container's `9090` port to `9090` port on the localhost
- remove container (`--rm` flag) when you kill the process
- `--name` specifies the name of the container
- `-v` flag allows us to mount the prometheus config in the container to your local config so we can edit the config on the fly.

Prometheus will now be running so we can __simply go to [`localhost:9090`](http://localhost:9090) in your browser__ and you should see the following Prometheus dashboard. You might want to select local time so the metric are logged in your time zone and query history so you can track all queries made. And of course configure dark mode(top right corner) as it's clearly the best!

<img src="images/prom_init_dash.png?modified=232132453">

If you start typing in the expression field then prometheus will suggest queries you can run. Some metrics will not be configured to be viewable currently but prometheus can track itself. So we could start by running the expressions starting with prometheus. For instance run ``prometheus_build_info`` and and then execute the query. Notice you get details about what the query does when you mouse over it from the dropdown list.

<img src="images/prom_expression_selected_dashboard.png?modified=232132453">

You will get the results of the query in the panel underneath in this case the result was. 

`prometheus_build_info{branch="HEAD", goversion="go1.17.1", instance="localhost:9090", job="prometheus", revision="b30db03f35651888e34ac101a06e25d27d15b476", version="2.30.2"}`

- **metric name:** `promethus_build_info` 
- **labels:** goversion, instance, job etc.

You can see a test list of all metrics the Prometheus server is currently logging by going to `http://localhost:9090/metrics.`

## Exercise

- Try to run `20` Prometheus expressions and figure out what their use is. 
- Check their graph representation(some expressions may not have graphs for instance `prometheus_build_info`)

Check the hints given from the dropdown list to check what they do otherwise use google to find information on the metric if you are unsure.

# Prometheus configuration

## Configuration file

Until now, we didn't know any details of what just happened, but that's about to change.

> __Prometheus server is configured via [YAML](https://en.wikipedia.org/wiki/YAML) files__ the configuration file will be used to define what metrics and instances we would like to scrape. 

Previously, we ran Prometheus __with default configuration file__ (one can see it in [`localhost:9090/config`](http://localhost:9090/config)) or by selecting **Status > Configuration** from the dashboard .

Let's take a look at the config file and get a better idea of what's going on.

In [None]:
# Section with default values
global:
  scrape_interval: 15s # How frequently to scrape targets from jobs
  scrape_timeout: 10s # If there is no response from instance do not try to scrape
  evaluation_interval: 15s # How frequently to evaluate rules (e.g. reload graphs with new data)
# Prometheus alert manager, left for now
alerting:
  alertmanagers:
  - follow_redirects: true
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets: []
# Specific configuration for jobs
scrape_configs:
- job_name: prometheus # Name of the job, can be anything
  honor_timestamps: true # Use timestamps provided by job
  scrape_interval: 15s # As before, but for this job
  scrape_timeout: 10s # ^
  metrics_path: /metrics # Where metrics are located w.r.t. port (localhost:9090/metrics)
  scheme: http # Configures the protocol scheme used for requests (localhost is http)
  follow_redirects: true
  static_configs:
  - targets:
    - localhost:9090

> __Prometheus provides A LOT of configuration options, check all of them [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)__

A few things to note about config file to keep in mind:
- __What's amazing about Prometheus is it reloads it's configuration based on the config file automatically so you can change it live!.__ (you can change it live or even force it to reload by using the `/-/reload` flag.)
- __If the file is not correctly formatted IT WILL NOT BE UPDATED__, needs to valid **YAML**.
- All parameters that are specified under `global` are avaliable to all other scraper configs defined in the configuration file i.e if you create `scrape_config` that gets metrics from EC2 instances then it will use the global parameters unless explicity changed.

By adding the `--web.enable-lifecycle` flag when creating our Docker container and mounting our local `prometheus.yml` file to the container we should be able to edit the config live. Let's change the default `scrape_interval` and `scrape_timeout` to 1s. The start of your config should now be:

In [None]:
# Section with default values
global:
  scrape_interval: 1s # How frequently to scrape targets from jobs
  scrape_timeout: 1s # If there is no response from instance do not try to scrape
  evaluation_interval: 15s # How frequently to evaluate rules (e.g. reload graphs with new data)
# Prometheus alert manager, left for now

Now we can run the `/-/reload` command to update our config while our Prometheus server is running, once you have made your changes run the following command in your terminal.

In [None]:
curl -X POST http://localhost:9090/-/reload

Now go to `localhost:9090/config` and check you config, it should be updated and you should see the changes that we have made. Excellent we can not edit our config while our Prometheus server is running!

## [scrape_configs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)

> Specifies __sets of targets__ (what should be scraped) and parameters describing how to do it for each target. For instance you could specify a `scrape_config` to scrape metrics from an EC2 instances, or a config to scrape metrics from Kubernetes.

Targets can be defined __statically__ or __dynamically__

- `statically` - configured in our configuration YAML in a `scrape_config`.
- `dynamically` Using __service discovery configurations__. By defining the service discovery options we can allow Prometheus to track new instances of a particular service. For example in industry you might be starting, stopping, and creating new Docker containers constantly based on the needs of the business. By defining the service discovery configuration Prometheus will be able to track any newly created Docker containers without having to specify them directly in the configuration file.

A lot of service discovery integrations are avaliable for commonly used services:
- [`consul_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config) - retrieve scraping targets from HashiCorp's [Consul](https://www.consul.io/) used for service discovery and network setup
- [`dockerswarm_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dockerswarm_sd_config) - used with [Docker's Swarm mode](https://docs.docker.com/engine/swarm/) which allows us to connect and orchestrate many containers as one application
- [`ec2_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config)- retrieve scrape targets from EC2 instances
- [`kubernetes_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config) - configure scrape targets from Kubernetes REST API


The full documentation about creating configs can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config).

## Command line

> Instance of Prometheus server itself can be configured via command line flags

You can run the command below to see available options:

In [None]:
docker container run --rm prom/prometheus --help

Most notable are:

- `--web.read-timeout=5m` - Maximum duration before timing out read of the request, and closing idle connections
- `--web.max-connections=512` - Maximum number of simultaneous connections
- `--web.enable-lifecycle` - Enable shutdown and reload via HTTP request (requests to `/-/reload` mentioned previously)
- `--web.page-title="..."` - Change header of the webpage we ran previously
- `--storage.tsdb.retention.time` - How long should we keep the data (default: `15` days)
- `--storage.remote.read-concurrent-limit=10` - How many targets can be read simultaneuosly
- `--log.level=info` - Verbosity of Prometheus server, one of `[debug, info, warn, error]` can be set

__Note: Those flags have their categories first, followed by more categories (optionally) and option as the last one__

Above is due to Prometheus's structure and `golang` as a language of choice.

## Exporters

> Exporters are libraries which make it easier for you to export existing metrics from third-party systems and make them available to Prometheus to track i.e an exporter so you can track GitHub metrics.

There are hundreds of exporters currently all in different states of development, the full list of the most common ones [here](https://prometheus.io/docs/instrumenting/exporters/) and a complete list of all exporters on GitHub [here](https://github.com/prometheus/prometheus/wiki/Default-port-allocations):
- **official** - best practices and verifired by Prometheus; __always pick them if possible__
- **unofficial** - working, not verified for best practices or may have overlapping functionalities
- **in development** - to be released as either of the two above

Important things to keep in mind:
- __Most of the exporters occupy ports `9100`-`9999`__ and any new exporter should use it if any is available (see the github list above to see which ports exporters run on)
- There are a few exporters outside the standard range (once again, see the github list)

To learn more, we will now use one of the commonly used exporters which will Hardware and Software metrics on your Operating system, __[Node exporter](https://github.com/prometheus/node_exporter)__:


## Setting up the Node exporter

> `Node` exporter is a single static binary one can download and run straight from the workstation

- Following command will download the exporter, unpack the `.tar.gz` archive (__we assume you have Linux based system!__).
- You may run those commands anywhere you want, cell below uses temporary directory.
- If you have Windows, you should use [this exporter](https://github.com/prometheus-community/windows_exporter).
- If on MacOS, run `brew install node_exporter`

### For Linux based machines

In [None]:
mkdir tmp
cd $tmpdir
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.1.2.linux-amd64.tar.gz
./node_exporter-1.1.2.linux-amd64/node_exporter --help
cd ..

__Note:__ Add the following alias:
- `unpack` -> `tar xvzf`

> We start three node exporters to show a few more things about Prometheus, __one would be enough to monitor your OS!__

And let's start the `node_exporter` (run those inside your command line after downloading, check the comments):
- each in separate terminal __OR__
- each in the background (using `&` after the command)

In [None]:
# By default it will bind to port 9100
# if on Mac, and you've installed the package, just remove the ./ from the commands below before running in the same way

# ./node_exporter --web.listen-address 127.0.0.1:9100

## Configure Prometheus to scrape Node exporter

Now that we have our exporters running we need to setup `Prometheus` server to scrape data from them.

Let's change our `prometheus.yml` config file to contain the following setup, which will allow the `Node exporter` to scrape metrics from our system.

### For linux/MAC systems

In [None]:
global:
  scrape_interval: '1s'  # By default, scrape targets every 15 seconds.
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:
  # Prometheus monitoring itself
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']
  # OS monitoring
  - job_name: 'node'
    scrape_interval: '5s'
    static_configs:
      - targets: ['prometheus:9100']
        labels:
          group: 'production' # notice we have defined two nodes to be labelled in the production environment

### For Windows systems

In [None]:
global:
  scrape_interval: '5s'  # By default, scrape targets every 15 seconds.
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:
  # Prometheus monitoring itself
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']
  # OS monitoring
  - job_name: 'wmiexporter'
    scrape_interval: '30s'
    static_configs:
      - targets: ['YOUR PUBLIC IP HERE:9182']
  

Notice the basic layout when defining new targets to scrape is:

In [None]:
scrape_configs: # all targets will be defined in a config
  - job_name: 'prometheus' # name of the job
    scrape_interval: '10s' #parameters defining how to scrape the target.
    static_configs:              # allow you to specify a list of targets    
      - targets: ['localhost:9090']  # define the targets here.

Now we need to give this config file to `Prometheus` server (__it is contained in Docker, remember!__), there are two options for this we have already implemented the first one which allows us to edit the config file while the server is running. 

There are two ways to do that:
- Mount directory in your `localhost` to Docker container during runtime: 
    - When configuration is changing often (and you have autoreload set)
- Create new Docker image and copy the configuration:
    - When configuration is static and changing rarely
    
We will using the first approach editing the configuration file locally and then uploading it to the server while it is running. You can find the documentation to the second approach using a `Dockerfile` on the following [page](https://prometheus.io/docs/prometheus/latest/installation/). 

Now that our `prometheus.yml` has been updated locally we can push it to the Prometheus server using the same command as before.

In [None]:
curl -X POST http://localhost:9090/-/reload

__Go to [`localhost:9090/targets`](http://localhost:9090/targets) to verify everything was set up correctly:__ you can see on the targets dashboard that both of our targets have the state **UP** so will be collecting metrics, targets can also in other predefined state.

- **Down** Prometheus cannot connect to the target.
- **Unknown** Prometheus doesn't know where to find the target, usually due to a configuration issue.

<img src="images/prom_target_dash.png?modified=232132453">

From the targets window we can click on the endpoint of the target in our cases for windows exporter. `http://192.168.8.50:9182/metrics` to get a text list of all avalible metrics which are being scraped and avaliable to Prometheus. This can be helpful if you are unsure what metrics are avaliable to you. 

<img src="images/prom_metrics_endpoint.png?modified=232132453">

Once done, go to [`localhost:9090`](http://localhost:9090) and check whether you have the commands prefixed with `node` or `windows` available:__

<img src="images/prom_win_commands.png?modified=23212453">

Run a few expressions to see the result of the graphs a good one for windows will be `windows_os_processes` notice we can see the amount of processes running every 5 seconds by hovering over the graph.   

<img src="images/prom_win_process.png?modified=23212453">

__In the next lesson, we will learn how to query our data efficiently__, but before that.. let's show you how to monitor metrics from Docker containers!

## Configuring Prometheus to track Docker

Firstly to begin getting Prometheus to track Docker we will need to edit Dockers configuration so that we can specify it's metric address so Prometheus knows where to collect the metrics. We need to do this by editing the `daemon.json` docker file or by editing Docker Desktops configuration file.

- **For Linux:** Navigate to `/etc/docker/daemon.json`
- **For MAC/Windows:** Go to Docker Desktop, click the cog to go to **settings > Docker Engine**.

Add this code to configure scraping of metrics either the `daemon.json` file or to the Docker Engine config.


In [None]:
{
  "metrics-addr" : "0.0.0.0:9323",
  "experimental" : true
}

For Docker Desktop it will likely ask you to restart your Docker Desktop application. On Linux you will need to restart Docker `service docker restart` on Linux. Great now we just need to update our `Prometheus.yml` to add docker as a target so Prometheus can begin scraping the metrics. Add docker as a target in your `Prometheus.yml` like so.

In [None]:
global:
  scrape_interval: '15s'  # By default, scrape targets every 15 seconds.
  scrape_timeout: '10s'
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:
  # Prometheus monitoring itself
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']
  # OS monitoring
  - job_name: 'wmiexporter'
    scrape_interval: '30s'
    static_configs:
      - targets: ['YOU LOCAL IP HERE:9182']

  - job_name: 'docker'
         # metrics_path defaults to '/metrics'
         # scheme defaults to 'http'.

    static_configs:
      - targets: ['YOUR LOCAL IP HERE:9323']
  

Now like before update the config on the Promtheus server while it's running using the command.

In [None]:
curl -X POST http://localhost:9090/-/reload

Now we can check the Promethus targets pane from our server and see that docker is now avaliable as an endpoint to collect metrics. Remember to check its endpoint/metrics to see a list of avaliable metrics to track. From docker they will usually start with `engine_daemon_container`.

<img src="images/prom_targets_docker.png?modified=232132453">

Let's try starting some containers either start from Prometheus docker containers like before or use Docker hub to create some containers. Once they are running you can run the expressions to view the output of the metrics. For example here is the result after starting and stopping some containers from the `engine_daemon_container_states_containers` expression. Where the red line indicates containers stopping and the blue indicates containers are being run.

<img src="images/prom_cont_states.png?modified=232132453">

### Summary

- We've learned to start a Prometheus Server inside a Docker container.
- Learned how to access the Prometheus dashboard and run some query expressions.
- How to create a `Prometheus.yml` configuration for our server which we can update while the server is running live.
- We have integrated the node/windows exporter with the server so we can track hardware/software metrics from a machine.
- We have also setup prometheus to track the docker service and learned how to define targets to track in the `prometheus.yml` config file. 

### Additional 

- What are the available alternatives to Prometheus? Read about them [here](https://prometheus.io/docs/introduction/comparison/#). When should we use a different tool? 
- What is Prometheus's [Push Gateway](https://prometheus.io/docs/practices/pushing/)? When should we use it?
- Read a little bit about alerting in Prometheus (e.g. when disaster happens it can send you an e-mail). Go through [documentation](https://prometheus.io/docs/alerting/latest/overview/) and get a basic grasp (leave alerting rules until the next lesson though)
- Local disk storage is limited. How can one integrate with Prometheus's remote storage? Read about it [here](https://prometheus.io/docs/prometheus/latest/storage/).
- What is Prometheus Federation? Read about it [here](https://prometheus.io/docs/prometheus/latest/federation/)