diff --git a/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png b/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png new file mode 100644 index 00000000..f60b43a8 Binary files /dev/null and b/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png differ diff --git a/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/index.md b/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/index.md new file mode 100644 index 00000000..5d7ea113 --- /dev/null +++ b/blog/2023-09-26-netdata-processes-monitoring-comparison-with-console-tools/2023-09-26-netdata-prometheus-grafana-stack/index.md @@ -0,0 +1,170 @@ +--- +slug: netdata-processes-monitoring-comparison-with-console-tools +title: "Netdata Processes monitoring and its comparison with other console based tools" +authors: satya +tags: [processes, top, htop, atop, glances, application-monitoring, apm] +keywords: [processes, top, htop, atop, glances, application-monitoring, apm] +image: ./img/stacked-netdata.png +--- + +![netdata-prometheus-grafana-stack](./img/stacked-netdata.png) + +Netdata reads `/proc//stat` for all processes, once per second and extracts `utime` and +`stime` (user and system cpu utilization), much like all the console tools do. + +But it also extracts `cutime` and `cstime` that account the user and system time of the exit children of each process. +By keeping a map in memory of the whole process tree, it is capable of assigning the right time to every process, taking +into account all its exited children. + +It is tricky, since a process may be running for 1 hour and once it exits, its parent should not +receive the whole 1 hour of cpu time in just 1 second - you have to subtract the cpu time that has +been reported for it prior to this iteration. + +It is even trickier, because walking through the entire process tree takes some time itself. So, +if you sum the CPU utilization of all processes, you might have more CPU time than the reported +total cpu time of the system. Netdata solves this, by adapting the per process cpu utilization to +the total of the system. + + + +## Comparison with console tools + +SSH to a server running Netdata and execute this: + +```sh +while true; do ls -l /var/run >/dev/null; done +``` + +In most systems `/var/run` is a `tmpfs` device, so there is nothing that can stop this command +from consuming entirely one of the CPU cores of the machine. + +As we will see below, **none** of the console performance monitoring tools can report that this +command is using 100% CPU. They do report of course that the CPU is busy, but **they fail to +identify the process that consumes so much CPU**. + +Here is what common Linux console monitoring tools report: + +### top + +`top` reports that `bash` is using just 14%. + +If you check the total system CPU utilization, it says there is no idle CPU at all, but `top` +fails to provide a breakdown of the CPU consumption in the system. The sum of the CPU utilization +of all processes reported by `top`, is 15.6%. + +``` +top - 18:46:28 up 3 days, 20:14, 2 users, load average: 0.22, 0.05, 0.02 +Tasks: 76 total, 2 running, 74 sleeping, 0 stopped, 0 zombie +%Cpu(s): 32.8 us, 65.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 1.3 hi, 0.3 si, 0.0 st +KiB Mem : 1016576 total, 244112 free, 52012 used, 720452 buff/cache +KiB Swap: 0 total, 0 free, 0 used. 753712 avail Mem + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND +12789 root 20 0 14980 4180 3020 S 14.0 0.4 0:02.82 bash + 9 root 20 0 0 0 0 S 1.0 0.0 0:22.36 rcuos/0 + 642 netdata 20 0 132024 20112 2660 S 0.3 2.0 14:26.29 netdata +12522 netdata 20 0 9508 2476 1828 S 0.3 0.2 0:02.26 apps.plugin + 1 root 20 0 67196 10216 7500 S 0.0 1.0 0:04.83 systemd + 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd +``` + +### htop + +Exactly like `top`, `htop` is providing an incomplete breakdown of the system CPU utilization. + +``` + CPU[||||||||||||||||||||||||100.0%] Tasks: 27, 11 thr; 2 running + Mem[||||||||||||||||||||85.4M/993M] Load average: 1.16 0.88 0.90 + Swp[ 0K/0K] Uptime: 3 days, 21:37:03 + + PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command +12789 root 20 0 15104 4484 3208 S 14.0 0.4 10:57.15 -bash + 7024 netdata 20 0 9544 2480 1744 S 0.7 0.2 0:00.88 /usr/libexec/netd + 7009 netdata 20 0 138M 21016 2712 S 0.7 2.1 0:00.89 /usr/sbin/netdata + 7012 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.31 /usr/sbin/netdata + 563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/ + 7019 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.14 /usr/sbin/netdata +``` + +### atop + +`atop` also fails to break down CPU usage. + +``` +ATOP - localhost 2016/12/10 20:11:27 ----------- 10s elapsed +PRC | sys 1.13s | user 0.43s | #proc 75 | #zombie 0 | #exit 5383 | +CPU | sys 67% | user 31% | irq 2% | idle 0% | wait 0% | +CPL | avg1 1.34 | avg5 1.05 | avg15 0.96 | csw 51346 | intr 10508 | +MEM | tot 992.8M | free 211.5M | cache 470.0M | buff 87.2M | slab 164.7M | +SWP | tot 0.0M | free 0.0M | | vmcom 207.6M | vmlim 496.4M | +DSK | vda | busy 0% | read 0 | write 4 | avio 1.50 ms | +NET | transport | tcpi 16 | tcpo 15 | udpi 0 | udpo 0 | +NET | network | ipi 16 | ipo 15 | ipfrw 0 | deliv 16 | +NET | eth0 ---- | pcki 16 | pcko 15 | si 1 Kbps | so 4 Kbps | + + PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/600 +12789 0.98s 0.40s 0K 0K 0K 336K -- - S 14% bash + 9 0.08s 0.00s 0K 0K 0K 0K -- - S 1% rcuos/0 + 7024 0.03s 0.00s 0K 0K 0K 0K -- - S 0% apps.plugin + 7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata +``` + +### glances + +And the same is true for `glances`. The system runs at 100%, but `glances` reports only 17% +per process utilization. + +Note also, that being a `python` program, `glances` uses 1.6% CPU while it runs. + +``` +localhost Uptime: 3 days, 21:42:00 + +CPU [100.0%] CPU 100.0% MEM 23.7% SWAP 0.0% LOAD 1-core +MEM [ 23.7%] user: 30.9% total: 993M total: 0 1 min: 1.18 +SWAP [ 0.0%] system: 67.8% used: 236M used: 0 5 min: 1.08 + idle: 0.0% free: 757M free: 0 15 min: 1.00 + +NETWORK Rx/s Tx/s TASKS 75 (90 thr), 1 run, 74 slp, 0 oth +eth0 168b 2Kb +eth1 0b 0b CPU% MEM% PID USER NI S Command +lo 0b 0b 13.5 0.4 12789 root 0 S -bash + 1.6 2.2 7025 root 0 R /usr/bin/python /u +DISK I/O R/s W/s 1.0 0.0 9 root 0 S rcuos/0 +vda1 0 4K 0.3 0.2 7024 netdata 0 S /usr/libexec/netda + 0.3 0.0 7 root 0 S rcu_sched +FILE SYS Used Total 0.3 2.1 7009 netdata 0 S /usr/sbin/netdata +/ (vda1) 1.56G 29.5G 0.0 0.0 17 root 0 S oom_reaper +``` + +### why does this happen? + +All the console tools report usage based on the processes found running *at the moment they +examine the process tree*. So, they see just one `ls` command, which is actually very quick +with minor CPU utilization. But the shell, is spawning hundreds of them, one after another +(much like shell scripts do). + +### What does Netdata report? + +The total CPU utilization of the system: + +![image](https://cloud.githubusercontent.com/assets/2662304/21076212/9198e5a6-bf2e-11e6-9bc0-6bdea25befb2.png) +
***Figure 1**: The system overview section at Netdata, just a few seconds after the command was run* + +And at the applications `apps.plugin` breaks down CPU usage per application: + +![image](https://cloud.githubusercontent.com/assets/2662304/21076220/c9687848-bf2e-11e6-8d81-348592c5aca2.png) +
***Figure 2**: The Applications section at Netdata, just a few seconds after the command was run* + +So, the `ssh` session is using 95% CPU time. + +Why `ssh`? + +`apps.plugin` groups all processes based on its configuration file. +The default configuration has nothing for `bash`, but it has for `sshd`, so Netdata accumulates +all ssh sessions to a dimension on the charts, called `ssh`. This includes all the processes in +the process tree of `sshd`, **including the exited children**. + +> Distributions based on `systemd`, provide another way to get cpu utilization per user session +> or service running: control groups, or cgroups, commonly used as part of containers +> `apps.plugin` does not use these mechanisms. The process grouping made by `apps.plugin` works +> on any Linux, `systemd` based or not. \ No newline at end of file diff --git a/blog/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png b/blog/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png new file mode 100644 index 00000000..f60b43a8 Binary files /dev/null and b/blog/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png differ diff --git a/blog/2023-09-26-netdata-prometheus-grafana-stack/index.md b/blog/2023-09-26-netdata-prometheus-grafana-stack/index.md new file mode 100644 index 00000000..065b41e2 --- /dev/null +++ b/blog/2023-09-26-netdata-prometheus-grafana-stack/index.md @@ -0,0 +1,267 @@ +--- +slug: netdata-prometheus-grafana-stack +title: "Netdata, Prometheus, Grafana Stack" +authors: satya +tags: [prometheus, exporter, grafana, netdata, monitoring-stack] +keywords: [prometheus, exporter, grafana, netdata, monitoring-stack] +image: ./img/stacked-netdata.png +--- + +![netdata-prometheus-grafana-stack](./img/stacked-netdata.png) + +In this blog, we will walk you through the basics of getting Netdata, Prometheus and Grafana all working together and +monitoring your application servers. This article will be using docker on your local workstation. We will be working +with docker in an ad-hoc way, launching containers that run `/bin/bash` and attaching a TTY to them. We use docker here +in a purely academic fashion and do not condone running Netdata in a container. We pick this method so individuals +without cloud accounts or access to VMs can try this out and for it's speed of deployment. + + + +## Why Netdata, Prometheus, and Grafana + +Some time ago I was introduced to Netdata by a coworker. We were attempting to troubleshoot python code which seemed to +be bottlenecked. I was instantly impressed by the amount of metrics Netdata exposes to you. I quickly added Netdata to +my set of go-to tools when troubleshooting systems performance. + +Some time ago, even later, I was introduced to Prometheus. Prometheus is a monitoring application which flips the normal +architecture around and polls rest endpoints for its metrics. This architectural change greatly simplifies and decreases +the time necessary to begin monitoring your applications. Compared to current monitoring solutions the time spent on +designing the infrastructure is greatly reduced. Running a single Prometheus server per application becomes feasible +with the help of Grafana. + +Grafana has been the go to graphing tool for… some time now. It's awesome, anyone that has used it knows it's awesome. +We can point Grafana at Prometheus and use Prometheus as a data source. This allows a pretty simple overall monitoring +architecture: Install Netdata on your application servers, point Prometheus at Netdata, and then point Grafana at +Prometheus. + +I'm omitting an important ingredient in this stack in order to keep this tutorial simple and that is service discovery. +My personal preference is to use Consul. Prometheus can plug into consul and automatically begin to scrape new hosts +that register a Netdata client with Consul. + +At the end of this tutorial you will understand how each technology fits together to create a modern monitoring stack. +This stack will offer you visibility into your application and systems performance. + +## Getting Started - Netdata + +To begin let's create our container which we will install Netdata on. We need to run a container, forward the necessary +port that Netdata listens on, and attach a tty so we can interact with the bash shell on the container. But before we do +this we want name resolution between the two containers to work. In order to accomplish this we will create a +user-defined network and attach both containers to this network. The first command we should run is: + +```sh +docker network create --driver bridge netdata-tutorial +``` + +With this user-defined network created we can now launch our container we will install Netdata on and point it to this +network. + +```sh +docker run -it --name netdata --hostname netdata --network=netdata-tutorial -p 19999:19999 centos:latest '/bin/bash' +``` + +This command creates an interactive tty session (`-it`), gives the container both a name in relation to the docker +daemon and a hostname (this is so you know what container is which when working in the shells and docker maps hostname +resolution to this container), forwards the local port 19999 to the container's port 19999 (`-p 19999:19999`), sets the +command to run (`/bin/bash`) and then chooses the base container images (`centos:latest`). After running this you should +be sitting inside the shell of the container. + +After we have entered the shell we can install Netdata. This process could not be easier. If you take a look at [this +link](https://github.com/netdata/netdata/blob/master/packaging/installer/README.md), the Netdata devs give us several one-liners to install Netdata. I have not had +any issues with these one liners and their bootstrapping scripts so far (If you guys run into anything do share). Run +the following command in your container. + + +```sh +wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh --dont-wait +``` + +After the install completes you should be able to hit the Netdata dashboard at (replace +localhost if you're doing this on a VM or have the docker container hosted on a machine not on your local system). If +this is your first time using Netdata I suggest you take a look around. The amount of time I've spent digging through +`/proc` and calculating my own metrics has been greatly reduced by this tool. Take it all in. + +Next I want to draw your attention to a particular endpoint. Navigate to + In your browser. This is the endpoint which +publishes all the metrics in a format which Prometheus understands. Let's take a look at one of these metrics. +`netdata_disk_space_GiB_average{chart="disk_space._run",dimension="avail",family="/run",mount_point="/run",filesystem="tmpfs",mount_root="/"} 0.0298195 1684951093000` +This metric is representing several things which I will go in more details in the section on Prometheus. For now understand +that this metric: `netdata_disk_space_GiB_average` has several labels: (`chart`, `family`, `dimension`, `mountt_point`, `filesystem`, `mount_root`). +This corresponds with disk space you see on the Netdata dashboard. + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%204.00.45%20PM.png) + +This CHART is called `system.cpu`, The FAMILY is `cpu`, and the DIMENSION we are observing is `system`. You can begin to +draw links between the charts in Netdata to the Prometheus metrics format in this manner. + +## Prometheus + +We will be installing Prometheus in a container for purpose of demonstration. While Prometheus does have an official +container I would like to walk through the install process and setup on a fresh container. This will allow anyone +reading to migrate this tutorial to a VM or Server of any sort. + +Let's start another container in the same fashion as we did the Netdata container. + +```sh +docker run -it --name prometheus --hostname prometheus \ +--network=netdata-tutorial -p 9090:9090 centos:latest '/bin/bash' +``` + +This should drop you into a shell once again. Once there quickly install your favorite editor as we will be editing +files later in this tutorial. + +```sh +yum install vim -y +``` + +You will also need `wget` and `curl` to download files and `sudo` if you are not root. + +```sh +yum install curl sudo wget -y +``` + +Prometheus provides a tarball of their latest stable versions [here](https://prometheus.io/download/). + +Let's download the latest version and install into your container. + +```sh +cd /tmp && curl -s https://api.github.com/repos/prometheus/prometheus/releases/latest \ +| grep "browser_download_url.*linux-amd64.tar.gz" \ +| cut -d '"' -f 4 \ +| wget -qi - + +mkdir /opt/prometheus + +sudo tar -xvf /tmp/prometheus-*linux-amd64.tar.gz -C /opt/prometheus --strip=1 +``` + +This should get Prometheus installed into the container. Let's test that we can run Prometheus and connect to it's web +interface. + +```sh +/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml +``` + +Now attempt to go to . You should be presented with the Prometheus homepage. This is a good +point to talk about Prometheus's data model which can be viewed here: +As explained we have two key elements in Prometheus metrics. We have the _metric_ and its _labels_. Labels allow for +granularity between metrics. Let's use our previous example to further explain. + +```conf +netdata_disk_space_GiB_average{chart="disk_space._run",dimension="avail",family="/run",mount_point="/run",filesystem="tmpfs",mount_root="/"} 0.0298195 1684951093000 +``` + +Here our metric is `netdata_disk_space_GiB_average` and our common labels are `chart`, `family`, and `dimension`. The +last two values constitute the actual metric value for the metric type (gauge, counter, etc…). We also have specific +label for this chart named `mount_point`,`filesystem`, and `mount_root`. We can begin graphing system metrics with this information, +but first we need to hook up Prometheus to poll Netdata stats. + +Let's move our attention to Prometheus's configuration. Prometheus gets it config from the file located (in our example) +at `/opt/prometheus/prometheus.yml`. I won't spend an extensive amount of time going over the configuration values +documented here: . We will be adding a new job under the +`scrape_configs`. Let's make the `scrape_configs` section look like this (we can use the DNS name Netdata due to the +custom user-defined network we created in docker beforehand). + +```yaml +scrape_configs: + # The job name is added as a label `job=` to any timeseries scraped from this config. + - job_name: 'prometheus' + + # metrics_path defaults to '/metrics' + # scheme defaults to 'http'. + + static_configs: + - targets: ['localhost:9090'] + + - job_name: 'netdata' + + metrics_path: /api/v1/allmetrics + params: + format: [ prometheus ] + + static_configs: + - targets: ['netdata:19999'] +``` + +Let's start Prometheus once again by running `/opt/prometheus/prometheus`. If we now navigate to Prometheus at + we should see our target being successfully scraped. If we now go back to the +Prometheus's homepage and begin to type `netdata\_` Prometheus should auto complete metrics it is now scraping. + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%205.13.43%20PM.png) + +Let's now start exploring how we can graph some metrics. Back in our Netdata container lets get the CPU spinning with a +pointless busy loop. On the shell do the following: + +```sh +[root@netdata /]# while true; do echo "HOT HOT HOT CPU"; done +``` + +Our Netdata cpu graph should be showing some activity. Let's represent this in Prometheus. In order to do this let's +keep our metrics page open for reference: . We are +setting out to graph the data in the CPU chart so let's search for `system.cpu` in the metrics page above. We come +across a section of metrics with the first comments `# COMMENT homogeneous chart "system.cpu", context "system.cpu", +family "cpu", units "percentage"` followed by the metrics. This is a good start now let us drill down to the specific +metric we would like to graph. + +```conf +# COMMENT +netdata_system_cpu_percentage_average: dimension "system", value is percentage, gauge, dt 1501275951 to 1501275951 inclusive +netdata_system_cpu_percentage_average{chart="system.cpu",family="cpu",dimension="system"} 0.0000000 1501275951000 +``` + +Here we learn that the metric name we care about is `netdata_system_cpu_percentage_average` so throw this into +Prometheus and see what we get. We should see something similar to this (I shut off my busy loop) + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%205.47.53%20PM.png) + +This is a good step toward what we want. Also make note that Prometheus will tag on an `instance` label for us which +corresponds to our statically defined job in the configuration file. This allows us to tailor our queries to specific +instances. Now we need to isolate the dimension we want in our query. To do this let us refine the query slightly. Let's +query the dimension also. Place this into our query text box. +`netdata_system_cpu_percentage_average{dimension="system"}` We now wind up with the following graph. + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%205.54.40%20PM.png) + +Awesome, this is exactly what we wanted. If you haven't caught on yet we can emulate entire charts from Netdata by using +the `chart` dimension. If you'd like you can combine the `chart` and `instance` dimension to create per-instance charts. +Let's give this a try: `netdata_system_cpu_percentage_average{chart="system.cpu", instance="netdata:19999"}` + +This is the basics of using Prometheus to query Netdata. I'd advise everyone at this point to read [this +page](https://github.com/netdata/netdata/blob/master/exporting/prometheus/README.md#using-netdata-with-prometheus). The key point here is that Netdata can export metrics from +its internal DB or can send metrics _as-collected_ by specifying the `source=as-collected` URL parameter like so. + If you choose to use +this method you will need to use Prometheus's set of functions here: to +obtain useful metrics as you are now dealing with raw counters from the system. For example you will have to use the +`irate()` function over a counter to get that metric's rate per second. If your graphing needs are met by using the +metrics returned by Netdata's internal database (not specifying any source= URL parameter) then use that. If you find +limitations then consider re-writing your queries using the raw data and using Prometheus functions to get the desired +chart. + +## Grafana + +Finally we make it to grafana. This is the easiest part in my opinion. This time we will actually run the official +grafana docker container as all configuration we need to do is done via the GUI. Let's run the following command: + +```sh +docker run -i -p 3000:3000 --network=netdata-tutorial grafana/grafana +``` + +This will get grafana running at . Let's go there and +login using the credentials Admin:Admin. + +The first thing we want to do is click "Add data source". Let's make it look like the following screenshot + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%206.36.55%20PM.png) + +With this completed let's graph! Create a new Dashboard by clicking on the top left Grafana Icon and create a new graph +in that dashboard. Fill in the query like we did above and save. + +![](https://github.com/ldelossa/NetdataTutorial/raw/master/Screen%20Shot%202017-07-28%20at%206.39.38%20PM.png) + +## Conclusion + +There you have it, a complete systems monitoring stack which is very easy to deploy. From here I would begin to +understand how Prometheus and a service discovery mechanism such as Consul can play together nicely. My current prod +deployments automatically register Netdata services into Consul and Prometheus automatically begins to scrape them. Once +achieved you do not have to think about the monitoring system until Prometheus cannot keep up with your scale. Once this +happens there are options presented in the Prometheus documentation for solving this. Hope this was helpful, happy +monitoring. diff --git a/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png b/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png new file mode 100644 index 00000000..f60b43a8 Binary files /dev/null and b/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/img/stacked-netdata.png differ diff --git a/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/index.md b/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/index.md new file mode 100644 index 00000000..418ca26d --- /dev/null +++ b/blog/2023-09-26-netdata-qos-monitoring/2023-09-26-netdata-prometheus-grafana-stack/index.md @@ -0,0 +1,206 @@ +--- +slug: netdata-qos-classes-monitoring +title: "Netdata QoS Classes monitoring" +authors: satya +tags: [qos, tc, quality-of-service, classes, network-monitoring] +keywords: [qos, tc, quality-of-service, classes, network-monitoring] +image: ./img/stacked-netdata.png +--- + +![netdata-qos-classes](./img/stacked-netdata.png) + +Netdata monitors `tc` QoS classes for all interfaces. + +If you also use [FireQOS](http://firehol.org/tutorial/fireqos-new-user/) it will collect interface and class names. + +There is a [shell helper](https://raw.githubusercontent.com/netdata/netdata/master/collectors/tc.plugin/tc-qos-helper.sh.in) for this (all parsing is done by the plugin in `C` code - this shell script is just a configuration for the command to run to get `tc` output). + +The source of the tc plugin is [here](https://raw.githubusercontent.com/netdata/netdata/master/collectors/tc.plugin/plugin_tc.c). It is somewhat complex, because a state machine was needed to keep track of all the `tc` classes, including the pseudo classes tc dynamically creates. + +You can see a live demo [here](https://registry.my-netdata.io/spaces/registrymy-netdataio/rooms/local/overview#metrics_correlation=false&modal=&modalTab=&modalParams=&selectedIntegrationCategory=deploy.operating-systems&chartName-val=menu_tc&local--chartName-val=menu_tc). + + +## Motivation + +One category of metrics missing in Linux monitoring, is bandwidth consumption for each open socket (inbound and outbound traffic). So, you cannot tell how much bandwidth your web server, your database server, your backup, your ssh sessions, etc are using. + +To solve this problem, the most *adventurous* Linux monitoring tools install kernel modules to capture all traffic, analyze it and provide reports per application. A lot of work, CPU intensive and with a great degree of risk (due to the kernel modules involved which might affect the stability of the whole system). Not to mention that such solutions are probably better suited for a core linux router in your network. + +Others use NFACCT, the netfilter accounting module which is already part of the Linux firewall. However, this would require configuring a firewall on every system you want to measure bandwidth (just FYI, I do install a firewall on every server - and I strongly advise you to do so too - but configuring accounting on all servers seems overkill when you don't really need it for billing purposes). + +**There is however a much simpler approach**. + +## QoS + +One of the features the Linux kernel has, but it is rarely used, is its ability to **apply QoS on traffic**. Even most interesting is that it can apply QoS to **both inbound and outbound traffic**. + +QoS is about 2 features: + +1. **Classify traffic** + + Classification is the process of organizing traffic in groups, called **classes**. Classification can evaluate every aspect of network packets, like source and destination ports, source and destination IPs, netfilter marks, etc. + + When you classify traffic, you just assign a label to it. Of course classes have some properties themselves (like queuing mechanisms), but let's say it is that simple: **a label**. For example **I call `web server` traffic, the traffic from my server's tcp/80, tcp/443 and to my server's tcp/80, tcp/443, while I call `web surfing` all other tcp/80 and tcp/443 traffic**. You can use any combinations you like. There is no limit. + +2. **Apply traffic shaping rules to these classes** + + Traffic shaping is used to control how network interface bandwidth should be shared among the classes. Normally, you need to do this, when there is not enough bandwidth to satisfy all the demand, or when you want to control the supply of bandwidth to certain services. Of course classification is sufficient for monitoring traffic, but traffic shaping is also quite important, as we will explain in the next section. + +## Why you want QoS + +1. **Monitoring the bandwidth used by services** + + Netdata provides wonderful real-time charts, like this one (wait to see the orange `rsync` part): + + ![qos3](https://cloud.githubusercontent.com/assets/2662304/14474189/713ede84-0104-11e6-8c9c-8dca5c2abd63.gif) + +2. **Ensure sensitive administrative tasks will not starve for bandwidth** + + Have you tried to ssh to a server when the network is congested? If you have, you already know it does not work very well. QoS can guarantee that services like ssh, dns, ntp, etc will always have a small supply of bandwidth. So, no matter what happens, you will be able to ssh to your server and DNS will always work. + +3. **Ensure administrative tasks will not monopolize all the bandwidth** + + Services like backups, file copies, database dumps, etc can easily monopolize all the available bandwidth. It is common for example a nightly backup, or a huge file transfer to negatively influence the end-user experience. QoS can fix that. + +4. **Ensure each end-user connection will get a fair cut of the available bandwidth.** + + Several QoS queuing disciplines in Linux do this automatically, without any configuration from you. The result is that new sockets are favored over older ones, so that users will get a snappier experience, while others are transferring large amounts of traffic. + +5. **Protect the servers from DDoS attacks.** + + When your system is under a DDoS attack, it will get a lot more bandwidth compared to the one it can handle and probably your applications will crash. Setting a limit on the inbound traffic using QoS, will protect your servers (throttle the requests) and depending on the size of the attack may allow your legitimate users to access the server, while the attack is taking place. + + Using QoS together with a [SYNPROXY](https://github.com/netdata/netdata/blob/master/collectors/proc.plugin/README.md) will provide a great degree of protection against most DDoS attacks. Actually when I wrote that article, a few folks tried to DDoS the Netdata demo site to see in real-time the SYNPROXY operation. They did not do it right, but anyway a great deal of requests reached the Netdata server. What saved Netdata was QoS. The Netdata demo server has QoS installed, so the requests were throttled and the server did not even reach the point of resource starvation. Read about it [here](https://github.com/netdata/netdata/blob/master/collectors/proc.plugin/README.md). + +On top of all these, QoS is extremely light. You will configure it once, and this is it. It will not bother you again and it will not use any noticeable CPU resources, especially on application and database servers. + +``` +- ensure administrative tasks (like ssh, dns, etc) will always have a small but guaranteed bandwidth. So, no matter what happens, I will be able to ssh to my server and DNS will work. + +- ensure other administrative tasks will not monopolize all the available bandwidth. So, my nightly backup will not hurt my users, a developer that is copying files over the net will not get all the available bandwidth, etc. + +- ensure each end-user connection will get a fair cut of the available bandwidth. +``` + +Once **traffic classification** is applied, we can use **[netdata](https://github.com/netdata/netdata)** to visualize the bandwidth consumption per class in real-time (no configuration is needed for Netdata - it will figure it out). + +QoS, is extremely light. You will configure it once, and this is it. It will not bother you again and it will not use any noticeable CPU resources, especially on application and database servers. + +This is QoS from a home linux router. Check these features: + +1. It is real-time (per second updates) +2. QoS really works in Linux - check that the `background` traffic is squeezed when `surfing` needs it. + +![test2](https://cloud.githubusercontent.com/assets/2662304/14093004/68966020-f553-11e5-98fe-ffee2086fafd.gif) + +--- + +## QoS in Linux? + +Of course, `tc` is probably **the most undocumented, complicated and unfriendly** command in Linux. + +For example, do you know that for matching a simple port range in `tc`, e.g. all the high ports, from 1025 to 65535 inclusive, you have to match these: + +``` +1025/0xffff +1026/0xfffe +1028/0xfffc +1032/0xfff8 +1040/0xfff0 +1056/0xffe0 +1088/0xffc0 +1152/0xff80 +1280/0xff00 +1536/0xfe00 +2048/0xf800 +4096/0xf000 +8192/0xe000 +16384/0xc000 +32768/0x8000 +``` + +To do it the hard way, you can go through the [tc configuration steps](#qos-configuration-with-tc). An easier way is to use **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, a tool that simplifies QoS management in Linux. + +## Qos Configuration with FireHOL + +The **[FireHOL](https://firehol.org/)** package already distributes **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**. Check the **[FireQOS tutorial](https://firehol.org/tutorial/fireqos-new-user/)** to learn how to write your own QoS configuration. + +With **[FireQOS](https://firehol.org/tutorial/fireqos-new-user/)**, it is **really simple for everyone to use QoS in Linux**. Just install the package `firehol`. It should already be available for your distribution. If not, check the **[FireHOL Installation Guide](https://firehol.org/installing/)**. After that, you will have the `fireqos` command which uses a configuration like the following `/etc/firehol/fireqos.conf`, used at the Netdata demo site: + +```sh + # configure the Netdata ports + server_netdata_ports="tcp/19999" + + interface eth0 world bidirectional ethernet balanced rate 50Mbit + class arp + match arp + + class icmp + match icmp + + class dns commit 1Mbit + server dns + client dns + + class ntp + server ntp + client ntp + + class ssh commit 2Mbit + server ssh + client ssh + + class rsync commit 2Mbit max 10Mbit + server rsync + client rsync + + class web_server commit 40Mbit + server http + server netdata + + class client + client surfing + + class nms commit 1Mbit + match input src 10.2.3.5 +``` + +Nothing more is needed. You just run `fireqos start` to apply this configuration, restart Netdata and you have real-time visualization of the bandwidth consumption of your applications. FireQOS is not a daemon. It will just convert the configuration to `tc` commands. It will run them and it will exit. + +**IMPORTANT**: If you copy this configuration to apply it to your system, please adapt the speeds - experiment in non-production environments to learn the tool, before applying it on your servers. + +And this is what you are going to get: + +![image](https://cloud.githubusercontent.com/assets/2662304/14436322/c91d90a4-0024-11e6-9fb1-57cdef1580df.png) + +## QoS Configuration with tc + +First, setup the tc rules in rc.local using commands to assign different QoS markings to different classids. You can see one such example in [github issue #4563](https://github.com/netdata/netdata/issues/4563#issuecomment-455711973). + +Then, map the classids to names by creating `/etc/iproute2/tc_cls`. For example: + +``` +2:1 Standard +2:8 LowPriorityData +2:10 HighThroughputData +2:16 OAM +2:18 LowLatencyData +2:24 BroadcastVideo +2:26 MultimediaStreaming +2:32 RealTimeInteractive +2:34 MultimediaConferencing +2:40 Signalling +2:46 Telephony +2:48 NetworkControl +``` + +Add the following configuration option in `/etc/netdata.conf`: + +```\[plugin:tc] + enable show all classes and qdiscs for all interfaces = yes +``` + +Finally, create `/etc/netdata/tc-qos-helper.conf` with this content: +`tc_show="class"` + +Please note, that by default Netdata will enable monitoring metrics only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after Netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though). Set `yes` for a chart instead of `auto` to enable it permanently. You can also set the `enable zero metrics` option to `yes` in the `[global]` section which enables charts with zero metrics for all internal Netdata plugins.