Skip to content

Setting up cluster monitoring

Raul.A edited this page Mar 6, 2018 · 2 revisions

Cluster Monitoring

This is a set of instructions for setting up a basic Collectd->Graphite->Grafana monitoring system for a DCOS cluster.

Step 1: Setup Graphite

Graphite is the center of the monitoring system. It gives Collectd a place to store data, and Grafana a place to pull from.

The Docker image chosen for graphite is located at nickstenning/graphite. This is a basic image consisting of graphite, the carbon backend, and NGINX webserver to provide data through a web interface.

the following ports need to be mapped:

  • 80 - for the web interface and API
  • 2003 - for recieving data from Collectd

you will want to map the following volumes as well:

  • /var/lib/graphite/conf - for persistent configuration
  • /var/lib/graphite/storage/whisper - the actual data being collected

you can start a new instance in DCOS with the following JSON

{
    "id": "/graphite",
    "backoffFactor": 1.15,
    "backoffSeconds": 1,
    "container": {
        "portMappings": [
            {
                "containerPort": 80,
                "hostPort": 0,
                "labels": {
                    "VIP_0": "/graphite2:80"
                },
                "protocol": "tcp",
                "servicePort": 10154
            },
            {
                "containerPort": 2003,
                "hostPort": 0,
                "labels": {
                    "VIP_1": "/graphite2:2003"
                },
                "protocol": "tcp",
                "servicePort": 10155
            }
        ],
        "type": "DOCKER",
        "volumes": [
            {
                "containerPath": "/var/lib/graphite/conf",
                "hostPath": "<your/host-path/here>",
                "mode": "RW"
            },
            {
                "containerPath": "/var/lib/graphite/storage/whisper",
                "hostPath": "<your/host-path/here>",
                "mode": "RW"
            }
        ],
        "docker": {
            "image": "nickstenning/graphite",
            "forcePullImage": false,
            "privileged": false,
            "parameters": []
        }
    },
    "cpus": 4,
    "disk": 0,
    "instances": 1,
    "maxLaunchDelaySeconds": 3600,
    "mem": 2056,
    "gpus": 0,
    "networks": [
        {
            "mode": "container/bridge"
        }
    ],
    "requirePorts": false,
    "upgradeStrategy": {
        "maximumOverCapacity": 1,
        "minimumHealthCapacity": 1
    },
    "killSelection": "YOUNGEST_FIRST",
    "unreachableStrategy": {
        "inactiveAfterSeconds": 0,
        "expungeAfterSeconds": 0
    },
    "healthChecks": [],
    "fetch": [],
    "constraints": []
}

In DCOS you'll want to make sure to check "Enable load balanced service address" to get a service address for both ports. The CPU and Memory required will vary on how many nodes you plan to monitor, but i found 4 CPU and 2GB was plenty for 16 nodes.

You can now verify graphite is up and running by going to ip_address:80/dashboard.

To adjust retention settings, open up storage-schemas.conf located wherever you set up you volume for "/var/lib/graphite/conf" add the following in between the [carbon] and [default_1min_for_1day] blocks

[collectd]
pattern = ^collectd.*
retentions = 10s:1h,1m:1d,10m:2w

this will retain 10s data points for 1 hour, 1m data points for 1d, and 10m data points for 2 weeks for any data coming from a collectd source.

after adjusting the conf file restart the graphite instance.

Step 2: Collectd

ollectd collects data based on which plugins you choose to enable. there are plugins for everything you would ever want, and even more for stuff you wouldnt. in this case, we are only going to monitor CPU, memory, and disk.

here is our basic collectd conf.

FQDNLookup false
Interval 10
Timeout 2
ReadThreads 5

LoadPlugin cpu
LoadPlugin disk
LoadPlugin memory
LoadPlugin write_graphite

<Plugin disk>
	Disk "/^[hs]d[a-f][0-9]?$/"
	IgnoreSelected false
</Plugin>

<Plugin "write_graphite">
 <Node "endpoint">
   Host "${EP_HOST}"
   Port "${EP_PORT}"
   Protocol "tcp"
   LogSendErrors true
   EscapeCharacter "_"
   Prefix "${PREFIX}"
 </Node>
</Plugin>

a simple installation script for centos 7 follows

#!/bin/bash

EP_HOST=<the host given by DCOS>
EP_PORT=2003
PREFIX="collectd."

yum -y install epel-release
yum -y install collectd collectd-utils

sed -e "s/\${EP_HOST}/$EP_HOST/" -e "s/\${EP_PORT}/$EP_PORT/" -e "s/\${PREFIX}/$PREFIX/" collectd.conf > /etc/collectd.conf

systemctl enable collectd
systemctl start collectd

This assumes the script file and collectd.conf are in the same directory.

Simply run this script on any node you wish to monitor and it will start reporting data to graphite right away!

You can confirm data is coming in by going back to :80/dashboard. if the collectd. prefix shows up then you know it is working.

Step 3: Grafana

Grafana docker image info here https://hub.docker.com/r/grafana/grafana/

You'll want to expose port 3000 for web access.

For DCOS here is an example configuration:

{
    "id": "/grafana",
    "backoffFactor": 1.15,
    "backoffSeconds": 1,
    "container": {
        "portMappings": [
            {
                "containerPort": 3000,
                "hostPort": 3000,
                "labels": {
                    "VIP_0": "/grafana:3000"
                },
                "protocol": "tcp",
                "servicePort": 10153
            }
        ],
        "type": "DOCKER",
        "volumes": [
            {
                "containerPath": "/var/lib/grafana",
                "hostPath": "hostpath",
                "mode": "RW"
            }
        ],
        "docker": {
            "image": "grafana/grafana",
            "forcePullImage": false,
            "privileged": false,
            "parameters": []
        }
    },
    "cpus": 2,
    "disk": 0,
    "instances": 1,
    "maxLaunchDelaySeconds": 3600,
    "mem": 512,
    "gpus": 0,
    "networks": [
        {
            "mode": "container/bridge"
        }
    ],
    "requirePorts": false,
    "upgradeStrategy": {
        "maximumOverCapacity": 1,
        "minimumHealthCapacity": 1
    },
    "killSelection": "YOUNGEST_FIRST",
    "unreachableStrategy": {
        "inactiveAfterSeconds": 0,
        "expungeAfterSeconds": 0
    },
    "healthChecks": [],
    "fetch": [],
    "constraints": []
}

Once it's up and running you can access the web ui via ip:3000 and use admin/admin to log in.

Next we need to set up a data source. under configuration-> datasources create a new data source. Name it whatever you want, but set the type to graphite. For URL give it the load balanced URL from DCOS. Keep access as proxy.

Hit save & test and if everything works OK you will see a green data source is working box.

From here you should create a new dashboard. Click on the + symbol on the left and select dashboard. click graph, and a new empty graph will appear. Click "panel title" and select edit. Set the datasource to the graphite datasource we just set up.

Now you can start drilling down into which metric you want displayed. For example, to show all disk write ops of a node, you would use collectd..disk-.disk_ops.write, and then add a sum function to add all the data points together.

Clone this wiki locally