Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
133 lines (92 sloc) 6.75 KB
---
title: Selecting and Configuring a Monitoring System
owner: Cloud Ops
---
This topic describes considerations for selecting and configuring a system to continuously monitor Pivotal Cloud Foundry (PCF) component performance and health.
## <a id='select'></a>Selecting a Monitoring Platform
Pivotal recommends using <a href="https://docs.pivotal.io/pcf-healthwatch/index.html">PCF Healthwatch</a> to
monitor your deployment. PCF Healthwatch is a service tile developed and supported by Pivotal and available on [Pivotal Network](https://network.pivotal.io/products/p-healthwatch).
Many third-party systems can also be used to monitor a PCF deployment.</p>
### <a id='types'></a>Monitoring Platform Types
Monitoring platforms support two types of monitoring:
- A _dashboard_ for active monitoring when you are at a keyboard and
screen
- Automated _alerts_ for when your attention is elsewhere
Some monitoring solutions offer both in one package. Others require
putting the two pieces together.
### <a id='platforms'></a>Monitoring Platforms
There are many monitoring options
available, both open source and commercial products. Some commonly-used platforms among PCF customers include:
* [PCF Healthwatch](https://docs.pivotal.io/pcf-healthwatch/index.html) by Pivotal
* PCF Partner
Services available on [Pivotal Network](https://network.pivotal.io):
- [AppDynamics](https://docs.pivotal.io/partners/appdynamics/old_index.html)
- [Datadog](https://docs.pivotal.io/partners/datadog/release-notes.html)
- [Dynatrace](https://docs.pivotal.io/partners/dynatrace-fullstack-addon/release-notes.html)
- [New
Relic](http://docs.pivotal.io/partners/newrelic/release-notes.html)
- [SignalFx](http://docs.pivotal.io/partners/signalfx/)
- [WaveFront by VMware](http://docs.pivotal.io/partners/wavefront-nozzle/)
* Other Commercial Services
- [VMware vRealize Operations (vROPS)](http://www.vmware.com/products/vrealize-operations.html)
* Open Source Tooling
- [Prometheus +
Grafana](https://github.com/bosh-prometheus/prometheus-boshrelease)
- [OpenTSDB](http://opentsdb.net)
### <a id='cloud-ops'></a>Pivotal Cloud Ops Tools
The Pivotal Cloud Ops Team manages two types of deployments for internal Pivotal use: open-source Cloud Foundry, and Pivotal Cloud Foundry (PCF).
For **Cloud Foundry**, Pivotal Cloud Ops uses several monitoring tools.
The [Datadog Config
repository](https://github.com/pivotal-cf-experimental/datadog-config-oss)
provides an example of how the Pivotal Cloud Ops team uses a customized Datadog dashboard to monitor the health of its open-source Cloud Foundry deployments.
To monitor its **PCF** deployments, Pivotal Cloud Ops leverages a combination of
[PCF
Healthwatch](https://docs.pivotal.io/pcf-healthwatch/1-2/index.html)
and [Google Stackdriver](https://cloud.google.com/stackdriver/).
## <a id='cloud-ops'></a>Key Inputs for Platform Monitoring
### <a id='component-metrics'></a>BOSH VM and PCF Component Health Metrics
Most monitoring service tiles for PCF come packaged with the Firehose
nozzle necessary to extract the BOSH and CF metrics leveraged for platform
monitoring. Nozzles are programs that consume data from the Loggregator Firehose. Nozzles can be configured to select, buffer, and transform data, and to forward it to other apps and services.
The nozzles gather the component logs and metrics streaming from the Loggregator Firehose endpoint. For more information about the Firehose, see [Loggregator Architecture](https://docs.pivotal.io/pivotalcf/loggregator/architecture.html).
As of PCF v2.0, both BOSH VM Health metrics and Cloud Foundry component metrics stream through the Firehose by default.
- PCF component metrics originate from the Metron agents on their
source components, then travel through Dopplers to the Traffic
Controller.
- The Traffic Controller aggregates both metrics and log messages
system-wide from all Dopplers, and emits them from its Firehose
endpoint.
The following topic list high-signal-value metrics and capacity scaling
indicators in a PCF deployment:
- [Key Performance Indicators](https://docs.pivotal.io/pivotalcf/monitoring/kpi.html)
- [Key Capacity Scaling Indicators](https://docs.pivotal.io/pivotalcf/monitoring/key-cap-scaling.html)
### <a id='smoke-tests'></a>Continuous Functional Smoke Tests
PCF includes [smoke
tests](https://github.com/pivotal-cloudops/cf-smoke-tests/tree/Dockerfile),
which are functional unit and integration tests on all major system
components. By default, whenever an operator upgrades to a new version
of PAS, these smoke tests run as a post-deploy errand.
Pivotal recommends additional higher-resolution monitoring by the execution
of continuous smoke tests, or Service Level Indicator tests, that measure user-defined features and check them against expected levels.
- [PCF
Healthwatch](https://network.pivotal.io/products/p-healthwatch)
automatically executes these tests for PAS Service Level Indicators.
- The Pivotal Cloud Ops [CF Smoke
Tests](https://github.com/pivotal-cloudops/cf-smoke-tests/tree/Dockerfile)
repository offers additional testing examples.
See the [Metrics](https://concourse.ci/metrics.html) topic in the Concourse documentation for how to set up Concourse to generate custom component metrics.
## <a id="thresholds"></a>Warning and Critical Thresholds
To properly configure your monitoring dashboard and alerts, you must establish what thresholds should drive alerting and red/yellow/green dashboard behavior.
Some key metrics have more fixed thresholds, with similar threshold numbers numbers recommended across different foundations and use cases. These metrics tend to revolve around the health and performance of key components that can
impact the performance of the entire system.
Other metrics of operational value are more dynamic in nature. This means that you must establish a baseline and yellow/red thresholds suitable for your system and its use cases. You can establish initial baselines by watching values of key metrics over time and noting what seems to be a good starting threshold level that divides acceptable and unacceptable system performance and health.
### <a id='continuous'></a>Continuous Evolution
Effective platform monitoring requires continuous evolution.
After you establish initial baselines, Pivotal recommends that you continue to refine your metrics and tests to maintain the appropriate balance between early detection and reducing
unnecessary alert fatigue. The dynamic measures recommended in [Key
Performance
Indicators](https://docs.pivotal.io/pivotalcf/monitoring/kpi.html)
and [Key Capacity Scaling
Indicators](https://docs.pivotal.io/pivotalcf/monitoring/key-cap-scaling.html)
should be revisited on occasion to ensure they are still appropriate to
the current system configuration and its usage patterns.
You can’t perform that action at this time.