Skip to content

Commit

Permalink
Fixed broken links in gke-monitoring (#1748)
Browse files Browse the repository at this point in the history
* Fixed broken links in gke-monitoring

* replace entire contents with Prometheus

* Addressed review comments
  • Loading branch information
k8s-ci-robot committed Mar 4, 2020
1 parent 183cef0 commit 5c114a5
Showing 1 changed file with 2 additions and 157 deletions.
159 changes: 2 additions & 157 deletions content/docs/gke/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,162 +4,7 @@ description = "Logging and monitoring for Kubeflow"
weight = 110
+++

This guide has information about how to set up logging and monitoring for your
Kubeflow deployment.

# Logging
[Prometheus](https://prometheus.io/) is a monitoring tool often used with Kubernetes. If you configure Kubernetes Engine Monitoring and include Prometheus support, then the metrics that are generated by services using the [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/) can be exported from the cluster and made visible as [external metrics](https://cloud.google.com/monitoring/api/metrics_other#externalgoogleapiscom) in Cloud Monitoring.

## Stackdriver on GKE

The default on GKE is to send logs to
[Stackdriver logging](https://cloud.google.com/logging/docs/).

Stackdriver recently introduced new features for [Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine/migration) that are currently
in Beta. These features are only available on Kubernetes v1.10 or later and must
be explicitly installed. Below are instructions for both versions of Stackdriver Kubernetes.

### Default stackdriver

This section contains instructions for using the existing stackdriver support
for GKE which is the default.

To get the logs for a particular pod you can use the following
advanced filter in Stackdriver logging's search UI.

```
resource.type="container"
resource.labels.cluster_name="${CLUSTER}"
resource.labels.pod_id="${POD_NAME}"
```

where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster.

The equivalent gcloud command would be

```
gcloud --project=${PROJECT} logging read \
--freshness=24h \
--order asc \
"resource.type=\"container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_id=\"${POD}\" "
```


Kubernetes events for the TFJob are also available in stackdriver and can
be obtained using the following query in the UI

```
resource.labels.cluster_name="${CLUSTER}"
logName="projects/${PROJECT}/logs/events"
jsonPayload.involvedObject.name="${TFJOB}"
```

The equivalent gcloud command is

```
gcloud --project=${PROJECT} logging read \
--freshness=24h \
--order asc \
"resource.labels.cluster_name=\"${CLUSTER}\" jsonPayload.involvedObject.name=\"${TFJOB}\" logName=\"projects/${PROJECT}/logs/events\" "
```

### Stackdriver Kubernetes

This section contains the relevant stackdriver queries and gloud commands
if you are using the new [Stackdriver Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine)

To get the stdout/stderr logs for a particular container you can use the following
advanced filter in Stackdriver logging's search UI.

```
resource.type="k8s_container"
resource.labels.cluster_name="${CLUSTER}"
resource.labels.pod_name="${POD_NAME}"
```

where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster.

The equivalent gcloud command would be

```
gcloud --project=${PROJECT} logging read \
--freshness=24h \
--order asc \
"resource.type=\"k8s_container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" "
```

Events about individual pods can be obtained with the following query

```
resource.type="k8s_pod"
resource.labels.cluster_name="${CLUSTER}"
resource.labels.pod_name="${POD_NAME}"
```

or via gcloud

```
gcloud --project=${PROJECT} logging read \
--freshness=24h \
--order asc \
"resource.type=\"k8s_pod\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" "
```

#### Filter with labels

The new agents also support querying for logs using pod labels
For example:

```
resource.type="k8s_container"
resource.labels.cluster_name="${CLUSTER}"
metadata.userLables.${LABEL_KEY}="${LABEL_VALUE}"
```

# Monitoring

## Stackdriver on GKE
The new [Stackdriver Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine)
provides single dashboard observability and is compatible with Prometheus data model.

See this [doc](https://cloud.google.com/monitoring/kubernetes-engine/observing) for more
details on the dashboard.

Stackdriver by default provides container level CPU/memory metrics.
We can also define custom Prometheus metrics and view them on the Stackdriver dashboard.
See for more [detail](https://cloud.google.com/monitoring/kubernetes-engine/prometheus).

## Prometheus

### Kubeflow Prometheus component
Kubeflow provides a Prometheus [component](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.libsonnet).
To deploy the Prometheus component:

```
ks generate prometheus prom --projectId=YOUR_PROJECT --clusterName=YOUR_CLUSTER --zone=ZONE
ks apply YOUR_ENV -c prom
```

The prometheus server will scrape the services with annotation `prometheus.io/scrape=true`.
See for more [detail](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.yml#L75)
and an [example](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/metric-collector.libsonnet#L83).

#### Export metrics to Stackdriver
The Prometheus server will export metrics to Stackdriver, as
[configured](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.yml#L127).
We are using an [image](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.libsonnet#L170)
provided by Stackdriver. See Stackdriver [doc](https://cloud.google.com/monitoring/kubernetes-engine/prometheus)
for more detail, but you don't need to change anything here.

If you don't want to export metrics to Stackdriver, remove the `remote_write` part in the `prometheus.yml`,
and use a native Prometheus [image](https://hub.docker.com/r/prom/prometheus/tags/).

### Metric collector component for IAP (GKE only)
Kubeflow also provides a metric-collector [component](https://github.com/kubeflow/kubeflow/tree/master/metric-collector).
This component periodically pings your Kubeflow endpoint and provides a
[metric](https://github.com/kubeflow/kubeflow/blob/master/metric-collector/service-readiness/kubeflow-readiness.py#L21)
of whether the endpoint is up or not. To deploy it:

```
ks generate metric-collector mc --targetUrl=YOUR_KF_ENDPOINT
ks apply YOUR_ENV -c mc
```
To configure and use Prometheus with Kubernetes Engine Monitoring, see [the GCP documentation](https://cloud.google.com/monitoring/kubernetes-engine/prometheus).

0 comments on commit 5c114a5

Please sign in to comment.