SRE Workshop with Dynatrace

This repository contains the files required for the SRE Workshop

This repository showcase the usage of several solutions :

the HipsterShop
Litmus Chaos
OpenTelemetry Collector
Istio
Keptn LifeCycle Controller
THe OpenTelemetry Demo application

In this workshop we will walk through the usage of Configuring The Keptn LifeCycle Toolkit to deploy

the hipster-shop
the OpenTelemetry demo Application.

During this workshop we will learn how to :

Configure Chaos Experiments
Create SLO to keep track on the health of our OpenTelemetry Collectors

Prerequisite

The following tools need to be install on your machine :

jq
kubectl
git
gcloud ( if you are using GKE)
Helm

1.Create a Google Cloud Platform Project

PROJECT_ID="<your-project-id>"
gcloud services enable container.googleapis.com --project ${PROJECT_ID}
gcloud services enable monitoring.googleapis.com \
cloudtrace.googleapis.com \
clouddebugger.googleapis.com \
cloudprofiler.googleapis.com \
--project ${PROJECT_ID}

2.Create a GKE cluster

ZONE=europe-west3-a
NAME=sreworshop
gcloud container clusters create ${NAME} --zone=${ZONE} --machine-type=e2-standard-8 --num-nodes=3

3.Clone Github repo

git clone https://github.com/henrikrexed/sre-workshop
cd sre-workshop

4. Deploy

0. Label Nodes

kubectl get nodes -o wide kubectl label node-type=observability kubectl label node-type=worker kubectl label node-type=worker

1. Istio

Download Istioctl

curl -L https://istio.io/downloadIstio | sh -

This command download the latest version of istio ( in our case istio 1.17.2) compatible with our operating system. 2. Add istioctl to you PATH

cd istio-1.17.2

this directory contains samples with addons . We will refer to it later.

export PATH=$PWD/bin:$PATH

1. Install Istio

To enable Istio and take advantage of the tracing capabilities of Istio, you need to install istio with the following settings

istioctl install --set meshConfig.accessLogFile=/dev/stdout --set meshConfig.enableTracing=true --set profile=demo -y

2. Dynatrace

1. Dynatrace Tenant - start a trial

If you don't have any Dyntrace tenant , then i suggest to create a trial using the following link : Dynatrace Trial Once you have your Tenant save the Dynatrace (including https) tenant URL in the variable DT_TENANT_URL (for example : https://dedededfrf.live.dynatrace.com)

DT_TENANT_URL=<YOUR TENANT URL>

2. Create the Dynatrace API Tokens

The dynatrace operator will require to have several tokens:

Token to deploy and configure the various components
Token to ingest metrics and Traces

Operator Token

One for the operator having the following scope:

Create ActiveGate tokens
Read entities
Read Settings
Write Settings
Access problem and event feed, metrics and topology
Read configuration
Write configuration
Paas integration - installer downloader

Save the value of the token . We will use it later to store in a k8S secret

API_TOKEN=<YOUR TOKEN VALUE>

Ingest data token

Create a Dynatrace token with the following scope:

Ingest metrics (metrics.ingest)
Ingest logs (logs.ingest)
Ingest events (events.ingest)
Ingest OpenTelemtry
Read metrics

Save the value of the token . We will use it later to store in a k8S secret

DATA_INGEST_TOKEN=<YOUR TOKEN VALUE>

3. Run the deployment script

cd ..
chmod 777 deployment.sh
./deployment.sh  --clustername "${NAME}" --dturl "${DT_TENANT_URL}" --dtoperatortoken "${API_TOKEN}" --dtingesttoken "${DATA_INGEST_TOKEN}"

4. SLO

Before running any Chaos Experiments, let's start creating Alerting rules and SLO to measure the health and the efficiency of :

the K8S cluster
the Application

1. Efficiency of the Ressource requested at the cluster level:

This Slo would be measure by comparing the Total CPU core usage with the total cpu core requested : Our SLI would be expressed :

Total usage/ total requested * 100 because of the nature on how we manage our ressources in k8S, it is possible that the usage could be higher than the request ( if the limit > request) In Dynatrace we can express the SLI with the following metric expression : ((builtin:containers.cpu.usageUserMilliCores:filter(and(or(in("dt.entity.container_group_instance",entitySelector("type(container_group_instance),fromRelationship.isCgiOfCluster(type(KUBERNETES_CLUSTER),entityName.equals(~"sreworshop~"))"))))):splitBy():sum)*(0.001))/(builtin:kubernetes.node.requests_cpu:splitBy():sum)

2. Efficiency of the ressource requested at the namespace level

It would be the similar ration except that the cpu core requested and used would be splited by namespace : (builtin:kubernetes.workload.cpu_usage:splitBy("k8s.namespace.name"):sort(value(auto,descending)):limit(20))/(builtin:kubernetes.workload.requests_cpu:splitBy("k8s.namespace.name"):sort(value(auto,descending)):limit(20))*(100)

3. Efficiency of the ressource requested a the workload level

(builtin:kubernetes.workload.cpu_usage:splitBy("k8s.workload.name"):sort(value(auto,descending)):limit(20))/(builtin:kubernetes.workload.requests_cpu:splitBy("k8s.workload.name"):sort(value(auto,descending)):limit(20))*(100)

4. SLO on the availabilty of the nodes

We would measure the number of nodes in ready state compared to the total number of nodes available in the cluster. This SLO would be expressed with the following metric expression:

6. SLO on the response time of our application

In our example the Hipster-shop has a small K6 load test that acts like a synthetic tests. THe K6 tests is using the Dynatrace output plugin that is sending all the K6 statistics to Dynatrace. We could measure the response time from k6 , with the following metric expression:

5.Chaos Experiments

1. Kubernetes settings

The eviction process happens if one of the node is in any of the Pressure conditions :

DiskPressure
NodePressure
NetworkUnavailable -..

In our example we will try to simulate the NodePressure situation. For this we will use 2 existent experiment available in the ChaosHub of Litmus :

Node CPU hog
Node Memory Hog We will run in parallel on the same node ( having the label node-type=worker) both experiments.
cpu hog usage : 70%
memory hog usage : 70%

Let's select one Node from our cluster

kubectl get ndoes -l node-type=worker

Save one of the nodename if the following variable:

NODE_NAME=<YOUR NODE_NAME>

let's update our Chaos experiment with our node name:

sed -i "s,NODE_NAME_TO_REPLACE,$NODE_NAME,"  litmus chaos/chaos_schedule_nodememoryhog.yaml

now we can run the experiment :

kubectl apply -f litmus chaos/rbac.yaml -n hipster-shop
kubectl apply -f litmus chaos/chaos_schedule_nodememoryhog.yaml

2. Application experiments

To measure the impact of the failure of the important components of the Hipster-shope :

Redis database
Productcatalog TO achieve this we will run first an experiment deleting Redis and then we will run the experiment deleting the product catalog.

kubectl apply -f litmus chaos/redis_product.yaml

6. HPA

To guarantee the godd behavior of our main service , the frontend , let's create a HPA rule that will scale the frontend deploymnet based on the cpu throttling of the frontend container. By default HPA works with the Kubernetes metric server havinb by default only the CPU and memory usage of the pods. To extend our current metric server , we will use the Metric Operator deployed with Keptn LifeCycle Toolkit. For this we need to create:

a KeptnMetricProvider using the provider type Dynatrace.
a keptn metric with the right metric expression

kubectl apply -f keptn/metricProvider.yaml
kubectl apply -f keptn/keptnmetric.yaml

Let's have a look a the value reported by in our k8S cluster :

kubectl get KeptnMetric -n hipster-shop

Now that we see a value to our new custom metric , we can deploy our HPA rule :

kubectl apply -f hipstershop/hpa.yaml -n hipster-shop

7. Keptn LifeCycle Toolkit

kubectl apply -f openTelemetry-demo/deployment.yaml -n otel-demo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRE Workshop with Dynatrace

Prerequisite

1.Create a Google Cloud Platform Project

2.Create a GKE cluster

3.Clone Github repo

4. Deploy

0. Label Nodes

1. Istio

1. Install Istio

2. Dynatrace

1. Dynatrace Tenant - start a trial

2. Create the Dynatrace API Tokens

Operator Token

Ingest data token

3. Run the deployment script

4. SLO

1. Efficiency of the Ressource requested at the cluster level:

2. Efficiency of the ressource requested at the namespace level

3. Efficiency of the ressource requested a the workload level

4. SLO on the availabilty of the nodes

6. SLO on the response time of our application

5.Chaos Experiments

1. Kubernetes settings

2. Application experiments

6. HPA

7. Keptn LifeCycle Toolkit

8. Report the Envoy metrics to Dynatrace using the OpenTelemetry Collector

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dynatrace		dynatrace
hipstershop		hipstershop
image		image
istio		istio
keptn		keptn
litmus chaos		litmus chaos
openTelemetry-demo		openTelemetry-demo
README.md		README.md
deployment.sh		deployment.sh

henrikrexed/sre-workshop

Folders and files

Latest commit

History

Repository files navigation

SRE Workshop with Dynatrace

Prerequisite

1.Create a Google Cloud Platform Project

2.Create a GKE cluster

3.Clone Github repo

4. Deploy

0. Label Nodes

1. Istio

1. Install Istio

2. Dynatrace

1. Dynatrace Tenant - start a trial

2. Create the Dynatrace API Tokens

Operator Token

Ingest data token

3. Run the deployment script

4. SLO

1. Efficiency of the Ressource requested at the cluster level:

2. Efficiency of the ressource requested at the namespace level

3. Efficiency of the ressource requested a the workload level

4. SLO on the availabilty of the nodes

6. SLO on the response time of our application

5.Chaos Experiments

1. Kubernetes settings

2. Application experiments

6. HPA

7. Keptn LifeCycle Toolkit

8. Report the Envoy metrics to Dynatrace using the OpenTelemetry Collector

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages