Bug 1465974 Add option for installing datadog-agent #473

jklukas · 2018-08-23T14:19:33Z

Had some conversation on IRC indicating that turning on Datadog for all clusters would likely be prohibitively expensive due to per-host pricing. The idea here is to make datadog-agent explicitly opt-in by requiring command-line argument --install-datadog.

jklukas · 2018-08-23T14:21:31Z

ansible/files/bootstrap/telemetry.sh


+# Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws
+if "$INSTALL_DATADOG" && "$IS_MASTER" ; then
+    export DD_API_KEY="$(aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)"


cc @whd Would it be reasonable to go ahead and provision a Datadog API key that we can place in S3 so we can move ahead with testing this PR? Does the approach here look reasonable so far?

haroldwoo · 2018-08-29T14:12:11Z

ansible/files/bootstrap/telemetry.sh

+        export DD_API_KEY="$(sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)"
+        sudo bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"
+    fi
+    if [ "$METRICS_PROVIDER" = "stackdriver" ] ; then


I don't think we're moving forward with stackdriver for metrics

So we're planning to keep Datadog for metrics longer term? Do we have that documented anywhere at this point?

I would talk to Brian Pitts about this. I don't think the long term solution has been decided yet

Talked to Brian. Sounds like Datadog for the near or medium term, so I'm going to remove the stackdriver bits from this PR.

haroldwoo · 2018-08-29T14:12:52Z

ansible/files/bootstrap/telemetry.sh

+if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then
+    # Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization
+    sudo mkdir -p /etc/google/auth/
+    sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json


who will have access to this bucket?

This is following existing practice in this repo for, e.g. plotly credentials. It's not immediately obvious to me who all has read/write access to telemetry-spark-emr-2. I'm open to other suggestions if this doesn't seem like a good approach.

Just checked and this bucket lives in our cloudservices-aws-dev account, so all developers have access to it. A better practice would be to generate credentials(which we would have to do for the aws link gcp project anyway) and put them in a s3 bucket which only the EMR iam has access to. This is annoying because we either have to host that s3 bucket in a separate aws account and set up cross account access, or add a rule to explicitly deny access to this bucket to the group that all developers belong to

As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:

update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets

update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds

have at it

In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.

But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.

haroldwoo · 2018-08-29T14:15:19Z

ansible/files/bootstrap/telemetry.sh

+fi
+if [ "$LOGGING_PROVIDER" = "stackdriver" ] ; then
+    # https://cloud.google.com/logging/docs/agent/installation
+    sudo bash -c "$(curl -sLS https://dl.google.com/cloudagents/install-logging-agent.sh)"


This likely requires additional setup and configuration, esp. if this runs in AWS.

That may be the case? The stackdriver docs specifically talk about installing on AWS and it sounds like simply providing creds and installing the agent is enough to get a default set of logs shipping, including syslog. My plan is to deploy this to a personal staging bucket once we have creds figured out and start testing what gets to GCP to discover if additional configuration is needed.

A manual step for linking the aws account is required. A service account + credentials then need to be created and placed on host. The agent(fluentd) is preconfigured to send syslog + other things, but spark logs on emr likely require addl configuration. I have outlined some steps https://mana.mozilla.org/wiki/display/SVCOPS/Stackdriver+Logging+for+AWS but this is for ec2 instances running containers

whd

Apologies for totally missing this; generally /CC on github is not strong enough to ensure I will see something.

I haven't actually looked at the PR, only the question posed in the /CC, which see my more substantive comment farther down.

whd · 2018-08-30T18:48:01Z

ansible/files/bootstrap/telemetry.sh

+if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then
+    # Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization
+    sudo mkdir -p /etc/google/auth/
+    sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json


As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:

update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets

update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds

have at it

In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.

But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.

jklukas · 2018-08-30T19:12:56Z

Thanks to both of you, @haroldwoo and @whd for looking this over and commenting.

bpitts is out today, but I will follow up with him about plans for the future of metrics and monitoring.

If it's not too scary, it would indeed unblock me from further investigation if we could provision a DataDog API key and google application creds and upload them to the bucket.

I opened https://bugzilla.mozilla.org/show_bug.cgi?id=1487073 to request the GCP creds. Is opening the bug helpful? Should I open one for Datadog creds as well? As @haroldwoo mentioned, I may not be fulling comprehending the scope of what's necessary to get Stackdriver hooked up, so let me know if this is a significant amount of work to make those creds available.

jklukas · 2018-09-18T20:14:42Z

ansible/files/bootstrap/telemetry.sh

+tags:
+$(describe_cluster '.Cluster.Tags[] | "- \(.Key):\(.Value)"')
+- cluster_id:${CLUSTER_ID}
+EOL


For an example cluster I spun up similar to an ATMO cluster, this heredoc expanded to:

init_config: instances: - spark_url: http://ip-172-31-33-18:8088 cluster_name: klukas-metrics-test spark_cluster_mode: spark_yarn_mode tags: - App:telemetry-analysis - Environment:prod - Owner:jklukas@mozilla.com - Type:worker - Application:telemetry-analysis-worker-instance - Name:klukas-metrics-test - cluster_id:j-330MAU776X1LH

jklukas · 2018-09-18T20:23:11Z

ansible/files/bootstrap/telemetry.sh

+  tags:
+  - cluster_id:${CLUSTER_ID}
+$(describe_cluster '.Cluster.Tags[] | "  - \(.Key):\(.Value)"')
+EOL


For a test cluster I made similar to ATMO, this expanded to:

init_config: instances: - spark_url: http://ip-172-31-40-188:8088 cluster_name: klukas-metrics-test spark_cluster_mode: spark_yarn_mode tags: - cluster_id:j-2T2VUFT1TOPVE - App:telemetry-analysis - Application:telemetry-analysis-worker-instance - Environment:prod - Name:klukas-metrics-test - Owner:jklukas@mozilla.com - Type:worker

jklukas · 2018-09-18T20:24:33Z

ansible/files/cloudformation.json

                  "Effect": "Allow",
                  "Action": [
                    "elasticmapreduce:Describe*",
+                    "elasticmapreduce:List*",


Required to call aws emr describe-cluster

This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).

Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.

jklukas · 2018-09-18T20:27:06Z

This is now complete and ready to review. This should not cause any change in behavior for new clusters since the new behavior requires setting --metrics-provider.

jklukas · 2018-09-18T21:11:41Z

You can hold off on review. I realized I missed some nuance from discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1465974#c10 so going to refactor to use StatsD more directly. Should land tomorrow.

jklukas · 2018-09-19T19:51:20Z

This now has Spark configured to report to Statsd rather than using Datadog's integration. Ready for review.

ansible/files/configuration/configuration.json

jklukas · 2018-09-19T19:54:20Z

ansible/files/configuration/configuration.json

  }, {
    "Classification": "spark-defaults",
    "Properties": {
+      "spark.metrics.namespace": "spark",


We set namespace to "spark" so that Spark doesn't use the default namespace of applicationId, which is unique for every cluster.

By setting this to spark and not configuring a prefix for the StatsdSink, we end up with metric paths like:

spark.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime.mean_rate spark.1.executor.shuffleTotalBytesRead

whd

r+wc

Aside from the annoyance of having to call the AWS API to get the name, this seems reasonable.

whd · 2018-09-19T20:45:50Z

ansible/files/cloudformation.json

                  "Effect": "Allow",
                  "Action": [
                    "elasticmapreduce:Describe*",
+                    "elasticmapreduce:List*",


This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).

Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.

whd · 2018-09-19T20:50:19Z

ansible/files/bootstrap/telemetry.sh

+    sudo yum -y localinstall $DD_DIR/sops-3.0.4-1.x86_64.rpm
+    export DD_API_KEY=$(/usr/local/bin/sops --decrypt --extract '["datadog_agent::api_key"]' $DD_DIR/emr.yaml)
+    # Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws
+    bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"


We ideally should not rely on github (or datadog) for bootstrapping datadog. I actually did the work of productionizing the EMR datadog flow (i.e. porting the ops workflow for datadog installation to work with EMR) for an earlier iteration of telemetry-streaming but it wouldn't be worth the time to hack it onto this bootstrap, given that if this fails I assume it will be noncritical.

We could mirror this script into the telemetry conf bucket instead, but then we won't receive updates to the script. I'm not sure which method is better, but for reliability purposes mirroring the installation script only protects against github going down, not datadog (again, ops logic has this all sorted out correctly).

Given the above I'm inclined to ok the current approach with the caveat that we may need to fix it in the future.

I prefer using the vendor's install because they can update it without us having to make changes. It would be a good idea to add some monitoring to ensure that the agent is actually installed/running afterwards.

https://docs.datadoghq.com/agent/basic_agent_usage/ubuntu/?tab=agentv6
Do we want to set DD_UPGRADE=true to install v6?

ansible/files/bootstrap/telemetry.sh

whd · 2018-09-19T20:54:08Z

ansible/files/bootstrap/telemetry.sh

+
+tags:
+- cluster_name:${CLUSTER_NAME}
+- cluster_id:${CLUSTER_ID}


Normally we put an env and some other standard tags. I would say add App:data and Application:data and that should probably suffice.

App and Application are already provided as tags on the cluster, at least in the ATMO case, so the next line here (calling describe-clusters) will produce App and Application tags mirroring what's set on the cluster.

Does that sound fine? Does that need to be better documented?

Added commentary about what tags to expect.

ansible/files/bootstrap/telemetry.sh

whd · 2018-09-19T20:58:20Z

ansible/files/configuration/configuration.json

  }, {
    "Classification": "spark-defaults",
    "Properties": {
+      "spark.metrics.namespace": "spark",


ansible/files/configuration/configuration.json

haroldwoo · 2018-09-20T13:38:35Z

ansible/files/bootstrap/telemetry.sh

+if [ "$METRICS_PROVIDER" = "datadog" ] ; then
+    # Pull down and decrypt Datadog API key; see https://github.com/mozilla/emr-bootstrap-spark/pull/485
+    DD_DIR=$(mktemp -d)
+    aws s3 cp $TELEMETRY_CONF_BUCKET/packages/sops-3.0.4-1.x86_64.rpm $DD_DIR/


make the sops rpm name a variable in case we ever need to update it

ansible/files/bootstrap/telemetry.sh

haroldwoo · 2018-09-20T13:52:16Z

ansible/files/bootstrap/telemetry.sh

+EOF
+
+    sudo restart datadog-agent
+elif [ -n "$METRICS_PROVIDER" ]; then


should this just be else? In case the user submits empty string? Or should we use a different flag altogether e.g. --use-datadog-metrics so we don't have to arg check

This was an attempt to stick with existing convention for other options. For --email and --efs-dns, if the options aren't supplied, then we never set the corresponding variable and we use a -n "$EMAIL" check to see if we need to do anything.

I'm not too concerned about the possibility of a user accidentally passing empty string as the argument. It seems more likely that they'd pass it with no argument, in which case shift would fail or the next flag would get swallowed.

haroldwoo

had minor comments. lgtm

jklukas · 2018-09-21T18:32:39Z

I've been testing these changes in stage, and realized a few minor fixups and one necessary change.

In particular, I was only partially correct about it being fine to enable StatsdSink on all clusters. An EMR 5.13.0 cluster behaves just fine with StatsdSink outputting to a nonexistent agent. However, StatsdSink doesn't exist in earlier spark versions, so a 5.11.1 cluster with StatsdSink configured enabled will throw errors on attempting to instantiate a spark context.

I've squashed all the previously reviewed commits and added one additional commit that factors out metrics configuration to a new file that we can leave empty if the metrics option is not enabled.

jklukas · 2018-09-21T18:34:42Z

ansible/files/bootstrap/telemetry.sh

+    if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then
+        echo "ERROR: Incompatible options. StatsdSink is not present in $EMR_LABEL, but is required to support --metrics-provider=datadog; must specify $MIN_EMR_LABEL or later to enable metrics; terminating" 1>&2
+        # Cause the cluster to terminate with "Bootstrap failure"
+        exit 1


I'm choosing to exit 1 here to force the cluster to terminate. I tested this case and I'm able to access the logs to see what went wrong.

It seems better to force the cluster down during bootstrap than to let it come up and have the user be faced with mysterious errors when they try to run jobs.

jklukas · 2018-09-21T18:36:32Z

Marking this for review again to see if you have any objections to the new changes in cb8fce4.

This is now behaving as expected in stage, though we're not able to actually install the datadog-agent in stage due to the credentials, sops package, etc. existing only in the prod S3 bucket.

haroldwoo · 2018-09-21T18:43:02Z

ansible/files/bootstrap/telemetry.sh

+
+    MIN_EMR_LABEL=emr-5.13.0
+    EMR_LABEL=$(describe_cluster .Cluster.ReleaseLabel)
+    if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then


My only comment is that this may fail if they choose to change the way the emr label version strings are constructed

jklukas · 2018-09-21T20:12:27Z

ansible/files/bootstrap/telemetry.sh

+if (parse_version('${SPARK_VERSION}') < parse_version('${MIN_SPARK_VERSION}')):
+   exit(1)
+EOF
+    }


This now parses based on Spark version rather than emr label, which should be more stable. Also, we now use python pkg_version.parse_version rather than string comparison.

jklukas commented Aug 23, 2018

View reviewed changes

haroldwoo reviewed Aug 29, 2018

View reviewed changes

whd reviewed Aug 30, 2018

View reviewed changes

whd mentioned this pull request Aug 30, 2018

Add sops role assumption permissions #485

Merged

jklukas force-pushed the datadog-agent branch 2 times, most recently from cfb6b1f to 062c801 Compare September 18, 2018 19:11

jklukas changed the title ~~WIP Bug 1465974 Add option for installing datadog-agent~~ Bug 1465974 Add option for installing datadog-agent Sep 18, 2018

jklukas commented Sep 18, 2018

View reviewed changes

jklukas force-pushed the datadog-agent branch from 448e306 to 829e30d Compare September 18, 2018 20:22

jklukas commented Sep 18, 2018

View reviewed changes

jklukas requested review from haroldwoo and whd September 18, 2018 20:24

jklukas removed request for haroldwoo and whd September 18, 2018 20:29

jklukas force-pushed the datadog-agent branch from 4ca2d6a to a523f1c Compare September 19, 2018 19:49

jklukas requested review from haroldwoo and whd September 19, 2018 19:50

jklukas commented Sep 19, 2018

View reviewed changes

ansible/files/configuration/configuration.json Outdated Show resolved Hide resolved

jklukas commented Sep 19, 2018

View reviewed changes

whd approved these changes Sep 19, 2018

View reviewed changes

haroldwoo reviewed Sep 20, 2018

View reviewed changes

jklukas commented Sep 20, 2018

View reviewed changes

ansible/files/bootstrap/telemetry.sh Show resolved Hide resolved

haroldwoo reviewed Sep 20, 2018

View reviewed changes

haroldwoo approved these changes Sep 20, 2018

View reviewed changes

jklukas force-pushed the datadog-agent branch from 8459234 to cb8fce4 Compare September 21, 2018 18:25

jklukas commented Sep 21, 2018

View reviewed changes

jklukas requested review from haroldwoo and whd September 21, 2018 18:35

haroldwoo reviewed Sep 21, 2018

View reviewed changes

haroldwoo approved these changes Sep 21, 2018

View reviewed changes

jklukas force-pushed the datadog-agent branch from d397ad5 to ec5f417 Compare September 21, 2018 20:08

jklukas commented Sep 21, 2018

View reviewed changes

Bug 1465974 Add option for installing datadog-agent

b99e289

jklukas force-pushed the datadog-agent branch from ec5f417 to b99e289 Compare September 24, 2018 12:47

jklukas merged commit ad2dced into mozilla:master Sep 24, 2018

jklukas deleted the datadog-agent branch September 24, 2018 12:47

Bug 1465974 Add option for installing datadog-agent #473

Bug 1465974 Add option for installing datadog-agent #473

Uh oh!

Conversation

jklukas commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklukas commented Aug 30, 2018

Uh oh!

jklukas Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklukas Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklukas commented Sep 18, 2018

Uh oh!

jklukas commented Sep 18, 2018

Uh oh!

jklukas commented Sep 19, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jklukas Sep 18, 2018 •

edited

Loading

jklukas Sep 18, 2018 •

edited

Loading