Skip to content
This repository was archived by the owner on Feb 15, 2020. It is now read-only.

Conversation

@jklukas
Copy link
Contributor

@jklukas jklukas commented Aug 23, 2018

Had some conversation on IRC indicating that turning on Datadog for all clusters would likely be prohibitively expensive due to per-host pricing. The idea here is to make datadog-agent explicitly opt-in by requiring command-line argument --install-datadog.


# Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws
if "$INSTALL_DATADOG" && "$IS_MASTER" ; then
export DD_API_KEY="$(aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @whd Would it be reasonable to go ahead and provision a Datadog API key that we can place in S3 so we can move ahead with testing this PR? Does the approach here look reasonable so far?

export DD_API_KEY="$(sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)"
sudo bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"
fi
if [ "$METRICS_PROVIDER" = "stackdriver" ] ; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're moving forward with stackdriver for metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're planning to keep Datadog for metrics longer term? Do we have that documented anywhere at this point?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would talk to Brian Pitts about this. I don't think the long term solution has been decided yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Brian. Sounds like Datadog for the near or medium term, so I'm going to remove the stackdriver bits from this PR.

if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then
# Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization
sudo mkdir -p /etc/google/auth/
sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who will have access to this bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following existing practice in this repo for, e.g. plotly credentials. It's not immediately obvious to me who all has read/write access to telemetry-spark-emr-2. I'm open to other suggestions if this doesn't seem like a good approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked and this bucket lives in our cloudservices-aws-dev account, so all developers have access to it. A better practice would be to generate credentials(which we would have to do for the aws link gcp project anyway) and put them in a s3 bucket which only the EMR iam has access to. This is annoying because we either have to host that s3 bucket in a separate aws account and set up cross account access, or add a rule to explicitly deny access to this bucket to the group that all developers belong to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:

  1. update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets
  2. update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds
  3. have at it

In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.

But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.

fi
if [ "$LOGGING_PROVIDER" = "stackdriver" ] ; then
# https://cloud.google.com/logging/docs/agent/installation
sudo bash -c "$(curl -sLS https://dl.google.com/cloudagents/install-logging-agent.sh)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely requires additional setup and configuration, esp. if this runs in AWS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be the case? The stackdriver docs specifically talk about installing on AWS and it sounds like simply providing creds and installing the agent is enough to get a default set of logs shipping, including syslog. My plan is to deploy this to a personal staging bucket once we have creds figured out and start testing what gets to GCP to discover if additional configuration is needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A manual step for linking the aws account is required. A service account + credentials then need to be created and placed on host. The agent(fluentd) is preconfigured to send syslog + other things, but spark logs on emr likely require addl configuration. I have outlined some steps https://mana.mozilla.org/wiki/display/SVCOPS/Stackdriver+Logging+for+AWS but this is for ec2 instances running containers

Copy link
Member

@whd whd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for totally missing this; generally /CC on github is not strong enough to ensure I will see something.

I haven't actually looked at the PR, only the question posed in the /CC, which see my more substantive comment farther down.

if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then
# Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization
sudo mkdir -p /etc/google/auth/
sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:

  1. update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets
  2. update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds
  3. have at it

In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.

But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 30, 2018

Thanks to both of you, @haroldwoo and @whd for looking this over and commenting.

bpitts is out today, but I will follow up with him about plans for the future of metrics and monitoring.

If it's not too scary, it would indeed unblock me from further investigation if we could provision a DataDog API key and google application creds and upload them to the bucket.

I opened https://bugzilla.mozilla.org/show_bug.cgi?id=1487073 to request the GCP creds. Is opening the bug helpful? Should I open one for Datadog creds as well? As @haroldwoo mentioned, I may not be fulling comprehending the scope of what's necessary to get Stackdriver hooked up, so let me know if this is a significant amount of work to make those creds available.

@jklukas jklukas force-pushed the datadog-agent branch 2 times, most recently from cfb6b1f to 062c801 Compare September 18, 2018 19:11
@jklukas jklukas changed the title WIP Bug 1465974 Add option for installing datadog-agent Bug 1465974 Add option for installing datadog-agent Sep 18, 2018
tags:
$(describe_cluster '.Cluster.Tags[] | "- \(.Key):\(.Value)"')
- cluster_id:${CLUSTER_ID}
EOL
Copy link
Contributor Author

@jklukas jklukas Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an example cluster I spun up similar to an ATMO cluster, this heredoc expanded to:

init_config:
instances:
- spark_url: http://ip-172-31-33-18:8088
  cluster_name: klukas-metrics-test
spark_cluster_mode: spark_yarn_mode
tags:
- App:telemetry-analysis
- Environment:prod
- Owner:jklukas@mozilla.com
- Type:worker
- Application:telemetry-analysis-worker-instance
- Name:klukas-metrics-test
- cluster_id:j-330MAU776X1LH

tags:
- cluster_id:${CLUSTER_ID}
$(describe_cluster '.Cluster.Tags[] | " - \(.Key):\(.Value)"')
EOL
Copy link
Contributor Author

@jklukas jklukas Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a test cluster I made similar to ATMO, this expanded to:

init_config:
instances:
- spark_url: http://ip-172-31-40-188:8088
  cluster_name: klukas-metrics-test
  spark_cluster_mode: spark_yarn_mode
  tags:
  - cluster_id:j-2T2VUFT1TOPVE
  - App:telemetry-analysis
  - Application:telemetry-analysis-worker-instance
  - Environment:prod
  - Name:klukas-metrics-test
  - Owner:jklukas@mozilla.com
  - Type:worker

"Effect": "Allow",
"Action": [
"elasticmapreduce:Describe*",
"elasticmapreduce:List*",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required to call aws emr describe-cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).

Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.

@jklukas jklukas requested review from haroldwoo and whd September 18, 2018 20:24
@jklukas
Copy link
Contributor Author

jklukas commented Sep 18, 2018

This is now complete and ready to review. This should not cause any change in behavior for new clusters since the new behavior requires setting --metrics-provider.

@jklukas jklukas removed request for haroldwoo and whd September 18, 2018 20:29
@jklukas
Copy link
Contributor Author

jklukas commented Sep 18, 2018

You can hold off on review. I realized I missed some nuance from discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1465974#c10 so going to refactor to use StatsD more directly. Should land tomorrow.

@jklukas
Copy link
Contributor Author

jklukas commented Sep 19, 2018

This now has Spark configured to report to Statsd rather than using Datadog's integration. Ready for review.

}, {
"Classification": "spark-defaults",
"Properties": {
"spark.metrics.namespace": "spark",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set namespace to "spark" so that Spark doesn't use the default namespace of applicationId, which is unique for every cluster.

By setting this to spark and not configuring a prefix for the StatsdSink, we end up with metric paths like:

  spark.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime.mean_rate
  spark.1.executor.shuffleTotalBytesRead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@whd whd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

Aside from the annoyance of having to call the AWS API to get the name, this seems reasonable.

"Effect": "Allow",
"Action": [
"elasticmapreduce:Describe*",
"elasticmapreduce:List*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).

Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.

sudo yum -y localinstall $DD_DIR/sops-3.0.4-1.x86_64.rpm
export DD_API_KEY=$(/usr/local/bin/sops --decrypt --extract '["datadog_agent::api_key"]' $DD_DIR/emr.yaml)
# Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws
bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally should not rely on github (or datadog) for bootstrapping datadog. I actually did the work of productionizing the EMR datadog flow (i.e. porting the ops workflow for datadog installation to work with EMR) for an earlier iteration of telemetry-streaming but it wouldn't be worth the time to hack it onto this bootstrap, given that if this fails I assume it will be noncritical.

We could mirror this script into the telemetry conf bucket instead, but then we won't receive updates to the script. I'm not sure which method is better, but for reliability purposes mirroring the installation script only protects against github going down, not datadog (again, ops logic has this all sorted out correctly).

Given the above I'm inclined to ok the current approach with the caveat that we may need to fix it in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using the vendor's install because they can update it without us having to make changes. It would be a good idea to add some monitoring to ensure that the agent is actually installed/running afterwards.

https://docs.datadoghq.com/agent/basic_agent_usage/ubuntu/?tab=agentv6
Do we want to set DD_UPGRADE=true to install v6?

tags:
- cluster_name:${CLUSTER_NAME}
- cluster_id:${CLUSTER_ID}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we put an env and some other standard tags. I would say add App:data and Application:data and that should probably suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App and Application are already provided as tags on the cluster, at least in the ATMO case, so the next line here (calling describe-clusters) will produce App and Application tags mirroring what's set on the cluster.

Does that sound fine? Does that need to be better documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commentary about what tags to expect.

}, {
"Classification": "spark-defaults",
"Properties": {
"spark.metrics.namespace": "spark",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if [ "$METRICS_PROVIDER" = "datadog" ] ; then
# Pull down and decrypt Datadog API key; see https://github.com/mozilla/emr-bootstrap-spark/pull/485
DD_DIR=$(mktemp -d)
aws s3 cp $TELEMETRY_CONF_BUCKET/packages/sops-3.0.4-1.x86_64.rpm $DD_DIR/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the sops rpm name a variable in case we ever need to update it

EOF

sudo restart datadog-agent
elif [ -n "$METRICS_PROVIDER" ]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be else? In case the user submits empty string? Or should we use a different flag altogether e.g. --use-datadog-metrics so we don't have to arg check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to stick with existing convention for other options. For --email and --efs-dns, if the options aren't supplied, then we never set the corresponding variable and we use a -n "$EMAIL" check to see if we need to do anything.

I'm not too concerned about the possibility of a user accidentally passing empty string as the argument. It seems more likely that they'd pass it with no argument, in which case shift would fail or the next flag would get swallowed.

Copy link

@haroldwoo haroldwoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had minor comments. lgtm

@jklukas
Copy link
Contributor Author

jklukas commented Sep 21, 2018

I've been testing these changes in stage, and realized a few minor fixups and one necessary change.

In particular, I was only partially correct about it being fine to enable StatsdSink on all clusters. An EMR 5.13.0 cluster behaves just fine with StatsdSink outputting to a nonexistent agent. However, StatsdSink doesn't exist in earlier spark versions, so a 5.11.1 cluster with StatsdSink configured enabled will throw errors on attempting to instantiate a spark context.

I've squashed all the previously reviewed commits and added one additional commit that factors out metrics configuration to a new file that we can leave empty if the metrics option is not enabled.

if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then
echo "ERROR: Incompatible options. StatsdSink is not present in $EMR_LABEL, but is required to support --metrics-provider=datadog; must specify $MIN_EMR_LABEL or later to enable metrics; terminating" 1>&2
# Cause the cluster to terminate with "Bootstrap failure"
exit 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm choosing to exit 1 here to force the cluster to terminate. I tested this case and I'm able to access the logs to see what went wrong.

It seems better to force the cluster down during bootstrap than to let it come up and have the user be faced with mysterious errors when they try to run jobs.

@jklukas jklukas requested review from haroldwoo and whd September 21, 2018 18:35
@jklukas
Copy link
Contributor Author

jklukas commented Sep 21, 2018

Marking this for review again to see if you have any objections to the new changes in cb8fce4.

This is now behaving as expected in stage, though we're not able to actually install the datadog-agent in stage due to the credentials, sops package, etc. existing only in the prod S3 bucket.


MIN_EMR_LABEL=emr-5.13.0
EMR_LABEL=$(describe_cluster .Cluster.ReleaseLabel)
if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only comment is that this may fail if they choose to change the way the emr label version strings are constructed

if (parse_version('${SPARK_VERSION}') < parse_version('${MIN_SPARK_VERSION}')):
exit(1)
EOF
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now parses based on Spark version rather than emr label, which should be more stable. Also, we now use python pkg_version.parse_version rather than string comparison.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jklukas jklukas merged commit ad2dced into mozilla:master Sep 24, 2018
@jklukas jklukas deleted the datadog-agent branch September 24, 2018 12:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants