-
Notifications
You must be signed in to change notification settings - Fork 19
Bug 1465974 Add option for installing datadog-agent #473
Conversation
ansible/files/bootstrap/telemetry.sh
Outdated
|
|
||
| # Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws | ||
| if "$INSTALL_DATADOG" && "$IS_MASTER" ; then | ||
| export DD_API_KEY="$(aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @whd Would it be reasonable to go ahead and provision a Datadog API key that we can place in S3 so we can move ahead with testing this PR? Does the approach here look reasonable so far?
ansible/files/bootstrap/telemetry.sh
Outdated
| export DD_API_KEY="$(sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/datadog -)" | ||
| sudo bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)" | ||
| fi | ||
| if [ "$METRICS_PROVIDER" = "stackdriver" ] ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're moving forward with stackdriver for metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're planning to keep Datadog for metrics longer term? Do we have that documented anywhere at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would talk to Brian Pitts about this. I don't think the long term solution has been decided yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Brian. Sounds like Datadog for the near or medium term, so I'm going to remove the stackdriver bits from this PR.
ansible/files/bootstrap/telemetry.sh
Outdated
| if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then | ||
| # Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization | ||
| sudo mkdir -p /etc/google/auth/ | ||
| sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who will have access to this bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following existing practice in this repo for, e.g. plotly credentials. It's not immediately obvious to me who all has read/write access to telemetry-spark-emr-2. I'm open to other suggestions if this doesn't seem like a good approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked and this bucket lives in our cloudservices-aws-dev account, so all developers have access to it. A better practice would be to generate credentials(which we would have to do for the aws link gcp project anyway) and put them in a s3 bucket which only the EMR iam has access to. This is annoying because we either have to host that s3 bucket in a separate aws account and set up cross account access, or add a rule to explicitly deny access to this bucket to the group that all developers belong to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:
- update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets
- update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds
- have at it
In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.
But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.
ansible/files/bootstrap/telemetry.sh
Outdated
| fi | ||
| if [ "$LOGGING_PROVIDER" = "stackdriver" ] ; then | ||
| # https://cloud.google.com/logging/docs/agent/installation | ||
| sudo bash -c "$(curl -sLS https://dl.google.com/cloudagents/install-logging-agent.sh)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely requires additional setup and configuration, esp. if this runs in AWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may be the case? The stackdriver docs specifically talk about installing on AWS and it sounds like simply providing creds and installing the agent is enough to get a default set of logs shipping, including syslog. My plan is to deploy this to a personal staging bucket once we have creds figured out and start testing what gets to GCP to discover if additional configuration is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A manual step for linking the aws account is required. A service account + credentials then need to be created and placed on host. The agent(fluentd) is preconfigured to send syslog + other things, but spark logs on emr likely require addl configuration. I have outlined some steps https://mana.mozilla.org/wiki/display/SVCOPS/Stackdriver+Logging+for+AWS but this is for ec2 instances running containers
whd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for totally missing this; generally /CC on github is not strong enough to ensure I will see something.
I haven't actually looked at the PR, only the question posed in the /CC, which see my more substantive comment farther down.
ansible/files/bootstrap/telemetry.sh
Outdated
| if [ "$METRICS_PROVIDER" = "stackdriver" ] || [ "$LOGGING_PROVIDER" = "stackdriver" ]; then | ||
| # Install GCP credentials for stackdriver; see https://cloud.google.com/logging/docs/agent/authorization | ||
| sudo mkdir -p /etc/google/auth/ | ||
| sudo aws s3 cp $TELEMETRY_CONF_BUCKET/credentials/gcp.json /etc/google/auth/application_default_credentials.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As :klukas mentions this is status quo and I'm comfortable leaving it as-is for something such as datadog API keys, which are already accessible to developers if they know what they're doing. The only difference in access is that these will be stored unencrypted in s3 whereas for other dev instances they are stored KMS-encrypted, but I don't think we should block this PR on revamping the sordid state of affairs that is the emr bootstrap process. It would technically be possible to:
- update the IAM role for EMR to allow assuming the role that allows KMS decryption of dev secrets, and the s3 location of the common dev secrets
- update the bootstrap logic to assume the role, pull down the secrets that contain the datadog API key, use sops to decrypt the file, and parse the YAML to access the datadog creds
- have at it
In fact I vaguely recall telling :sunasuh some years ago that this was possible, but we never implemented it.
But again I don't know that it's necessarily worth the time to do this. If you want to give it a whirl I can look up the specific values of the various parameters mentioned above.
|
Thanks to both of you, @haroldwoo and @whd for looking this over and commenting. bpitts is out today, but I will follow up with him about plans for the future of metrics and monitoring. If it's not too scary, it would indeed unblock me from further investigation if we could provision a DataDog API key and google application creds and upload them to the bucket. I opened https://bugzilla.mozilla.org/show_bug.cgi?id=1487073 to request the GCP creds. Is opening the bug helpful? Should I open one for Datadog creds as well? As @haroldwoo mentioned, I may not be fulling comprehending the scope of what's necessary to get Stackdriver hooked up, so let me know if this is a significant amount of work to make those creds available. |
cfb6b1f to
062c801
Compare
ansible/files/bootstrap/telemetry.sh
Outdated
| tags: | ||
| $(describe_cluster '.Cluster.Tags[] | "- \(.Key):\(.Value)"') | ||
| - cluster_id:${CLUSTER_ID} | ||
| EOL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an example cluster I spun up similar to an ATMO cluster, this heredoc expanded to:
init_config:
instances:
- spark_url: http://ip-172-31-33-18:8088
cluster_name: klukas-metrics-test
spark_cluster_mode: spark_yarn_mode
tags:
- App:telemetry-analysis
- Environment:prod
- Owner:jklukas@mozilla.com
- Type:worker
- Application:telemetry-analysis-worker-instance
- Name:klukas-metrics-test
- cluster_id:j-330MAU776X1LH448e306 to
829e30d
Compare
ansible/files/bootstrap/telemetry.sh
Outdated
| tags: | ||
| - cluster_id:${CLUSTER_ID} | ||
| $(describe_cluster '.Cluster.Tags[] | " - \(.Key):\(.Value)"') | ||
| EOL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a test cluster I made similar to ATMO, this expanded to:
init_config:
instances:
- spark_url: http://ip-172-31-40-188:8088
cluster_name: klukas-metrics-test
spark_cluster_mode: spark_yarn_mode
tags:
- cluster_id:j-2T2VUFT1TOPVE
- App:telemetry-analysis
- Application:telemetry-analysis-worker-instance
- Environment:prod
- Name:klukas-metrics-test
- Owner:jklukas@mozilla.com
- Type:worker
| "Effect": "Allow", | ||
| "Action": [ | ||
| "elasticmapreduce:Describe*", | ||
| "elasticmapreduce:List*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required to call aws emr describe-cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).
Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.
|
This is now complete and ready to review. This should not cause any change in behavior for new clusters since the new behavior requires setting |
|
You can hold off on review. I realized I missed some nuance from discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1465974#c10 so going to refactor to use StatsD more directly. Should land tomorrow. |
4ca2d6a to
a523f1c
Compare
|
This now has Spark configured to report to Statsd rather than using Datadog's integration. Ready for review. |
| }, { | ||
| "Classification": "spark-defaults", | ||
| "Properties": { | ||
| "spark.metrics.namespace": "spark", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set namespace to "spark" so that Spark doesn't use the default namespace of applicationId, which is unique for every cluster.
By setting this to spark and not configuring a prefix for the StatsdSink, we end up with metric paths like:
spark.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime.mean_rate
spark.1.executor.shuffleTotalBytesRead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
whd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+wc
Aside from the annoyance of having to call the AWS API to get the name, this seems reasonable.
| "Effect": "Allow", | ||
| "Action": [ | ||
| "elasticmapreduce:Describe*", | ||
| "elasticmapreduce:List*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather unfortunate, as we've had rate-limiting issues hitting the describe and list APIs before. Perhaps this isn't a problem anymore, and since this is looking up a specific cluster it might not be an issue (IIRC the previous issues were mainly with list-clusters).
Ideally we could pass this cluster name in through via ATMO/airflow instead, but at any rate these API calls have been known to fail before.
| sudo yum -y localinstall $DD_DIR/sops-3.0.4-1.x86_64.rpm | ||
| export DD_API_KEY=$(/usr/local/bin/sops --decrypt --extract '["datadog_agent::api_key"]' $DD_DIR/emr.yaml) | ||
| # Install datadog-agent; see https://app.datadoghq.com/account/settings#agent/aws | ||
| bash -c "$(curl -sLS https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally should not rely on github (or datadog) for bootstrapping datadog. I actually did the work of productionizing the EMR datadog flow (i.e. porting the ops workflow for datadog installation to work with EMR) for an earlier iteration of telemetry-streaming but it wouldn't be worth the time to hack it onto this bootstrap, given that if this fails I assume it will be noncritical.
We could mirror this script into the telemetry conf bucket instead, but then we won't receive updates to the script. I'm not sure which method is better, but for reliability purposes mirroring the installation script only protects against github going down, not datadog (again, ops logic has this all sorted out correctly).
Given the above I'm inclined to ok the current approach with the caveat that we may need to fix it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using the vendor's install because they can update it without us having to make changes. It would be a good idea to add some monitoring to ensure that the agent is actually installed/running afterwards.
https://docs.datadoghq.com/agent/basic_agent_usage/ubuntu/?tab=agentv6
Do we want to set DD_UPGRADE=true to install v6?
| tags: | ||
| - cluster_name:${CLUSTER_NAME} | ||
| - cluster_id:${CLUSTER_ID} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally we put an env and some other standard tags. I would say add App:data and Application:data and that should probably suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
App and Application are already provided as tags on the cluster, at least in the ATMO case, so the next line here (calling describe-clusters) will produce App and Application tags mirroring what's set on the cluster.
Does that sound fine? Does that need to be better documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added commentary about what tags to expect.
| }, { | ||
| "Classification": "spark-defaults", | ||
| "Properties": { | ||
| "spark.metrics.namespace": "spark", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
ansible/files/bootstrap/telemetry.sh
Outdated
| if [ "$METRICS_PROVIDER" = "datadog" ] ; then | ||
| # Pull down and decrypt Datadog API key; see https://github.com/mozilla/emr-bootstrap-spark/pull/485 | ||
| DD_DIR=$(mktemp -d) | ||
| aws s3 cp $TELEMETRY_CONF_BUCKET/packages/sops-3.0.4-1.x86_64.rpm $DD_DIR/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the sops rpm name a variable in case we ever need to update it
| EOF | ||
|
|
||
| sudo restart datadog-agent | ||
| elif [ -n "$METRICS_PROVIDER" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be else? In case the user submits empty string? Or should we use a different flag altogether e.g. --use-datadog-metrics so we don't have to arg check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an attempt to stick with existing convention for other options. For --email and --efs-dns, if the options aren't supplied, then we never set the corresponding variable and we use a -n "$EMAIL" check to see if we need to do anything.
I'm not too concerned about the possibility of a user accidentally passing empty string as the argument. It seems more likely that they'd pass it with no argument, in which case shift would fail or the next flag would get swallowed.
haroldwoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had minor comments. lgtm
8459234 to
cb8fce4
Compare
|
I've been testing these changes in stage, and realized a few minor fixups and one necessary change. In particular, I was only partially correct about it being fine to enable I've squashed all the previously reviewed commits and added one additional commit that factors out metrics configuration to a new file that we can leave empty if the metrics option is not enabled. |
| if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then | ||
| echo "ERROR: Incompatible options. StatsdSink is not present in $EMR_LABEL, but is required to support --metrics-provider=datadog; must specify $MIN_EMR_LABEL or later to enable metrics; terminating" 1>&2 | ||
| # Cause the cluster to terminate with "Bootstrap failure" | ||
| exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm choosing to exit 1 here to force the cluster to terminate. I tested this case and I'm able to access the logs to see what went wrong.
It seems better to force the cluster down during bootstrap than to let it come up and have the user be faced with mysterious errors when they try to run jobs.
|
Marking this for review again to see if you have any objections to the new changes in cb8fce4. This is now behaving as expected in stage, though we're not able to actually install the datadog-agent in stage due to the credentials, sops package, etc. existing only in the prod S3 bucket. |
ansible/files/bootstrap/telemetry.sh
Outdated
|
|
||
| MIN_EMR_LABEL=emr-5.13.0 | ||
| EMR_LABEL=$(describe_cluster .Cluster.ReleaseLabel) | ||
| if [[ "$EMR_LABEL" =~ ^emr-5.* && "$MIN_EMR_LABEL" > "$EMR_LABEL" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only comment is that this may fail if they choose to change the way the emr label version strings are constructed
d397ad5 to
ec5f417
Compare
| if (parse_version('${SPARK_VERSION}') < parse_version('${MIN_SPARK_VERSION}')): | ||
| exit(1) | ||
| EOF | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now parses based on Spark version rather than emr label, which should be more stable. Also, we now use python pkg_version.parse_version rather than string comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
ec5f417 to
b99e289
Compare
Had some conversation on IRC indicating that turning on Datadog for all clusters would likely be prohibitively expensive due to per-host pricing. The idea here is to make datadog-agent explicitly opt-in by requiring command-line argument
--install-datadog.