Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAAP-HEC-AWS #40: update ADES-K8s metrics to use generalized schema #6

Merged
merged 2 commits into from
Jun 14, 2022

Conversation

pymonger
Copy link
Collaborator

This PR updates the format of the emitted job metrics from the ADES-K8s backend that is collected from the Calrissian docker_usage.json file. The format now conforms to this schema:

https://docs.google.com/document/d/1p0dYy_6NMBQrn5Qq3yXXFIk17Q0jBmlprItmzFlQ41Y/edit#heading=h.qwv42w6gqyda

In order to provide these metrics in this format, the following updates were required:

  • when specifying the calrissian command line options, we now utilize the --pod-labels option to feed in a YAML file that specifies labels that should be attached to every CWL process pod that is created by the job pod
    • the YAML file is created on-the-fly in the calrissian docker image by extracting the first argument (job_id) in the list of arguments sent to our wrapper script (pymonger/calrissian@436e530)
    • Pasted_Image_6_14_22__2_51_PM
  • update the get_job ADES API to query for all pods that have a specific job_id label and to iterate over the returned pods to extract the node specific metrics
    • NOTE: a K8s node's disk_space_free_gb and memory_gb will be unknown from the context of pod execution because K8s abstracts away the specifics of the node hardware from the pod memory and disk requirements
      • it may be possible to query for a K8s node's resource information but that will require additional queries against the K8s API; can be looked at in the future

The following snippet is an example JSON payload returned by a get_job() call after the job has completed and contains the updated metrics:

{
  "statusInfo": {
    "jobID": "downsample-landsat-workflow-3.0.0-cafe2d889b598f02104310feeb6dbd0e1deb64f7",
    "metrics": {
      "blob": {
        "children": [
          {
            "cpu_hours": 0.0016666666666666668,
            "cpus": 1.0,
            "disk_megabytes": 2.128405,
            "elapsed_hours": 0.0016666666666666668,
            "elapsed_seconds": 6.0,
            "finish_time": "2022-06-14T21:37:12+00:00",
            "name": "stage_in",
            "ram_megabyte_hours": 0.4473924266666667,
            "ram_megabytes": 268.435456,
            "start_time": "2022-06-14T21:37:06+00:00"
          },
          {
            "cpu_hours": 0.006666666666666667,
            "cpus": 1.0,
            "disk_megabytes": 0.035387,
            "elapsed_hours": 0.006666666666666667,
            "elapsed_seconds": 24.0,
            "finish_time": "2022-06-14T21:37:40+00:00",
            "name": "downsample_landsat",
            "ram_megabyte_hours": 1.7895697066666667,
            "ram_megabytes": 268.435456,
            "start_time": "2022-06-14T21:37:16+00:00"
          },
          {
            "cpu_hours": 0.0002777777777777778,
            "cpus": 1.0,
            "disk_megabytes": 0.0,
            "elapsed_hours": 0.0002777777777777778,
            "elapsed_seconds": 1.0,
            "finish_time": "2022-06-14T21:37:44+00:00",
            "name": "stage_out",
            "ram_megabyte_hours": 0.07456540444444444,
            "ram_megabytes": 268.435456,
            "start_time": "2022-06-14T21:37:43+00:00"
          }
        ],
        "cores_allowed": 1.0,
        "elapsed_hours": 0.010555555555555556,
        "elapsed_seconds": 38.0,
        "finish_time": "2022-06-14T21:37:44+00:00",
        "max_parallel_cpus": 1.0,
        "max_parallel_ram_megabytes": 268.435456,
        "max_parallel_tasks": 1,
        "ram_mb_allowed": 1073.741824,
        "start_time": "2022-06-14T21:37:06+00:00",
        "total_cpu_hours": 0.008611111111111111,
        "total_disk_megabytes": 2.163792,
        "total_ram_megabyte_hours": 2.3115275377777778,
        "total_tasks": 3
      },
      "processes": [
        {
          "memory_max_gb": 0.262144,
          "name": "stage_in",
          "node": {
            "cores": 1.0,
            "disk_space_free_gb": "unknown",
            "hostname": "10.1.0.39",
            "ip_address": "10.1.0.39",
            "memory_gb": "unknown"
          },
          "time_end": "2022-06-14T21:37:12+00:00",
          "time_started": "2022-06-14T21:37:06+00:00",
          "work_dir_size_gb": 0.0020785205078125
        },
        {
          "memory_max_gb": 0.262144,
          "name": "downsample_landsat",
          "node": {
            "cores": 1.0,
            "disk_space_free_gb": "unknown",
            "hostname": "10.1.0.40",
            "ip_address": "10.1.0.40",
            "memory_gb": "unknown"
          },
          "time_end": "2022-06-14T21:37:40+00:00",
          "time_started": "2022-06-14T21:37:16+00:00",
          "work_dir_size_gb": 3.45576171875e-05
        },
        {
          "memory_max_gb": 0.262144,
          "name": "stage_out",
          "node": {
            "cores": 1.0,
            "disk_space_free_gb": "unknown",
            "hostname": "10.1.0.41",
            "ip_address": "10.1.0.41",
            "memory_gb": "unknown"
          },
          "time_end": "2022-06-14T21:37:44+00:00",
          "time_started": "2022-06-14T21:37:43+00:00",
          "work_dir_size_gb": 0.0
        }
      ],
      "workflow": {
        "exit_code": 0,
        "time_end": "2022-06-14T21:37:44+00:00",
        "time_queued": "2022-06-14T21:36:48+00:00",
        "time_started": "2022-06-14T21:37:06+00:00"
      }
    },
    "status": "successful"
  }
}

Copy link

@mkarim2017 mkarim2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pymonger pymonger merged commit 0d37a3b into main Jun 14, 2022
@pymonger pymonger deleted the maap-hec-aws#40 branch June 14, 2022 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants