Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send ML feature information with UpdateNodeInfo. #11913

Merged
merged 4 commits into from Dec 22, 2021

Conversation

vkalintiris
Copy link
Contributor

@vkalintiris vkalintiris commented Dec 16, 2021

Summary

We achieve this by adding the ml_{capable,enabled} fields in
system_info. When streaming, these fields allow a parent to understand if
the child has ML and if it runs ML for itself.

A newly added protobuf message contains the aforementioned field. Inside
NodeInfo the information is an exact copy of system_info. Inside
UpdateNodeInfo the ml_capable field reflects the ability of the localhost
to run ML, and the ml_enabled field denotes the training status of the
specified node instance.

Component Name

aclk

Test Plan

TBD

Additional Information

Setting the ml_{capable,enabled} fields inside rrdhost_create is a little bit
awkward. The reason we have to do this is because there's an ML configuration
option that allows a user to filter by hostname the nodes to be trained. This means
that we need to create/initialize the RRDHOST object first and then check if the
ML component will train it based on the configuration options of the user.

We achieve this by adding the `ml_{capable,enabled}` fields in
`system_info`. When streaming, these fields allow a parent to understand if
the child has ML and if it runs ML for itself.

The UpdateNodeInfo includes this information about a child, plus a
boolean that is set to true when the parent runs ML for the child.
@thiagoftsm thiagoftsm self-requested a review December 20, 2021 15:26
@thiagoftsm
Copy link
Contributor

@vkalintiris the following warning is not associated with this PR, but it has relationship with ML:

ml/ml-dummy.c: In function 'ml_get_host_info':
ml/ml-dummy.c:13:50: warning: control reaches end of non-void function [-Wreturn-type]
   13 | char *ml_get_host_info(RRDHOST *RH) { (void) RH; }

@thiagoftsm
Copy link
Contributor

@vkalintiris last commit does not allow me to compile with the following errors:

aclk/schema-wrappers/node_info.cc: In function 'int generate_node_info(nodeinstance::info::v1::NodeInfo*, aclk_node_info*)':
aclk/schema-wrappers/node_info.cc:65:29: error: 'MachineLearningInfo' is not a member of 'nodeinstance::info::v1'
   65 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = info->mutable_ml_info();
      |                             ^~~~~~~~~~~~~~~~~~~
aclk/schema-wrappers/node_info.cc:65:50: error: 'ml_info' was not declared in this scope; did you mean 'mc_info'?
   65 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = info->mutable_ml_info();
      |                                                  ^~~~~~~
      |                                                  mc_info
aclk/schema-wrappers/node_info.cc:65:66: error: 'class nodeinstance::info::v1::NodeInfo' has no member named 'mutable_ml_info'
   65 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = info->mutable_ml_info();
      |                                                                  ^~~~~~~~~~~~~~~
aclk/schema-wrappers/node_info.cc: In function 'char* generate_update_node_info_message(size_t*, update_node_info*)':
aclk/schema-wrappers/node_info.cc:93:29: error: 'MachineLearningInfo' is not a member of 'nodeinstance::info::v1'
   93 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = msg.mutable_ml_info();
      |                             ^~~~~~~~~~~~~~~~~~~
aclk/schema-wrappers/node_info.cc:93:50: error: 'ml_info' was not declared in this scope; did you mean 'mc_info'?
   93 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = msg.mutable_ml_info();
      |                                                  ^~~~~~~
      |                                                  mc_info
aclk/schema-wrappers/node_info.cc:93:64: error: 'class nodeinstance::info::v1::UpdateNodeInfo' has no member named 'mutable_ml_info'
   93 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = msg.mutable_ml_info();

@vkalintiris
Copy link
Contributor Author

@vkalintiris last commit does not allow me to compile with the following errors:

aclk/schema-wrappers/node_info.cc: In function 'int generate_node_info(nodeinstance::info::v1::NodeInfo*, aclk_node_info*)':
aclk/schema-wrappers/node_info.cc:65:29: error: 'MachineLearningInfo' is not a member of 'nodeinstance::info::v1'
   65 |     nodeinstance::info::v1::MachineLearningInfo *ml_info = info->mutable_ml_info();

@thiagoftsm this PR updates the aclk-schema submodule. You need to make sure it points to 9fa80e397be191b56320b0cfa7da97b249c3b3eb (ie. netdata/aclk-schemas#28).

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything worked as expected, I did not observe any problem with stream during two hours. LGTM!

@MrZammler
Copy link
Contributor

Hi @vkalintiris !

First test, with a single (no children) agent, ML enabled, connected to staging with new architecture:

{
         "nodeId" => "c8ee2856-9c9b-4af6-8eda-e94ca585ebd4",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "winterland",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 12,
              "cpuFrequency" => "4714000000",
                    "memory" => "16776036352",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
                "hostLabels" => {
                      "_is_k8s_node" => "false",
                        "_aclk_impl" => "Next Generation",
                "_aclk_ng_available" => "true",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "12",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "16776036352",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "4714000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
              "_container_detection" => "none",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true,
            "mlEnabled" => true
        }
    },
      "updatedAt" => "2021-12-21T15:23:13.921963Z",
    "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
         "mlInfo" => {
        "mlCapable" => true,
        "mlEnabled" => true
    }
}

@MrZammler
Copy link
Contributor

Same node with ML disabled:

{
         "nodeId" => "c8ee2856-9c9b-4af6-8eda-e94ca585ebd4",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "winterland",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 12,
              "cpuFrequency" => "4714000000",
                    "memory" => "16776036352",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
                "hostLabels" => {
              "_container_detection" => "none",
                        "_aclk_impl" => "Next Generation",
                "_aclk_ng_available" => "true",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "12",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "16776036352",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "4714000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
                      "_is_k8s_node" => "false",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true
        }
    },
      "updatedAt" => "2021-12-22T08:14:04.670129Z",
    "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
         "mlInfo" => {
        "mlCapable" => true
    }
}

@MrZammler
Copy link
Contributor

This is the UpdateNodeInfo message sent by the parent (winterland) for it's child (6am) to the cloud. Both are compiled with ML but not enabled:

{
         "nodeId" => "c16ec0e2-7a97-4d76-a6db-cd203f443013",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "6am",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 6,
              "cpuFrequency" => "2375000000",
                    "memory" => "7794307072",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
                "hostLabels" => {
                      "_is_k8s_node" => "false",
                        "_aclk_impl" => "Next Generation ",
                "_aclk_ng_available" => "true",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "6",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "7794307072",
                       "_streams_to" => "192.168.0.1",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "2375000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
              "_container_detection" => "none",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true
        }
    },
      "updatedAt" => "2021-12-22T08:23:41.316180Z",
    "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
          "child" => true,
         "mlInfo" => {
        "mlCapable" => true
    }
}

@MrZammler
Copy link
Contributor

(Sorry for the spamming).

Parent enabled ml, child not. Parent's UpdateNodeInfo for itself:

{
         "nodeId" => "c8ee2856-9c9b-4af6-8eda-e94ca585ebd4",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "winterland",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 12,
              "cpuFrequency" => "4714000000",
                    "memory" => "16776036352",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
                "hostLabels" => {
                      "_is_k8s_node" => "false",
                "_aclk_ng_available" => "true",
                        "_aclk_impl" => "Next Generation",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "12",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "16776036352",
                        "_is_parent" => "true",
                  "_system_cpu_freq" => "4714000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
              "_container_detection" => "none",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true,
            "mlEnabled" => true
        }
    },
      "updatedAt" => "2021-12-22T08:31:45.366432Z",
    "machineGuid" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
         "mlInfo" => {
        "mlCapable" => true,
        "mlEnabled" => true
    }
}

Parent's UpdateNodeInfo for it's child:

{
         "nodeId" => "c16ec0e2-7a97-4d76-a6db-cd203f443013",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "6am",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 6,
              "cpuFrequency" => "2375000000",
                    "memory" => "7794307072",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
                "hostLabels" => {
              "_container_detection" => "none",
                        "_aclk_impl" => "Next Generation ",
                "_aclk_ng_available" => "true",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "6",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                       "_streams_to" => "192.168.0.1",
                 "_system_ram_total" => "7794307072",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "2375000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
                      "_is_k8s_node" => "false",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true
        }
    },
      "updatedAt" => "2021-12-22T08:31:46.443400Z",
    "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
          "child" => true,
         "mlInfo" => {
        "mlCapable" => true,
        "mlEnabled" => true
    }
}

@MrZammler
Copy link
Contributor

Both parent and child enabled ML, looks good:

{
         "nodeId" => "c16ec0e2-7a97-4d76-a6db-cd203f443013",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "6am",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 6,
              "cpuFrequency" => "2375000000",
                    "memory" => "7794307072",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
                "hostLabels" => {
              "_container_detection" => "none",
                "_aclk_ng_available" => "true",
                        "_aclk_impl" => "Next Generation ",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "6",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "7794307072",
                       "_streams_to" => "192.168.0.1",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "2375000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
                      "_is_k8s_node" => "false",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {
            "mlCapable" => true,
            "mlEnabled" => true
        }
    },
      "updatedAt" => "2021-12-22T08:37:53.494683Z",
    "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
          "child" => true,
         "mlInfo" => {
        "mlCapable" => true,
        "mlEnabled" => true
    }
}

@MrZammler
Copy link
Contributor

Child compiled with --disable-ml, parent enabled:

{
         "nodeId" => "c16ec0e2-7a97-4d76-a6db-cd203f443013",
        "claimId" => "19a32f44-f03b-44d4-b30a-b43c91502a32",
           "data" => {
                      "name" => "6am",
                        "os" => "linux",
                    "osName" => "Gentoo",
                 "osVersion" => "unknown",
                "kernelName" => "Linux",
             "kernelVersion" => "5.10.76-gentoo-r1",
              "architecture" => "x86_64",
                      "cpus" => 6,
              "cpuFrequency" => "2375000000",
                    "memory" => "7794307072",
                 "diskSpace" => "0",
                   "version" => "v1.32.1-8-ge8d8c85c8",
            "releaseChannel" => "nightly",
                  "timezone" => "EET",
        "virtualizationType" => "none",
             "containerType" => "unknown",
               "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
                "hostLabels" => {
                      "_is_k8s_node" => "false",
                "_aclk_ng_available" => "true",
                        "_aclk_impl" => "Next Generation ",
                       "_os_version" => "unknown",
                "_system_disk_space" => "0",
                     "_system_cores" => "6",
                        "_container" => "unknown",
                       "_aclk_proxy" => "none",
                     "_architecture" => "x86_64",
                   "_kernel_version" => "5.10.76-gentoo-r1",
                   "_virt_detection" => "none",
                 "_system_ram_total" => "7794307072",
                       "_streams_to" => "192.168.0.1",
                        "_is_parent" => "false",
                  "_system_cpu_freq" => "2375000000",
                          "_os_name" => "Gentoo",
            "_aclk_legacy_available" => "true",
              "_container_detection" => "none",
                   "_virtualization" => "none"
        },
                    "mlInfo" => {}
    },
      "updatedAt" => "2021-12-22T08:44:49.123388Z",
    "machineGuid" => "0179d482-6300-11ec-b07e-0c3796231cc3",
          "child" => true,
         "mlInfo" => {
        "mlCapable" => true,
        "mlEnabled" => true
    }
}

@MrZammler
Copy link
Contributor

Same with the last check, if the child is from current master and not from this PR.

I think it looks good! @vkalintiris anything else you think we should check?

@vkalintiris
Copy link
Contributor Author

I think it looks good! @vkalintiris anything else you think we should check?

Nothing I can think of.

@vkalintiris vkalintiris merged commit df8930d into netdata:master Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants