Skip to content

Switch to Fluent bit from Fluentd#943

Merged
Gerhut merged 8 commits intomicrosoft:dltsdevfrom
Gerhut:logging/fluentd-to-fluent-bit
Mar 23, 2020
Merged

Switch to Fluent bit from Fluentd#943
Gerhut merged 8 commits intomicrosoft:dltsdevfrom
Gerhut:logging/fluentd-to-fluent-bit

Conversation

@Gerhut
Copy link
Copy Markdown
Member

@Gerhut Gerhut commented Mar 18, 2020

  • Bring nanoseconds back
  • Should we bring docker.container_id back? No
    @Anbang-Hu @xudifsd currently the restfulapi is retrieving logs by pod_name and nfs stores logs by container_id. In which case are they different?

@coveralls
Copy link
Copy Markdown

coveralls commented Mar 18, 2020

Pull Request Test Coverage Report for Build 2731

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 92.848%

Totals Coverage Status
Change from base Build 2726: 0.0%
Covered Lines: 665
Relevant Lines: 706

💛 - Coveralls

@xudifsd
Copy link
Copy Markdown
Contributor

xudifsd commented Mar 19, 2020

What you meant by "nfs stores by container id"? Right now the pod name and container id within pod is the same, I was thinking about using main as container name, but we can stick to this if this may break something.

@Gerhut
Copy link
Copy Markdown
Member Author

Gerhut commented Mar 19, 2020

Hi @xudifsd

Currently the job log on NFS is grouped by container_id (https://github.com/microsoft/DLWorkspace/blob/dltsdev/src/ClusterManager/joblog_manager.py#L59-L70)

while the GetJobLog API group job logs by pod_name (https://github.com/microsoft/DLWorkspace/blob/dltsdev/src/utils/JobRestAPIUtils.py#L649-L659)

Are we currently or in the future plan to put 2 or more containers in a pod? if no, I will only keep logs grouped by only pod_name, which is good for job scheduling.

@xudifsd
Copy link
Copy Markdown
Contributor

xudifsd commented Mar 19, 2020

I think we'd better unified to use only pod name. Seems we will not change this one pod one container policy for jobs, but I've heard that AML will put a sidecar inside job pod to facilitate log and metric collection. So we'd better prepare for this.

@Gerhut
Copy link
Copy Markdown
Member Author

Gerhut commented Mar 19, 2020

executing 22 cases [
    'test_inference_job.test_inference_job_running',
    'test_distributed_job.test_blobfuse',
    'test_distributed_job.test_distributed_job_env',
    'test_distributed_job.test_distributed_job_ssh',
    'test_distributed_job.test_distributed_non_preemptable_job_running',
    'test_distributed_job.test_distributed_preemptable_job_running',
    'test_distributed_job.test_distributed_with_default_cmd',
    'test_regular_job.test_batch_kill_jobs',
    'test_regular_job.test_batch_op_jobs',
    'test_regular_job.test_blobfuse',
    'test_regular_job.test_data_job_running',
    'test_regular_job.test_do_not_expose_private_key',
    'test_regular_job.test_job_fail',
    'test_regular_job.test_list_all_jobs',
    'test_regular_job.test_op_job',
    'test_regular_job.test_regular_job_custom_ssh_key',
    'test_regular_job.test_regular_job_env',
    'test_regular_job.test_regular_job_ssh',
    'test_regular_job.test_regular_non_preemptable_job_running',
    'test_regular_job.test_regular_preemptable_job_running',
    'test_regular_job.test_ssh_do_not_expose_private_key',
    'test_regular_job.test_sudo_installed'
], 8 failed
test_inference_job.test_inference_job_running,
test_distributed_job.test_distributed_job_env,
test_distributed_job.test_distributed_job_ssh,
test_distributed_job.test_distributed_with_default_cmd,
test_regular_job.test_regular_job_custom_ssh_key,
test_regular_job.test_regular_job_env,
test_regular_job.test_regular_job_ssh,
test_regular_job.test_ssh_do_not_expose_private_key

@Gerhut Gerhut marked this pull request as ready for review March 20, 2020 15:56
@Gerhut
Copy link
Copy Markdown
Member Author

Gerhut commented Mar 23, 2020

Workaround an issue in hostNetwork deployment: perviously we are able to deploy fluentd to hostNetwork but it does not work fine when switching to fluent-bit.

Fluentd requests k8s apiserver through IP in environment variables

'https://' + ENV.fetch('KUBERNETES_SERVICE_HOST') + ':' + ENV.fetch('KUBERNETES_SERVICE_PORT') + '/api'

which fluent bit requests k8s apiserver through internal domain (https://kubernetes.default.svc:443)

After configure fluent bit to request k8s apiserver through IP, (https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}) there is a certificate verification failure.

Since k8s CA have only these domain names for SNI (kubernetes, kubernetes.default, kubernetes.default.svc, kube rnetes.default.svc.cluster.local) request via pure IP is not allowed when SNI is enabled in tls verification.

Unfortunately fluent-bit has this feature while fluentd has not.

Tracking in fluent/fluent-bit#1615 (comment)

@Gerhut Gerhut force-pushed the logging/fluentd-to-fluent-bit branch from cb33fe5 to dbbdb4c Compare March 23, 2020 14:59
@Gerhut Gerhut merged commit 6fd8fba into microsoft:dltsdev Mar 23, 2020
@xudifsd
Copy link
Copy Markdown
Contributor

xudifsd commented Mar 23, 2020

Why we want fluent bit use host network? For efficiency?

@Gerhut Gerhut deleted the logging/fluentd-to-fluent-bit branch March 27, 2020 16:06
@Gerhut
Copy link
Copy Markdown
Member Author

Gerhut commented Mar 27, 2020

Why we want fluent bit use host network? For efficiency?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants