Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calicoctl node diags grab logs from hosted installs #1702

Merged
merged 1 commit into from
Sep 16, 2017

Conversation

heschlie
Copy link
Contributor

@heschlie heschlie commented Aug 16, 2017

Description

Fixes #1690
Fixes projectcalico/calico#810

With this change calicoctl node diags will pickup logs from hosted installs. This will allow for easier and quicker debugging when a user sends us their debug logs with hosted installs.

There is also a small change to add some clarity when a command fails, as before it was simply saying exit status 1 will now tell you what command failed to run. This will give the user a better idea of what failed to run in the diags.

I've tested this via the Calico vagrant demo and verified it works with and without Calico containers (obviously not writing any logs when there are no Calico containers present).

Todos

  • Tests
  • Documentation
  • Release note

Release Note

The [`calicoctl node diags`](https://docs.projectcalico.org/v2.6/reference/calicoctl/commands/node/diags) command now returns logs even when Calico is configured to log to stdout (such as in a hosted install).

// Get a list of Calico containers running on this Node.
out, err := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}").Output()
if err != nil {
fmt.Printf("Could not run docker command: %s\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to capture these failures (170 and 175) in the diag output? it's possible the errors are too unlikely to bother trying to capture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly isn't much effort to do, and could shed light on a possible permissions issue if the user isn't running as root. As for the No Calico containers found it might not really be an error, as someone might not be running a hosted install.

Also, this should be short circuiting here, and it's not :(

@bcreane
Copy link
Contributor

bcreane commented Aug 17, 2017

lgtm!

Copy link
Contributor

@gunjan5 gunjan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions/suggestions, but this feature will be very useful!

@@ -134,6 +136,9 @@ func runDiags(logDir string) {
fmt.Printf("No logs found in %s; skipping log copying", logDir)
}

// Try to copy logs from containers for hosted installs.
getDockerLogs(tmpLogDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this getContainerLogs, because in future if we want to add rkt logs or any other container runtime logs it sounds more generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getHostedLogs might be more appropriate? As it is what we are trying to get, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that!

// Get a list of Calico containers running on this Node.
var stderr bytes.Buffer
cmd := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}")
cmd.Stderr = &stderr
Copy link
Contributor

@gunjan5 gunjan5 Aug 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is necessary. I think you can just do the following:

out,err := cmd.Output()
if err != nil {
     fmt.Println("blah: %s\n", err.Stderr)
     return
}

it might be err.(*ExitError).Stderr, I haven't tried it 🍰
From Golang spec:

Any returned error will usually be of type *ExitError. If c.Stderr was nil, Output populates ExitError.Stderr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me a runtime error if docker doesn't exist, but works if docker throws an error. Whereas the other method works in both situations.

func getDockerLogs(logDir string) {
os.Mkdir(logDir, os.ModeDir)

fmt.Println("Copying logs from Calico containers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move this after line 182, since if we don't find any Calico containers then we print this message suggesting we're collecting logs but in fact we are not (and the debug log goes silently unnoticed by the user)

}

// No Calico containers found.
if string(out) == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of this variable name out, I think it should be containerNames, containers or cNames or something similar. WDYT?

name := scanner.Text()
cLog, err := exec.Command("docker", "logs", name).Output()
if err != nil {
fmt.Printf("Could not pull log for container %s: %s\n", name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we continue here if we get an error from line 188? else it will try to access cLog which might be a nil in line 192

@@ -26,6 +26,8 @@ import (
"github.com/docopt/docopt-go"
log "github.com/sirupsen/logrus"
shutil "github.com/termie/go-shutil"
"bufio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if you ran goimports on this, but the recommendation is to have all the Go native imports together at the top

Copy link
Contributor

@gunjan5 gunjan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one tiny nit, then looks good!


"github.com/docopt/docopt-go"
log "github.com/sirupsen/logrus"
shutil "github.com/termie/go-shutil"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I much rather explicitly name the import when import name is different than the package name (not a good practice) like in this case package is shutil (and import is go-shutil) and we use shutil in the code, so I like to make it explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I missed that, I agree with what you're saying, my IDE must have thought different.

@gunjan5
Copy link
Contributor

gunjan5 commented Aug 24, 2017

squash it when you address the last bit of nit (if you agree with it)

Copy link
Contributor

@gunjan5 gunjan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heschlie heschlie requested review from caseydavenport and removed request for robbrockbank August 30, 2017 23:42
@heschlie
Copy link
Contributor Author

@caseydavenport could I get you to do a once over of this PR to see if we can get it merged?

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heschlie finally got around to this!

A couple of things I think we should do and a question or two. LMK what you think.

@@ -169,7 +212,7 @@ func writeDiags(cmds diagCmd, dir string) {

content, err := command.Output()
if err != nil {
fmt.Println(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're losing the error in this log the way it's written now. I think we need to keep that so we can tell what failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll get this setup so we see the returned error from the command,. Before we were only getting an unexplained exit status 1 here.

@@ -134,6 +136,9 @@ func runDiags(logDir string) {
fmt.Printf("No logs found in %s; skipping log copying", logDir)
}

// Try to copy logs from containers for hosted installs.
getHostedLogs(tmpLogDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit: this isn't specific to "self-hosted" calico installs - this is just "get stdout logs from any calico containers"

So, basically a change to the comment and function name (e.g. getNodeContainerLogs)


// Get a list of Calico containers running on this Node.
var stderr bytes.Buffer
cmd := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this --filter name=calico actually work when Calico is deployed as a DaemonSet? I would think the name would be something like k8s_calico-node-asdfasfdksdflahsdfhsdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be a partial match and will return any containers with "calico" in them, so it will match, I tested this on a Vagrant install and got the expected output.


fmt.Println("Copying logs from Calico containers")

// Grab the log for each container and write it as <containerName>.log.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we had talked about was also saving off the output of docker ps so that users could correlate containerName.log with a specific container / timestamp / etc.

i.e. if I'm looking at the directory this code generates, how do I know which of the 100 containers I'm interested in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall talking about it, but at the moment I'm having trouble seeing what this would buy the user, at the moment the only output of the docker command is a list of container names that have "calico" in them, which we use for grabbing the logs and using that name for the logfile. I could run a similar query with more flags that could include some extra info, like start/creation time and save that out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing we're looking for is the ability to correlate a log file with a specific iteration of a container.

So, being able to see the start/creation time is important. Alternatively we can just encode that into the file name somehow.

Copy link
Contributor Author

@heschlie heschlie Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names are already pretty long, I'll dump a file with the format:

{{.Names}}: {{.CreatedAt}}

Which will give us a mapping. Dunno what a good name for this would be

container_creation_time
container_to_creation_map
when_logged_containers_were_created

Something better than that maybe?

@heschlie
Copy link
Contributor Author

@caseydavenport I think we might be all set? Let me know if the changes to grabbing the container->created time mapping looks ok.

return
}

fmt.Println("Copying logs from Calico containers")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some output I got from running on a k8s system:

$ docker ps -a --filter name=calico --format "{{.Names}}: {{.CreatedAt}}"
k8s_install-cni_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:44 +0000 UTC
k8s_calico-node_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:36 +0000 UTC
k8s_calico-etcd_calico-etcd-t6dwl_kube-system_5fe749f4-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:26 +0000 UTC
k8s_POD_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:23 +0000 UTC
k8s_POD_calico-etcd-t6dwl_kube-system_5fe749f4-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:23 +0000 UTC

We should filter out anything with k8s_POD_ in it since those are just the infra containers and won't have useful logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, did not know that, will filter them out.

@heschlie
Copy link
Contributor Author

When trying to trim out k8s_POD containers I found that the policy controller log was empty, turns out we are only logging to stderr in that container, so I migrated to exec.Command.CombinedOutput() which coincidentally helped me cleanup a few other parts of the code.

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heschlie this LGTM - could you please squash before merging?

With this change `calicoctl node diags` will pickup logs from hosted
installs.

There is also a small change to add some clarity when a command fails,
as before it was simply saying `exit status 1` will now tell you what
command failed to run.
@heschlie heschlie merged commit e03438a into projectcalico:master Sep 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants