-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calicoctl node diags grab logs from hosted installs #1702
calicoctl node diags grab logs from hosted installs #1702
Conversation
calicoctl/commands/node/diags.go
Outdated
// Get a list of Calico containers running on this Node. | ||
out, err := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}").Output() | ||
if err != nil { | ||
fmt.Printf("Could not run docker command: %s\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to capture these failures (170 and 175) in the diag output? it's possible the errors are too unlikely to bother trying to capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly isn't much effort to do, and could shed light on a possible permissions issue if the user isn't running as root. As for the No Calico containers found it might not really be an error, as someone might not be running a hosted install.
Also, this should be short circuiting here, and it's not :(
lgtm! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few questions/suggestions, but this feature will be very useful!
calicoctl/commands/node/diags.go
Outdated
@@ -134,6 +136,9 @@ func runDiags(logDir string) { | |||
fmt.Printf("No logs found in %s; skipping log copying", logDir) | |||
} | |||
|
|||
// Try to copy logs from containers for hosted installs. | |||
getDockerLogs(tmpLogDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this getContainerLogs
, because in future if we want to add rkt logs or any other container runtime logs it sounds more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getHostedLogs
might be more appropriate? As it is what we are trying to get, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that!
calicoctl/commands/node/diags.go
Outdated
// Get a list of Calico containers running on this Node. | ||
var stderr bytes.Buffer | ||
cmd := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}") | ||
cmd.Stderr = &stderr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is necessary. I think you can just do the following:
out,err := cmd.Output()
if err != nil {
fmt.Println("blah: %s\n", err.Stderr)
return
}
it might be err.(*ExitError).Stderr
, I haven't tried it 🍰
From Golang spec:
Any returned error will usually be of type *ExitError. If c.Stderr was nil, Output populates ExitError.Stderr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives me a runtime error if docker
doesn't exist, but works if docker throws an error. Whereas the other method works in both situations.
calicoctl/commands/node/diags.go
Outdated
func getDockerLogs(logDir string) { | ||
os.Mkdir(logDir, os.ModeDir) | ||
|
||
fmt.Println("Copying logs from Calico containers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move this after line 182, since if we don't find any Calico containers then we print this message suggesting we're collecting logs but in fact we are not (and the debug log goes silently unnoticed by the user)
calicoctl/commands/node/diags.go
Outdated
} | ||
|
||
// No Calico containers found. | ||
if string(out) == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of this variable name out
, I think it should be containerNames
, containers
or cNames
or something similar. WDYT?
calicoctl/commands/node/diags.go
Outdated
name := scanner.Text() | ||
cLog, err := exec.Command("docker", "logs", name).Output() | ||
if err != nil { | ||
fmt.Printf("Could not pull log for container %s: %s\n", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we continue
here if we get an error from line 188? else it will try to access cLog
which might be a nil
in line 192
calicoctl/commands/node/diags.go
Outdated
@@ -26,6 +26,8 @@ import ( | |||
"github.com/docopt/docopt-go" | |||
log "github.com/sirupsen/logrus" | |||
shutil "github.com/termie/go-shutil" | |||
"bufio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not sure if you ran goimports
on this, but the recommendation is to have all the Go native imports together at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one tiny nit, then looks good!
calicoctl/commands/node/diags.go
Outdated
|
||
"github.com/docopt/docopt-go" | ||
log "github.com/sirupsen/logrus" | ||
shutil "github.com/termie/go-shutil" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I much rather explicitly name the import when import name is different than the package name (not a good practice) like in this case package is shutil
(and import is go-shutil
) and we use shutil
in the code, so I like to make it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I missed that, I agree with what you're saying, my IDE must have thought different.
squash it when you address the last bit of nit (if you agree with it) |
351fda9
to
46de5d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@caseydavenport could I get you to do a once over of this PR to see if we can get it merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heschlie finally got around to this!
A couple of things I think we should do and a question or two. LMK what you think.
calicoctl/commands/node/diags.go
Outdated
@@ -169,7 +212,7 @@ func writeDiags(cmds diagCmd, dir string) { | |||
|
|||
content, err := command.Output() | |||
if err != nil { | |||
fmt.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're losing the error in this log the way it's written now. I think we need to keep that so we can tell what failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll get this setup so we see the returned error from the command,. Before we were only getting an unexplained exit status 1
here.
calicoctl/commands/node/diags.go
Outdated
@@ -134,6 +136,9 @@ func runDiags(logDir string) { | |||
fmt.Printf("No logs found in %s; skipping log copying", logDir) | |||
} | |||
|
|||
// Try to copy logs from containers for hosted installs. | |||
getHostedLogs(tmpLogDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit: this isn't specific to "self-hosted" calico installs - this is just "get stdout logs from any calico containers"
So, basically a change to the comment and function name (e.g. getNodeContainerLogs
)
calicoctl/commands/node/diags.go
Outdated
|
||
// Get a list of Calico containers running on this Node. | ||
var stderr bytes.Buffer | ||
cmd := exec.Command("docker", "ps", "-a", "--filter", "\"name=calico\"", "--format", "{{.Names}}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this --filter name=calico
actually work when Calico is deployed as a DaemonSet? I would think the name would be something like k8s_calico-node-asdfasfdksdflahsdfhsdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be a partial match and will return any containers with "calico" in them, so it will match, I tested this on a Vagrant install and got the expected output.
calicoctl/commands/node/diags.go
Outdated
|
||
fmt.Println("Copying logs from Calico containers") | ||
|
||
// Grab the log for each container and write it as <containerName>.log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we had talked about was also saving off the output of docker ps
so that users could correlate containerName.log with a specific container / timestamp / etc.
i.e. if I'm looking at the directory this code generates, how do I know which of the 100 containers I'm interested in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall talking about it, but at the moment I'm having trouble seeing what this would buy the user, at the moment the only output of the docker command is a list of container names that have "calico" in them, which we use for grabbing the logs and using that name for the logfile. I could run a similar query with more flags that could include some extra info, like start/creation time and save that out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing we're looking for is the ability to correlate a log file with a specific iteration of a container.
So, being able to see the start/creation time is important. Alternatively we can just encode that into the file name somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names are already pretty long, I'll dump a file with the format:
{{.Names}}: {{.CreatedAt}}
Which will give us a mapping. Dunno what a good name for this would be
container_creation_time
container_to_creation_map
when_logged_containers_were_created
Something better than that maybe?
@caseydavenport I think we might be all set? Let me know if the changes to grabbing the container->created time mapping looks ok. |
calicoctl/commands/node/diags.go
Outdated
return | ||
} | ||
|
||
fmt.Println("Copying logs from Calico containers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some output I got from running on a k8s system:
$ docker ps -a --filter name=calico --format "{{.Names}}: {{.CreatedAt}}"
k8s_install-cni_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:44 +0000 UTC
k8s_calico-node_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:36 +0000 UTC
k8s_calico-etcd_calico-etcd-t6dwl_kube-system_5fe749f4-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:26 +0000 UTC
k8s_POD_calico-node-6p0xp_kube-system_603c4dd8-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:23 +0000 UTC
k8s_POD_calico-etcd-t6dwl_kube-system_5fe749f4-98c8-11e7-acf3-42010a800008_0: 2017-09-13 21:13:23 +0000 UTC
We should filter out anything with k8s_POD_
in it since those are just the infra containers and won't have useful logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, did not know that, will filter them out.
When trying to trim out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heschlie this LGTM - could you please squash before merging?
With this change `calicoctl node diags` will pickup logs from hosted installs. There is also a small change to add some clarity when a command fails, as before it was simply saying `exit status 1` will now tell you what command failed to run.
7bdadf9
to
8bd70e7
Compare
Description
Fixes #1690
Fixes projectcalico/calico#810
With this change
calicoctl node diags
will pickup logs from hosted installs. This will allow for easier and quicker debugging when a user sends us their debug logs with hosted installs.There is also a small change to add some clarity when a command fails, as before it was simply saying
exit status 1
will now tell you what command failed to run. This will give the user a better idea of what failed to run in the diags.I've tested this via the Calico vagrant demo and verified it works with and without Calico containers (obviously not writing any logs when there are no Calico containers present).
Todos
Release Note