Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: Fix a lot of leakages and implement support for multiple socket filter programs #2264

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

mauriciovasquezbernal
Copy link
Member

All of this started while investigating this comment, which refers to #2108. I found that we weren't releasing the network tracer, and then I found that the issue was worst and we're not releasing any programs at all when running image-based gadgets.

How to test

Before this fix

Before running any gadget, there are some programs and maps in the node:

$ sudo bpftool prog list --json | jq length
28
$ sudo bpftool map list --json | jq length
40

Run a gadget, and terminate it as soon as it prints some events.

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_exec:latest -A
... // ctrl + C

The number of open programs and maps increased:

$ sudo bpftool prog list --json | jq length
30
$ sudo bpftool map list --json | jq length
51

If you run the gadget again, they keep increasing

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_exec:latest -A
... // ctrl + C

$ sudo bpftool prog list --json | jq length
32
$ sudo bpftool map list --json | jq length
53

The situation is worst with a networking gadget:

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_dns:latest -A
... // ctrl + C

$ sudo bpftool prog list --json | jq length
45
$ sudo bpftool map list --json | jq length
59

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_dns:latest -A.
... // ctrl + C

$ sudo bpftool prog list --json | jq length
58
$ sudo bpftool map list --json | jq length
68

A lot of maps named tail_call and the dns program itself are leaked

$ sudo bpftool map list name tail_call
22521: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
	owner_prog_type socket_filter  owner jited
22576: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
	owner_prog_type socket_filter  owner jited
22616: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
22617: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
	owner_prog_type socket_filter  owner jited
22657: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
22658: prog_array  name tail_call  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 4096B
	owner_prog_type socket_filter  owner jited

$ sudo bpftool prog list name ig_trace_dns
6083: socket_filter  name ig_trace_dns  tag 73eb506774d37c0f  gpl
	loaded_at 2023-11-30T20:32:43+0000  uid 0
	xlated 7272B  jited 4152B  memlock 8192B  map_ids 22635,22637,22631,22636
	btf_id 9806
	pids gadgettracerman(802944)
6122: socket_filter  name ig_trace_dns  tag 73eb506774d37c0f  gpl
	loaded_at 2023-11-30T20:33:05+0000  uid 0
	xlated 7272B  jited 4152B  memlock 8192B  map_ids 22676,22678,22672,22677
	btf_id 9861
	pids gadgettracerman(802944)

After this fix

Before running any gadget, there are some programs and maps in the node:

$ sudo bpftool prog list --json | jq length
28
$ sudo bpftool map list --json | jq length
39

Run a gadget, and terminate it as soon as it prints some events.

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_exec:latest -A
... // ctrl + C

The number of open programs and maps is the same as before

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_exec:latest -A
... // ctrl + C

$ sudo bpftool prog list --json | jq length
28
$ sudo bpftool map list --json | jq length
39

Running the dns gadget has the same behaviour

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_dns:latest -A
... // ctrl + C

$ sudo bpftool prog list --json | jq length
30
$ sudo bpftool map list --json | jq length
42

In this case the numbers increased because the socket enricher is lazily-loaded, if we run again the numbers keep constant:

$ ./kubectl-gadget run ghcr.io/inspektor-gadget/gadget/trace_dns:latest -A
... // ctrl + C

$ sudo bpftool prog list --json | jq length
30
$ sudo bpftool map list --json | jq length
42

Fixes #2108

@matthyx
Copy link
Contributor

matthyx commented Dec 1, 2023

@mauriciovasquezbernal I find this approach much better, LGTM

Copy link
Member

@burak-ok burak-ok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for the fix and multiple socket filter program feature

@@ -124,11 +124,7 @@ func (t *Tracer) Init(gadgetCtx gadgets.GadgetContext) error {
return nil
}

// Close is needed because of the StartStopGadget interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember the importance of the override keyword in C++

Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks OK from code inspection.
Can you please add a Fixes: tag to the first commit?

}
socketFilterFound = true
err := t.networkTracer.AttachProg(t.collection.Programs[progName])
networkTracer := t.networkTracers[p.Name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you test if ok to avoid a SEGFAULT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that case is possible. If that happens, it seems there's something very broken, so crashing is fine.

The GadgetStartStop interface was removed a long time ago in a3e5871 ("pkg/runtime/local: Move context and timeout management into tracers"),
so the Stop() method was never being called. It was causing to leak
all BPF objects of the gadgets being run.

This commit also fixes a leakage on the network tracer as it was never
being closed.

Fixes: 07ca0fb ("Initial support for containerized gadgets")
Fixes: c1cc966 ("implement networking gadgets for containerized gadgets")

Signed-off-by: Mauricio Vásquez <mauriciov@microsoft.com>
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/socket-filter-network-tracer branch from 175be40 to 966b473 Compare December 1, 2023 12:49
The logic to create the network tracer (needed by socket filter programs)
was placed in NewInstance(), unfortunately it has a lot of drawbacks:
1. NewInstance() is called many times in the workflow without running
the gadget, it was causing the network tracer to be instantiated
multiple times without any purpose.
2. Creating the network tracer requires root. It was making `ig --help`
to print some annoying warnings (#2108)
3. The network tracer was always created, even for gadgets that don't
need it, also only a single socket filter program was supported.

The main motivation to have this logic on NewInstance() was that the
network tracer has to be created before calling AttachContainer(). This
commit moves all that logic to Init() (that is called before
AttachContainer()) and enables the support for multiple socket filter
programs on a gadget.

Signed-off-by: Mauricio Vásquez <mauriciov@microsoft.com>
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/socket-filter-network-tracer branch from 966b473 to b663533 Compare December 1, 2023 12:54
@mauriciovasquezbernal mauriciovasquezbernal merged commit 70f5d80 into main Dec 1, 2023
50 checks passed
@mauriciovasquezbernal mauriciovasquezbernal deleted the mauricio/socket-filter-network-tracer branch December 1, 2023 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/P0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ig --help gives error "failed to create dummy instance"
4 participants