Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically expose pprof+prometheus endpoint. #677

Merged
merged 8 commits into from Mar 18, 2020
Merged

Conversation

@raulk
Copy link
Member

raulk commented Mar 12, 2020

With this change, test plans using runtime.Invoke() will automatically benefit from pprof and prometheus HTTP exports. Fixes #673.

Works in all runneres (local:exec, local:docker, cluster:k8s (via k8s port forwarding).

Fixes #698.

@raulk raulk changed the title automatically expose pprof+prometheus endpoint. [DO NOT MERGE] automatically expose pprof+prometheus endpoint. Mar 12, 2020
@raulk

This comment has been minimized.

Copy link
Member Author

raulk commented Mar 12, 2020

TODO:

  • publish ports on local:docker.
  • make sure sidecar doesn't annul the routes.
@raulk raulk added this to the Testground v0.3 milestone Mar 12, 2020
pkg/build/golang/Dockerfile.template Show resolved Hide resolved
manifests/placebo.toml Show resolved Hide resolved
pkg/docker/container.go Show resolved Hide resolved
pkg/runner/local_docker.go Show resolved Hide resolved
// Allow traffic from the host machine (acting as the gateway).
var (
gwIPs = make(map[string]struct{})
gwRoutes []netlink.Route
)
for _, route := range controlRoutes {
if _, ok := gwIPs[route.Gw.String()]; ok {
continue
}
nlroutes, err := netlinkHandle.RouteGet(route.Gw)
if err != nil {
return nil, fmt.Errorf("failed to resolve route: %w", err)
}
gwIPs[route.Gw.String()] = struct{}{}
gwRoutes = append(gwRoutes, nlroutes...)
}

controlRoutes = append(controlRoutes, gwRoutes...)
Comment on lines 177 to 194

This comment has been minimized.

Copy link
@raulk

raulk Mar 17, 2020

Author Member

This is the key. In kubernetes we don't need this, because kubernetes has native port-forwarding out of the box. In our case, we connect our test containers to the control and data networks. The former is tightly restricted, and the latter is internal and connects only the containers participating in the test. For containers to successfully expose ports to the host, besides publishing those ports, we need to ALLOW traffic from the gateway (host).

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

What is actually added here?
I think this would be just a route to the single host, the route to the host, right?

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

by the way, I see this route table on plan containers:

docker exec tg-example-prometheus2-f3de125372b3-single-1 ip route
default via 192.18.0.1 dev eth0 
16.0.0.0/16 dev eth1 proto kernel scope link src 16.0.0.2 
192.18.0.0/16 dev eth0 proto kernel scope link src 192.18.0.6 

they can ping google

docker exec tg-example-prometheus2-f3de125372b3-single-1 ping google.com
PING google.com (172.217.1.142) 56(84) bytes of data.
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=1 ttl=52 time=97.6 ms
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=2 ttl=52 time=104 ms

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

Just for comparison... Without this change, from the current master,
the route table looks like this:

docker exec tg-example-prometheus2-19979292b320-single-1 ip route
16.1.0.0/16 dev eth1 proto kernel scope link src 16.1.0.3 
192.18.0.3 dev eth0 src 192.18.0.7 
192.18.0.4 dev eth0 src 192.18.0.7 

and there is no default route, so no place to send packets outside of known networks.

docker exec tg-example-prometheus2-19979292b320-single-1 ping -c 2 google.com
ping: google.com: Temporary failure in name resolution

This comment has been minimized.

Copy link
@raulk

raulk Mar 18, 2020

Author Member

Thanks for spotting this. Definitely unintended. I've fixed this logic in eaa06cd. Can you take it out for a spin again? Works for me.

  • The host can access ports exposed by the container.
  • The container cannot access public IPs.
@raulk raulk changed the title [DO NOT MERGE] automatically expose pprof+prometheus endpoint. automatically expose pprof+prometheus endpoint. Mar 17, 2020
@raulk raulk requested a review from coryschwartz Mar 17, 2020
Sometimes the flush could end up sending a nil bytes slice
to TgWriter, and this would make it as a nil progress payload
on the wire.
manifests/placebo.toml Show resolved Hide resolved
pkg/docker/container.go Show resolved Hide resolved
// Allow traffic from the host machine (acting as the gateway).
var (
gwIPs = make(map[string]struct{})
gwRoutes []netlink.Route
)
for _, route := range controlRoutes {
if _, ok := gwIPs[route.Gw.String()]; ok {
continue
}
nlroutes, err := netlinkHandle.RouteGet(route.Gw)
if err != nil {
return nil, fmt.Errorf("failed to resolve route: %w", err)
}
gwIPs[route.Gw.String()] = struct{}{}
gwRoutes = append(gwRoutes, nlroutes...)
}

controlRoutes = append(controlRoutes, gwRoutes...)

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

What is actually added here?
I think this would be just a route to the single host, the route to the host, right?

pkg/docker/container.go Show resolved Hide resolved
manifests/placebo.toml Show resolved Hide resolved
// Allow traffic from the host machine (acting as the gateway).
var (
gwIPs = make(map[string]struct{})
gwRoutes []netlink.Route
)
for _, route := range controlRoutes {
if _, ok := gwIPs[route.Gw.String()]; ok {
continue
}
nlroutes, err := netlinkHandle.RouteGet(route.Gw)
if err != nil {
return nil, fmt.Errorf("failed to resolve route: %w", err)
}
gwIPs[route.Gw.String()] = struct{}{}
gwRoutes = append(gwRoutes, nlroutes...)
}

controlRoutes = append(controlRoutes, gwRoutes...)

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

by the way, I see this route table on plan containers:

docker exec tg-example-prometheus2-f3de125372b3-single-1 ip route
default via 192.18.0.1 dev eth0 
16.0.0.0/16 dev eth1 proto kernel scope link src 16.0.0.2 
192.18.0.0/16 dev eth0 proto kernel scope link src 192.18.0.6 

they can ping google

docker exec tg-example-prometheus2-f3de125372b3-single-1 ping google.com
PING google.com (172.217.1.142) 56(84) bytes of data.
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=1 ttl=52 time=97.6 ms
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=2 ttl=52 time=104 ms
// Allow traffic from the host machine (acting as the gateway).
var (
gwIPs = make(map[string]struct{})
gwRoutes []netlink.Route
)
for _, route := range controlRoutes {
if _, ok := gwIPs[route.Gw.String()]; ok {
continue
}
nlroutes, err := netlinkHandle.RouteGet(route.Gw)
if err != nil {
return nil, fmt.Errorf("failed to resolve route: %w", err)
}
gwIPs[route.Gw.String()] = struct{}{}
gwRoutes = append(gwRoutes, nlroutes...)
}

controlRoutes = append(controlRoutes, gwRoutes...)

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 17, 2020

Collaborator

Just for comparison... Without this change, from the current master,
the route table looks like this:

docker exec tg-example-prometheus2-19979292b320-single-1 ip route
16.1.0.0/16 dev eth1 proto kernel scope link src 16.1.0.3 
192.18.0.3 dev eth0 src 192.18.0.7 
192.18.0.4 dev eth0 src 192.18.0.7 

and there is no default route, so no place to send packets outside of known networks.

docker exec tg-example-prometheus2-19979292b320-single-1 ping -c 2 google.com
ping: google.com: Temporary failure in name resolution
@raulk raulk requested a review from coryschwartz Mar 18, 2020
@raulk

This comment has been minimized.

Copy link
Member Author

raulk commented Mar 18, 2020

Re-requested review from @coryschwartz. I've rectified the logic that whitelists the host's IP addr.

Copy link
Collaborator

coryschwartz left a comment

Appears to work, I can hit http://<plan_ip>:6060/metrics and it responds with output,

The plan can't ping outside the network. LGTM

switch {
case len(pub) == 0:
logging.S().Warnw("failed to discover gateway/host address; no routes to public IPs", "container_id", container.ID)
case pub[0].Gw == nil:

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 18, 2020

Collaborator

With this change, I have the following routing table in the plans:

docker exec tg-example-prometheus2-a0bf3767448d-single-9 ip r
16.2.0.0/16 dev eth1 proto kernel scope link src 16.2.0.3 
192.18.0.1 dev eth0 src 192.18.0.6 
192.18.0.3 dev eth0 src 192.18.0.6 
192.18.0.4 dev eth0 src 192.18.0.6 

Very nice, no default gateway.

and the outside is un-pingable, also very nice.

docker exec tg-example-prometheus2-a0bf3767448d-single-9 ping 1.1.1.1
connect: Network is unreachable
// Sidecar doesn't whitelist traffic to public addresses, but it special-cases
// traffic between the container and the host, so that pprof, metrics and other
// ports can be exposed to the Docker host.
var PublicAddr = net.ParseIP("1.1.1.1")

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 18, 2020

Collaborator

This is the IP address of Cloudfliar DNS. I don't have a problem with that, just pointing that out.

This comment has been minimized.

Copy link
@raulk

raulk Mar 18, 2020

Author Member

Yep, that's ok. This will not perform any lookup of any kind, it just matches against the kernel routing tables.

@@ -44,6 +44,10 @@ type Error struct {
}

func (tgw *TgWriter) Write(p []byte) (n int, err error) {
if p == nil {

This comment has been minimized.

Copy link
@coryschwartz

coryschwartz Mar 18, 2020

Collaborator

Does this prevent errors from popping up?

This comment has been minimized.

Copy link
@raulk

raulk Mar 18, 2020

Author Member

It suppresses empty writes from making it onto the wire, which I found can happen when Flush is called. This was leading to the intermittent bug with the describe command.

@raulk raulk merged commit e16f9ca into master Mar 18, 2020
0 of 2 checks passed
0 of 2 checks passed
Travis CI - Branch Build Failed
Details
Travis CI - Pull Request Build Failed
Details
@raulk raulk deleted the feat/automatic-pprof branch Mar 18, 2020
@raulk raulk added status/done and removed status/waiting labels Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
You can’t perform that action at this time.