Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose UDP message on /metrics endpoint #2608

Merged
merged 1 commit into from
Jun 14, 2018

Conversation

fmejia97
Copy link
Contributor

@fmejia97 fmejia97 commented Jun 5, 2018

What this PR does / why we need it:

This PR:

  1. Adds a UDP collector that listens to UDP messages from monitor.lua and exposes them on /metrics endpoint using prometheus client.
  2. Removes VTS collector.

Why do we need this PR?

With the dynamic mode introduced a couple of releases ago, the load balancing is handled in LUA. This means there is no upstream section in the generated nginx.conf, there is only one used to switch to LUA. In this scenario, the VTS module is useless because the stats only contains information about this upstream without any context about the traffic. Additionally, by using the VTS module it is not possible to add custom variables to the stats, such as information about the namespace, ingress rule, and service (something available as NGINX variable).

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

replaces #2607

fixes #2355
fixes #2128
fixes #557

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2018
@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

You have some lint issues (maybe from one of my commits :P )

./hack/verify-golint.sh
!!! 'golint' problems: 
/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:53:6: exported type StatsCollector should have comment or be unexported
/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:64:1: exported function NewInstance should have comment or be unexported
/home/aledbf/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:179:1: exported method StatsCollector.Run should have comment or be unexported

@gianrubio
Copy link
Contributor

Does it make sense to fully remove the current vts library or should we have metrics/v1 => vts and metrics/v2 => lua exporter.

Can I have a docker image of this changes?

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2018
@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

@gianrubio I was building that 😉 quay.io/aledbf/nginx-ingress-controller:0.371

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

Does it make sense to fully remove the current vts library or should we have metrics/v1 => vts and metrics/v2 => lua exporter.

Once this PR lands we are removing the VTS module from the nginx image

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

@fmejia97

I0605 19:00:21.763053       6 stats.go:126] msg: {"requestLength":"1343","namespace":"gitlab","host":"git.192.168.1.29.xip.io","upstreamResponseTime":"-","remoteAddr":"192.168.1.2","ingress":"gitlab-dev","status":"499","protocol":"HTTP\/1.1","upstreamStatus":"-","path":"\/docker-images\/base\/branches","method":"GET","service":"gitlab","requestDuration":"0.522","bytesSent":"0","upstreamIP":"10.2.1.22:80"}
panic: json: cannot unmarshal number - into Go struct field udpData.upstreamResponseTime of type float64

goroutine 66 [running]:
k8s.io/ingress-nginx/internal/ingress/metric/collector.(*StatsCollector).handleMessage(0xc4204ca5f0, 0xc420626000, 0x167, 0x10400)
	/home/aledbf/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:132 +0xbc7
k8s.io/ingress-nginx/internal/ingress/metric/collector.(*StatsCollector).(k8s.io/ingress-nginx/internal/ingress/metric/collector.handleMessage)-fm(0xc420626000, 0x167, 0x10400)

@fmejia97
Copy link
Contributor Author

fmejia97 commented Jun 5, 2018

🤔 Looking right now

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

Example of the metrics https://gist.github.com/aledbf/c68ad149c1292da54d6bd3162bf05740

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

"upstreamResponseTime":"-"

This could be an issue. In some scenarios, nginx returns - in some fields (usually when status==499)

@fmejia97
Copy link
Contributor Author

fmejia97 commented Jun 5, 2018

Yes, that is what's crashing Unmarshal. I'm working on a fix and will update the PR soon.

@aledbf aledbf added this to In progress in 0.16.0 Jun 5, 2018
@ElvinEfendi
Copy link
Member

I've not looked deep into the PR yet, but can you make another PR for Lua middleware and deletion of VTS? Will make it easier to review.

I'd suggest we first get the Lua middleware merged and then merge this PR. And finally remove VTS and edit nginx.tmpl to switch to the new metric collection.

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

I'd suggest we first get the Lua middleware merged and then merge this PR.

Both things are in this PR

And finally remove VTS and edit nginx.tmpl to switch to the new metric collection.

That's the plan

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

@ElvinEfendi by default the vts module is disabled so the prometheus metrics only contain the default status information from nginx

@ElvinEfendi
Copy link
Member

ElvinEfendi commented Jun 5, 2018

Both things are in this PR

I know but for example Lua middleware lacks unit tests, and having self contained separate PRs if possible makes it easier to review/revert.

@aledbf
Copy link
Member

aledbf commented Jun 5, 2018

Will make it easier to review.

Not sure if splitting this PR makes sense (in this case). It will be harder to review from my point of view

Name: "bytes_sent",
Help: "The the number of bytes sent to a client",
Namespace: ns,
Buckets: prometheus.LinearBuckets(50, 25, 10), // 10 buckets, each 50 bytes wide.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use exponential buckets for bytes sent, from 100 bytes to maybe 10 MiB.

@SuperQ
Copy link

SuperQ commented Jun 6, 2018

Is it possible to use ngx.socket.stream with a unix domain socket rather than UDP? This would make more sense for something communicating locally. It would avoid the whole network stack, even on localhost large amounts of UDP can be problematic.

@aledbf
Copy link
Member

aledbf commented Jun 6, 2018

@SuperQ yes but not sure if that change will be done in this PR

@aledbf
Copy link
Member

aledbf commented Jun 6, 2018

@fmejia97 maybe we could add this additional metric #2121 now? (and close that PR)

@fmejia97
Copy link
Contributor Author

fmejia97 commented Jun 6, 2018

@aledbf Looks like the PR wants to add a metric for average request time for individual services (upstreams). I'm currently exposing upstream_response_time_seconds_sum{...} which contains services labels and looking at Prometheus conventions, aggregation (avg) is usually carried out at the prometheus server side. We could obtain the additional metric using the query: avg_over_time(range-vector) https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation-_over_time. This query allows us to specify how back in time we want to consider values for the aggregation. What are your thoughts?

@aledbf
Copy link
Member

aledbf commented Jun 12, 2018

@fmejia97 I forgot to mention that we don't need nginx_status_collector.go. Please remove that file too.
After this PR, there is only one way to get metrics.

@fmejia97
Copy link
Contributor Author

fmejia97 commented Jun 13, 2018

@aledbf After talking with @ElvinEfendi and @andrewlouis93 , we decided that keeping nginx_status_collector.go is a good idea since it gives us connections and request metrics. Ideally we don't want to emit these metrics through the monitor.lua module since this module only emits data regarding each individual request. Connections metrics are independent of individual requests. As it is right now, the nginx_status collector writes metrics using the prometheus client, therefore both monitor.lua and nginx_status metrics will be visible on the same endpoint: /metrics. If we remove it, we won't have connection metrics. What do you think?

@aledbf
Copy link
Member

aledbf commented Jun 13, 2018

we decided that keeping nginx_status_collector.go is a good idea since it gives us connections and request metrics.

Until now nginx_status_collector.go was used only if the VTS module was disabled. I would prefer to get this info only from one place if possible. Maybe we can leave this until we merge this PR and then see exactly what can be removed (or not)

@gianrubio
Copy link
Contributor

I'd like to understand what are the main differences between the prometheus Lua collector and nginx vts?

I noticed that nginx vts added the feature to expose metrics in prometheus format, can we discuss pro and con between lua x vts?

@aledbf
Copy link
Member

aledbf commented Jun 13, 2018

I noticed that nginx vts added the feature to expose metrics in prometheus format, can we discuss pro and con between lua x vts?

With the dynamic mode introduced a couple of releases ago, the load balancing stuff is handled in LUA. This means there is no upstream section in the generated nginx.conf, there is only one used to switch to LUA. In this scenario, the VTS module is useless because the stats only contains information about this upstream without any context about the traffic.
Additionally to this, using the VTS module is not possible to add custom variables to the stats, like information about the namespace, ingress rule, and service (something available as NGINX variable).

@andrewloux
Copy link
Contributor

andrewloux commented Jun 13, 2018

@fmejia97 Is not clear to everyone why we are switching away from VTS - could we add more context to this PR description before this ships?

@codecov-io
Copy link

codecov-io commented Jun 13, 2018

Codecov Report

Merging #2608 into master will decrease coverage by 0.03%.
The diff coverage is 3.67%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2608      +/-   ##
=========================================
- Coverage   40.83%   40.8%   -0.04%     
=========================================
  Files          75      74       -1     
  Lines        5123    5083      -40     
=========================================
- Hits         2092    2074      -18     
+ Misses       2750    2726      -24     
- Partials      281     283       +2
Impacted Files Coverage Δ
internal/ingress/controller/nginx.go 11.51% <0%> (+0.24%) ⬆️
internal/ingress/metric/collector/udp_collector.go 0% <0%> (ø)
internal/ingress/controller/util.go 27.77% <0%> (-13.89%) ⬇️
...rnal/ingress/metric/collector/process_collector.go 0% <0%> (ø)
...ingress/metric/collector/nginx_status_collector.go 0% <0%> (ø)
cmd/nginx/main.go 22.62% <0%> (-6.38%) ⬇️
internal/ingress/metric/collector/listener.go 69.23% <69.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 764bcd5...2cd2da7. Read the comment docs.

@aledbf
Copy link
Member

aledbf commented Jun 13, 2018

@fmejia97 please rollback the last commit. The namespace cannot be a constant. Use a helper to fix the name (like ingress-nginx to ingress_nginx)

@fmejia97 fmejia97 force-pushed the expose-udp-metrics-updated branch 3 times, most recently from 8f33578 to f508189 Compare June 14, 2018 01:00
@fmejia97
Copy link
Contributor Author

@aledbf I used a constant since I saw that's how it was used before. Added a new commit where I remove the constant and instead I fix the namespace name 👍

@fmejia97
Copy link
Contributor Author

@aledbf Rebased + squashed 👍

@andrewloux
Copy link
Contributor

@aledbf I'm happy with this as is right now. Will do another PR where we'll try an reduce complexity once this one is shipped 🙏 Great work on this @fmejia97 ❤️

@aledbf
Copy link
Member

aledbf commented Jun 14, 2018

Let's merge this, it's already too big.

@aledbf
Copy link
Member

aledbf commented Jun 14, 2018

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, fmejia97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2018
@aledbf
Copy link
Member

aledbf commented Jun 14, 2018

@fmejia97 @andrewlouis93 I will work in the use of unix sockets instead of UDP to send the metrics

@k8s-ci-robot k8s-ci-robot merged commit c9a0c90 into kubernetes:master Jun 14, 2018
@aledbf aledbf moved this from In progress to Done in 0.16.0 Jun 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
No open projects
0.16.0
  
Done
9 participants