feat: add monitoring #674

samsja · 2022-04-05T13:41:56Z

numb3r3 · 2022-04-06T01:47:57Z

server/clip_server/torch-flow.yml

+    monitoring: true
+    port_monitoring: 9000


the both monitoring and monitoring_port are redundant. If monitor_port is set, then monitor is enabled, otherwise disabled. Hence, monitoring is not necessary here.

The issue is that there is always a default value pass to monitoring_port, so we do need another parameter to check to say if monitoring is enable or no. What I could do here is just use the default monitoring port so that the yaml does not contain redundant information

The default value can be None, am I right?

numb3r3 · 2022-04-06T01:52:26Z

server/clip_server/executors/clip_torch.py

+        self.summary_text = self.get_summary('text_preproc_second', 'Time to preprocess text')
+        self.summary_image = self.get_summary('image_preproc_second', 'Time to preprocess image')
+        self.summary_encode = self.get_summary('encode_second', 'Time to encode')
+    def get_summary(self, title, details):


Can we use a global Summary context instance, which can be initialized in the BaseExecutor. Hence, the developer does not need to inject the above codes into the executor implementation. For custome tracking name, we can borrow some ideas from the metric logger from Tensorboard. e.g.,

self.summary.log('text_docs', len(da), ...) ...

There are already some base summary initialize in the BaseExecutor, but here proces text and proces image are specific so this executor so they would always need to be define here
I note for the Tensorboard for custom tracking name.

i agree with Felix on this one, I'd prefer:

Best: No change to the executor at all. Injection happens inside the core. Think about Hub Executors, how can we change all of them, especially those are not owned by us.
Okay: injection happens at @requests(..., monitor=True, summary=True)
Bad: need a lot of lines of change like the current PR. This is unacceptable to me, also to all of our Hub Executors.

I might be unclear.

The core will support all bunch of metrics which will be common to all Executor. Like the latency on gateway,head,worker side and the time that the function wrapper by requests takes.
This will be common to all Executor and not a single line of code would be needed on the Executor side.

In addition of that I want to have an extra feature that allow Executor developer to customize and enrich this monitoring by defining by themself new metrics to expose. This is highly relevant for clip as a service as I would like to know how much time is actually used on gpu computation and how much time is used in the text/image preprocessing. The idea with that extra feature is that you can monitoring each sub function that is called inside your request method. And those things can't be abstracted away as it is not common to all executor.

hanxiao

this change is not acceptable as I pointed out

samsja · 2022-04-14T11:49:28Z

Summary of the sync between @numb3r3 and me:

We think about three possible interfaces for exposing the metrics from an Executor:

Accessing the raw prometheus client.
- Con: user need to define the summary by themself.
- Pro: they can leverage the full power of the prometheus client. This solution will stay any way because the two others will rely on it
Having a decorator to monitor private method
Having a pytorch-like interface to do the monitoring: ex: self.summary.log('text_docs', len(da), ...)
- Pro: easier to use as we just need to add one line of code
- Con: we have to teach a new monitoring interface to the user

I will create a draft feature for each of the proposition so that we can try it out with @numb3r3 and decide what we want to ship at the end

samsja · 2022-05-03T11:41:21Z

Summary of the sync between @numb3r3 and me:

We think about three possible interfaces for exposing the metrics from an Executor:
* Accessing the raw prometheus client.
  
  * Con: user need to define the summary by themself.
  * Pro: they can leverage the full power of the prometheus client. This solution will stay any way because the two others will rely on it

* Having a decorator to monitor private method

* Having a pytorch-like interface to do the monitoring: ex: `self.summary.log('text_docs', len(da), ...)`
  
  * Pro: easier to use as we just need to add one line of code
  * Con: we have to teach a new monitoring interface to the user
I will create a draft feature for each of the proposition so that we can try it out with @numb3r3 and decide what we want to ship at the end

raw prometheus (this pr)
decorator ( feat: add monitoring 2 #704)
log (feat: add monitoring 3 #705)

@numb3r3 @JoanFM

numb3r3 · 2022-05-05T02:40:28Z

server/clip_server/executors/clip_torch.py

@@ -52,6 +52,7 @@ def __init__(

        self._pool = ThreadPool(processes=num_worker_preprocess)

+    @monitor('preproc_images_seconds','Time preprocessing images')


Does the monitor decorator work with arbitrary function? or it can only work with the Executor function (i.e., member function of the executor) whose input and outputs are DocumentArray?

It works with any method of the Executor, it does not require any particular signature. It still need to be a method and not a arbitrary function though, as it needs to access the metrics registry which is store inside the Executor

codecov · 2022-05-26T04:38:37Z

Codecov Report

Merging #674 (5f6c199) into main (e1acf9a) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #674      +/-   ##
==========================================
+ Coverage   80.56%   80.86%   +0.30%     
==========================================
  Files          16       16              
  Lines        1137     1155      +18     
==========================================
+ Hits          916      934      +18     
  Misses        221      221

Flag	Coverage Δ
cas	`80.86% <100.00%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/clip_server/executors/clip_onnx.py	`84.50% <100.00%> (+1.43%)`	⬆️
server/clip_server/executors/clip_tensorrt.py	`92.85% <100.00%> (+0.85%)`	⬆️
server/clip_server/executors/clip_torch.py	`85.71% <100.00%> (+1.50%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c9625...5f6c199. Read the comment docs.

samsja marked this pull request as draft April 5, 2022 13:43

samsja force-pushed the feat-monitoring branch 3 times, most recently from 10e8ac4 to 5fe72a5 Compare April 5, 2022 14:45

github-actions bot added size/m component/server labels Apr 5, 2022

numb3r3 reviewed Apr 6, 2022

View reviewed changes

hanxiao requested changes Apr 6, 2022

View reviewed changes

This was referenced May 2, 2022

Draft the three Executor monitoring interfaces jina-ai/jina#4732

Closed

feat: add a decoratator for monitoring method jina-ai/jina#4737

Merged

samsja force-pushed the feat-monitoring branch from d5997de to 214762a Compare May 2, 2022 13:23

samsja closed this May 2, 2022

samsja force-pushed the feat-monitoring branch from d36e679 to 0f94166 Compare May 2, 2022 13:37

samsja reopened this May 2, 2022

github-actions bot added size/xs and removed size/m labels May 2, 2022

samsja force-pushed the feat-monitoring branch from a6576e1 to b7d4d12 Compare May 2, 2022 13:55

samsja mentioned this pull request May 2, 2022

feat: add monitoring 2 #704

Closed

numb3r3 reviewed May 5, 2022

View reviewed changes

samsja force-pushed the feat-monitoring branch 2 times, most recently from 2d3c830 to 1b04c87 Compare May 16, 2022 08:05

samsja marked this pull request as ready for review May 16, 2022 08:05

samsja force-pushed the feat-monitoring branch from 1b04c87 to 993e58b Compare May 16, 2022 08:24

Sami Jaghouar and others added 2 commits May 26, 2022 11:24

feat: add monitoring of internal part

adcded5

fix: rebase to main

a18cfad

numb3r3 force-pushed the feat-monitoring branch from 993e58b to a18cfad Compare May 26, 2022 04:03

github-actions bot added size/s and removed size/xs labels May 26, 2022

numb3r3 added 2 commits May 26, 2022 12:06

fix: bump jina 3.4.0

7008ca4

fix: pb dependency

3a0c105

numb3r3 added 2 commits May 26, 2022 12:44

fix: revert rank monitor

782d7bf

fix: revert custom counter

5f6c199

numb3r3 merged commit 60a986a into main May 27, 2022

numb3r3 deleted the feat-monitoring branch May 27, 2022 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add monitoring #674

feat: add monitoring #674

samsja commented Apr 5, 2022

numb3r3 Apr 6, 2022

samsja Apr 6, 2022

numb3r3 Apr 6, 2022

numb3r3 Apr 6, 2022

samsja Apr 6, 2022

hanxiao Apr 6, 2022

samsja Apr 7, 2022

hanxiao left a comment

samsja commented Apr 14, 2022

samsja commented May 3, 2022 •

edited

numb3r3 May 5, 2022

samsja May 5, 2022

codecov bot commented May 26, 2022 •

edited

		@@ -52,6 +52,7 @@ def __init__(

		self._pool = ThreadPool(processes=num_worker_preprocess)

		@monitor('preproc_images_seconds','Time preprocessing images')

feat: add monitoring #674

feat: add monitoring #674

Conversation

samsja commented Apr 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanxiao left a comment

Choose a reason for hiding this comment

samsja commented Apr 14, 2022

samsja commented May 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 26, 2022 • edited

Codecov Report

samsja commented May 3, 2022 •

edited

codecov bot commented May 26, 2022 •

edited