Add Metrics Logging for Kruize Recommendations #1206

msvinaykumar · 2024-06-07T11:21:35Z

Pull Request Description: Add Metrics Logging for Kruize Recommendations

Summary

This pull request introduces the KruizeNotificationCollectionRegistry class, which is responsible for logging and creating metrics for notifications related to Kruize recommendations. The class processes recommendation notifications at various levels (container, timestamp, term, and model) and creates appropriate counters using Micrometer.

Key Features

Class KruizeNotificationCollectionRegistry: This new class handles the collection and logging of recommendation notifications.
Constructor: Initializes the class with experiment name, interval end time, and container name.
Method logNotification: Logs notifications from ContainerData by iterating through its recommendation structure and creating counters.
Method createCounterTag: Creates a counter with tags for the given level, term, model, and list of recommendation notifications.

Detailed Description

Class KruizeNotificationCollectionRegistry:
- This class is introduced to streamline the process of logging recommendation notifications and creating metrics.
- It holds information about the experiment name, interval end time, and container name, which are essential for tagging metrics.
Constructor:
- Initializes the object with necessary parameters: experiment_name, interval_end_time, and container_name.
Method logNotification:
- Accepts a ContainerData object as input.
- Iterates through the nested structure of recommendations within the ContainerData.
- For each level (container, timestamp, term, model), it collects notifications and calls createCounterTag.
Method createCounterTag:
- Accepts parameters such as level, term, model, and a collection of RecommendationNotification objects.
- Checks if the notification type is configured to be logged based on KruizeDeploymentInfo.log_recommendation_metrics_level.
- Creates additional tags using the provided information and formats them according to KruizeConstants.KRUIZE_RECOMMENDATION_METRICS.
- Finds or creates a counter for the metric and increments it.

Benefits

Enhanced Observability: By logging metrics for recommendations, this feature improves the observability and monitoring of Kruize recommendations.
Granular Metrics: Metrics are logged at various levels, providing detailed insights into different aspects of the recommendation process.
Configurable Logging: Only logs notifications that match the configured logging levels, ensuring flexibility and control over what gets logged.

Notes

Ensure that the necessary dependencies for Micrometer and other related utilities are available in the project.
This PR addresses the need for detailed metrics in Kruize recommendations, aiding in performance monitoring and debugging.

Testing

Thorough testing should be conducted to ensure that the metrics are correctly logged at each level.
Verify that the counters are created and incremented accurately based on the incoming notifications.
Ensure that the tags are properly formatted and include all relevant information.

Related Issues

References to any related issues or enhancement requests can be mentioned here.

Please review the changes and provide feedback. Your input is valuable to ensure that this feature integrates seamlessly and functions as expected.

test Image : quay.io/vinakuma/autotune_operator:metrics

kusumachalasani · 2024-06-12T12:07:06Z

src/main/java/com/autotune/metrics/KruizeNotificationCollectionRegistry.java

+                if (counterNotifications == null) {
+                    counterNotifications = MetricsConfig.timerBKruizeNotifications.tags(additionalTags).register(MetricsConfig.meterRegistry);
+                }
+                counterNotifications.increment();


What happens if generateRecommendations is called twice for the same time interval and experiment ?

Counter gets incremented

kusumachalasani

Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.

kusumachalasani · 2024-06-14T06:19:59Z

src/main/java/com/autotune/utils/MetricsConfig.java

@@ -58,6 +61,8 @@ private MetricsConfig() {
        timerBListDS = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "listDataSources").tag("method", "GET");
        timerBImportDSMetadata = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "importDataSourceMetadata").tag("method", "POST");
        timerBImportDSMetadata = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "importDataSourceMetadata").tag("method", "GET");
+
+        timerBKruizeNotifications = Counter.builder("KruizeNotifications").description("Kruize notifications").tag("api","updaterecommendations");


Change 'updaterecommendations' to 'updateRecommendations' to follow camel case convention like all methods. Would be good to make the description value as a constant to maintain consistency across.

msvinaykumar · 2024-06-14T08:33:42Z

Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.

I agree , @chandrams we might need short scalability run for this... But please ensure each experiment creating at least one error or critical notifications.

msvinaykumar · 2024-06-14T09:09:13Z

@chandrams this sample results json create some error notifications

https://privatebin.corp.redhat.com/?3b171fbe1bbb3244#8ZnjimR1QbfKUksj9qGZegAGCPMJUkhSFdidVvrmV3gv

chandrams · 2024-06-19T11:52:43Z

src/main/java/com/autotune/analyzer/recommendations/engine/RecommendationEngine.java

@@ -28,6 +28,7 @@
 import com.autotune.common.k8sObjects.K8sObject;
 import com.autotune.common.utils.CommonUtils;
 import com.autotune.database.service.ExperimentDBService;
+import com.autotune.metrics.KruizeNotificationCollectionRegistry;


@msvinaykumar - Please update kruize documentation with the metrics being captured

@msvinaykumar Can you update the documentation with the details, as discussed will need this if to update the kruize metrics script before running the scale test

chandrams · 2024-06-26T12:07:31Z

@msvinaykumar - Updated the kruize metrics script to capture the notifications and I have triggered a short scalability run with the new image that you provided - quay.io/vinakuma/autotune_operator:metrics2 and the results json that you shared.

chandrams · 2024-06-27T05:07:59Z

@msvinaykumar - The scalability 5k / 15 days run took 3 hrs 16 mins which is lesser than the scale test run on the same cluster with 0.0.22_mvp which was 3 hrs 50 mins. Does your build contain all the latest changes along with this PR?

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21767
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.61 / 0.39
Update Results Latency Max / Avg value: 0.13 / 0.11
LoadResultsByExpName Latency Max / Avg value: 0.2 / 0.16
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 33.11 GB
Kruize cpu Max value: 6.92
Execution time - 03:15:30

The logs have these errors

scaletest250-2.log:psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already
scaletest250-2.log:AN ERROR OCCURED: too many values to unpack (expected 2)

Summary of the test run with 0.0.23_mvp on the same cluster

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21760
python3 parse_metrics.py -d /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results -r 7200000
Directory path - /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.63 / 0.39
Update Results Latency Max / Avg value: 0.24 / 0.17
LoadResultsByExpName Latency Max / Avg value: 0.35 / 0.25
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 32.59 GB
Kruize cpu Max value: 4.52
Execution time - 03:51:01

msvinaykumar · 2024-06-27T05:31:07Z

@chandrams can you please confirm kruizeRecommendation_total counts

msvinaykumar · 2024-06-27T05:35:56Z

This error occurs when the PostgreSQL server has reached the maximum number of allowed client connections. we can increase the max_connections setting in your PostgreSQL configuration or optimize your application to use fewer connections. But however we can ignore this error
bcoz there is no data a loss
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
we can consider this as warning

msvinaykumar · 2024-06-27T05:39:48Z

Build is having this PR change and please confirm we have generated enough KruizeRecommendations metrics

chandrams · 2024-06-27T05:56:23Z

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

…mmendations. Signed-off-by: msvinaykumar <vinakuma@redhat.com>

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

msvinaykumar · 2024-06-27T07:00:37Z

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

This looks good. The count is over 500k, so the idea is to generate more notifications without impacting execution time. Based on the results, performance is unaffected, so we're good to proceed. We also have a flag to disable it just in case any issues arise.

dinogun

LGTM

msvinaykumar requested review from khansaad, kusumachalasani and chandrams June 7, 2024 11:21

kusumachalasani reviewed Jun 12, 2024

View reviewed changes

kusumachalasani reviewed Jun 14, 2024

View reviewed changes

msvinaykumar force-pushed the logKruizeEventsNotification branch from 9787f24 to 6eb046c Compare June 14, 2024 09:03

chandrams reviewed Jun 19, 2024

View reviewed changes

dinogun added this to the Kruize 0.0.23_rm Release milestone Jun 27, 2024

dinogun assigned msvinaykumar Jun 27, 2024

msvinaykumar added 7 commits June 27, 2024 12:11

logging and creating metrics for notifications related to Kruize reco…

0b53f3e

…mmendations. Signed-off-by: msvinaykumar <vinakuma@redhat.com>

Remove unwanted directory from repository

4169bf1

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

Remove unwanted directory from repository

e8c10bf

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

resolved conflicts

dc9bac5

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

incorporated review comments

0019223

incorporated review comments

232dfe3

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

resolved conflicts

344e3c8

Signed-off-by: msvinaykumar <vinakuma@redhat.com>

msvinaykumar force-pushed the logKruizeEventsNotification branch from 0c799db to 344e3c8 Compare June 27, 2024 06:55

dinogun approved these changes Jun 28, 2024

View reviewed changes

dinogun merged commit 7045682 into kruize:mvp_demo Jun 28, 2024
2 of 3 checks passed

rbadagandi1 added the remote_monitoring label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Metrics Logging for Kruize Recommendations #1206

Add Metrics Logging for Kruize Recommendations #1206

msvinaykumar commented Jun 7, 2024

kusumachalasani Jun 12, 2024

msvinaykumar Jun 13, 2024

kusumachalasani left a comment •

edited

Loading

kusumachalasani Jun 14, 2024

msvinaykumar Jun 14, 2024

msvinaykumar commented Jun 14, 2024

msvinaykumar commented Jun 14, 2024

chandrams Jun 19, 2024

chandrams Jun 25, 2024

msvinaykumar Jun 25, 2024

chandrams commented Jun 26, 2024

chandrams commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

chandrams commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

dinogun left a comment

Add Metrics Logging for Kruize Recommendations #1206

Add Metrics Logging for Kruize Recommendations #1206

Conversation

msvinaykumar commented Jun 7, 2024

Pull Request Description: Add Metrics Logging for Kruize Recommendations

Summary

Key Features

Detailed Description

Benefits

Notes

Testing

Related Issues

kusumachalasani Jun 12, 2024

Choose a reason for hiding this comment

msvinaykumar Jun 13, 2024

Choose a reason for hiding this comment

kusumachalasani left a comment • edited Loading

Choose a reason for hiding this comment

kusumachalasani Jun 14, 2024

Choose a reason for hiding this comment

msvinaykumar Jun 14, 2024

Choose a reason for hiding this comment

msvinaykumar commented Jun 14, 2024

msvinaykumar commented Jun 14, 2024

chandrams Jun 19, 2024

Choose a reason for hiding this comment

chandrams Jun 25, 2024

Choose a reason for hiding this comment

msvinaykumar Jun 25, 2024

Choose a reason for hiding this comment

chandrams commented Jun 26, 2024

chandrams commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

chandrams commented Jun 27, 2024

msvinaykumar commented Jun 27, 2024

dinogun left a comment

Choose a reason for hiding this comment

kusumachalasani left a comment •

edited

Loading