Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metrics Logging for Kruize Recommendations #1206

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

msvinaykumar
Copy link
Contributor

Pull Request Description: Add Metrics Logging for Kruize Recommendations

Summary

This pull request introduces the KruizeNotificationCollectionRegistry class, which is responsible for logging and creating metrics for notifications related to Kruize recommendations. The class processes recommendation notifications at various levels (container, timestamp, term, and model) and creates appropriate counters using Micrometer.

Key Features

  • Class KruizeNotificationCollectionRegistry: This new class handles the collection and logging of recommendation notifications.
  • Constructor: Initializes the class with experiment name, interval end time, and container name.
  • Method logNotification: Logs notifications from ContainerData by iterating through its recommendation structure and creating counters.
  • Method createCounterTag: Creates a counter with tags for the given level, term, model, and list of recommendation notifications.

Detailed Description

  1. Class KruizeNotificationCollectionRegistry:

    • This class is introduced to streamline the process of logging recommendation notifications and creating metrics.
    • It holds information about the experiment name, interval end time, and container name, which are essential for tagging metrics.
  2. Constructor:

    • Initializes the object with necessary parameters: experiment_name, interval_end_time, and container_name.
  3. Method logNotification:

    • Accepts a ContainerData object as input.
    • Iterates through the nested structure of recommendations within the ContainerData.
    • For each level (container, timestamp, term, model), it collects notifications and calls createCounterTag.
  4. Method createCounterTag:

    • Accepts parameters such as level, term, model, and a collection of RecommendationNotification objects.
    • Checks if the notification type is configured to be logged based on KruizeDeploymentInfo.log_recommendation_metrics_level.
    • Creates additional tags using the provided information and formats them according to KruizeConstants.KRUIZE_RECOMMENDATION_METRICS.
    • Finds or creates a counter for the metric and increments it.

Benefits

  • Enhanced Observability: By logging metrics for recommendations, this feature improves the observability and monitoring of Kruize recommendations.
  • Granular Metrics: Metrics are logged at various levels, providing detailed insights into different aspects of the recommendation process.
  • Configurable Logging: Only logs notifications that match the configured logging levels, ensuring flexibility and control over what gets logged.

Notes

  • Ensure that the necessary dependencies for Micrometer and other related utilities are available in the project.
  • This PR addresses the need for detailed metrics in Kruize recommendations, aiding in performance monitoring and debugging.

Testing

  • Thorough testing should be conducted to ensure that the metrics are correctly logged at each level.
  • Verify that the counters are created and incremented accurately based on the incoming notifications.
  • Ensure that the tags are properly formatted and include all relevant information.

Related Issues

  • References to any related issues or enhancement requests can be mentioned here.

Please review the changes and provide feedback. Your input is valuable to ensure that this feature integrates seamlessly and functions as expected.

test Image : quay.io/vinakuma/autotune_operator:metrics

if (counterNotifications == null) {
counterNotifications = MetricsConfig.timerBKruizeNotifications.tags(additionalTags).register(MetricsConfig.meterRegistry);
}
counterNotifications.increment();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if generateRecommendations is called twice for the same time interval and experiment ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter gets incremented

Copy link
Contributor

@kusumachalasani kusumachalasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.

@@ -58,6 +61,8 @@ private MetricsConfig() {
timerBListDS = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "listDataSources").tag("method", "GET");
timerBImportDSMetadata = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "importDataSourceMetadata").tag("method", "POST");
timerBImportDSMetadata = Timer.builder("kruizeAPI").description(API_METRIC_DESC).tag("api", "importDataSourceMetadata").tag("method", "GET");

timerBKruizeNotifications = Counter.builder("KruizeNotifications").description("Kruize notifications").tag("api","updaterecommendations");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 'updaterecommendations' to 'updateRecommendations' to follow camel case convention like all methods. Would be good to make the description value as a constant to maintain consistency across.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@msvinaykumar
Copy link
Contributor Author

Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.

I agree , @chandrams we might need short scalability run for this... But please ensure each experiment creating at least one error or critical notifications.

@msvinaykumar
Copy link
Contributor Author

@chandrams this sample results json create some error notifications

https://privatebin.corp.redhat.com/?3b171fbe1bbb3244#8ZnjimR1QbfKUksj9qGZegAGCPMJUkhSFdidVvrmV3gv

@@ -28,6 +28,7 @@
import com.autotune.common.k8sObjects.K8sObject;
import com.autotune.common.utils.CommonUtils;
import com.autotune.database.service.ExperimentDBService;
import com.autotune.metrics.KruizeNotificationCollectionRegistry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msvinaykumar - Please update kruize documentation with the metrics being captured

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msvinaykumar Can you update the documentation with the details, as discussed will need this if to update the kruize metrics script before running the scale test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@chandrams
Copy link
Contributor

@msvinaykumar - Updated the kruize metrics script to capture the notifications and I have triggered a short scalability run with the new image that you provided - quay.io/vinakuma/autotune_operator:metrics2 and the results json that you shared.

@chandrams
Copy link
Contributor

@msvinaykumar - The scalability 5k / 15 days run took 3 hrs 16 mins which is lesser than the scale test run on the same cluster with 0.0.22_mvp which was 3 hrs 50 mins. Does your build contain all the latest changes along with this PR?

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21767
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.61 / 0.39
Update Results Latency Max / Avg value: 0.13 / 0.11
LoadResultsByExpName Latency Max / Avg value: 0.2 / 0.16
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 33.11 GB
Kruize cpu Max value: 6.92
Execution time - 03:15:30

The logs have these errors

scaletest250-2.log:psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already
scaletest250-2.log:AN ERROR OCCURED: too many values to unpack (expected 2)

Summary of the test run with 0.0.23_mvp on the same cluster

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21760
python3 parse_metrics.py -d /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results -r 7200000
Directory path - /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.63 / 0.39
Update Results Latency Max / Avg value: 0.24 / 0.17
LoadResultsByExpName Latency Max / Avg value: 0.35 / 0.25
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 32.59 GB
Kruize cpu Max value: 4.52
Execution time - 03:51:01

@msvinaykumar
Copy link
Contributor Author

@chandrams can you please confirm kruizeRecommendation_total counts

@msvinaykumar
Copy link
Contributor Author

This error occurs when the PostgreSQL server has reached the maximum number of allowed client connections. we can increase the max_connections setting in your PostgreSQL configuration or optimize your application to use fewer connections. But however we can ignore this error
bcoz there is no data a loss
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
we can consider this as warning

@msvinaykumar
Copy link
Contributor Author

Build is having this PR change and please confirm we have generated enough KruizeRecommendations metrics

@chandrams
Copy link
Contributor

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

…mmendations.

Signed-off-by: msvinaykumar <vinakuma@redhat.com>
Signed-off-by: msvinaykumar <vinakuma@redhat.com>
Signed-off-by: msvinaykumar <vinakuma@redhat.com>
Signed-off-by: msvinaykumar <vinakuma@redhat.com>
Signed-off-by: msvinaykumar <vinakuma@redhat.com>
Signed-off-by: msvinaykumar <vinakuma@redhat.com>
@msvinaykumar
Copy link
Contributor Author

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

This looks good. The count is over 500k, so the idea is to generate more notifications without impacting execution time. Based on the results, performance is unaffected, so we're good to proceed. We also have a flag to disable it just in case any issues arise.

Copy link
Contributor

@dinogun dinogun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dinogun dinogun merged commit 7045682 into kruize:mvp_demo Jun 28, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants