Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to perform heapDump on exceeded memory limit failures #16669

Merged
merged 1 commit into from Sep 15, 2021

Conversation

@pgupta2
Copy link
Contributor

@pgupta2 pgupta2 commented Aug 31, 2021

We don't have any easy way to get heapdump on exceeded_memory_limit
failures. Having an ability to trigger a heapdump in such cases will
greatly improve debugging experiences around query OOMs.

Test plan - Manually ran queries with session property enabled and heap snapshot was dumped into the specified file

== RELEASE NOTES ==

General Changes
* Add a new session property `heap_dump_on_exceeded_memory_limit_enabled` to enable heapdump on exceeded memory failures. The heapdump file directory can be provided using `exceeded_memory_limit_heap_dump_file_directory` session property

@pgupta2 pgupta2 requested a review from aweisberg Aug 31, 2021
Copy link
Member

@aweisberg aweisberg left a comment

One nit regarding the name of the option. Otherwise LGTM. Thank you!

return new ExceededMemoryLimitException(EXCEEDED_LOCAL_MEMORY_LIMIT,
format("Query exceeded per-node total memory limit of %s [%s]", maxMemory, additionalFailureInfo));
}

public static ExceededMemoryLimitException exceededLocalRevocableMemoryLimit(DataSize maxMemory, String additionalFailureInfo)
public static ExceededMemoryLimitException exceededLocalRevocableMemoryLimit(
Copy link
Member

@aweisberg aweisberg Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borderline it might be less useful to do this for revocable memory, but since this is a debug option you enable it seems like it should be fine since exceeding revocable limit is pretty rare.

@@ -204,6 +204,8 @@
public static final String MATERIALIZED_VIEW_DATA_CONSISTENCY_ENABLED = "materialized_view_data_consistency_enabled";
public static final String QUERY_OPTIMIZATION_WITH_MATERIALIZED_VIEW_ENABLED = "query_optimization_with_materialized_view_enabled";
public static final String AGGREGATION_IF_TO_FILTER_REWRITE_ENABLED = "aggregation_if_to_filter_rewrite_enabled";
public static final String HEAP_DUMP_ON_EXCEEDED_MEMORY_LIMIT_ENABLED = "heap_dump_on_exceeded_memory_limit_enabled";
public static final String HEAP_DUMP_FILE_PATH = "heap_dump_file_path";
Copy link
Member

@aweisberg aweisberg Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to exceeding the memory limit so the name should probably reflect that.

private boolean heapDumpOnExceededMemoryLimitEnabled;

@GuardedBy("this")
private String heapDumpFilePath;
Copy link
Member

@aweisberg aweisberg Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this being null given the specialized nature of the whole thing, but if someone picky comes along to commit they might want this to be Optional.

@aweisberg aweisberg requested a review from highker Aug 31, 2021
@pgupta2 pgupta2 force-pushed the heap_dumper branch 2 times, most recently from 446d8eb to 4e7fc30 Aug 31, 2021
@highker highker requested a review from souravpal Sep 1, 2021
return new ExceededMemoryLimitException(
EXCEEDED_REVOCABLE_MEMORY_LIMIT,
format("Query exceeded per-node revocable memory limit of %s [%s]", maxMemory, additionalFailureInfo));
}

private static void performHeapDumpIfEnabled(boolean heapDumpOnExceededMemoryLimitEnabled, Optional<String> heapDumpFilePath)
{
if (heapDumpOnExceededMemoryLimitEnabled && heapDumpFilePath.isPresent()) {
Copy link
Contributor

@souravpal souravpal Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check whether the heapDumpFilePath is a valid path?

Copy link
Member

@aweisberg aweisberg Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point!

This is a session property per query and not a configuration at startup. It will fail if the path is bad here anyways, but it might make sense to do something like https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/spiller/FileSingleStreamSpillerFactory.java#L101 ?

I am torn on whether the query should fail fast at startup if the file can't be created vs just letting it fail to create the heap dump. It's a bit finicky because we would need to check for each query whether the file can be created.

This is for debug so the temptation is to just do what is simple.

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did thought of adding some kind of valid path check but did not pursued on it since this is more of a debugging feature. If heapDump is enabled and heapDumpPath is not specified, a tmp path is automatically constructed for heapDump. Users have the ability to override the heapDumpPath, if needed and they are expected to provide a valid path.

Also, the heapDump is a best effort feature and will not fail the query on exceptions. If any exception is thrown, it will be swallowed and query execution will fail with proper EXCEEDED_MEMORY_LIMIT exception.

Copy link
Contributor

@souravpal souravpal left a comment

I suspect this will be one of the most useful debug features. Ship it!

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.facebook.presto;
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to com.facebook.presto.util;

*
* @param fileName name of the heap dump file
*/
public static synchronized void dumpHeap(String fileName)
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need synchronized method if we use AtomicBoolean for isHeapDumpTriggered?

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use AtomicBoolean, then we need to first set isHeapDumpTriggered and then trigger the heap dump. While the heapDump is in progress, other threads will skip this logic and throw the exceeded memory exception immediately, which should be OK.

HotSpotDiagnosticMXBean bean =
ManagementFactory.newPlatformMXBeanProxy(server,
HOTSPOT_BEAN_NAME, HotSpotDiagnosticMXBean.class);
return bean;
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline bean

return ManagementFactory.newPlatformMXBeanProxy(server, HOTSPOT_BEAN_NAME, HotSpotDiagnosticMXBean.class);

isHeapDumpTriggered = true;
}
catch (Throwable throwable) {
// Consume the error as we dont want to fail during heapdump
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/dont/do not

performHeapDumpIfEnabled(heapDumpOnExceededMemoryLimitEnabled, heapDumpFilePath);
return new ExceededMemoryLimitException(EXCEEDED_LOCAL_MEMORY_LIMIT,
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda weird having a dump inside an exception and do it synchronically.... Can we at the callsite, when memory exceeds, we throw the error and asynchronically use an executor to do the dump?

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a basically a catch-all method for any exceeded memory OOM. If we add this logic at callsite, it will be spread throughout the code and might not look clean. Presto exceeded memory limit failure is like an OOM so getting a heapdump here makes most sense for debugging purpose.

Also, this is a debugging feature and will be disabled in prod. I didn't wanted to over-engineer the solution since this logic will be triggered in control environment while testing a specific query for OOM failures.

Copy link
Member

@aweisberg aweisberg Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important the heap dump be triggered synchronously with the out of memory error because the purpose of the entire thing is capture the state of the heap when the error occurs. If it's asynchronous it allows time for queries to fail and cleanup to occur both in this query and in other queries that are running concurrently.

WDYT?

Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, shall we add comments to this method (and other's in this class) to indicate this needed and only for debugging purpose.

HEAP_DUMP_ON_EXCEEDED_MEMORY_LIMIT_ENABLED,
"Trigger heap dump to `EXCEEDED_MEMORY_LIMIT_HEAP_DUMP_FILE_PATH` on exceeded memory limit exceptions",
false,
false),
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't wanna expose this session property to users..... let's make it hidden; same for the other one

booleanProperty(
HEAP_DUMP_ON_EXCEEDED_MEMORY_LIMIT_ENABLED,
"Trigger heap dump to `EXCEEDED_MEMORY_LIMIT_HEAP_DUMP_FILE_PATH` on exceeded memory limit exceptions",
false,
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the default value, we usually specify it in a config; for this case, featuresConfig would be a good one. Check the other examples in this class.

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly did not add a Config property for this since we don't expect it to be enabled in prod. This feature will only be enabled by session property when specified. Is it always the case that we should specify a corresponding config property for every session property?

Copy link
Member

@aweisberg aweisberg Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I considered it I decided that the way it is structured it doesn't make much sense as a config option. It's a pretty niche usage and it will only dump the heap once per process (there is a guard in heap dumper).

I'm not sure how applicable this is for production clusters given that restriction. It seems like a good restriction just because it limits the blast radius of accidentally enabling the option and blasting it at a production cluster (yikes!).

Just thinking about it makes me think we actually want to prohibit this by default unless the cluster is started with an option permitting it that can't be set via session property (so features config).

Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add a comment to these two places to explain the rationale why we don't want a config

EXCEEDED_MEMORY_LIMIT_HEAP_DUMP_FILE_PATH,
"File to which heap snapshot will be dumped, if heap_dump_on_exceeded_memory_limit_enabled",
System.getProperty("java.io.tmpdir") + "heap_dump.hprof",
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we wanna specify the directory path instead of the file name. Then we can encode our own file name like <query_id>_<stage_id>.hprof

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to keep this simple but this suggestion also sounds good. Lets do this.

queryContext.setHeapDumpOnExceededMemoryLimitEnabled(isHeapDumpOnExceededMemoryLimitEnabled(session));
queryContext.setHeapDumpFilePath(getHeapDumpFilePath(session));
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the two files immutable and pass them in throw the constructor? We will have a lot of tasks per query and it's not necessary setting them all the time

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had asked the similar question in the past in one of the PR and I got this response for why we should not pass it in constructor: #16297 (comment)

Copy link
Member

@aweisberg aweisberg Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into the same thing. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/SqlTaskManager.java#L409
QueryContext is constructed from a LoadingCache and when the LoadingCache is constructed there is no session. Nested loading caches in fact. I think there are races where the Task and associated QueryContext need to be retrieved before the session is available because messages for a task can arrive before the associated session information does.

Seems like we could simplify by having one setter with two params though?

private boolean heapDumpOnExceededMemoryLimitEnabled;

@GuardedBy("this")
private Optional<String> heapDumpFilePath;
Copy link
Contributor

@highker highker Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they could be final if we get them from constructor

highker
highker approved these changes Sep 2, 2021
private static final Logger log = Logger.get(HeapDumper.class);
private static final String HOTSPOT_BEAN_NAME = "com.sun.management:type=HotSpotDiagnostic";

private static AtomicBoolean isHeapDumpTriggered = new AtomicBoolean(false);
Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this could be private static final
  • call it IS_HEAPDUMP_TRIGGERED and move it closer to HOTSPOT_BEAN_NAME

catch (Throwable throwable) {
// Consume the error as we do not want to fail during heapdump
log.error(throwable, "Unable to perform heap dump");
}
Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a

finally {
    isHeapDumpTriggered.set(false);
}

Copy link
Contributor Author

@pgupta2 pgupta2 Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont. We want to trigger heapdump only once. As soon as the first thread sets the atomic boolean to TRUE, no other thread should be able to trigger a heapdump after that.

performHeapDumpIfEnabled(heapDumpOnExceededMemoryLimitEnabled, heapDumpFilePath);
return new ExceededMemoryLimitException(EXCEEDED_LOCAL_MEMORY_LIMIT,
Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, shall we add comments to this method (and other's in this class) to indicate this needed and only for debugging purpose.

booleanProperty(
HEAP_DUMP_ON_EXCEEDED_MEMORY_LIMIT_ENABLED,
"Trigger heap dump to `EXCEEDED_MEMORY_LIMIT_HEAP_DUMP_FILE_PATH` on exceeded memory limit exceptions",
false,
Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add a comment to these two places to explain the rationale why we don't want a config

@@ -204,6 +204,8 @@
public static final String MATERIALIZED_VIEW_DATA_CONSISTENCY_ENABLED = "materialized_view_data_consistency_enabled";
public static final String QUERY_OPTIMIZATION_WITH_MATERIALIZED_VIEW_ENABLED = "query_optimization_with_materialized_view_enabled";
public static final String AGGREGATION_IF_TO_FILTER_REWRITE_ENABLED = "aggregation_if_to_filter_rewrite_enabled";
public static final String HEAP_DUMP_ON_EXCEEDED_MEMORY_LIMIT_ENABLED = "heap_dump_on_exceeded_memory_limit_enabled";
public static final String EXCEEDED_MEMORY_LIMIT_HEAP_DUMP_FILE_DIR = "exceeded_memory_limit_heap_dump_file_dir";
Copy link
Contributor

@highker highker Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spell out directory and DIRECTORY

@highker
Copy link
Contributor

@highker highker commented Sep 2, 2021

Could you update the RELEASE NOTES to reflect the right session property names?

@highker highker self-assigned this Sep 2, 2021
We don't have any easy way to get heapdump on exceeded_memory_limit
failures. Having an ability to trigger a heapdump in such cases will
greatly improve debugging experiences around query OOMs.
@highker highker merged commit ed2a648 into prestodb:master Sep 15, 2021
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants