-
Notifications
You must be signed in to change notification settings - Fork 54
Fix a leak where FileWatchers are added but never removed #36
Fix a leak where FileWatchers are added but never removed #36
Conversation
…expired from the cache, and some minor updates to privatize the cache by calling the public API in consumers Prior to this change, FileWatchers to track when a KNN index is deleted are added to the ResourceWatcherService but never removed even after indexes have been deleted, which causes the set of watchers to grow unbounded over time. This change updates the logic to clean up FileWatchers upon cache eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Catch. thanks for the PR.
|
||
private static class KNNIndexCacheEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: java doc for class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add
private void onRemoval(RemovalNotification<String, KNNIndexCacheEntry> removalNotification) { | ||
KNNIndexCacheEntry knnIndexCacheEntry = removalNotification.getValue(); | ||
|
||
knnIndexCacheEntry.getFileWatcherHandle().stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add try/finally block and move knnIndexCacheEntry.getFileWatcherHandle().stop();
to finally block to ensure the watcher is removed as file is deleted at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what would be wrapped in the try{}? knnIndexCacheEntry.getFileWatcherHandle().stop()
should not throw - it just removes the watcher from the set, and doesn't declare any kind of exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to ensuring knnIndexCacheEntry.getFileWatcherHandle().stop()
being called when cache invalidate happens. some thing like this
try {
KNNIndexCacheEntry knnIndexCacheEntry = removalNotification.getValue();
executor.execute(() -> knnIndexCacheEntry.getKnnIndex().close());
} finally {
knnIndexCacheEntry.getFileWatcherHandle().stop();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a perceived difference in behavior here? Both should be called regardless - in the case of the PR, FileWatcher.stop
is called first, and should doesn't have an expected failure mode, and then we submit a task to gc
the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree there is no failure mode at this point. Ideally we want to ensure both gc() and stop() always get called in this function. We can defer this until we use functions that throws.
// the entry | ||
fileWatcher.init(); | ||
|
||
final KNNIndex knnIndex = KNNIndex.loadIndex(indexPathUrl, algoParams); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should first do resourceWatcherService.add
and then load the index.
resourceWatcherService.add
seems to already callinit()
on the filewatcher. So no explicit call to init()- this will also address the case of index not loaded if file is only present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by the second case?
For the first case, the reason why I'm calling loadIndex
in between FileWatcher.init
and resourceWatcherService.add
is because I'm still not totally confident about the guarantee of no search threads executing while the file is deleted.
By calling init()
first, we know with certainty whether the file existed at the time we started loading the graph, and if it did not exist, we are guaranteed that loadIndex
will fail. If we also start the monitor before loading the index, then in the case that we're wrong about the guarantee mentioned above, we're effectively creating a multi-second window (depending on index loading time) for which the watcher could notify about a file delete, and invalidate() would cause a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Thanks
// TODO verify that this is safe - ideally we'd explicitly ensure that the FileWatcher is only checked | ||
// after the guava cache has finished loading the key to avoid a race condition where the watcher | ||
// causes us to invalidate an entry before the key has been fully loaded. | ||
final WatcherHandle<FileWatcher> watcherHandle = resourceWatcherService.add(fileWatcher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think guava cache takes the lock when loading the entry for the key. So when invalidate happens, it should happen after the graph is loaded?
https://github.com/google/guava/blob/master/guava/src/com/google/common/cache/Cache.java#L91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the following gist. Guava invalidate(key) does not block but is instead a no-op when executed if guava hasn't fully loaded the key into the cache: https://gist.github.com/jschmitz28/c8ff2d3b5856f625cf1f5a98095b7a51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. You mean invalidating entry would not interrupt loading graph but no-op would make us miss gc()
on the graph. Agree this minimizes the risk of race condition assuming the index deletion happens during search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the changes
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNIndexCache.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the fix!
Description
Fix a leak where FileWatchers are added but never removed upon being expired from the cache, and some minor updates to privatize the cache by calling the public API in consumers
Prior to this change, FileWatchers to track when a KNN index is deleted are added to the ResourceWatcherService but never removed even after indexes have been deleted, which causes the set of watchers to grow unbounded over time. This change updates the logic to clean up FileWatchers upon cache eviction.