trunk-tracking-shutdown: improve shutdown code #897

oschaaf · 2015-02-03T21:39:37Z

Add a test to make sure all logged output looks sane by whitelisting
current errors/warnings.
Stop our nginx test instances after we are done testing.
Add tests for shutting down and reloading configuration under high
load (depends on ab).
Reduce the number of keepalive requests in the keepalive tests to speed
up test runs.
Fix exiting with open file descriptors, fix cleanup in nginx's cache
manager/loader processes
Attempt to finish up queued up NgxBaseFetches/requests on shutdown/reload
Under valgrind the blocking rewrite started failing after adding a test
for reloading configuration under high load.
I've added it to the expected failures for valgrind, looking into this
is up next.

- Add a test to make sure all logged output looks sane - Stop our nginx test instances after we are done testing, and - Add a test shuts down under high load (needs ab) - Reduce the number of keepalive requests in the keepalive tests to speed up test runs. - Add notes/questions which need to be sorted out.

…lean" This reverts commit 013ca06. Another approach is needed, this was a bad idea

- Valgrind clean run for the cache manager/loader processes - Undo tracking file descriptors with valgrind, as it seems it isn't possible to suppress messages, and nginx leaves some descriptors to the os. - Disallow [crit] messages in error.log next to [alert]

- Add a test to make sure all logged output looks sane by whitelisting current errors/warnings. - Stop our nginx test instances after we are done testing. - Add tests for shutting down and reloading configuration under high load (depends on ab). - Reduce the number of keepalive requests in the keepalive tests to speed up test runs. - Fix exiting with open file descriptors, fix cleanup in nginx's cache manager/loader processes - Attempt to finish up queued up NgxBaseFetches/requests on shutdown/reload - Under valgrind the blocking rewrite started failing after adding a test for reloading configuration under high load. I've added it to the expected failures for valgrind, looking into this is up next.

oschaaf · 2015-02-03T23:14:45Z

I ran the tests for a while, and discovered that the fix for cleaning up when exiting from the cache manager/loader processes needs more work. I'm closing this pull, and will re-open when I have fixed that. Sorry!

oschaaf · 2015-02-04T11:00:10Z

@jeffkaufman Stil working on this, but I think I'm very nearly done.

Could you evaluate these PSOL changes? These are required for this patch to work.
We need to explicitly shut down the driver factory before deleting it, which currently causes a DCHECK.
Also, with Serf enabled, the driver factory will core dump upon deleting it when it's not
initialised all they way through, which happens in nginx's cache manager/loader processes.

Index: pagespeed/system/serf_url_async_fetcher.cc
===================================================================
--- pagespeed/system/serf_url_async_fetcher.cc  (revision 4537)
+++ pagespeed/system/serf_url_async_fetcher.cc  (working copy)
@@ -144,6 +144,7 @@
     }
     if (pool_ != NULL) {
       apr_pool_destroy(pool_);
+      pool_ = NULL;
     }
   }

@@ -1173,8 +1174,13 @@
   if (threaded_fetcher_ != NULL) {
     delete threaded_fetcher_;
   }
-  delete mutex_;
-  apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+  if (mutex_ != NULL) {
+    delete mutex_;
+  }
+  if (pool_ != NULL) {
+    apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+    pool_ = NULL;
+  }
 }

 void SerfUrlAsyncFetcher::ShutDown() {
Index: pagespeed/system/system_caches.cc
===================================================================
--- pagespeed/system/system_caches.cc   (revision 4537)
+++ pagespeed/system/system_caches.cc   (working copy)
@@ -76,7 +76,6 @@
 }

 void SystemCaches::ShutDown(MessageHandler* message_handler) {
-  DCHECK(!was_shut_down_);
   if (was_shut_down_) {
     return;
   }

jeffkaufman · 2015-02-04T13:05:51Z

@ -144,6 +144,7 @@
     }
     if (pool_ != NULL) {
       apr_pool_destroy(pool_);
+      pool_ = NULL;
     }
   }

Setting pool_ to NULL in ~SerfFetch shouldn't do anything, right? If there are methods still getting called on the SerfFetch after it's deleted it seems like we should fix the callers instead?

@@ -1173,8 +1174,13 @@
   if (threaded_fetcher_ != NULL) {
     delete threaded_fetcher_;
   }
-  delete mutex_;
-  apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+  if (mutex_ != NULL) {
+    delete mutex_;
+  }
+  if (pool_ != NULL) {
+    apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+    pool_ = NULL;
+  }
 }

Same here, except that now you don't destroy pool_ if it's NULL. SerfUrlAsyncFetcher is created with a NULL pool_, but it's allocated in SerfUrlAsyncFetcher::Init. Is the issue that we want to be able to create SerfUrlAsyncFetchers and then delete them before running Init?

@@ -76,7 +76,6 @@
 }

 void SystemCaches::ShutDown(MessageHandler* message_handler) {
-  DCHECK(!was_shut_down_);
   if (was_shut_down_) {
     return;
   }

Why do we need to be able to shut down SystemCaches multiple times?

oschaaf · 2015-02-04T15:55:54Z

@jeffkaufman
Please disregard the changes to SerfUrlAsyncFetcher.cc, I was mistaken. I introduced a bug caused by the ordering of two atexit() handlers: one introduced by me in this change, and one in SystemRewriteDriverFactory::InitApr(). Some weird values I saw in gdb made me jump to wrong conclusions.

As for removing the DCHECK in SystemCaches::ShutDown(): I need to explicitly call ShutDown() on the driver factory and destruct it at another point in time (pool cleanup handler).
However, the destructor SystemRewriteDriverFactory ends up calling SystemCaches ::ShutDown() as well unconditionally:
https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/rewrite_driver_factory.cc#128
https://code.google.com/p/modpagespeed/source/browse/trunk/src/pagespeed/system/system_rewrite_driver_factory.cc#485

oschaaf · 2015-02-04T16:50:25Z

@jeffkaufman Actually, tthe explicit shutdown is no longer needed as we'll be waiting until there are no more active base fetches. So, disregard that too, there are no PSOL changes necessary.

jeffkaufman · 2015-02-04T18:10:49Z

Sounds good! Let me know when you've got something you'd like me to review.

oschaaf added 10 commits January 28, 2015 12:48

shutdown: finish up processing before destroying the driver factory

7268b98

valgrind: track file descriptors

231149d

shutdown: get rid of the "open socket" alerts in error.log

872ee1a

fd-leak: save easy fix for shutdown in NgxEventConnection

45a88ba

shutdown: move active factory to file level scope so we can clean

013ca06

Revert "shutdown: move active factory to file level scope so we can c…

f836f7b

…lean" This reverts commit 013ca06. Another approach is needed, this was a bad idea

valgrind: clean run

66b1cfc

- Valgrind clean run for the cache manager/loader processes - Undo tracking file descriptors with valgrind, as it seems it isn't possible to suppress messages, and nginx leaves some descriptors to the os. - Disallow [crit] messages in error.log next to [alert]

system-test: Add sanity check in system tests for error.log.

45b554d

oschaaf closed this Feb 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trunk-tracking-shutdown: improve shutdown code #897

trunk-tracking-shutdown: improve shutdown code #897

oschaaf commented Feb 3, 2015

oschaaf commented Feb 3, 2015

oschaaf commented Feb 4, 2015

jeffkaufman commented Feb 4, 2015

oschaaf commented Feb 4, 2015

oschaaf commented Feb 4, 2015

jeffkaufman commented Feb 4, 2015

trunk-tracking-shutdown: improve shutdown code #897

trunk-tracking-shutdown: improve shutdown code #897

Conversation

oschaaf commented Feb 3, 2015

oschaaf commented Feb 3, 2015

oschaaf commented Feb 4, 2015

jeffkaufman commented Feb 4, 2015

oschaaf commented Feb 4, 2015

oschaaf commented Feb 4, 2015

jeffkaufman commented Feb 4, 2015