Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

trunk-tracking-shutdown: improve shutdown code #897

Closed
wants to merge 10 commits into from

Conversation

oschaaf
Copy link
Member

@oschaaf oschaaf commented Feb 3, 2015

  • Add a test to make sure all logged output looks sane by whitelisting
    current errors/warnings.
  • Stop our nginx test instances after we are done testing.
  • Add tests for shutting down and reloading configuration under high
    load (depends on ab).
  • Reduce the number of keepalive requests in the keepalive tests to speed
    up test runs.
  • Fix exiting with open file descriptors, fix cleanup in nginx's cache
    manager/loader processes
  • Attempt to finish up queued up NgxBaseFetches/requests on shutdown/reload
  • Under valgrind the blocking rewrite started failing after adding a test
    for reloading configuration under high load.
    I've added it to the expected failures for valgrind, looking into this
    is up next.

- Add a test to make sure all logged output looks sane
- Stop our nginx test instances after we are done testing, and
- Add a test shuts down under high load (needs ab)
- Reduce the number of keepalive requests in the keepalive tests
  to speed up test runs.
- Add notes/questions which need to be sorted out.
…lean"

This reverts commit 013ca06.
Another approach is needed, this was a bad idea
- Valgrind clean run for the cache manager/loader processes
- Undo tracking file descriptors with valgrind, as it seems
it isn't possible to suppress messages, and nginx leaves some
descriptors to the os.
- Disallow [crit] messages in error.log next to [alert]
- Add a test to make sure all logged output looks sane by whitelisting
  current errors/warnings.
- Stop our nginx test instances after we are done testing.
- Add tests for shutting down and reloading configuration under high
  load (depends on ab).
- Reduce the number of keepalive requests in the keepalive tests to speed
  up test runs.
- Fix exiting with open file descriptors, fix cleanup in nginx's cache
  manager/loader processes
- Attempt to finish up queued up NgxBaseFetches/requests on shutdown/reload
- Under valgrind the blocking rewrite started failing after adding a test
  for reloading configuration under high load.
  I've added it to the expected failures for valgrind, looking into this
  is up next.
@oschaaf
Copy link
Member Author

oschaaf commented Feb 3, 2015

I ran the tests for a while, and discovered that the fix for cleaning up when exiting from the cache manager/loader processes needs more work. I'm closing this pull, and will re-open when I have fixed that. Sorry!

@oschaaf oschaaf closed this Feb 3, 2015
@oschaaf
Copy link
Member Author

oschaaf commented Feb 4, 2015

@jeffkaufman Stil working on this, but I think I'm very nearly done.

Could you evaluate these PSOL changes? These are required for this patch to work.
We need to explicitly shut down the driver factory before deleting it, which currently causes a DCHECK.
Also, with Serf enabled, the driver factory will core dump upon deleting it when it's not
initialised all they way through, which happens in nginx's cache manager/loader processes.

Index: pagespeed/system/serf_url_async_fetcher.cc
===================================================================
--- pagespeed/system/serf_url_async_fetcher.cc  (revision 4537)
+++ pagespeed/system/serf_url_async_fetcher.cc  (working copy)
@@ -144,6 +144,7 @@
     }
     if (pool_ != NULL) {
       apr_pool_destroy(pool_);
+      pool_ = NULL;
     }
   }

@@ -1173,8 +1174,13 @@
   if (threaded_fetcher_ != NULL) {
     delete threaded_fetcher_;
   }
-  delete mutex_;
-  apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+  if (mutex_ != NULL) {
+    delete mutex_;
+  }
+  if (pool_ != NULL) {
+    apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+    pool_ = NULL;
+  }
 }

 void SerfUrlAsyncFetcher::ShutDown() {
Index: pagespeed/system/system_caches.cc
===================================================================
--- pagespeed/system/system_caches.cc   (revision 4537)
+++ pagespeed/system/system_caches.cc   (working copy)
@@ -76,7 +76,6 @@
 }

 void SystemCaches::ShutDown(MessageHandler* message_handler) {
-  DCHECK(!was_shut_down_);
   if (was_shut_down_) {
     return;
   }

@jeffkaufman
Copy link
Contributor

@ -144,6 +144,7 @@
     }
     if (pool_ != NULL) {
       apr_pool_destroy(pool_);
+      pool_ = NULL;
     }
   }

Setting pool_ to NULL in ~SerfFetch shouldn't do anything, right? If there are methods still getting called on the SerfFetch after it's deleted it seems like we should fix the callers instead?

@@ -1173,8 +1174,13 @@
   if (threaded_fetcher_ != NULL) {
     delete threaded_fetcher_;
   }
-  delete mutex_;
-  apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+  if (mutex_ != NULL) {
+    delete mutex_;
+  }
+  if (pool_ != NULL) {
+    apr_pool_destroy(pool_);  // also calls apr_allocator_destroy on the allocator
+    pool_ = NULL;
+  }
 }

Same here, except that now you don't destroy pool_ if it's NULL. SerfUrlAsyncFetcher is created with a NULL pool_, but it's allocated in SerfUrlAsyncFetcher::Init. Is the issue that we want to be able to create SerfUrlAsyncFetchers and then delete them before running Init?

@@ -76,7 +76,6 @@
 }

 void SystemCaches::ShutDown(MessageHandler* message_handler) {
-  DCHECK(!was_shut_down_);
   if (was_shut_down_) {
     return;
   }

Why do we need to be able to shut down SystemCaches multiple times?

@oschaaf
Copy link
Member Author

oschaaf commented Feb 4, 2015

@jeffkaufman
Please disregard the changes to SerfUrlAsyncFetcher.cc, I was mistaken. I introduced a bug caused by the ordering of two atexit() handlers: one introduced by me in this change, and one in SystemRewriteDriverFactory::InitApr(). Some weird values I saw in gdb made me jump to wrong conclusions.

As for removing the DCHECK in SystemCaches::ShutDown(): I need to explicitly call ShutDown() on the driver factory and destruct it at another point in time (pool cleanup handler).
However, the destructor SystemRewriteDriverFactory ends up calling SystemCaches ::ShutDown() as well unconditionally:
https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/rewrite_driver_factory.cc#128
https://code.google.com/p/modpagespeed/source/browse/trunk/src/pagespeed/system/system_rewrite_driver_factory.cc#485

@oschaaf
Copy link
Member Author

oschaaf commented Feb 4, 2015

@jeffkaufman Actually, tthe explicit shutdown is no longer needed as we'll be waiting until there are no more active base fetches. So, disregard that too, there are no PSOL changes necessary.

@jeffkaufman
Copy link
Contributor

Sounds good! Let me know when you've got something you'd like me to review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants