Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition on startup with janitor process #26

Closed
smclay opened this issue Aug 20, 2020 · 2 comments
Closed

Race condition on startup with janitor process #26

smclay opened this issue Aug 20, 2020 · 2 comments
Labels

Comments

@smclay
Copy link
Contributor

smclay commented Aug 20, 2020

On startup NGAS creates a janitor process. The janitor process reports and error when it tries to archive a rotated log file. The error occurs because the NGAS HTTP server is not ready to handle requests. Here are the log messages...

2020-08-20T09:27:49.310 [ 1881] [JANITOR-PR] [  INFO] ngamsServer.janitor.expired_data_cleaner#run:54 Checking/cleaning up processing directory: /srv/ngas1/./processing
2020-08-20T09:27:49.310 [ 1881] [JANITOR-PR] [  INFO] ngamsServer.janitor.expired_data_cleaner#run:54 Checking/cleaning up subscription backlog buffer: /srv/ngas1/./subscr-back-log
2020-08-20T09:27:49.311 [ 1881] [JANITOR-PR] [  INFO] ngamsServer.janitor.expired_data_cleaner#run:54 Checking/cleaning up NGAS tmp directory: /srv/ngas1/tmp
2020-08-20T09:27:49.314 [ 1881] [JANITOR-PR] [  INFO] ngamsPClient.ngamsPClient#archive:142 Archiving file with URI: file:///srv/ngas1/log/LOG-ROTATE-2020-08-20T09:23:53.255.nglog
2020-08-20T09:27:49.315 [ 1881] [JANITOR-PR] [  INFO] ngamsLib.ngamsHttpUtils#_http_response:125 About to GET to 127.0.0.1:8001//ARCHIVE?async=0&no_versioning=0&filename=file%3A%2F%2F%2Fsrv%2Fngas1%2Flog%2FLOG-ROTATE-2020-08-20T09%3A23%3A53.255.nglog&mime_type=ngas%2Fnglog
2020-08-20T09:27:49.317 [ 1881] [JANITOR-PR] [ ERROR] ngamsServer.janitor.main#JanitorCycle:66 Unexpected error in janitor plug-in run
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ngamsServer/janitor/main.py", line 62, in JanitorCycle
    p(srvObj, stopEvt)
  File "/usr/lib/python2.7/site-packages/ngamsServer/janitor/rotated_logfiles_handler.py", line 76, in run
    file_uri, 'ngas/nglog')
  File "/usr/lib/python2.7/site-packages/ngamsPClient/ngamsPClient.py", line 153, in archive
    return self.get_status(cmd, pars=pars)
  File "/usr/lib/python2.7/site-packages/ngamsPClient/ngamsPClient.py", line 693, in get_status
    resp, host, port = self._get(cmd, pars, hdrs)
  File "/usr/lib/python2.7/site-packages/ngamsPClient/ngamsPClient.py", line 719, in _get
    resp = self._do_get(host, port, cmd, pars, hdrs)
  File "/usr/lib/python2.7/site-packages/ngamsPClient/ngamsPClient.py", line 756, in _do_get
    timeout=self.timeout, auth=self.basic_auth
  File "/usr/lib/python2.7/site-packages/ngamsLib/ngamsHttpUtils.py", line 303, in httpGetUrl
    pars=allpars, hdrs=hdrs, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/ngamsLib/ngamsHttpUtils.py", line 273, in httpGet
    pars=pars, hdrs=hdrs, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/ngamsLib/ngamsHttpUtils.py", line 127, in _http_response
    _connect(conn)
  File "/usr/lib/python2.7/site-packages/ngamsLib/ngamsHttpUtils.py", line 63, in _connect
    conn.connect()
  File "/usr/lib64/python2.7/httplib.py", line 833, in connect
    self.timeout, self.source_address)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 111] Connection refused
2020-08-20T09:27:49.324 [ 1881] [JANITOR-PR] [  INFO] ngamsServer.janitor.main#janitorThread:131 Janitor Thread executed - suspending for 86400 [s] ...
2020-08-20T09:27:49.723 [ 1869] [DATA-CHECK] [  INFO] ngamsServer.ngamsDataCheckThread#dataCheckThread:787 Data Check Thread starting iteration ...
2020-08-20T09:27:49.741 [ 1869] [DATA-CHECK] [  INFO] ngamsServer.ngamsDataCheckThread#get_disks_to_check:678 Will check 0 disks that are mounted in this system
2020-08-20T09:27:49.743 [ 1869] [DATA-CHECK] [  INFO] ngamsServer.ngamsDataCheckThread#collect_files_on_disk:205 Found 0 files to check in 0 disks
2020-08-20T09:27:49.858 [ 1869] [MainThread] [  INFO] ngamsServer.ngamsServer#serve:2177 NG/AMS HTTP Server ready

EDIT by rtobar to format error message

@rtobar rtobar added the bug label Sep 2, 2020
@rtobar
Copy link
Contributor

rtobar commented Sep 2, 2020

I just wanted to note that even though this error occurs, this is a non-blocking problem as the archiving of the rotate logfiles will be successful in the next janitor cycle.

Scratch that: actually if there's a failure the file will not be picked up anymore as its extension is changed prior to the archiving attempt, and not changed back on error. So effectively files will accumulate over the rotation threshold slowly when these errors occur.

rtobar added a commit that referenced this issue Sep 2, 2020
Before logfiles are rotated (i.e., removed) a series of processing tasks
are performed onto them. Among these, the first one is to archive the
logfile into the NGAS server itself, if the server has been configured
for this. During normal operations this is not a problem, but exactly on
the first try, when the janitor process has just been created and the
HTTP server might not be bound yet, it might result on an ECONNREFUSED
error. This produced big error messages on the logs, while in reality
this is a transient error that should disappear on the next try.

A second, more general problem, was found while inspecting this code:
logfiles were not renamed back to have their original ".unsaved"
extensions when errors happened. This meant that when errors in general
were found (and in particular when ECONNREFUSED was raised) logfiles
were not picked up by successive janitor cycles.

This commit acknowledges these problems, improving the handling of the
ECONNREFUSED error in particular, and of errors in general. On the on
hand, when the ECONNREFUSED error is encountered we simply issue a
warning log statement instead of letting the exception to propagate up
through the stack. On the other hand, if *any* error happens during
archiving we rename the file back to its original *.unsaved name so it
gets picked up again in the next janitor cycle.

To make code and error handling a bit simpler I took the chance of
moving the archiving of files into a separate try_archiving() function,
whose invocation is then surrounded by the error handling block.
Additionally I also added a sorted() call to process unsaved logfiles in
time order, which until now wasn't guaranteed (and is a nice property to
have).

This commit addresses #26.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@rtobar
Copy link
Contributor

rtobar commented Sep 3, 2020

Fixed by #27. @smclay let me know if you need this fix in the v11 branch and I can backport it.

@rtobar rtobar closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants