Skip to content
This repository

lantorrent corruption #81

Closed
bc-umigs opened this Issue November 21, 2011 · 5 comments

4 participants

bc-umigs John Bresnahan David LaBissoniere igs-jeff
bc-umigs

We have noticed an issue with lantorrent that causes it to hang and become unresponsive. The situation that causes this can be reproduced by repeating the following steps.

-Start up an instance/cluster either using the cloud-client or ec2 api.
-Issue a terminate before the propagation has completed.

After this occurs the database is left with the entries as stale records, which never seem to get updated. If this occurs either after a large termination or after repeated executions of the above process, lantorrent stops copying any new requests. Looking at the lantorrent daemon log shows this error:

2011-11-21 11:37:27,422 - WARNING - Stack trace
2011-11-21 11:37:27,423 - WARNING - ===========
2011-11-21 11:37:27,423 - WARNING - Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 114, in send
self._write_to_socket(data)
File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 99, in _write_to_socket
self.socket.sendall(data)
File "", line 1, in sendall
error: [Errno 104] Connection reset by peer

2011-11-21 11:37:27,423 - WARNING - ===========
2011-11-21 11:37:27,423 - WARNING -

If we clear out the req.db file and restart lantorrent, all seems to be well again. This is occurring in our production environment of 2.7, as well as devel which is at 2.8.

Please let me know what other information you need in order to help troubleshoot this.

Thanks,
Brian

David LaBissoniere
Collaborator

Thanks for the report. We will check this out.

John Bresnahan
Collaborator

This issue sounds a lot like: #39, which should be solved in 2.8. So the behavior you are describing is expected on the 2.7 install but not the 2.8 one. Would it be possible to see more of your log? I was expecting to see a line with the string "send error " in it.

igs-jeff

patched both daemon.py & client.py.

cleared all lantorrent log files

launched serveral instances & terminated all instances immediately.

only ltrequest.log thus far, no ltdaemon.log yet, in ltrequest, all I get is

2011-12-08 12:01:55,863 - INFO - checking for done on f2b710e6-21b9-11e1-b251-e41f13b8e0b8
2011-12-08 12:01:55,871 - INFO - checking for done on f251d744-21b9-11e1-acc8-e41f13b8ed90
2011-12-08 12:01:55,875 - INFO - checking for done on f29331c6-21b9-11e1-b356-e41f13b8968c
2011-12-08 12:01:55,933 - INFO - enter
2011-12-08 12:01:55,944 - INFO - checking for done on f24fabe0-21b9-11e1-a542-e41f13b8f7f0
2011-12-08 12:01:55,956 - INFO - enter
2011-12-08 12:01:55,967 - INFO - checking for done on f28d39ec-21b9-11e1-94c8-e41f13b8df98
2011-12-08 12:02:27,470 - INFO - enter
2011-12-08 12:02:27,473 - INFO - enter
2011-12-08 12:02:27,475 - INFO - enter
2011-12-08 12:02:27,481 - INFO - enter
2011-12-08 12:02:27,481 - INFO - enter
2011-12-08 12:02:27,482 - INFO - checking for done on f2aacfa2-21b9-11e1-b8ba-e41f13b877c4
2011-12-08 12:02:27,487 - INFO - checking for done on f29331c6-21b9-11e1-b356-e41f13b8968c
2011-12-08 12:02:27,488 - INFO - checking for done on f2419dca-21b9-11e1-818d-e41f13b8f2cc
2011-12-08 12:02:27,494 - INFO - checking for done on f251d744-21b9-11e1-acc8-e41f13b8ed90
2011-12-08 12:02:27,497 - INFO - enter
2011-12-08 12:02:27,512 - INFO - checking for done on f28d39ec-21b9-11e1-94c8-e41f13b8df98
2011-12-08 12:02:27,516 - INFO - checking for done on f24fabe0-21b9-11e1-a542-e41f13b8f7f0
2011-12-08 12:02:27,530 - INFO - enter
2011-12-08 12:02:27,542 - INFO - enter
2011-12-08 12:02:27,545 - INFO - checking for done on f2b710e6-21b9-11e1-b251-e41f13b8e0b8

ps & grep lant on all VMMs, showed 3 lantorrent python process per instance.

stop and restart lantorrent ...

now, ltdaemon.log shows the exact error as you saw in the 1st post: here's it again:

2011-12-08 12:08:20,959 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepo__VMS__7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer

2011-12-08 12:08:20,959 - WARNING - Stack trace
2011-12-08 12:08:20,959 - WARNING - ===========
2011-12-08 12:08:20,959 - WARNING - Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 114, in send
self._write_to_socket(data)
File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 99, in _write_to_socket
self.socket.sendall(data)
File "", line 1, in sendall
error: [Errno 104] Connection reset by peer

2011-12-08 12:08:20,960 - WARNING - ===========
2011-12-08 12:08:20,960 - WARNING -
2011-12-08 12:08:20,960 - WARNING - bad data: {"code": 503, "md5sum": "", "id": "f251d744-21b9-11e1-acc8-e41f13b8ed90", "host": "172.20.101.11", "file": "/secureimages/wrksp-180/tmpSRa4FbRepo__VMS__7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw", "message": "The output file could not be opened [Errno 2] No such file or directory: u'/secureimages/wrksp-180/tmpSRa4FbRepo__VMS__7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw.lantorrent'", "port": 2893}

grep send error string:

2011-12-08 12:08:20,959 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepo__VMS__7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer
2011-12-08 12:10:03,892 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepo__VMS__7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer

all the lantorrent process' still exist on VMMs.

database is corrupted again.

igs-jeff

After patching daemon.py in ve, if I:

launch 10 instances and terminate immediately 3 times
launch 10 more instances without termination, status stays in pending

If you monitor lantorrent processes across all VMMs, you will get (3 x 10 x 4), 120 initially, eventually down to 0 (takes about 20? minutes or so)

on service node (/var/log/messages):

Dec 16 09:57:36 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package
Dec 16 09:57:36 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047456-18401 (res:4), deleting
Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest
Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected
Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18419 to /var/spool/abrt/pyhook-1324047457-18419
Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected
Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18419' creation detected
Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package
Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18419 (res:4), deleting
Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest
Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected
Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18429 to /var/spool/abrt/pyhook-1324047457-18429
Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected
Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18429' creation detected
Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package
Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18429 (res:4), deleting
Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest
Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected
Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18449 to /var/spool/abrt/pyhook-1324047457-18449
Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected
Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18449' creation detected
Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package
Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18449 (res:4), deleting

at this point, I can not start any more instances. Restart the service node to fix the above error.

After service node restarted, it still shows the last 10 instances in pending state.

I can then terminate these 10 instances, and launch new ones

John Bresnahan
Collaborator

This specific issue appears to be fix. We will wait for verification from testing on the next RC

John Bresnahan buzztroll closed this January 26, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.