Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mycodo 4.1.16 daemon crashes after a few days and does not automatically restart. #198

Closed
lucasvickers opened this issue Mar 1, 2017 · 96 comments

Comments

@lucasvickers
Copy link

lucasvickers commented Mar 1, 2017

Mycodo Issue Report:

  • 4.1.16

Problem Description

Daemon shuts down / crashes without log entries.

  • I have a DHT22, K30, and Luminosity sensor attached as well as an LCD screen for output. I am just running the system to test stability
  • Not only does the daemon go down, it does not automatically start back up. I need to manually restart it.

Errors

Daemon Log:
Below is all I could find. The rest of the log is repeating mysql queries that I assume are the result of a persistent web-client / graphing.

Feb 27 20:45:55 nibbler systemd[1]: mycodo.service: main process exited, code=killed, status=6/ABRT
Feb 27 20:45:55 nibbler mycodo_client.py[20340]: 2017-02-27 20:45:55,935 Connection refused. Is the daemon running?

Steps to Reproduce the issue:

How can this issue be reproduced?

  1. Run setup for multiple days.

Additional Notes

I've been able to do this twice now, so I believe it's consistent. Please let me know what other types of log entries I should look for or if I can enable some increased logging, etc.

@lucasvickers
Copy link
Author

Also note that the /var/lock/mycodo.pid file gets left and I need to remove it before restarting daemon.

@kizniche
Copy link
Owner

kizniche commented Mar 1, 2017

Run the daemon manually in debug mode and see what error it spits back at you.

sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d

@lucasvickers
Copy link
Author

Thanks will do. I think it runs for a solid 4-5 days before crashing so I'll update this ticket when I can.

@kizniche
Copy link
Owner

kizniche commented Mar 1, 2017

Sounds good. Periodically check the size of the log, making sure it doesn't use all the free space on your SD card. Debug mode produces a lot of log lines!

@kizniche
Copy link
Owner

kizniche commented Mar 1, 2017

Also, there wasn't an error in /var/log/mycodo/mycodo.log when not running in debug mode?

@lucasvickers
Copy link
Author

32GB card so should be good, but thanks for the tip.

I couldn't find any specific error, tho the log is busy. I also checked the system log and saw the same info. Are your logging levels standardized and I could just grep the log for "ERROR", etc, to figure out if one was thrown way earlier on?

@kizniche
Copy link
Owner

kizniche commented Mar 1, 2017

Yes, it should be entered as error (but not always). Can you attach the log here for me to look at. There shouldn't be any sensitive info logged.

@lucasvickers
Copy link
Author

I'm running in debug now.

Logs here. The restart on 2017-02-26 10:34:02 was on purpose and errors before that were due to experimenting with sensors.

@lucasvickers
Copy link
Author

The below occurred when the last measurement was taken. Any insight?
I've also attached both logs. mycodo-2.zip

Mar  4 04:28:16 nibbler systemd[1]: Stopping User Manager for UID 1001...
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Default.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Default.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Basic System.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Basic System.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Paths.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Paths.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Timers.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Timers.
Mar  4 04:28:16 nibbler systemd[22374]: Stopping Sockets.
Mar  4 04:28:16 nibbler systemd[22374]: Stopped target Sockets.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar  4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).
Mar  4 04:28:16 nibbler systemd[1]: Stopped User Manager for UID 1001.
Mar  4 04:28:16 nibbler systemd[1]: Stopping user-1001.slice.
Mar  4 04:28:16 nibbler systemd[1]: Removed slice user-1001.slice.

@kizniche
Copy link
Owner

kizniche commented Mar 5, 2017

Both logs you sent stop March 1st, but you said your error occurred March 4th. Were there no log lines at the time of the issue?

@lucasvickers
Copy link
Author

I just double checked and both logs in the mycodo-2.zip zip file end March 4.

Can you double check?

@kizniche
Copy link
Owner

kizniche commented Mar 6, 2017

Looking over the log, it doesn't indicate an issue. Can you enable components one by one to determine what part is causing the issue, either sensor measurements, LCD output, or other parts of the system you have active?

@lucasvickers
Copy link
Author

Sure. I currently have the following:

  • LCD Output
  • Two Graphs
  • Time Lapse Photos
  • K30 Sensor
  • TSL2561
  • DHT22

I'll start one by one.

@kizniche
Copy link
Owner

kizniche commented Mar 6, 2017

Graphs shouldn't be an issue, but the others could.

@lucasvickers
Copy link
Author

If you can think of any places additional logging would help, nows the time :) I'll add more elements after 5 days of successful running, but troubleshooting may wind up taking a while.

You should think about a tool to monitor your daemon similar to how you monitor sensors. Could be useful to notify people that it crashed, etc.

@kizniche
Copy link
Owner

kizniche commented Mar 6, 2017

Without knowing exactly what controller(s) is causing the issue, it's difficult to know where to add more logging lines. The daemon-status check is a good idea. I'll put it on the todo list for the 5.0 release.

@lucasvickers
Copy link
Author

In case it's helpful at all, whenever I write similar complex systems I always add a verbose level of logging. At the very least I log when I begin a major operation (reading a specific sensor, outputting to an LCD screen, etc). It creates a lot of log but it gives a definitive idea of where an issue occurred.

The type of black box troubleshooting we're doing right now doesn't scale very well.

Either way, I'm happy to do this debugging, just providing some input.

@kizniche
Copy link
Owner

kizniche commented Mar 8, 2017

That should be what happens in debug mode (sudo python ~/Mycodo/mycodo/mycodo_daemon.py -d). That sets the logging level to debug, for which there are a lot of logging instances of. It might just happen that you came across a bug that is in an area of the code that doesn't have good exception and/or logging coverage, or something else.

@kizniche
Copy link
Owner

kizniche commented Mar 8, 2017

I think this is the first or one of the very few times (I can't recall any other) that the daemon just stops without anything appearing in the log. You're the only person that I know of that's experiencing this issue. So, it could possibly be your install or hardware.

@kizniche
Copy link
Owner

kizniche commented Mar 8, 2017

Just curious, before concluding the daemon crashed, did you confirm with ps aux | grep mycodo_daemon.py that the daemon did indeed stop and wasn't just unresponsive?

@lucasvickers
Copy link
Author

I sure hope it's not the hardware, that would be challenging to debug. The next time I get it to hang I'll check ps.

Just curious, what were these messages about:

Mar  4 04:28:16 nibbler systemd[22374]: Reached target Shutdown.
Mar  4 04:28:16 nibbler systemd[22374]: Starting Exit the Session...
Mar  4 04:28:16 nibbler systemd[22374]: Received SIGRTMIN+24 from PID 13300 (kill).

I think it's pretty suspicious that they coincided with the last LCD update. I was using the LCD screen with time output to see if it was responsive or not, and when the issue occurred the LCD was frozen at roughly 4:28.

Thanks for helping me work through all this.

@kizniche
Copy link
Owner

kizniche commented Mar 8, 2017

I'm not sure what those messages mean.

I just looked over the LCD controller and it's fairly covered with exception-level logging lines, so it should be caught if that's the issue. It's a mystery.

@lucasvickers
Copy link
Author

http://serverfault.com/questions/774491/what-is-sigrtmin24-in-syslog

Weird. Perhaps this is some type of system issue. I'll keep my eyes on it and see what I can find.

@lucasvickers
Copy link
Author

I was able to confirm that the daemon does crash. It does not hang. I will continue with removing individual components of my system until it runs stable.

I reviewed the logs and your code a little more and I want to bring back up the option of additional logging. You did a great job in defensive programming and finding ways to capture exceptions, but clearly there is something going on that we can't find.

Even in debug mode, all I see are the following 4 lines repeated

2017-03-10 06:14:44,967 - mycodo.lcd_jK2jnVDR - DEBUG - Latest temperature: 17.5 @ 2017-03-10 06:14:42
2017-03-10 06:14:44,998 - mycodo.lcd_jK2jnVDR - DEBUG - Latest humidity: 10.8 @ 2017-03-10 06:14:42
2017-03-10 06:14:45,026 - mycodo.lcd_jK2jnVDR - DEBUG - Latest co2: 413 @ 2017-03-10 06:14:34
2017-03-10 06:14:45,067 - mycodo.lcd_jK2jnVDR - DEBUG - Latest time: 17.5 @ 2017-03-10 06:14:42

This doesn't give me any insight into what else is happening. Presumably there is a function that will handle grabbing sensor values, and a function that will kick off camera captures, etc, correct?

I would recommend adding logging such as the following:

def update_lcd():
  Logger.verbose("Starting LCD Update.")
  ... your logic
  Logger.verbose("Finished LCD Update.")

or if inside a loop something line this

def run():
  try:
    logging.verbose("Starting LCD Loop.")
    self.running = True
    ...
    while self.running:
    if time.time() > self.timer:
     logging.verbose("Starting LCD Logic.")
     self.get_lcd_strings()
     ...
     logging.verbose("LCD Sleeping.")
     time.sleep(1)
     ...
  logging.verbose("LCD Loop has Ended.")

While this will make a large amount of logs, it would let us actually know what was going on at the time of a crash. Whatever the issue may be, be it a hardware problem, an uncaught exception, etc, we would quickly be able to tell the last function that was called before things went wrong. This would really help narrow down troubleshooting.

The reason I suggest it's done in a VERBOSE mode is that you don't need this in most situations, so allowing a DEBUG and VERBOSE log helps resolve issues like the one I am having without flooding logs for the users who only need DEBUG.

If you are interested in this I'm happy to help work through your code and add this type of logging.

@kizniche
Copy link
Owner

More logging would be nice. Thanks for the offer to help. A few questions and concerns:

How would one add VERBOSE as a logging level beyond DEBUG?

I'm getting ready to release version 5.0, so adding code to 4.1.16 may not be the best idea because it will not be maintained and merging with 5.0, when it's released, could become more difficult. Perhaps you should give the dev-5.0 branch a try and see if you still experience the same daemon crash issues.

@lucasvickers
Copy link
Author

Ahh sorry, forgot VERBOSE is not enabled by default in Python. There are ways to add custom levels, I can help with that.

I'll set my test back up in dev-5.0 and see if I have the issue. We can take it from there, and if needed I'll help with the logging in 5.0.

Thanks!

@kizniche
Copy link
Owner

kizniche commented Mar 14, 2017

This seems like the proper way to add another logging level.

import logging
logging.DEBUG_LEVEL_NUM = 9
logging.addLevelName(logging.DEBUG_LEVEL_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
    if self.isEnabledFor(DEBUG_LEVELV_NUM):
        self._log(DEBUG_LEVELV_NUM, message, args, **kws) 
logging.Logger.debugv = debugv

@lucasvickers
Copy link
Author

I'll keep 5.0-dev running for the week and see if it is stable.

@kizniche
Copy link
Owner

I just got 5,0 freshly installed. Thanks for the debugging.

@lucasvickers
Copy link
Author

Sad to report the daemon went dead after 4-5 days uptime (same as usual).

I'm going to go back to disabling individual components and see what that yields, but I'd love to talk about getting that extra logging in place.

@lucasvickers
Copy link
Author

5.0.27

@kizniche
Copy link
Owner

kizniche commented May 7, 2017

I meant these are the same Pi version?

@lucasvickers
Copy link
Author

Both 3 B. Both running Raspbian, and I just updated and upgraded the OS so they should be mirrors of each other.

@lucasvickers
Copy link
Author

lucasvickers commented May 12, 2017

A and B devices both crashed within a few days of each other. The RAM usage grew slightly but never got above 50MB. I'm re-running with debug to see if I can get any additional information.

FWIW these OS's were installed independent of each other, not mirrored.

Do you know of any other 3B devices that have been running stable for some time?

@kizniche
Copy link
Owner

I have two Pi 3s that have been running for months without issue. One is a production device with several sensors and a time-lapse, all running since November.

@lucasvickers
Copy link
Author

Well, what are next steps for this? I have two devices with this issue, one with very standard wiring, one with slightly longer wiring for a production setup. I can start to remove sensor by sensor again and see if I can find the specific cause, but when I tried to do that last time I couldn't totally isolate the issue.

Perhaps the daemon re-launching tool? The crash itself isn't actually a big deal, but the idea of the daemon crashing and certain relays being left on for extended periods is problematic.

@kizniche
Copy link
Owner

Without knowing exactly what part of Mycodo is causing the issue, it's hard to do much. What did you harrow it down to?

@lucasvickers
Copy link
Author

I couldn't isolate it. On the A device I had removed/disabled sensors down to the TSL2561, and it was still crashing.

I had then set up both devices with a DHT22 then a TSL2561 and it ran for a week straight. I assumed the issue was the possible use of a non-standard GPIO pin for the DHT22, and added back all sensors. The all sensor test has now failed.

I can start from scratch again, but it's hard to isolate. I'll start with the TSL2561 alone.

I want to use one of the devices for production work, so I need a way to keep the daemon up when it crashes. I can still dedicate a second device for testing. I'm going to modify the service so that it runs the stayalive script you posted above. I'll then dedicate device A to testing and will start from no sensors again on that device.

@kizniche
Copy link
Owner

Perhaps you might want to try forking the repo, then adding more debug logging lines throughout the functions you've been using, then run the daemon in debug mode until it crashes.

@lucasvickers
Copy link
Author

Good call, seems like as long as I insert logging into each sensors .next() it could help isolate the issue. The code is really easy to follow 👍

@Ericktz
Copy link
Contributor

Ericktz commented May 16, 2017

after a recently crash, i deleted the .lock file, and try relunch daemon in debug mode... but the AM2302 sensor (id=2) did not start...


2017-05-15 20:45:19,203 - mycodo - ERROR - Could not activate Sensor controller with ID 2: unpack requires a string argument of length 16
Traceback (most recent call last):
  File "/home/pi/Mycodo/mycodo/mycodo_daemon.py", line 415, in controller_activate
    ready, cont_id)
  File "/home/pi/Mycodo/mycodo/controller_sensor.py", line 287, in __init__
    power=self.power_relay_id)
  File "/home/pi/Mycodo/mycodo/sensors/dht22.py", line 58, in __init__
    self.pi = pigpio.pi()
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 3799, in __init__
    self._notify = _callback_thread(self.sl, host, port)
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 906, in __init__
    self.handle = _pigpio_command(self.sl, _PI_CMD_NOIB, 0, 0)
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 849, in _pigpio_command
    dummy, res = struct.unpack('12sI', sl.s.recv(16))
error: unpack requires a string argument of length 16

and in the info page the unique value in red is daemon virtualenv
captura de pantalla 2017-05-15 a las 21 02 39

I also attached my sensors connection maybe can be used to obtain clue:
rpi_conexion

at the moment, I'm rebooting the pi to be working again

@kizniche
Copy link
Owner

kizniche commented May 16, 2017

error: unpack requires a string argument of length 16

I'm not sure what caused that.

The Daemon won't run in the virtual environment if you're manually running it from the terminal and didn't start it from the virtual environment. Use this command:

sudo ~/Mycodo/env/python ~/Mycodo/mycodo/mycodo_daemon.py -d

at the moment, I'm rebooting the pi to be working again

Does this mean rebooting fixed the issue?

@Ericktz
Copy link
Contributor

Ericktz commented May 17, 2017

Does this mean rebooting fixed the issue?

yes, when the rpi is restarted, the mycodo daemon has no problem initializing, and the sensor ID 2 (AM2302, addded as DHT22) re-measures...

@Ericktz
Copy link
Contributor

Ericktz commented May 19, 2017

missing directory

sudo ~/Mycodo/env/python ~/Mycodo/mycodo/mycodo_daemon.py -d

corrected:

sudo ~/Mycodo/env/bin/python ~/Mycodo/mycodo/mycodo_daemon.py -d


today the same error when i tried to reload the daemon in debug mode, the daemon again stopped about 2 hours ago,

2017-05-18 23:16:29,140 - mycodo.sensor_1 - INFO - Activated in 136.8 ms
2017-05-18 23:16:29,341 - mycodo - ERROR - Could not activate Sensor controller with ID 2: unpack requires a string argument of length 16
Traceback (most recent call last):
  File "/home/pi/Mycodo/mycodo/mycodo_daemon.py", line 415, in controller_activate
    ready, cont_id)
  File "/home/pi/Mycodo/mycodo/controller_sensor.py", line 287, in __init__
    power=self.power_relay_id)
  File "/home/pi/Mycodo/mycodo/sensors/dht22.py", line 58, in __init__
    self.pi = pigpio.pi()
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 3799, in __init__
    self._notify = _callback_thread(self.sl, host, port)
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 906, in __init__
    self.handle = _pigpio_command(self.sl, _PI_CMD_NOIB, 0, 0)
  File "/usr/lib/python2.7/dist-packages/pigpio.py", line 849, in _pigpio_command
    dummy, res = struct.unpack('12sI', sl.s.recv(16))
error: unpack requires a string argument of length 16
2017-05-18 23:16:29,504 - mycodo.sensor_3 - INFO - Activated in 98.7 ms
2017-05-18 23:16:29,674 - mycodo.sensor_4 - INFO - Activated in 127.3 ms
2017-05-18 23:16:29,818 - mycodo.sensor_5 - INFO - Activated in 99.6 ms
2017-05-18 23:16:29,819 - mycodo - INFO - All activated sensor controllers started

I will change the AM2302 sensor for other one

@kizniche
Copy link
Owner

@Ericktz You might have corruption in your database. I've never seen that error before.

@kizniche
Copy link
Owner

It could also be an issue with your pigpio install, or a hardware issue.

@Ericktz
Copy link
Contributor

Ericktz commented May 19, 2017

@Ericktz You might have corruption in your database. I've never seen that error before.

If I delete the mycodo.db, can I reconfigure everything?

@kizniche
Copy link
Owner

kizniche commented May 19, 2017

This is how I would do it:

sudo service mycodo stop &&
sudo /etc/init.d/apache2 stop &&
cd ~/Mycodo/databases &&
sudo mv mycodo.db mycodo.db.backup &&
sudo /etc/init.d/apache2 start &&
wget --quiet --no-check-certificate -p http://127.0.0.1 -O /dev/null &&
sudo service mycodo start

That will backup your database, then trigger a new database to be created. You'll then need to go through the setup procedure, creating an admin account, adding/activating your sensors, etc.

@lucasvickers
Copy link
Author

Detailed logging during reading of each sensor shows that he crash does not occur inside a sensor reading.

I also tried disabling graphs and this did not stop the crashing. The crashing still happens consistently on devices A and B.

At this point I'm just going to set up the restart script and move forward. Thanks for the assistance to date.

@kizniche
Copy link
Owner

Sorry we couldn't come to a better resolution. If you ever get any more info, please come back to this (or a new) thread and share. Also, could you please provide the full script you are using?

@lucasvickers
Copy link
Author

You've helped out quite a bit! Perhaps one day we'll find the problem. I wish I had more time to personally dig deeper.

Once I have the restart script ironed out I'll post it here.

@lucasvickers
Copy link
Author

The following script is working for me as a tool to ensure the daemon is up. I start it at boot using cron.

#!/usr/bin/python
import os
import logging
import time

DAEMON_PID_FILE = '/var/lock/mycodo.pid'
LOG_FILE = '/var/log/mycodo/mycodokeepup.log'

def check_daemon():
    if os.path.exists(DAEMON_PID_FILE):
        with open(DAEMON_PID_FILE, 'r') as pid_file:
            if not os.path.exists("/proc/{pid}".format(pid=pid_file.read())):
                logging.info("Daemon is not running, restarting.")
                try:
                    os.remove(DAEMON_PID_FILE)
                    rcode = os.system('/usr/sbin/service mycodo restart')
                    if rcode != 0:
                        logging.error("Unable to execute restart command {}".format(rcode))
                except OSError as e:
                    logging.warn("Unable to remove pid file: {}".format(e))

if __name__ == '__main__':

    format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    logging.basicConfig(filename=LOG_FILE, format=format, level=logging.DEBUG)

    while True:
        check_daemon()
        time.sleep(5)

@kizniche
Copy link
Owner

kizniche commented Jul 3, 2017

Looks good. Thanks!

One suggestion would be to add a failsafe in case the daemon can't start properly (bad configuration or other), so it doesn't continually try to start it endlessly if it won't be able to start properly. Perhaps try 5 times and then stop the script entirely. If it successfully starts, then reset the counter.

@lucasvickers
Copy link
Author

You did the bulk of that work so thanks for that!

Re your comment, it actually won't retry. If a pid file is found, then the pid file is first removed, and then the daemon is started. If the daemon does not start, the pid file is still gone so it will not attempt a second time.

It will loop for infinity if there's an issue removing the pid file.

I could see building it out to have a failsafe for removing the pid file, but as it is now we can't try to start the daemon without a pid present otherwise it would launch the daemon even if the daemon was purposely taken down.

@kizniche
Copy link
Owner

kizniche commented Jul 5, 2017

Ah, I understand it now. It won't do anything if a PID file isn't present, so it's only taking action if it should be running and isn't.

kizniche added a commit that referenced this issue Jul 5, 2017
@kizniche
Copy link
Owner

kizniche commented Jul 5, 2017

I just added this script (slightly modified) with the latest commit. I now have it used during the upgrade/install that will basically check for a stale PID file and remove it and start the service if needed. You can also use the script with the --continuouscheck parameter to have it check every 5 seconds, as the original script did, so you can use this script with cron and also receive updates that may happen to the script to remain compatible with the current version.

@lucasvickers
Copy link
Author

Script seems to be working for me, thanks!

@kizniche
Copy link
Owner

This script has been added during a recent update (can't remember which and didn't reference it here) so a cron entry is created to start this script at boot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants