New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mycodo 4.1.16 daemon crashes after a few days and does not automatically restart. #198
Comments
Also note that the |
Run the daemon manually in debug mode and see what error it spits back at you.
|
Thanks will do. I think it runs for a solid 4-5 days before crashing so I'll update this ticket when I can. |
Sounds good. Periodically check the size of the log, making sure it doesn't use all the free space on your SD card. Debug mode produces a lot of log lines! |
Also, there wasn't an error in /var/log/mycodo/mycodo.log when not running in debug mode? |
32GB card so should be good, but thanks for the tip. I couldn't find any specific error, tho the log is busy. I also checked the system log and saw the same info. Are your logging levels standardized and I could just grep the log for "ERROR", etc, to figure out if one was thrown way earlier on? |
Yes, it should be entered as error (but not always). Can you attach the log here for me to look at. There shouldn't be any sensitive info logged. |
I'm running in debug now. Logs here. The restart on |
The below occurred when the last measurement was taken. Any insight?
|
Both logs you sent stop March 1st, but you said your error occurred March 4th. Were there no log lines at the time of the issue? |
I just double checked and both logs in the mycodo-2.zip zip file end March 4. Can you double check? |
Looking over the log, it doesn't indicate an issue. Can you enable components one by one to determine what part is causing the issue, either sensor measurements, LCD output, or other parts of the system you have active? |
Sure. I currently have the following:
I'll start one by one. |
Graphs shouldn't be an issue, but the others could. |
If you can think of any places additional logging would help, nows the time :) I'll add more elements after 5 days of successful running, but troubleshooting may wind up taking a while. You should think about a tool to monitor your daemon similar to how you monitor sensors. Could be useful to notify people that it crashed, etc. |
Without knowing exactly what controller(s) is causing the issue, it's difficult to know where to add more logging lines. The daemon-status check is a good idea. I'll put it on the todo list for the 5.0 release. |
In case it's helpful at all, whenever I write similar complex systems I always add a verbose level of logging. At the very least I log when I begin a major operation (reading a specific sensor, outputting to an LCD screen, etc). It creates a lot of log but it gives a definitive idea of where an issue occurred. The type of black box troubleshooting we're doing right now doesn't scale very well. Either way, I'm happy to do this debugging, just providing some input. |
That should be what happens in debug mode ( |
I think this is the first or one of the very few times (I can't recall any other) that the daemon just stops without anything appearing in the log. You're the only person that I know of that's experiencing this issue. So, it could possibly be your install or hardware. |
Just curious, before concluding the daemon crashed, did you confirm with |
I sure hope it's not the hardware, that would be challenging to debug. The next time I get it to hang I'll check Just curious, what were these messages about:
I think it's pretty suspicious that they coincided with the last LCD update. I was using the LCD screen with time output to see if it was responsive or not, and when the issue occurred the LCD was frozen at roughly 4:28. Thanks for helping me work through all this. |
I'm not sure what those messages mean. I just looked over the LCD controller and it's fairly covered with exception-level logging lines, so it should be caught if that's the issue. It's a mystery. |
http://serverfault.com/questions/774491/what-is-sigrtmin24-in-syslog Weird. Perhaps this is some type of system issue. I'll keep my eyes on it and see what I can find. |
I was able to confirm that the daemon does crash. It does not hang. I will continue with removing individual components of my system until it runs stable. I reviewed the logs and your code a little more and I want to bring back up the option of additional logging. You did a great job in defensive programming and finding ways to capture exceptions, but clearly there is something going on that we can't find. Even in debug mode, all I see are the following 4 lines repeated
This doesn't give me any insight into what else is happening. Presumably there is a function that will handle grabbing sensor values, and a function that will kick off camera captures, etc, correct? I would recommend adding logging such as the following:
or if inside a loop something line this
While this will make a large amount of logs, it would let us actually know what was going on at the time of a crash. Whatever the issue may be, be it a hardware problem, an uncaught exception, etc, we would quickly be able to tell the last function that was called before things went wrong. This would really help narrow down troubleshooting. The reason I suggest it's done in a VERBOSE mode is that you don't need this in most situations, so allowing a DEBUG and VERBOSE log helps resolve issues like the one I am having without flooding logs for the users who only need DEBUG. If you are interested in this I'm happy to help work through your code and add this type of logging. |
More logging would be nice. Thanks for the offer to help. A few questions and concerns: How would one add VERBOSE as a logging level beyond DEBUG? I'm getting ready to release version 5.0, so adding code to 4.1.16 may not be the best idea because it will not be maintained and merging with 5.0, when it's released, could become more difficult. Perhaps you should give the dev-5.0 branch a try and see if you still experience the same daemon crash issues. |
Ahh sorry, forgot VERBOSE is not enabled by default in Python. There are ways to add custom levels, I can help with that. I'll set my test back up in dev-5.0 and see if I have the issue. We can take it from there, and if needed I'll help with the logging in 5.0. Thanks! |
This seems like the proper way to add another logging level. import logging
logging.DEBUG_LEVEL_NUM = 9
logging.addLevelName(logging.DEBUG_LEVEL_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
if self.isEnabledFor(DEBUG_LEVELV_NUM):
self._log(DEBUG_LEVELV_NUM, message, args, **kws)
logging.Logger.debugv = debugv |
I'll keep 5.0-dev running for the week and see if it is stable. |
I just got 5,0 freshly installed. Thanks for the debugging. |
Sad to report the daemon went dead after 4-5 days uptime (same as usual). I'm going to go back to disabling individual components and see what that yields, but I'd love to talk about getting that extra logging in place. |
5.0.27 |
I meant these are the same Pi version? |
Both 3 B. Both running Raspbian, and I just updated and upgraded the OS so they should be mirrors of each other. |
A and B devices both crashed within a few days of each other. The RAM usage grew slightly but never got above 50MB. I'm re-running with debug to see if I can get any additional information. FWIW these OS's were installed independent of each other, not mirrored. Do you know of any other 3B devices that have been running stable for some time? |
I have two Pi 3s that have been running for months without issue. One is a production device with several sensors and a time-lapse, all running since November. |
Well, what are next steps for this? I have two devices with this issue, one with very standard wiring, one with slightly longer wiring for a production setup. I can start to remove sensor by sensor again and see if I can find the specific cause, but when I tried to do that last time I couldn't totally isolate the issue. Perhaps the daemon re-launching tool? The crash itself isn't actually a big deal, but the idea of the daemon crashing and certain relays being left on for extended periods is problematic. |
Without knowing exactly what part of Mycodo is causing the issue, it's hard to do much. What did you harrow it down to? |
I couldn't isolate it. On the A device I had removed/disabled sensors down to the TSL2561, and it was still crashing. I had then set up both devices with a DHT22 then a TSL2561 and it ran for a week straight. I assumed the issue was the possible use of a non-standard GPIO pin for the DHT22, and added back all sensors. The all sensor test has now failed. I can start from scratch again, but it's hard to isolate. I'll start with the TSL2561 alone. I want to use one of the devices for production work, so I need a way to keep the daemon up when it crashes. I can still dedicate a second device for testing. I'm going to modify the service so that it runs the stayalive script you posted above. I'll then dedicate device A to testing and will start from no sensors again on that device. |
Perhaps you might want to try forking the repo, then adding more debug logging lines throughout the functions you've been using, then run the daemon in debug mode until it crashes. |
Good call, seems like as long as I insert logging into each sensors |
I'm not sure what caused that. The Daemon won't run in the virtual environment if you're manually running it from the terminal and didn't start it from the virtual environment. Use this command:
Does this mean rebooting fixed the issue? |
yes, when the rpi is restarted, the mycodo daemon has no problem initializing, and the sensor ID 2 (AM2302, addded as DHT22) re-measures... |
missing directory
corrected:
today the same error when i tried to reload the daemon in debug mode, the daemon again stopped about 2 hours ago,
I will change the AM2302 sensor for other one |
@Ericktz You might have corruption in your database. I've never seen that error before. |
It could also be an issue with your pigpio install, or a hardware issue. |
If I delete the mycodo.db, can I reconfigure everything? |
This is how I would do it:
That will backup your database, then trigger a new database to be created. You'll then need to go through the setup procedure, creating an admin account, adding/activating your sensors, etc. |
Detailed logging during reading of each sensor shows that he crash does not occur inside a sensor reading. I also tried disabling graphs and this did not stop the crashing. The crashing still happens consistently on devices A and B. At this point I'm just going to set up the restart script and move forward. Thanks for the assistance to date. |
Sorry we couldn't come to a better resolution. If you ever get any more info, please come back to this (or a new) thread and share. Also, could you please provide the full script you are using? |
You've helped out quite a bit! Perhaps one day we'll find the problem. I wish I had more time to personally dig deeper. Once I have the restart script ironed out I'll post it here. |
The following script is working for me as a tool to ensure the daemon is up. I start it at boot using cron.
|
Looks good. Thanks! One suggestion would be to add a failsafe in case the daemon can't start properly (bad configuration or other), so it doesn't continually try to start it endlessly if it won't be able to start properly. Perhaps try 5 times and then stop the script entirely. If it successfully starts, then reset the counter. |
You did the bulk of that work so thanks for that! Re your comment, it actually won't retry. If a pid file is found, then the pid file is first removed, and then the daemon is started. If the daemon does not start, the pid file is still gone so it will not attempt a second time. It will loop for infinity if there's an issue removing the pid file. I could see building it out to have a failsafe for removing the pid file, but as it is now we can't try to start the daemon without a pid present otherwise it would launch the daemon even if the daemon was purposely taken down. |
Ah, I understand it now. It won't do anything if a PID file isn't present, so it's only taking action if it should be running and isn't. |
I just added this script (slightly modified) with the latest commit. I now have it used during the upgrade/install that will basically check for a stale PID file and remove it and start the service if needed. You can also use the script with the |
Script seems to be working for me, thanks! |
This script has been added during a recent update (can't remember which and didn't reference it here) so a cron entry is created to start this script at boot. |
Mycodo Issue Report:
Problem Description
Daemon shuts down / crashes without log entries.
Errors
Daemon Log:
Below is all I could find. The rest of the log is repeating mysql queries that I assume are the result of a persistent web-client / graphing.
Steps to Reproduce the issue:
How can this issue be reproduced?
Additional Notes
I've been able to do this twice now, so I believe it's consistent. Please let me know what other types of log entries I should look for or if I can enable some increased logging, etc.
The text was updated successfully, but these errors were encountered: