PID controllers get messed up after a while #671

floriske · 2019-07-18T09:20:05Z

Mycodo Issue Report:

Specific Mycodo Version: 7.5.10

Problem Description

PID issues

Errors

Once in a while I have to restart the Mycodo backend because my PID controllers get messed up

Steps to Reproduce the issue:

I can't reproduce it in a specific way, it just happens once in a while.
After rebooting the backend it starts running properly again.

Screenshot:

You can clearly see the strange behaviour before rebooting the backend (just after 9.30) and correct behaviour afterwards

p.s. I don't know if this info is relevant for the issues but have 5 PID controllers in total:

Cooling (Fridge)
Heating (Heater, seperate from cooling because it needs very different settings)
RV- 1 (Cooling to lower humidity)
RV -2 (Dehumidifier to lower humidity)
RV+ (Humidifier, seperate from both RV- because it needs very different settings)

I use both the dehumidifier as well as the fridge for lowering RV because it produces an overall better result.

kizniche · 2019-07-18T11:43:00Z

You can go to the asynchronous graph page and, for the offending PID, view the P, I, and D values for the PID, in addition to the PID duration output. This is the first bit of data that's needed to understand how the PID output is calculated. I would also include on this graph the Output duration itself to make sure it matches the PID output. Once we see that data for a period of time comprising good PID operation and bad PID operation, we can begin to understand what's happening.

kizniche · 2019-07-18T11:44:46Z

Also check the daemon log (ALL button) for any errors or strange lines at the time the PID(s) began acting strangely.

Was it all PIDs that started doing this or one?

kizniche · 2019-07-18T11:47:32Z

Another data series to include on the asynchronous graph I mentioned previously is the measurement the PID uses as input and the setpoint.

floriske · 2019-07-18T14:10:07Z

3 out of five PID controllers where acting up, the other 2 haven't produced any output the past time because no heating or humidifying where needed.

Here are the graphs:

PID Cooling

PID RV- with fridge

PID RV- with dehumidifier

In the log I see a lot of these errors during the period things where messed up:

"RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu"

These errors stopped after restarting the backend.

floriske · 2019-07-19T09:49:38Z

I just noticed that I used the average RV over 60 minutes as input in the 2 RV graphs, this actually should be a different input, average RV "live". I hope this won't be an issue for interpreting what went wrong?

kizniche · 2019-07-19T19:32:21Z

Can you run one of the PID controllers that suffers from the issue in debug mode (Input option Log Level Debug)?

kizniche · 2019-07-19T19:48:43Z

In fact, this will help the most. In addition to enabling debug mode on one of the affected PID controllers (and activating it), could you also do this:

Upgrade to master (I just pushed code that improved logging with respect to RPyC)
After upgrading to master, stop the daemon with sudo service mycodo stop
Open a terminal and start the daemon manually with cd ~/Mycodo && sudo ./env/bin/python ./mycodo/mycodo_daemon.py --debug

This will start the daemon in debug mode, which will provide a lot of potentially useful information when the error occurs. Upgrading to master is needed because I just pushed code that's not available in a release that includes debug logging lines about the RPyC process.

Keep in mind this will generate a lot of log information. You may want to add the Free Space Input to monitor your micro SD card free space and notify you if it starts to run low. Logrotate will take care of keeping the log sizes manageable, but you may have to go to where they're stored if you don't notice soon after the error occurs and logrotate has moved the pertinent log info to an archive (if the length of time from the error occurring and you getting to the logs being so long that logrotate has archived the log in the meantime).

Let me know if you can do this and if you have any questions. When in debug mode, the daemon will function exactly as it would normally, but there will just be more log lines.

kizniche · 2019-07-19T19:51:03Z

How long does it typically take for the error to occur?

floriske · 2019-07-22T07:49:08Z

Since my setup is currently in use I'd rather not update at the moment. I'll put onde of the PID controllers in debug mode for now.

There's no typical period after which the errors start occuring. It can be days, hours or weeks.

kizniche · 2019-07-22T13:00:55Z

Did you have the Mycodo RAM Input running at the time and can look at what the RAM usage was at the time the issue began?

floriske · 2019-07-23T14:33:50Z

Unfotunately not.

At the moment the temperature PID looks like its keeping the cooling output almost constantly high however the temperature never reaches the Band Min setpoint. Instead the temperature is regulating just above the Band Max setpoint within a span of 0.5 degrees C.

I guess my next step is to check if the fridge is working properly, maybe the fridge thermostat is switching. This would be unwanted behaviour and also would be strange because I measured the min temperature at the current thermostat setting and that was almost 5 degrees below the currently set temperature in the PID controler.

Another thing I noticed atm is that the output duration of each PID period with a min on time of 60 seconds gets split up in 49 seconds and 11 seconds instead of 60 seconds per PID period.

floriske · 2019-07-24T08:00:54Z

In the meantime I found the problem with the continuous cooling. It was indeed the fridge thermostat switching too soon. I hope that all of the excess switching during the strange behaviour of the PID controller hasn't lead to unwanted wear of the compressor.

At least now I can get back to trying to figure out why the PID controllers act up sometimes.

kizniche · 2019-07-25T20:27:41Z

Can you take a screenshot of the configuration options for your PIDs that cause the issue?

floriske · 2019-07-26T07:43:11Z

Sure, her you go:

kizniche · 2019-07-26T12:07:35Z

Something I'm confused about is your PID periods are all 60 seconds, which means no output used by any of these PIDs can turn on for more than 60 seconds (by these PIDs). However, I see in your initial post the outputs are turning on for far longer than 60 seconds (prior to the 'issue' starting). Do you have something else turning on these outputs besides the PIDs?

kizniche · 2019-07-26T12:15:08Z

Sorry, I wrote that before realizing how the PIDs actually work (excuse my just-woken-up brain). The PIDs can keep an output on continually (past the PID period), though I now have an idea for what's causing your PID behavior. I'll have to look at the code to determine where exactly it's occurring.

floriske · 2019-08-12T12:44:59Z

Sorry for the late reply, I've been "offline" (read: vacation) for a while.

Do you mean you think you can fix the issues?

kizniche · 2019-08-12T20:27:18Z

I added a debugging line to the PID controller, but I'm not sure if that's going to catch the issue. I had asked previously if you could look at the daemon log at the exact time the issue began. Can you try to bring those log lines up and paste the contents here, for just a few minutes before the issue started until a few minutes after the issue began?

floriske · 2019-08-13T12:46:30Z

Unfortunately there is no specific time it begins, it just gradually does.

All I can see in my Daemon error log is that I get a lot of "RPyC timed out .." errors
as well as "check_pid() TimeoutError" errors

Is it possible that this has something to do with the added ability to use the same output on 2 different PID controllers? I'm using 1 output on 2 PID controllers in my case

kizniche · 2019-08-13T13:08:36Z

What's more important than the error itself is what part of the system the error is coming from. Each log line has a string that identifies where the error originates. Therefore, it is crucial to have the exact log lines to analyze. Just knowing there is an error without knowing where it came from is not that useful.

Is it possible that this has something to do with the added ability to use the same output on 2 different PID controllers? I'm using 1 output on 2 PID controllers in my case

An easy way to test would be to disable one of the PIDs. I don't recommend using the same output for more than one PID because there can be unforeseen issues.

floriske · 2019-08-14T08:15:55Z

Here's a few lines from the past days:

2019-08-12 13:57:58,556 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-97816:
2019-08-12 13:58:53,640 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-97898:
2019-08-12 14:02:01,577 - ERROR - mycodo.controller_pid_828553b2 - check_pid() TimeoutError
2019-08-12 14:13:18,773 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-12 14:14:24,451 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-12 14:20:13,828 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-12 14:23:06,516 - ERROR - mycodo.controller_pid_828553b2 - check_pid() TimeoutError
2019-08-12 14:25:03,972 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-12 14:25:31,734 - ERROR - mycodo.controller_pid_33c54906 - check_pid() TimeoutError
2019-08-12 14:26:37,084 - ERROR - mycodo.controller_pid_33c54906 - check_pid() TimeoutError
2019-08-12 14:28:01,599 - ERROR - mycodo.controller_pid_33c54906 - check_pid() TimeoutError
2019-08-12 14:30:28,071 - ERROR - mycodo.controller_pid_561b7be3 - check_pid() TimeoutError
2019-08-12 14:32:14,584 - ERROR - mycodo.controller_pid_561b7be3 - check_pid() TimeoutError
2019-08-12 14:32:40,196 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-13 14:38:51,667 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-67154:
2019-08-13 14:41:31,957 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-13 14:44:15,023 - ERROR - mycodo.controller_pid_c78c49a0 - check_pid() TimeoutError
2019-08-13 14:45:16,584 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-67565:
2019-08-13 14:50:21,502 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-67915:
2019-08-13 14:53:11,815 - ERROR - mycodo.controller_pid_828553b2 - check_pid() TimeoutError
2019-08-13 14:53:20,543 - ERROR - mycodo.controller_pid_33c54906 - check_pid() TimeoutError
2019-08-13 14:54:09,534 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-68202:
2019-08-13 14:55:03,585 - ERROR - mycodo.controller_pid_33c54906 - check_pid() TimeoutError
2019-08-13 15:01:15,801 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-68688:
2019-08-13 15:02:30,099 - ERROR - mycodo.controller_pid_828553b2 - check_pid() TimeoutError
2019-08-13 15:03:45,612 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-68865:
2019-08-13 15:03:55,528 - ERROR - mycodo.controller_pid_828553b2 - check_pid() TimeoutError
2019-08-13 15:04:55,152 - ERROR - mycodo.controller_conditional_151814b8 - RPyC timed out. To prevent this error, increase the RPyC Timeout value in the configuration menu.
2019-08-13 15:06:49,896 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-69017:
2019-08-13 15:08:03,597 - ERROR - mycodo.controller_conditional_151814b8 - Error: Exception in thread Thread-69123:

As you can see it involves multiple PID controllers and one conditional statement (I only have 1)

Using the same output for 2 PID controllers is a must have in my case but I can test what happens if I disable one of them.

Edit: This is the code of the conditional statement, the actions are set PWM's. I'm running my fans at very low duty cycle so this is just to make sure they keep running.

#measurements
inlet_rpm = measure("{4aa60edd}")
outlet_rpm = measure("{948428bf}")

#Conditional Inlet Fan RPM
if inlet_rpm < 100:
    run_action("{3ad209a7}")
else:
    run_action("{0c5ed710}")

#Conditional Outlet Fan RPM
if outlet_rpm < 100:
    run_action("{ecc3b38e}")
else:
    run_action("{4d2f3b63}")

kizniche · 2019-08-15T19:12:13Z

I had one of my production systems experience this issue twice in the past 2 weeks (RPyC: "result timeout"). So, I'm exploring other RPC options, such as Pyro4. I've successfully replaced the code, and am currently running a stability test.

…ty (#671, #679)

floriske · 2019-08-16T06:39:41Z

"Good" to hear that it's not just me.

kizniche · 2019-08-20T14:47:25Z

I haven't seen any issues with the new communication system. If you would like to try it and see if your issues persist, you can upgrade to master.

floriske · 2019-08-21T06:55:04Z

Thanks! I will upgrade to master asap.

In the meantime I've raised the RPyC timout to 60 and this seems to delay the occurance of the issues but after a few days they still turn up slowly.

floriske · 2019-08-21T07:12:27Z

I get the following error when trying to upgrade to master. I'm currently in production so I restored back to 7.6.3 for now because I don't want too much down time.

[2019-08-21 09:01:21] There was an error during the upgrade.
[2019-08-21 09:01:21] Initial steps to try to fix:
[2019-08-21 09:01:21] 1. Reboot
[2019-08-21 09:01:21] 2. If that doesn't fix the issue, run the following command:
[2019-08-21 09:01:21] sudo /bin/bash ~/Mycodo/mycodo/scripts/upgrade_commands.sh upgrade
[2019-08-21 09:01:21] 3. If that command returns that you are running the latest version, run the following command:
[2019-08-21 09:01:21] sudo /bin/bash ~/Mycodo/mycodo/scripts/upgrade_post.sh

Edit:
just to be sure: will I loose all my settings when updating to master or will they remain intact?

kizniche · 2019-08-21T15:45:01Z

This may be a result of previously upgrading to master, then now upgrading to master again. The master branch changes, so repeated upgrades to master may cause issues. Can you restore to the backup that was made of your system just prior to upgrading to master the first time, then upgrade to master?

All settings are preserved when upgrading to master, and a backup is also made just as if you were upgrading to a release.

kizniche · 2019-08-21T15:45:35Z

Can you look at the full upgrade log for the error that caused the error you pasted to be returned?

floriske · 2019-08-22T06:53:37Z

I didn't upgrade to master earlier, this was my first try.

Here's the full log of yesterdays attempt:

[2019-08-21 08:59:56] 
[2019-08-21 08:59:56] 
[2019-08-21 08:59:56] Upgrade script executed with the 'force-upgrade-master' argument. Upgrading from github repo master.
[2019-08-21 08:59:56] 
[2019-08-21 08:59:56] #### Upgrade to v7.6.3 initiated 2019-08-21_08-59-45 ####
[2019-08-21 08:59:56] 
[2019-08-21 08:59:56] #### Beginning Upgrade: Stage 1 of 2 ####
[2019-08-21 09:00:23] Stopping the Mycodo daemon...Done.
[2019-08-21 09:00:28] Downloading latest Mycodo version to /home/pi/Mycodo-master.tar.gz...Done.
[2019-08-21 09:00:28] Creating /tmp/Mycodo-7.6.3...Done.
[2019-08-21 09:00:29] Extracting /home/pi/Mycodo-master.tar.gz to /tmp/Mycodo-7.6.3...Done.
[2019-08-21 09:00:29] Removing /home/pi/Mycodo-master.tar.gz...Done.
[2019-08-21 09:00:29] Copying /tmp/Mycodo-7.6.3/.upgrade status file to /tmp/Mycodo-7.6.3...Done.
[2019-08-21 09:00:29] Moving env directory...Done.
[2019-08-21 09:00:29] Copying databases from /home/pi/Mycodo/databases/ to /tmp/Mycodo-7.6.3/databases...Done.
[2019-08-21 09:00:29] Copying flask_secret_key from /home/pi/Mycodo/databases/ to /tmp/Mycodo-7.6.3/databases...Done.
[2019-08-21 09:00:29] Copying statistics ID...Done.
[2019-08-21 09:00:29] Copying SSL certificates...Done.
[2019-08-21 09:00:29] Copying mycodo/inputs/custom_inputs...Done.
[2019-08-21 09:00:29] #### Stage 1 of 2 Complete ####
[2019-08-21 09:00:30] 
[2019-08-21 09:00:30] #### Continuing Upgrade: Stage 2 of 2 ####
[2019-08-21 09:00:30] 
[2019-08-21 09:00:30] Moving old Mycodo from /home/pi/Mycodo to /var/Mycodo-backups/Mycodo-backup-2019-08-21_08-59-45-7.6.3...Done.
[2019-08-21 09:00:30] Moving new Mycodo from /tmp/Mycodo-7.6.3 to /home/pi/Mycodo...Done.
[2019-08-21 09:01:00] 
[2019-08-21 09:01:00] #### Upgrading Mycodo database with alembic (if needed)
[2019-08-21 09:01:21] Failed: Error while updating database with alembic.
[2019-08-21 09:01:21] 
[2019-08-21 09:01:21] 
[2019-08-21 09:01:21] There was an error during the upgrade.
[2019-08-21 09:01:21] Initial steps to try to fix:
[2019-08-21 09:01:21] 1. Reboot
[2019-08-21 09:01:21] 2. If that doesn't fix the issue, run the following command:
[2019-08-21 09:01:21] sudo /bin/bash ~/Mycodo/mycodo/scripts/upgrade_commands.sh upgrade
[2019-08-21 09:01:21] 3. If that command returns that you are running the latest version, run the following command:
[2019-08-21 09:01:21] sudo /bin/bash ~/Mycodo/mycodo/scripts/upgrade_post.sh

Thank you so much for all the effort you are putting into this!

kizniche · 2019-08-22T21:01:01Z

[2019-08-21 09:01:21] Failed: Error while updating database with alembic.

Exactly the error I suspected, which was likely the result of previously upgrading to master. If you follow the instructions in my previous comment, you should be able to upgrade successfully:

Can you restore to the backup that was made of your system just prior to upgrading to master the first time, then upgrade to master?

floriske · 2019-08-23T06:46:36Z

unfortunately I'm still getting the same error

kizniche · 2019-09-11T21:29:51Z

The latest release, 7.7.1, has switched to Pyro4. If you upgrade, let me know if your issue persists.

floriske · 2019-09-12T09:19:42Z

Thank you, I just updated and will let you know how it goes.

floriske · 2019-09-12T12:14:35Z

Looking good so far! (you can clearly see the difference before and after the update around 11:15)
Let's hope it stays that way.

kizniche · 2019-09-17T06:58:08Z

Oh wow. I took a closer look at the graphs and that is a lot better. I take it by no new messages since you updated that you haven't had any problems.

When I refactored Mycodo to use Pyro4 instead of RPyC, I didn't know there was a Pyro5 that existed. So, this latest version release (7.7.3) I've upgraded Pyro4 to Pyro5. I'm not sure if there will be any further noticeable performance improvements, but maybe ;)

kizniche · 2019-09-17T07:03:38Z

I think the performance improvement you saw was related to the messages being transmitted faster. For whatever reason, with RPyC, sometimes communication between process could take several seconds! Definitely not good for things like turning on outputs from PIDs. Seeing your data makes all that work worth it. I rarely see such an obvious performance improvement.

floriske · 2019-09-20T14:31:38Z

Had some issues with the PID controllers not working at all in version 7.7.3 but after upgrading to 7.7.4 it was all working again and it's still looking good so far.

There is some offset in the resulting values but I reckon I need to re-tweak the PID controllers for that.

kizniche · 2019-10-03T01:36:45Z

How has Pyro been treating your Mycodo?

kizniche · 2019-10-09T21:57:30Z

I don't mean to be a pain, but I'm eager to know if your issue has reemerged since the RPC change from RPyC to Pyro. If all has been good, I think we can finally close this issue.

floriske · 2019-10-17T06:11:09Z

Sorry for the late reply. I've been busy with a lot of things except for my curing chamber.

On top of that I'm currently having VNC issues with my r-pi but I had a good chance to test it after you implemented it.

After switching to Pyro4 it got a lot better, as you have seen on the screenshots I posted earlier. After the update to Pyro5 I was under the impression that it got a bit less stable than with version 4 but still a lot better. The lesser results might also have something to do with lesser products in the fridge.

floriske · 2019-11-07T10:46:53Z

Time to close this one!

Finally got to sorting my VNC issues and monitored the behaviour for a short while. Looking good!

Thank you for all the effort that went into this!

Former-commit-id: 737c16f

…ty (#671, #679) Former-commit-id: 635bbe3

kizniche added a commit that referenced this issue Aug 12, 2019

Add PID debugging line (#671)

737c16f

kizniche mentioned this issue Aug 15, 2019

RPyC AsyncResultTimeout("result expired") issue #679

Closed

kizniche added a commit that referenced this issue Aug 15, 2019

Refactor RPC by replacing RPyC with Pyro4 for improved system stabili…

635bbe3

…ty (#671, #679)

kizniche mentioned this issue Sep 18, 2019

Raspberry Pi 4 Requires latest wiringpi needs 2.52 #689

Closed

kizniche added the bug label Oct 7, 2019

floriske closed this as completed Nov 7, 2019

kizniche added a commit that referenced this issue Jun 14, 2020

Add PID debugging line (#671)

bf5a379

Former-commit-id: 737c16f

kizniche added a commit that referenced this issue Jun 14, 2020

Refactor RPC by replacing RPyC with Pyro4 for improved system stabili…

db0c786

…ty (#671, #679) Former-commit-id: 635bbe3

PID controllers get messed up after a while #671

PID controllers get messed up after a while #671

Comments

floriske commented Jul 18, 2019 • edited

Mycodo Issue Report:

Problem Description

Errors

Steps to Reproduce the issue:

kizniche commented Jul 18, 2019

kizniche commented Jul 18, 2019

kizniche commented Jul 18, 2019

floriske commented Jul 18, 2019

floriske commented Jul 19, 2019

kizniche commented Jul 19, 2019

kizniche commented Jul 19, 2019 • edited

kizniche commented Jul 19, 2019

floriske commented Jul 22, 2019

kizniche commented Jul 22, 2019

floriske commented Jul 23, 2019

floriske commented Jul 24, 2019

kizniche commented Jul 25, 2019

floriske commented Jul 26, 2019

kizniche commented Jul 26, 2019

kizniche commented Jul 26, 2019

floriske commented Aug 12, 2019

kizniche commented Aug 12, 2019

floriske commented Aug 13, 2019

kizniche commented Aug 13, 2019 • edited

floriske commented Aug 14, 2019 • edited by kizniche

kizniche commented Aug 15, 2019 • edited

floriske commented Aug 16, 2019

kizniche commented Aug 20, 2019

floriske commented Aug 21, 2019

floriske commented Aug 21, 2019 • edited by kizniche

kizniche commented Aug 21, 2019

kizniche commented Aug 21, 2019

floriske commented Aug 22, 2019 • edited by kizniche

kizniche commented Aug 22, 2019

floriske commented Aug 23, 2019

kizniche commented Sep 11, 2019

floriske commented Sep 12, 2019

floriske commented Sep 12, 2019 • edited

kizniche commented Sep 17, 2019

kizniche commented Sep 17, 2019

floriske commented Sep 20, 2019

kizniche commented Oct 3, 2019

kizniche commented Oct 9, 2019

floriske commented Oct 17, 2019

floriske commented Nov 7, 2019

floriske commented Jul 18, 2019 •

edited

kizniche commented Jul 19, 2019 •

edited

kizniche commented Aug 13, 2019 •

edited

floriske commented Aug 14, 2019 •

edited by kizniche

kizniche commented Aug 15, 2019 •

edited

floriske commented Aug 21, 2019 •

edited by kizniche

floriske commented Aug 22, 2019 •

edited by kizniche

floriske commented Sep 12, 2019 •

edited