New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PINE64: activated 1.2GHz and 1.34 GHz frequency steps #3
Conversation
Nice thanks. I assume it is stable - do you have a heatsink? |
PINE64: activated 1.2GHz and 1.34 GHz frequency steps
It is stable when you start throttling 1.34 GHz at 65°C (0x41). |
Please be aware of http://linux-sunxi.org/Hardware_Reliability_Tests#Reliability_of_cpufreq_voltage.2Ffrequency_settings Unfortunately it's not that easy to run this set of tests with newer Allwinner SoCs and this budget cooling stuff since cpufreq-ljt-stress-test doesn't recognize when throttling happens (you'll see it only if frequencies in the normal range are SKIPPED since budget cooling does not allow to switch to the next cpufreq the script tries to test). BTW: On Debian/Ubuntu this automates RPi-Monitor installation including template for A64 completely: http://kaiser-edv.de/tmp/4U4tkD/install-rpi-monitor-for-a64.sh @umiddelb please don't fool yourself: The issue is undervoltage that might occur under load peaks and then leads to data corruption at a point where throttling doesn't jump in since temperature is too low. This can only be tested with the approach above and especially the upper dvfs operating points need a strong fan to be able to really test through. |
@ThomasKaiser Understood, AW introduces operating points with inappropriate voltage settings and disables them by limiting maximum frequency in budget cooling. The
|
Strange. Does
help? |
@umiddelb - the cpufreq-ljt-stress-test tools work just fine for me - Just install "libjpeg-turbo-progs" (on Ubuntu) to avoid having to compile those tools and you should be fine. |
So far i am not able to get it running with 1.34 GHz - the Pine freezes instantly when having 1.34 GHz enabled as /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq and starting cpuburn-a53 - i assume this is undervoltage and i am using a 2A power supply only. Also i did not attach a heat sink or fan. |
Regarding these power issues. It's a well known fact that many/most USB cables are insufficient (resistance way too high ) and the Micro USB receptacle is also limited to 1.8A max by specs. I had some hope the PSU for RPi 3 (rated 2.5A) could do some magic but it looks the RPi foundation simply uses both data and power lines on the new RPi+PSU to overcome the 1.8A current limitation (at least the cable should be better and prevent undervoltage situations more often) With older Allwinner PMICs like AXP209 (or the enhanced drivers for them) it was possible to easily read this out and get a clue what's going on since undervoltage only happens when consumption increases (you think "Alright, my PSU has less amperage!" but in reality the voltage drop shut your board down) A different thing is VDD_CPUX as set in the dvfs table and this seems to be @longsleep's problem now since the dvfs table defines 1300mV @ 1200MHz as upper value (cooler table limiting to 1152 MHz ) In the upper region you need way more voltage if you increase cpufreq just a little bit. Regarding possible values all we have (as usual) are comments in fex files:
I didn't have a look in the A64's datasheet whether maximum VDD_CPUX is mentioned (for H3 it's 1400mV) but since the Pine64 already suffers from both DC-IN problems due to lack of a barrel plug and overheating I would refrain from entering this region if it's about stability/reliability. You can try to fulfill Pine64's marketing claims (1.2GHz) by trying to get 1200MHz stable. And leave everything else for the overclocker community (this does exist, have a look in the Orange Pi forums, there are people that fry there boards happily at 1.5V using fans and large heatsinks and idling at 1536MHz for no reason) But if you want even only the 1200 MHz confirmed to work reliably while choosing a dvfs operating point that isn't too high (since both consumption and temperature will increase heavily when leaving the 1.3V!) then good luck. Best practice would be to heat up the SoC (cpuburn-a8/-a53) at least 5 minutes before starting cpufreq-ljt-stress-test and only if you manage to prevent throttling this way with some safety headroom you can be sure that throttling won't occur when cpufreq-ljt-stress-test later runs on top (hard to achieve without fan+heatsink). So even 'unlocking' the 1.2GHz would already require huge efforts! Funny enough: All this was predictable by looking into the first leaked A64 BSP :) |
Thanks, I'm running the tests now ... |
|
You need to run cpuburn-a53 at the same time. |
I would try it with cpuburn-a8 before since cpuburn-a53 is highly efficient at putting stress on the CPU cores that I doubt you're able to even reach 1008MHz (since throttling will prevent this, CPU cores get killed or the board deadlocks when jumping from 480 MHz to 1008MHz) I doubt cpufreq-ljt-stress-test can work with Pine64 as intended, the operating points should be checked in ascending order (and then still checking for throttling and active CPU cores is missing as discussed yesterday in IRC) (cpufreq-ljt-stress-test was a quick hack for A10 back then when throttling and SMP wasn't a problem) |
And please keep in mind: The whole idea behind cpufreq-ljt-stress-test is to check whether VDD_CPUX for a specific dvfs operating point is sufficient or not. That means putting as much reasonable stress as possible on the SoC and then start this JPEG game to detect data corruption. Unfortunately cpuburn-a53 is a possible use case for software that makes heavy use of NEON stuff. But given that running cpuburn-a53 at 816MHz with applied heatsink already leads to killed CPU cores since temperatures exceeded tresholds I doubt you're able to get any usable results from cpufreq-ljt-stress-test for higher cpufreqs without a) liquid cooling and b) soldering a sane DC-IN solution when running cpuburn-a53 in parallel (ssvb said his consumption increased by 1100mA at 812MHz/1.1V when running cpuburn-a53. 1008MHz/1.2V might even exceed the Micro USB max rating of 1800mA already) |
And please allow a final comment. A good read is ssvb's explanation how he came up with default dvfs settings for Orange Pi PC: https://groups.google.com/d/msg/linux-sunxi/20Ir4It3GsA/TPT9wv8IAQAJ You should also always keep in mind that these dirt-cheap SoCs are subject to production tolerances and no selection process like with Intel CPUs happens afterwards. So while my board might run stable with specific dvfs settings at 1344MHz @longsleep's deadlocks and @umiddelb's corrupts data silently. This is stuff the overclocker camp doesn't care about but since the settings currently being developed here will most likely be the default on every OS image that relies on kernel 3.10.x it's somewhat important to stay safe. It's up to the user in question to apply different dvfs settings if he knows what he's doing. For example if I know that I need as less consumption as possible, I won't connect a display (plays a role whether the display engine is active or not!), limit the board to 480 scaling_max_cpufreq and know that my code doesn't make use of NEON then after some extensive testing I might come up with dvfs operating points using significantly less VDD_CPUX voltage (like eg. these settings for OPi PC). But these are things only the user in question can decide after testing his device individually -- the defaults should work on every board. |
TL;DR: I would suggest only unlocking the 1200MHz and adding one additional dvfs operating point: 1200MHz @ 1340mV. And if it's not possible to test this setting at least with 4 cpuburn-a8 threads running in parallel then revert back to defaults (letting the Pine64 people explain why their 1.2Ghz board only runs at 1152 MHz max ;) ) |
Thanks for all the details! I definitely want defaults in the images which work reliably and without loosing CPU cores when doing something. If you guys come up with updated DTS please feel free to send a PR. I will continue testing tomorrow night. |
No PR to be expected from me ;) I just wanted to point out what has to be considered when allowing higher clockspeeds. The process to develop a sane dvfs table is time consuming: You start with reasonable defaults, decrease the voltage step by step until problems occur and then add some mV for some safety headroom. And then search for others to test with these settings. And as already written: with the currently available set of tools it's impossible to test since deadlocks will occur for sure when switching from lowest to highest cpufreq settings. Unless this is fixed I would revert back to 1152 (or maybe 1200 -- IMO that's the highest value possible without risiking stability/realibility... that still needs confirmation regarding dvfs settings) |
So when I try to run cpuburn-a53 @ 816000Hz everything is fine, running cpuburn-a53 @1008000Hz I get
cooling stage 5 (>90°C) was reached within a couple of seconds. With the two remaining cores cpuburn-a53 produce 64°C, cpuburn-a8 44°C. Will run the tests with cpuburn-a8 now. How can I bring up killed CPU cores without rebooting? |
CPU temperature stays around 60°C |
CPU temperature touches 65°C without triggering cooling stage 1. |
Concerning the power supply, the PINE64 is connected to an Apple iPad USB power supply (2A rated) (and draws some extra power through the UART connector ;) |
|
Thanks. I'm using the UART connector on the EXP bus now. For the tests I needed to lift up the trigger temperature for stage 1 from 65°C to 68°C, leaving the rest untouched.
|
Nice thanks for testing! Can you share the rest of your exact setup? Do you have a heat sink? And just for confirmation you are running cpuburn-a8 the whole time? |
Yes, I'm using this heatsink with a thermal pad. Unfortunately there are no mounting holes to pressure down the heatsink on the board. The heatsink covers the SoC and the DRAM chips, but not the power converter (one half of the area marked with the white outline on the board). And yes, cpuburn-a8 was running the whole time, but building the kernel with make -j4 produces more stress. |
Ok, if kernel compilation is more intensive then cpuburn-a8 is insufficient :( Did you add another dvfs operating point or do you still use 1.3V as max voltage? BTW: When testing A83T I realised way too late that cpuburn wasn't running on all cores. Did you checked that too? 4 threads? |
BTW: My first attempt to use new dvfs operating points failed due to me failing to understand that I exchanged my build host in the meantime (shell history is great but sometimes...). I changed settings on host B but still copied over u-boot-with-dtb.bin from host A... |
Hmm... I tried to increase the 1300mV to 1360mV for DCDC2/DCDC3 in u-boot's drivers/power/sunxi/axp81X_supply.c but it's still not working and I get ARISC errors. Maybe I did something wrong again. Will now try my luck with
and higher thermal trip points instead to see whether I get Linpack running without throttling and without failure above 1152 MHz. |
With these dvfs settings 1296MHz passed (but throttling down to 1248MHz happened a few moments) and as expected 1344MHz failed again: http://pastebin.com/QEBXCgZz Since we (and the average user even more) is already running into throttling/thermal issues I would refrain from increasing VDD_CPUX above 1.3 and consider any clockspeed above 1.2Ghz as overclocking and irrelevant now. At least this Linpack run seems to be useful for realibility testings to me. I made another q&d test, set maximum cpu freq to 720MHz (to increase the runtime of Linpack), then let cpuminer run for a few minutes, stopped it and immediately started Linpack and started immediately afterwards cpuburn-a53 (pkill minerd && bin/OpenBLAS/xhpl && cpuburn-a53): Cpuminer: 39°C, Linpack 46°C, cpuburn-a53 53°C. |
BTW: I let cpuminer and Linpack also run on Orange Pi PC (H3, 1296MHz): 2.35 khash/s and 1.73 Gflops on H3 vs. 3.9 khash/s and 3.4Gflops with A64 running at the same clockspeed. |
I tried 3 different dvfs settings with voltages reduced by 20mV each to get an idea regarding safety headroom (using this small script... and to end up most probably at Allwinner's current defaults when adding a few mV back to ensure reliable operation everywhere ;)
And then I let Linpack test the specific dvfs operating points (only task running except of a few monitoring background tasks and unfortunately a whole running X-Windows that might be responsible for the result variations). I would say Linpack might be able to detect undervoltage situations (at least it's convenient to get a validation and not a deadlock as proof that undervoltage occured)? My next step would be to create an optimised dvfs table at the lower limit and testing it with just heatsink and on the other board that not even has a heatsink. Then add some mV to each operating point and let cpuburn-a53 kill the board. Any thoughts? Edit: already testing the last run and ending up with the following:
So +60mV for each operating point later, ignoring anything above 1200MHz and let cpuburn-a53 do the job... (note to myself: test also with connected HDMI display since at least with Orange Pi PC that made a huge difference) |
One question: I'm unsure if the number of of cooling states should match to the number of trip points and to the number of binds? |
No idea. But at least trip points and binds correlate and define stuff for both CPU and GPU (see the real node names |
concerning trip points and binds, yes, but cooling stages 7-9 are very unlikely to be reached (stage 6 already shuts down the device (init 0)). |
I have an idea how it's meant to be. The first four trip points control the CPU frequency and each point refers to a number of stages to be conducted if the temperature is still above the the trigger temperature. |
Hmm... I was solely focused on undervoltage today and trying to get an idea whether Linpack fail/pass depend on temperature or undervoltage instead. Testing the dvfs settings above with connected display but still heatsink+fan doesn't show a difference. 1296MHz @ 1300mV aren't stable but otherwise it looks ok: http://pastebin.com/eBTkPAyG I then tried it with the fan switched off and just a heatsink, 1st try with 1152MHz (shutdown when SoC temp reached 100°C pretty soon), then with 1056MHz (the same), then with 1008MHz: http://pastebin.com/wZE3sqRU (SoC temp close to 100°C with both 1008MHz and 960MHz). But at least it seems that stability/reliability issues are more related to undervoltage than temperatute (to be confirmed). Next try with the other board without a heatsink at all: I changed the order to start from 480MHz and increase clockspeeds since it's obvious that Linpack is way too demanding for A64 without improved heat dissipation: http://pastebin.com/8PF8QL0S (board powered off 20 seconds after starting the 912MHz @ 1060mV @ 105°C) From the results I would believe the Linpack bench can be used to detect undervoltage pretty good. But I wonder how to proceed. If we increase every dvfs operating point by 60mV then a Pine64+ without heatsink won't survive such a ran at 720MHz? Based on the above killing CPU cores rather quickly makes a bit more sense... but maybe I just messed up the whole thermal configuration? |
Most likely it depends on both. Have you read http://asic-soc.blogspot.fi/2008/03/process-variations-and-static-timing.html ? We can probably describe it with something like the following formula:
where And as you have already observed with the
Why hasn't it throttled to a much lower clock frequency? Seems like it had enough time to do this (20 seconds is a lot). |
It is reasonable to add very low clock frequencies to the cooling table, even if we can't reduce the voltage anymore. For example, downclocking from 648MHz @1.04V to something like 312MHz @1.04V still would roughly cut the power consumption in half under heavy load (because less work is getting done per second). We just need to figure out what would be a safe CPU clock frequency, where 4 cores are able to run cpuburn-a53 safely without overheating (and shutting down) on a device without a heatsink. Even if we have to go as low as something under 100MHz for this. |
My suspicion is that Linpack does a few first steps that aren't that demanding and then load/consumption/temperature 'explodes' immediately. But more likely I just adjusted thermal settings wrong (tried to define higher trip points to let throttling jump in later). And I've to admit that I still don't understand the relationsships since I didn't read the links you posted (next weekend). Unfortunately I can't also look briefly at the powermeter since I put the whole setup in another room since the fan is really annoying. This is how temperatures developed (only clockspeed and temperature correct):
|
Tried again with @longsleep's settings and it's obvious I f*cked up throttling. With the very same hardware setup but his most recent settings throttling prevents an emergency shutdown when I run the Linpack test just with a heatsink :) So all my tests were rather useles (maybe except of the relationship to undervoltage) EDIT: And the Pine suddenly died
Ok, this might be the other issue to try to get a clue whether this might be AXP803 related. Next weekend... |
Mhm seening it die at well below 95°C is bad - i could not reproduce this yet but this makes me reconsider increasing all the trip points by 5°C. |
The log looks like both the CPU clock speed and the temperature kept increasing in the end. Either something went really wrong in the budget cooling logic, or both the CPU frequency and the temperature were swinging up and down wildly and got sampled this way purely accidentally. Based on your previous logs, it seems like something is shutting down the CPU upon reaching 105°C. In the H3 FEX file, this 105°C limit used to be configured explicitly. I can't see it offhand in the DTS file or in the kernel sources, but I did not try to search for it too hard.
See the comment above. I suspect that it also got automatically shut down upon reaching 105°C, but this was just not logged properly. |
Or in fact what we get is that 100°C is labelled as "critical" and causes shutdowns. Maybe it is worth moving the last critical trip point a bit higher relative to the previous one in order to allow a bit of temperature overshoot and have a bit more time to respond? |
Nope, the issues I had before were happening at above 100°C. I had this other issue (frequent power offs) the last few days and tried to reproduce it (to no avail). I thought maybe fast switching between different dvfs operating points might overburden the PMU. I put a heatsink on the PMU of the Pine already wearing heatsinks and it stopped (no idea why -- maybe unrelated). Today I tried it with the other Pine with no heatsinks, let cpuminer run and a simple script that always jumps between 720 and 1056. Ran for hours at upper temperatures. So I thought maybe the AXP803 on the other Pine has a problem: There I jump between 720 and 1156 but also to no avail. Runs since hours... no idea... @longsleep I second ssvb's suggestion to increase the critical trip point. Dumb question: Does reaching this critical point initiates an immediate power-off or a clean shutdown? |
@ThomasKaiser clean shutdown at least from the Kernel pov. See https://github.com/longsleep/linux-pine64/blob/pine64-hacks-1.2/drivers/thermal/thermal_core.c#L348-L368 |
BTW, am I the only one here who finds the decompiled dts file with hex numbers and raw phandles instead of symbolic node references rather unreadable? As for the thermal framework dts bindings documentation, it can be found here. The last two CPU related cooling maps nodes (bind2 and bind3) look rather suspicious because they referencing multiple |
@ssvb once things have settled down i will merge back the changes to the device tree sources in the Kernel tree and we get readable stuff. I also happily accept a PR which changes the format :) Just did not find the time and the passion to change it myself. Regarding the multiple states, i have added the multiple states similar to what i saw in the original file but i have not researched what this actually does. |
Correction i did actually research a bit on it
But did not check what the code is doing with the values. |
So to add basic info that was confusing on the asic-blog link: Supply voltage scaling was initially adopted for switching power reduction. It is proved to be an effective method for power reduction because switching power is quadratically dependent on the supply voltage. Also leakage current is exponentially dependent on VDD, so in current nodes became more effective to reduce power in standby mode. Dynamic Power =α * CL * VDD^2 * fclk ; and Leakage Power =VDD * Isub^(VDD) High temps effect:
Then my recomendation as a IC designer is to find the lowest supply voltage possible per Freq per IC sample at 30C. We dont know what process or techniques were used in these cheap ASIC from chinese DH, but if Adaptive boby Biasing were available... man! |
No description provided.