Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PINE64: activated 1.2GHz and 1.34 GHz frequency steps #3

Merged
merged 1 commit into from Mar 1, 2016
Merged

PINE64: activated 1.2GHz and 1.34 GHz frequency steps #3

merged 1 commit into from Mar 1, 2016

Conversation

umiddelb
Copy link
Contributor

@umiddelb umiddelb commented Mar 1, 2016

No description provided.

@longsleep
Copy link
Owner

Nice thanks. I assume it is stable - do you have a heatsink?

longsleep added a commit that referenced this pull request Mar 1, 2016
PINE64: activated 1.2GHz and 1.34 GHz frequency steps
@longsleep longsleep merged commit 989d615 into longsleep:master Mar 1, 2016
@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 1, 2016

It is stable when you start throttling 1.34 GHz at 65°C (0x41).
I'm using a passive heatsink which I've tailored for the ODROID-XU4, before the test sample from HK arrived.

@ThomasKaiser
Copy link
Contributor

Please be aware of http://linux-sunxi.org/Hardware_Reliability_Tests#Reliability_of_cpufreq_voltage.2Ffrequency_settings

Unfortunately it's not that easy to run this set of tests with newer Allwinner SoCs and this budget cooling stuff since cpufreq-ljt-stress-test doesn't recognize when throttling happens (you'll see it only if frequencies in the normal range are SKIPPED since budget cooling does not allow to switch to the next cpufreq the script tries to test).

BTW: On Debian/Ubuntu this automates RPi-Monitor installation including template for A64 completely: http://kaiser-edv.de/tmp/4U4tkD/install-rpi-monitor-for-a64.sh

@umiddelb please don't fool yourself: The issue is undervoltage that might occur under load peaks and then leads to data corruption at a point where throttling doesn't jump in since temperature is too low. This can only be tested with the approach above and especially the upper dvfs operating points need a strong fan to be able to really test through.

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 1, 2016

@ThomasKaiser Understood, AW introduces operating points with inappropriate voltage settings and disables them by limiting maximum frequency in budget cooling.

The cpufreq-ljt-stress-test tests don't run out of the box ...

The cjpeg and djpeg tools from libjpeg-turbo are not found.
Trying to download and compile them.
Extracting libjpeg-turbo-1.3.1.tar.gz ... done
Compiling libjpeg-turbo, please be patient ..../config.guess: unable to guess system type

This script, last modified 2004-09-07, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (./config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-09-07

uname -m = aarch64
uname -r = 3.10.65+
uname -s = Linux
uname -v = #1 SMP PREEMPT Tue Mar 1 16:23:59 CET 2016

/usr/bin/uname -p =
/bin/uname -X     =

hostinfo               =
/bin/universe          =
/usr/bin/arch -k       =
/bin/arch              =
/usr/bin/oslevel       =
/usr/convex/getsysinfo =

UNAME_MACHINE = aarch64
UNAME_RELEASE = 3.10.65+
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Tue Mar 1 16:23:59 CET 2016
configure: error: cannot guess build type; you must specify one
 failed
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking for style of include used by make... GNU
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking dependency style of gcc... gcc3
checking how to run the C preprocessor... gcc -E
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ANSI C... (cached) none needed
checking dependency style of gcc... (cached) gcc3
checking whether gcc and cc understand -c and -o together... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking build system type...

@ThomasKaiser
Copy link
Contributor

Strange. Does

wget -O config.guess 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD'
wget -O config.sub 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD'

help?

@longsleep
Copy link
Owner

@umiddelb - the cpufreq-ljt-stress-test tools work just fine for me - Just install "libjpeg-turbo-progs" (on Ubuntu) to avoid having to compile those tools and you should be fine.

@longsleep
Copy link
Owner

So far i am not able to get it running with 1.34 GHz - the Pine freezes instantly when having 1.34 GHz enabled as /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq and starting cpuburn-a53 - i assume this is undervoltage and i am using a 2A power supply only. Also i did not attach a heat sink or fan.

@ThomasKaiser
Copy link
Contributor

Regarding these power issues. It's a well known fact that many/most USB cables are insufficient (resistance way too high ) and the Micro USB receptacle is also limited to 1.8A max by specs. I had some hope the PSU for RPi 3 (rated 2.5A) could do some magic but it looks the RPi foundation simply uses both data and power lines on the new RPi+PSU to overcome the 1.8A current limitation (at least the cable should be better and prevent undervoltage situations more often)

With older Allwinner PMICs like AXP209 (or the enhanced drivers for them) it was possible to easily read this out and get a clue what's going on since undervoltage only happens when consumption increases (you think "Alright, my PSU has less amperage!" but in reality the voltage drop shut your board down)

A different thing is VDD_CPUX as set in the dvfs table and this seems to be @longsleep's problem now since the dvfs table defines 1300mV @ 1200MHz as upper value (cooler table limiting to 1152 MHz )

In the upper region you need way more voltage if you increase cpufreq just a little bit. Regarding possible values all we have (as usual) are comments in fex files:

; lv1: core vdd is 1.30v if cpu frequency is (1104Mhz, 1152Mhz]
; lv2: core vdd is 1.26v if cpu frequency is (1008Mhz, 1104Mhz]
; lv3: core vdd is 1.20v if cpu frequency is (816Mhz,  1008Mhz]
; lv4: core vdd is 1.10v if cpu frequency is (648Mhz,   816Mhz]
; lv5: core vdd is 1.04v if cpu frequency is (480Mhz,   648Mhz]
; lv6: core vdd is 1.04v if cpu frequency is (480Mhz,   648Mhz]
; lv7: core vdd is 1.04v if cpu frequency is (480Mhz,   648Mhz]
; lv8: core vdd is 1.04v if cpu frequency is (480Mhz,   648Mhz]
;
;----------------------------------------------------------------------------------
[dvfs_table]
;extremity_freq = 1344000000

I didn't have a look in the A64's datasheet whether maximum VDD_CPUX is mentioned (for H3 it's 1400mV) but since the Pine64 already suffers from both DC-IN problems due to lack of a barrel plug and overheating I would refrain from entering this region if it's about stability/reliability. You can try to fulfill Pine64's marketing claims (1.2GHz) by trying to get 1200MHz stable. And leave everything else for the overclocker community (this does exist, have a look in the Orange Pi forums, there are people that fry there boards happily at 1.5V using fans and large heatsinks and idling at 1536MHz for no reason)

But if you want even only the 1200 MHz confirmed to work reliably while choosing a dvfs operating point that isn't too high (since both consumption and temperature will increase heavily when leaving the 1.3V!) then good luck. Best practice would be to heat up the SoC (cpuburn-a8/-a53) at least 5 minutes before starting cpufreq-ljt-stress-test and only if you manage to prevent throttling this way with some safety headroom you can be sure that throttling won't occur when cpufreq-ljt-stress-test later runs on top (hard to achieve without fan+heatsink). So even 'unlocking' the 1.2GHz would already require huge efforts!

Funny enough: All this was predictable by looking into the first leaked A64 BSP :)

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 2, 2016

@umiddelb - the cpufreq-ljt-stress-test tools work just fine for me - Just install "libjpeg-turbo-progs" (on Ubuntu) to avoid having to compile those tools and you should be fine.

Thanks, I'm running the tests now ...
It seems that all CPU cores are tested individually, in this case the CPU temp stays below 50°C (even with 1.34 GHz).

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 2, 2016

debian@p64:~/github/cpuburn-arm$ sudo ./cpufreq-ljt-stress-test
CPU stress test, which is doing JPEG decoding by libjpeg-turbo
at different cpufreq operating points.

Testing CPU 0
 1344 MHz ............................................................ OK
 1200 MHz ............................................................ OK
 1152 MHz ............................................................ OK
 1104 MHz ............................................................ OK
 1008 MHz ............................................................ OK
  816 MHz ............................................................ OK
  720 MHz ............................................................ OK
  600 MHz ............................................................ OK
  480 MHz ............................................................ OK

Testing CPU 1
 1344 MHz ............................................................ OK
 1200 MHz ............................................................ OK
 1152 MHz ............................................................ OK
 1104 MHz ............................................................ OK
 1008 MHz ............................................................ OK
  816 MHz ............................................................ OK
  720 MHz ............................................................ OK
  600 MHz ............................................................ OK
  480 MHz ............................................................ OK

Testing CPU 2
 1344 MHz ............................................................ OK
 1200 MHz ............................................................ OK
 1152 MHz ............................................................ OK
 1104 MHz ............................................................ OK
 1008 MHz ............................................................ OK
  816 MHz ............................................................ OK
  720 MHz ............................................................ OK
  600 MHz ............................................................ OK
  480 MHz ............................................................ OK

Testing CPU 3
 1344 MHz ............................................................ OK
 1200 MHz ............................................................ OK
 1152 MHz ............................................................ OK
 1104 MHz ............................................................ OK
 1008 MHz ............................................................ OK
  816 MHz ............................................................ OK
  720 MHz ............................................................ OK
  600 MHz ............................................................ OK
  480 MHz ............................................................ OK

Overall result : PASSED

@longsleep
Copy link
Owner

You need to run cpuburn-a53 at the same time.

@ThomasKaiser
Copy link
Contributor

I would try it with cpuburn-a8 before since cpuburn-a53 is highly efficient at putting stress on the CPU cores that I doubt you're able to even reach 1008MHz (since throttling will prevent this, CPU cores get killed or the board deadlocks when jumping from 480 MHz to 1008MHz)

I doubt cpufreq-ljt-stress-test can work with Pine64 as intended, the operating points should be checked in ascending order (and then still checking for throttling and active CPU cores is missing as discussed yesterday in IRC)

(cpufreq-ljt-stress-test was a quick hack for A10 back then when throttling and SMP wasn't a problem)

@ThomasKaiser
Copy link
Contributor

And please keep in mind: The whole idea behind cpufreq-ljt-stress-test is to check whether VDD_CPUX for a specific dvfs operating point is sufficient or not.

That means putting as much reasonable stress as possible on the SoC and then start this JPEG game to detect data corruption. Unfortunately cpuburn-a53 is a possible use case for software that makes heavy use of NEON stuff. But given that running cpuburn-a53 at 816MHz with applied heatsink already leads to killed CPU cores since temperatures exceeded tresholds I doubt you're able to get any usable results from cpufreq-ljt-stress-test for higher cpufreqs without a) liquid cooling and b) soldering a sane DC-IN solution when running cpuburn-a53 in parallel (ssvb said his consumption increased by 1100mA at 812MHz/1.1V when running cpuburn-a53. 1008MHz/1.2V might even exceed the Micro USB max rating of 1800mA already)

@ThomasKaiser
Copy link
Contributor

And please allow a final comment. A good read is ssvb's explanation how he came up with default dvfs settings for Orange Pi PC: https://groups.google.com/d/msg/linux-sunxi/20Ir4It3GsA/TPT9wv8IAQAJ

You should also always keep in mind that these dirt-cheap SoCs are subject to production tolerances and no selection process like with Intel CPUs happens afterwards. So while my board might run stable with specific dvfs settings at 1344MHz @longsleep's deadlocks and @umiddelb's corrupts data silently.

This is stuff the overclocker camp doesn't care about but since the settings currently being developed here will most likely be the default on every OS image that relies on kernel 3.10.x it's somewhat important to stay safe.

It's up to the user in question to apply different dvfs settings if he knows what he's doing. For example if I know that I need as less consumption as possible, I won't connect a display (plays a role whether the display engine is active or not!), limit the board to 480 scaling_max_cpufreq and know that my code doesn't make use of NEON then after some extensive testing I might come up with dvfs operating points using significantly less VDD_CPUX voltage (like eg. these settings for OPi PC). But these are things only the user in question can decide after testing his device individually -- the defaults should work on every board.

@ThomasKaiser
Copy link
Contributor

TL;DR: I would suggest only unlocking the 1200MHz and adding one additional dvfs operating point: 1200MHz @ 1340mV. And if it's not possible to test this setting at least with 4 cpuburn-a8 threads running in parallel then revert back to defaults (letting the Pine64 people explain why their 1.2Ghz board only runs at 1152 MHz max ;) )

@longsleep
Copy link
Owner

Thanks for all the details! I definitely want defaults in the images which work reliably and without loosing CPU cores when doing something. If you guys come up with updated DTS please feel free to send a PR. I will continue testing tomorrow night.

@ThomasKaiser
Copy link
Contributor

No PR to be expected from me ;) I just wanted to point out what has to be considered when allowing higher clockspeeds. The process to develop a sane dvfs table is time consuming: You start with reasonable defaults, decrease the voltage step by step until problems occur and then add some mV for some safety headroom. And then search for others to test with these settings.

And as already written: with the currently available set of tools it's impossible to test since deadlocks will occur for sure when switching from lowest to highest cpufreq settings. Unless this is fixed I would revert back to 1152 (or maybe 1200 -- IMO that's the highest value possible without risiking stability/realibility... that still needs confirmation regarding dvfs settings)

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

So when I try to run cpuburn-a53 @ 816000Hz everything is fine, running cpuburn-a53 @1008000Hz I get

[86723.275629] CPU Budget hotplug: cluster0 min:0 max:4                         
[86723.281098] CPU Budget hotplug: cluster0 min:0 max:4                         
[86724.269454] CPU Budget hotplug: cluster0 min:0 max:4                         
[86724.761455] CPU Budget hotplug: cluster0 min:0 max:4                         
[86726.237404] CPU Budget:update CPU 0 cpufreq max to 816000 min to 816000      
[86726.246860] CPU Budget hotplug: cluster0 min:0 max:2                         
[86726.266877] CPU Budget:Try to down cpu 3, cluster0 online 4, max 2           
[86726.287138] CPU3: shutdown                                                   
[86726.290133] psci: CPU3 killed.                                               
[86726.295252] CPU Budget:Try to down cpu 2, cluster0 online 4, max 2           
[86726.326476] CPU2: shutdown                                                   
[86726.329473] psci: CPU2 killed.

cooling stage 5 (>90°C) was reached within a couple of seconds. With the two remaining cores cpuburn-a53 produce 64°C, cpuburn-a8 44°C. Will run the tests with cpuburn-a8 now.

How can I bring up killed CPU cores without rebooting?

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

debian@p64:~/github/cpuburn-arm$ sudo ./cpufreq-ljt-stress-test 
CPU stress test, which is doing JPEG decoding by libjpeg-turbo
at different cpufreq operating points.

Testing CPU 0
 1344 MHz SKIPPED
 1200 MHz ............................................................ OK
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 1
 1344 MHz SKIPPED
 1200 MHz ............................................................ OK
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 2
 1344 MHz SKIPPED
 1200 MHz ............................................................ OK
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 3
 1344 MHz SKIPPED
 1200 MHz ............................................................ OK
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Overall result : PASSED

CPU temperature stays around 60°C

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

debian@p64:~/github/cpuburn-arm$ sudo ./cpufreq-ljt-stress-test 
CPU stress test, which is doing JPEG decoding by libjpeg-turbo
at different cpufreq operating points.

Testing CPU 0
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 1
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 2
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 3
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Overall result : PASSED

CPU temperature touches 65°C without triggering cooling stage 1.

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

Concerning the power supply, the PINE64 is connected to an Apple iPad USB power supply (2A rated) (and draws some extra power through the UART connector ;)

@longsleep
Copy link
Owner

echo 1 > /sys/devices/system/cpu/cpu0/online brings back CPU 0 AS example. Also i suggest to conduct these tests withou UART connection.

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

Thanks. I'm using the UART connector on the EXP bus now. For the tests I needed to lift up the trigger temperature for stage 1 from 65°C to 68°C, leaving the rest untouched.

debian@p64:~/github/cpuburn-arm$ sudo ./cpufreq-ljt-stress-test 
[sudo] password for debian: 
CPU stress test, which is doing JPEG decoding by libjpeg-turbo
at different cpufreq operating points.

Testing CPU 0
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 1
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 2
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Testing CPU 3
 1344 MHz ............................................................ OK
 1200 MHz SKIPPED
 1152 MHz SKIPPED
 1104 MHz SKIPPED
 1008 MHz SKIPPED
  816 MHz SKIPPED
  720 MHz SKIPPED
  600 MHz SKIPPED
  480 MHz SKIPPED

Overall result : PASSED

@longsleep
Copy link
Owner

Nice thanks for testing! Can you share the rest of your exact setup? Do you have a heat sink? And just for confirmation you are running cpuburn-a8 the whole time?

@umiddelb
Copy link
Contributor Author

umiddelb commented Mar 3, 2016

Yes, I'm using this heatsink with a thermal pad. Unfortunately there are no mounting holes to pressure down the heatsink on the board. The heatsink covers the SoC and the DRAM chips, but not the power converter (one half of the area marked with the white outline on the board).
The heatsink is cooled with a small fan, standing aside. If I remove the fan, the trigger points will be reached sooner, but the PINE64 doesn't crash. Without the heatsink I've experienced some crashes.

And yes, cpuburn-a8 was running the whole time, but building the kernel with make -j4 produces more stress.

@ThomasKaiser
Copy link
Contributor

Ok, if kernel compilation is more intensive then cpuburn-a8 is insufficient :(

Did you add another dvfs operating point or do you still use 1.3V as max voltage? BTW: When testing A83T I realised way too late that cpuburn wasn't running on all cores. Did you checked that too? 4 threads?

@ThomasKaiser
Copy link
Contributor

BTW: My first attempt to use new dvfs operating points failed due to me failing to understand that I exchanged my build host in the meantime (shell history is great but sometimes...). I changed settings on host B but still copied over u-boot-with-dtb.bin from host A...

@ThomasKaiser
Copy link
Contributor

Hmm... I tried to increase the 1300mV to 1360mV for DCDC2/DCDC3 in u-boot's drivers/power/sunxi/axp81X_supply.c but it's still not working and I get ARISC errors. Maybe I did something wrong again. Will now try my luck with

   extremity_freq = <1296000000>;
   max_freq = <1200000000>;
   min_freq = <480000000>;
   lv_count = <7>;
   lv1_freq = <1296000000>;
   lv1_volt = <1300>;
   lv2_freq = <1200000000>;
   lv2_volt = <1280>;
   lv3_freq = <1104000000>;
   lv3_volt = <1220>;
   lv4_freq = <1008000000>;
   lv4_volt = <1160>;
   lv5_freq = <912000000>;
   lv5_volt = <1100>;
   lv6_freq = <816000000>;
   lv6_volt = <1060>;
   lv7_freq = <648000000>;
   lv7_volt = <1020>;

and higher thermal trip points instead to see whether I get Linpack running without throttling and without failure above 1152 MHz.

@ThomasKaiser
Copy link
Contributor

With these dvfs settings 1296MHz passed (but throttling down to 1248MHz happened a few moments) and as expected 1344MHz failed again: http://pastebin.com/QEBXCgZz

Since we (and the average user even more) is already running into throttling/thermal issues I would refrain from increasing VDD_CPUX above 1.3 and consider any clockspeed above 1.2Ghz as overclocking and irrelevant now.

At least this Linpack run seems to be useful for realibility testings to me. I made another q&d test, set maximum cpu freq to 720MHz (to increase the runtime of Linpack), then let cpuminer run for a few minutes, stopped it and immediately started Linpack and started immediately afterwards cpuburn-a53 (pkill minerd && bin/OpenBLAS/xhpl && cpuburn-a53): Cpuminer: 39°C, Linpack 46°C, cpuburn-a53 53°C.

@ThomasKaiser
Copy link
Contributor

BTW: I let cpuminer and Linpack also run on Orange Pi PC (H3, 1296MHz): 2.35 khash/s and 1.73 Gflops on H3 vs. 3.9 khash/s and 3.4Gflops with A64 running at the same clockspeed.

@ThomasKaiser
Copy link
Contributor

I tried 3 different dvfs settings with voltages reduced by 20mV each to get an idea regarding safety headroom (using this small script... and to end up most probably at Allwinner's current defaults when adding a few mV back to ensure reliable operation everywhere ;)

And then I let Linpack test the specific dvfs operating points (only task running except of a few monitoring background tasks and unfortunately a whole running X-Windows that might be responsible for the result variations). I would say Linpack might be able to detect undervoltage situations (at least it's convenient to get a validation and not a deadlock as proof that undervoltage occured)?

My next step would be to create an optimised dvfs table at the lower limit and testing it with just heatsink and on the other board that not even has a heatsink. Then add some mV to each operating point and let cpuburn-a53 kill the board. Any thoughts?

Edit: already testing the last run and ending up with the following:

   extremity_freq = <1296000000>;
   max_freq = <1200000000>;
   min_freq = <480000000>;
   lv_count = <8>;
   lv1_freq = <1296000000>;
   lv1_volt = <1300>;
   lv2_freq = <1200000000>;
   lv2_volt = <1240>;
   lv3_freq = <1104000000>;
   lv3_volt = <1160>;
   lv4_freq = <1008000000>;
   lv4_volt = <1100>;
   lv5_freq = <912000000>;
   lv5_volt = <1060>;
   lv6_freq = <816000000>;
   lv6_volt = <1020>;
   lv7_freq = <648000000>;
   lv7_volt = <960>;
   lv8_freq = <480000000>;
   lv8_volt = <940>;

So +60mV for each operating point later, ignoring anything above 1200MHz and let cpuburn-a53 do the job... (note to myself: test also with connected HDMI display since at least with Orange Pi PC that made a huge difference)

@umiddelb
Copy link
Contributor Author

One question: I'm unsure if the number of of cooling states should match to the number of trip points and to the number of binds?

@ThomasKaiser
Copy link
Contributor

No idea. But at least trip points and binds correlate and define stuff for both CPU and GPU (see the real node names

@umiddelb
Copy link
Contributor Author

concerning trip points and binds, yes, but cooling stages 7-9 are very unlikely to be reached (stage 6 already shuts down the device (init 0)).

@umiddelb
Copy link
Contributor Author

concerning trip points and binds, yes, but cooling stages 7-9 are very unlikely to be reached (stage 6 already shuts down the device (init 0)).

I have an idea how it's meant to be. The first four trip points control the CPU frequency and each point refers to a number of stages to be conducted if the temperature is still above the the trigger temperature.

@ThomasKaiser
Copy link
Contributor

Hmm... I was solely focused on undervoltage today and trying to get an idea whether Linpack fail/pass depend on temperature or undervoltage instead.

Testing the dvfs settings above with connected display but still heatsink+fan doesn't show a difference. 1296MHz @ 1300mV aren't stable but otherwise it looks ok: http://pastebin.com/eBTkPAyG

I then tried it with the fan switched off and just a heatsink, 1st try with 1152MHz (shutdown when SoC temp reached 100°C pretty soon), then with 1056MHz (the same), then with 1008MHz: http://pastebin.com/wZE3sqRU (SoC temp close to 100°C with both 1008MHz and 960MHz). But at least it seems that stability/reliability issues are more related to undervoltage than temperatute (to be confirmed).

Next try with the other board without a heatsink at all: I changed the order to start from 480MHz and increase clockspeeds since it's obvious that Linpack is way too demanding for A64 without improved heat dissipation: http://pastebin.com/8PF8QL0S (board powered off 20 seconds after starting the 912MHz @ 1060mV @ 105°C)

From the results I would believe the Linpack bench can be used to detect undervoltage pretty good. But I wonder how to proceed. If we increase every dvfs operating point by 60mV then a Pine64+ without heatsink won't survive such a ran at 720MHz?

Based on the above killing CPU cores rather quickly makes a bit more sense... but maybe I just messed up the whole thermal configuration?

@ssvb
Copy link

ssvb commented Mar 14, 2016

@ThomasKaiser

Hmm... I was solely focused on undervoltage today and trying to get an idea whether Linpack fail/pass depend on temperature or undervoltage instead.

Most likely it depends on both. Have you read http://asic-soc.blogspot.fi/2008/03/process-variations-and-static-timing.html ? We can probably describe it with something like the following formula:

max_reliable_clock_frequency = F(board_specific_variation, voltage, temperature)

where F() is some unknown function, we don't know much about (and don't really care) except that higher voltage means higher max_reliable_clock_frequency and higher temperature means lower max_reliable_clock_frequency. And let's say that the board_specific_variation here covers the differences between the silicon quality, the PMIC voltage output accuracy and the temperature sensor accuracy (all of these things may vary to some extent).

And as you have already observed with the cpufreq-ljt-stress-test script yourself, the reliability problems are easier to get detected when the chip temperature is also high (by running the cpuburn program simultaneously). This happens because the transistors become slower at higher temperatures (as the linked article says).

board powered off 20 seconds after starting the 912MHz @ 1060mV @ 105°C

Why hasn't it throttled to a much lower clock frequency? Seems like it had enough time to do this (20 seconds is a lot).

@ssvb
Copy link

ssvb commented Mar 14, 2016

@longsleep

What do you mean with "explore". Try to reduce the voltage for those frequencies in the dvfs_tabl and run tests if its still stable. Though i have doubts that the voltage can be reduced that much that the temperature reduction matters. I think it is pretty much pointless to try this and we should rather concentrate on getting better performance hat high frequencies with a reasonable and easy to get heat dissipation solution.

It is reasonable to add very low clock frequencies to the cooling table, even if we can't reduce the voltage anymore. For example, downclocking from 648MHz @1.04V to something like 312MHz @1.04V still would roughly cut the power consumption in half under heavy load (because less work is getting done per second).

We just need to figure out what would be a safe CPU clock frequency, where 4 cores are able to run cpuburn-a53 safely without overheating (and shutting down) on a device without a heatsink. Even if we have to go as low as something under 100MHz for this.

@ThomasKaiser
Copy link
Contributor

@ssvb

Why hasn't it throttled to a much lower clock frequency?

My suspicion is that Linpack does a few first steps that aren't that demanding and then load/consumption/temperature 'explodes' immediately. But more likely I just adjusted thermal settings wrong (tried to define higher trip points to let throttling jump in later). And I've to admit that I still don't understand the relationsships since I didn't read the links you posted (next weekend). Unfortunately I can't also look briefly at the powermeter since I put the whole setup in another room since the fan is really annoying. This is how temperatures developed (only clockspeed and temperature correct):

Time        CPU    load %cpu %sys %usr %nice %io %irq   CPU
17:30:05: 1152MHz  3.08  73%   0%  71%   0%   0%   0%   66°C 4 cores active
17:30:10: 1152MHz  3.23  73%   0%  71%   0%   0%   0%   95°C 4 cores active
17:30:16: 1104MHz  3.37  73%   0%  71%   0%   0%   0%   93°C 4 cores active
17:30:21: 1104MHz  3.42  73%   0%  71%   0%   0%   0%   94°C 4 cores active
17:30:26: 1104MHz  3.47  73%   0%  71%   0%   0%   0%   98°C 4 cores active
17:30:32: 1056MHz  3.51  73%   0%  71%   0%   0%   0%   95°C 4 cores active
17:30:37: 1056MHz  3.71  73%   0%  71%   0%   0%   0%   95°C 4 cores active
17:30:43: 1056MHz  3.73  73%   0%  71%   0%   0%   0%   98°C 4 cores active
17:30:48: 1056MHz  3.93  73%   0%  71%   0%   0%   0%   96°C 4 cores active
17:30:53: 1056MHz  3.93  73%   0%  72%   0%   0%   0%   99°C 4 cores active
17:30:58: 1152MHz  3.78  73%   0%  72%   0%   0%   0%   80°C 4 cores active
17:31:03: 1152MHz  3.56  73%   0%  71%   0%   0%   0%   76°C 4 cores active
17:31:09: 1056MHz  3.59  73%   0%  71%   0%   0%   0%   97°C 4 cores active
17:31:14: 1056MHz  3.79  73%   0%  71%   0%   0%   0%   99°C 4 cores activeConnection to 192.168.83.112 closed by remote host.

Time        CPU    load %cpu %sys %usr %nice %io %irq   CPU
17:36:21: 1056MHz  2.83  43%   2%  35%   0%   6%   0%   63°C 4 cores active
17:36:26: 1056MHz  2.68  43%   2%  35%   0%   6%   0%   66°C 4 cores active
17:36:31: 1056MHz  2.79  44%   2%  36%   0%   6%   0%   92°C 4 cores active
17:36:37: 1056MHz  2.88  45%   2%  37%   0%   5%   0%   95°C 4 cores active
17:36:42: 1056MHz  3.05  46%   2%  38%   0%   5%   0%   97°C 4 cores active
17:36:47: 1056MHz  3.13  47%   2%  39%   0%   5%   0%   99°C 4 cores active
17:36:52: 1056MHz  3.20  48%   2%  41%   0%   5%   0%   94°C 4 cores active
17:36:58:  960MHz  3.42  49%   2%  42%   0%   5%   0%   95°C 4 cores active
17:37:03: 1056MHz  3.47  50%   2%  43%   0%   5%   0%  101°C 4 cores active
17:37:08: 1008MHz  3.51  51%   2%  44%   0%   5%   0%  101°C 4 cores active
17:37:14: 1008MHz  3.71  52%   1%  45%   0%   5%   0%  102°C 4 cores active
17:37:19: 1056MHz  3.73  53%   1%  46%   0%   5%   0%   85°C 4 cores active
17:37:24: 1056MHz  3.52  52%   1%  45%   0%   5%   0%   85°C 4 cores active
17:37:29: 1056MHz  3.31  52%   1%  45%   0%   4%   0%   83°C 4 cores active
17:37:34: 1056MHz  3.37  53%   1%  46%   0%   4%   0%  101°C 4 cores active
17:37:39: 1008MHz  3.58  54%   1%  47%   0%   4%   0%   99°C 4 cores active
17:37:45: 1056MHz  3.69  54%   1%  48%   0%   4%   0%  100°C 4 cores active
17:37:50: 1056MHz  4.05  55%   1%  49%   0%   4%   0%   98°C 4 cores active
17:37:55: 1008MHz  4.05  56%   1%  49%   0%   4%   0%  103°C 4 cores activeConnection to 192.168.83.112 closed by remote host.

Time        CPU    load %cpu %sys %usr %nice %io %irq   CPU
18:17:12:  912MHz  3.64  55%   1%  53%   0%   1%   0%   68°C 4 cores active
18:17:17:  912MHz  3.43  55%   1%  52%   0%   1%   0%   68°C 4 cores active
18:17:22:  912MHz  3.40  55%   1%  53%   0%   1%   0%   87°C 4 cores active
18:17:28:  912MHz  3.52  56%   1%  53%   0%   1%   0%   93°C 4 cores active
18:17:33:  912MHz  3.56  56%   1%  53%   0%   1%   0%   95°C 4 cores active
18:17:38:  912MHz  3.60  56%   1%  53%   0%   1%   0%   97°C 4 cores active
18:17:44:  912MHz  3.71  56%   1%  54%   0%   1%   0%  100°C 4 cores active
18:17:49:  912MHz  3.73  57%   1%  54%   0%   1%   0%  102°C 4 cores active
18:17:54:  912MHz  3.75  57%   1%  54%   0%   1%   0%  103°C 4 cores active
18:17:59:  912MHz  3.93  57%   1%  54%   0%   1%   0%  105°C 4 cores activeConnection to 192.168.83.112 closed by remote host.

@ThomasKaiser
Copy link
Contributor

@ssvb

Why hasn't it throttled to a much lower clock frequency?

Tried again with @longsleep's settings and it's obvious I f*cked up throttling. With the very same hardware setup but his most recent settings throttling prevents an emergency shutdown when I run the Linpack test just with a heatsink :)

So all my tests were rather useles (maybe except of the relationship to undervoltage)

EDIT: And the Pine suddenly died

21:30:51:  912MHz  3.81  68%   1%  66%   0%   0%   0%   88°C 4 cores active
21:30:57:  912MHz  3.83  68%   1%  66%   0%   0%   0%   89°C 4 cores active
21:31:02:  600MHz  3.92  68%   1%  66%   0%   0%   0%   86°C 4 cores active
21:31:07:  912MHz  4.01  68%   1%  66%   0%   0%   0%   90°C 4 cores active
21:31:13:  816MHz  4.01  68%   1%  66%   0%   0%   0%   88°C 4 cores active
21:31:18:  912MHz  4.01  68%   1%  66%   0%   0%   0%   90°C 4 cores active
21:31:23:  960MHz  4.01  69%   1%  66%   0%   0%   0%   91°C 4 cores activeConnection to 192.168.83.112 closed by remote host.

Ok, this might be the other issue to try to get a clue whether this might be AXP803 related. Next weekend...

@longsleep
Copy link
Owner

Mhm seening it die at well below 95°C is bad - i could not reproduce this yet but this makes me reconsider increasing all the trip points by 5°C.

@ssvb
Copy link

ssvb commented Mar 15, 2016

@ThomasKaiser

21:31:13: 816MHz 4.01 68% 1% 66% 0% 0% 0% 88°C 4 cores active
21:31:18: 912MHz 4.01 68% 1% 66% 0% 0% 0% 90°C 4 cores active
21:31:23: 960MHz 4.01 69% 1% 66% 0% 0% 0% 91°C 4 cores active
Connection to 192.168.83.112 closed by remote host.

The log looks like both the CPU clock speed and the temperature kept increasing in the end. Either something went really wrong in the budget cooling logic, or both the CPU frequency and the temperature were swinging up and down wildly and got sampled this way purely accidentally.

Based on your previous logs, it seems like something is shutting down the CPU upon reaching 105°C. In the H3 FEX file, this 105°C limit used to be configured explicitly. I can't see it offhand in the DTS file or in the kernel sources, but I did not try to search for it too hard.

@longsleep

Mhm seening it die at well below 95°C is bad - i could not reproduce this yet but this makes me reconsider increasing all the trip points by 5°C.

See the comment above. I suspect that it also got automatically shut down upon reaching 105°C, but this was just not logged properly.

@ssvb
Copy link

ssvb commented Mar 15, 2016

Or in fact what we get is that 100°C is labelled as "critical" and causes shutdowns. Maybe it is worth moving the last critical trip point a bit higher relative to the previous one in order to allow a bit of temperature overshoot and have a bit more time to respond?

@ThomasKaiser
Copy link
Contributor

Mhm seening it die at well below 95°C is bad

Nope, the issues I had before were happening at above 100°C. I had this other issue (frequent power offs) the last few days and tried to reproduce it (to no avail).

I thought maybe fast switching between different dvfs operating points might overburden the PMU. I put a heatsink on the PMU of the Pine already wearing heatsinks and it stopped (no idea why -- maybe unrelated).

Today I tried it with the other Pine with no heatsinks, let cpuminer run and a simple script that always jumps between 720 and 1056. Ran for hours at upper temperatures. So I thought maybe the AXP803 on the other Pine has a problem: There I jump between 720 and 1156 but also to no avail. Runs since hours... no idea...

@longsleep I second ssvb's suggestion to increase the critical trip point. Dumb question: Does reaching this critical point initiates an immediate power-off or a clean shutdown?

@longsleep
Copy link
Owner

@ssvb
Copy link

ssvb commented Mar 15, 2016

BTW, am I the only one here who finds the decompiled dts file with hex numbers and raw phandles instead of symbolic node references rather unreadable? As for the thermal framework dts bindings documentation, it can be found here.

The last two CPU related cooling maps nodes (bind2 and bind3) look rather suspicious because they referencing multiple cpu_budget_cool states (grouping state3+state4 and state5+state6). We probably need to have more cooling maps nodes for fine-grained CPU clock frequency selection. Especially around high temperatures.

@longsleep
Copy link
Owner

@ssvb once things have settled down i will merge back the changes to the device tree sources in the Kernel tree and we get readable stuff. I also happily accept a PR which changes the format :) Just did not find the time and the passion to change it myself.

Regarding the multiple states, i have added the multiple states similar to what i saw in the original file but i have not researched what this actually does.

@longsleep
Copy link
Owner

Correction i did actually research a bit on it

- cooling-device:   A phandle of a cooling device with its specifier,
  Type: phandle +   referring to which cooling device is used in this
    cooling specifier   binding. In the cooling specifier, the first cell
            is the minimum cooling state and the second cell
            is the maximum cooling state used in this map.

But did not check what the code is doing with the values.

@brunogm0
Copy link

brunogm0 commented Dec 4, 2016

So to add basic info that was confusing on the asic-blog link:

Supply voltage scaling was initially adopted for switching power reduction. It is proved to be an effective method for power reduction because switching power is quadratically dependent on the supply voltage. Also leakage current is exponentially dependent on VDD, so in current nodes became more effective to reduce power in standby mode.

Dynamic Power =α * CL * VDD^2 * fclk ;  and Leakage Power =VDD * Isub^(VDD)
where α is the activity factor, CL is the total load capacitance, Isub is subthreshold voltage current ,VDD is supply voltage, and fclk is the clock frequency.
this book: https://books.google.com.br/books?id=TcNAAAAAQBAJ&printsec=frontcover

High temps effect:

  • Noise: With increase in temperature thermal energy (kT, k-Boltzmann Constant, T-Temperature in Kelvin) of the charge carriers (electrons) increases. This increase in thermal energy will lead to more collisions between the electrons and increases noise.
  • Power Consumption:
  • To achieve same performance from a circuit at higher temperature, it consumes more power compared to lower temperature.
  • Power consumption of an idle circuit increases with temperature.
  • Speed of a circuit reduces with increase in temperature.
  • Reliability: Circuit degrades if its operated at higher temperatures. (electro-migration of dopants destroys transistors)

Then my recomendation as a IC designer is to find the lowest supply voltage possible per Freq per IC sample at 30C.
At high temps (>65C) the best is to change frequency because the behavior is linear.
Sometimes too low frequency does not really help in reducing power because the IC doesnt sleep.

We dont know what process or techniques were used in these cheap ASIC from chinese DH, but if Adaptive boby Biasing were available... man!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants