Skip to content

Commit

Permalink
raspi-hacks: Remove fakeclock, add e2fsck, libcofi_rpi, break out doi…
Browse files Browse the repository at this point in the history
…nst.sh
  • Loading branch information
idlemoor committed Jul 10, 2012
1 parent d397b48 commit 1e8584b
Show file tree
Hide file tree
Showing 12 changed files with 615 additions and 138 deletions.
21 changes: 15 additions & 6 deletions raspi-hacks/README
Expand Up @@ -9,13 +9,22 @@ lines' entries s0, s1 and s2. The Rasberry Pi's UART is /dev/ttyAMA0, not

* Clock - the Raspberry Pi has no hardware clock and boots with the
date/time set to 1970-01-01, so /etc/rc.d/rc.local will be modified to
set the correct date/time from the network (using the 'sntp' command).
Also, to provide a pragmatic approximation for the correct date/time
during early boot and if the network is unavailable, /sbin/hwclock will be
replaced by a shellscript which saves the date/time to disk on shutdown
and restores it on startup. This prevents the error 'Cannot access the
Hardware Clock via any known method'.
set the correct date/time from the network (using the 'sntp' command).
Additionally, the file /etc/e2fsck.conf will be created to stop e2fsck
from erroring or requiring manual intervention when it encounters bad time
stamps. The fakeclock mechanism in a previous version of this package is
no longer used, and the original hwclock command will be restored when you
upgrade this package. This reinstates the error 'Cannot access the
Hardware Clock via any known method' :-)

* Tuning - /etc/sysctl.conf will be created to tune vm.min_free_kbytes.
This prevents the error 'smsc95xx 1-1.1:1.0: eth0: kevent 2 may have
been dropped'.

* libcofi_rpi - The library /usr/lib/libcofi_rpi.so contains teh_orph's
replacement memcpy and memset functions. These replacements have been
reported to improve application performance. They are disabled by default,
but if you want to enable them, use this command and then log out and log
in again:

chmod ugo-x /etc/profile.d/libcofi_rpi.{sh,csh}
24 changes: 24 additions & 0 deletions raspi-hacks/doinst.sh
@@ -0,0 +1,24 @@
config() {
NEW="$1"
OLD="$(dirname $NEW)/$(basename $NEW .new)"
# If there's no config file by that name, mv it over:
if [ ! -r $OLD ]; then
mv $NEW $OLD
elif [ "$(cat $OLD | md5sum)" = "$(cat $NEW | md5sum)" ]; then
# toss the redundant copy
rm $NEW
fi
# Otherwise, we leave the .new copy for the admin to consider...
}

preserve_perms() {
NEW="$1"
OLD="$(dirname $NEW)/$(basename $NEW .new)"
if [ -e $OLD ]; then
cp -a $OLD ${NEW}.incoming
cat $NEW > ${NEW}.incoming
mv ${NEW}.incoming $NEW
fi
config $NEW
}

3 changes: 3 additions & 0 deletions raspi-hacks/e2fsck.conf.new
@@ -0,0 +1,3 @@
[options]
accept_time_fudge = 1
broken_system_clock =1
102 changes: 0 additions & 102 deletions raspi-hacks/fakeclock.sh

This file was deleted.

6 changes: 6 additions & 0 deletions raspi-hacks/libcofi_rpi/Makefile
@@ -0,0 +1,6 @@
libcofi_rpi.so: memcpy.o memset.o
$(CC) -o libcofi_rpi.so -shared memcpy.o memset.o -g
memset.o: memset.s
$(AS) memset.s -o memset.o -g
memcpy.o: memcpy.s
$(AS) memcpy.s -o memcpy.o -g
59 changes: 59 additions & 0 deletions raspi-hacks/libcofi_rpi/README.libcofi_rpi
@@ -0,0 +1,59 @@
copies-and-fills

SUMMARY

Replacement memcpy and memset functionality for the Raspberry Pi with the intention of gaining greater performance.
Coding with an understanding of single-issue is important.

Tested using a modified https://github.com/ssvb/ssvb-membench, from Siarhei Siamashka.
The testing involves lots of random numbers, iterating through sizes and source/destination alignments.
If you find a bug, please tell me!

To use: define the environment variable, LD_PRELOAD=/full/path/to/libcofi_rpi.so, then run program.

The inner loop of the misalignment path of memcpy is derived from the GNU libc ARM port. As a result "copies-and-fills" is licensed under the GNU Lesser General Public License version 2.1. See http://www.gnu.org/licenses/ for details.
To see the original memcpy, browse it here: http://sourceware.org/git/?p=glibc-ports.git;a=blob;f=sysdeps/arm/memcpy.S;hb=HEAD

Simon Hall

NOTES

memcpy:
Can be found in memcpy.s.
Compared to the generic libc memcpy, this one reaches performance parity at around ~150 bytes copies with any source/destination alignment and eventually gains 2-3x throughput, especially when the source buffer is uncached.
When taking the libc source and enabling the pld path, it certainly does improve. However the source alignment option appears to do nothing for performance yet greatly increases the code complexity.
In initial testing, some facts were found:
- despite the increase in free registers, copies via VFP were slower at peak by ~25%
- copying 32 bytes at a time with a single store-multiple gives the highest performance
- getting the destination 32b aligned gives a much greater throughput versus 4b-alignment
- some memcpys are of a fixed size, eg 1/2/4/8 byte in size
- byte transfers have a much worse performance than expected
- for misaligned transfers, 32b-aligned stms are the way forward with mov/orr byte shuffling; byte copies give very poor performance

The code deals with the special small sizes, then races to reach 32b alignment of the destination.
We then test for misalignment with the source. If the (source - dest alignment) & 3 != 0 then we use the misaligned path.
For the aligned path, we iterate through the data, 32 bytes at a time. We then handle a word at a time, then a byte.
For the misaligned path, we have to choose how misaligned we are - 1, 2, or 3 bytes. There is a custom path for each that does the appropriate shifts.

The key to this is prefetch of the source array. Prefetch instructions must be far from the load instruction, as it appears the load/store pipe is busy for a while after a large load instruction is issued.

Speeds of up to 680 MB/s have been achieved (effective 339 MB/s copy).

memset:
Can by found in memset.s.
Compared to the generic libc memset, this quickly reaches performance parity at around 100 bytes with any alignment.
On testing,
- it appears 32-byte stores yield ~1000-1100 MB/s, by two sequential 16-byte stores can reach 1300-1400 MB/s
- again 32b aligned destinations are good

The code 4-byte aligns the destination with a byte writer, then 32-byte aligns it with a word writer.
We then write two 2*16 bytes of data, then write words, then bytes.
No preload of destination data seems to be required.

Speeds of up to 1390 MB/s have been achieved. This is ~7x faster than the libc version.

VERSION HISTORY

09/07/2012, minor updates
01/07/2012, initial release

3 changes: 3 additions & 0 deletions raspi-hacks/libcofi_rpi/libcofi_rpi.csh.new
@@ -0,0 +1,3 @@
#!/bin/csh

setenv LD_PRELOAD /usr/lib/libcofi_rpi.so
3 changes: 3 additions & 0 deletions raspi-hacks/libcofi_rpi/libcofi_rpi.sh.new
@@ -0,0 +1,3 @@
#!/bin/sh

export LD_PRELOAD=/usr/lib/libcofi_rpi.so

0 comments on commit 1e8584b

Please sign in to comment.