New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of mainline kernel support #75

Open
sumpfralle opened this Issue Mar 18, 2018 · 118 comments

Comments

Projects
None yet
@sumpfralle
Copy link

sumpfralle commented Mar 18, 2018

I guess, it is of vital importance for the longevity of this project (and the hardware) to bring support the peripherals of GnuBee devices to the upstream kernel development tree. Otherwise we will end up with an outdated kernel (and thus more electronic landfill) somehwen ...

Since this is a not a short-term task, I propose to use this ticket for collecting the status of the ongoing mainlining efforts.

Could someone please start with summarizing the current state?
Thank you!

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Mar 19, 2018

https://github.com/neilbrown/linux/commits/gnubee/v4.15 mostly works. I just today noticed that only the first SATA controller works. Might fix that tomorrow.
I've posted some patches for inclusion in drivers/staging: https://lkml.org/lkml/2018/3/14/1035 Will repost with some improvements tomorrow.

@sumpfralle

This comment has been minimized.

Copy link
Author

sumpfralle commented Mar 28, 2018

Thank you for your quick response, for all the work you have put into mainline support and for the entertaining lwn article you wrote.
Your progress is way better, than I expected - this revives my positive emotions for the GnuBee platform. Thank you!

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Mar 29, 2018

@neilbrown Have you hit this bug with the ethernet driver ?

Right now, I have a job that checks the network connectivity every minute and reboot the GB if it fails...

I am very happy that a kernel hacker has got some interest in this project. Thanks for your time and your work.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Mar 29, 2018

I've seen something like bug #54 when the Gnubee was sitting at the u-boot prompt, and once when sitting at a shell prompt prompt in the initramfs, but never with a mainline kernel fully booted and the network configured.
This message: https://groups.google.com/d/msg/gnubee/cJuFmwCu4XI/F1KwJSIfAgAJ describes the problem the way I see it - the whole switch attached to the gnubee dies.
My guess would be that there is some inter-switch protocol (possibly the Spanning Tree Protocol) which the embedded switch in the gnubee is messing up.
In mainline Linux the embedded switch is configured as a boring transparent switch with no smarts, and maybe that is why is doesn't confuse other switches as much.
It would be interesting to see what happens if the gnubee is directly connected to a PC instead of to a switch.

@alethiophile

This comment has been minimized.

Copy link

alethiophile commented Mar 30, 2018

My experience is also that when it's booted into the Linux kernel, the network crash problem goes away. I see the problem when the board is in any of the three states uboot prompt/initramfs prompt/halted, but powered on.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Mar 30, 2018

Weird. It always happens on booted kernel (the librecmc-based one, provided by GB) and does not take the switch down. The GB still have ICMP (e.g. ping works) but the TCP and UDP become unusable. That how the jobs detects a failure : if it can ping a fixed IP but cannot connect to its echo service, it causes the system to reboot.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 1, 2018

Maybe there are two different bugs here - one that kills the switch and one that kills TCP but leaves ICMP working....

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Apr 12, 2018

@Adirelle I also hit bug #54 with this mainline kernel. GnuBee PC1 is sitting here on my desk, working "fine" (quotes because of ping shown later). But when I come back some hours/day(s) later, I can not connect to it - the time when it becomes unavailable via network is random, as far as I can tell. I Have to poweroff and power on again, nothing else worked for me ((un)plug cable).

gnubee ~ # uname -a
Linux gnubee.jvi.cz 4.15.12+ #3 SMP Wed Apr 4 15:29:16 CEST 2018 mips GNU/Linux

gnubee ~ # cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

gnubee ~ # cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
	address 172.16.202.254
	netmask 255.255.0.0
	gateway 172.16.whatever

Strange thing is, that when I run ping long enough, I see that every other while it takes gnubee quite long time to respond (100+ ms). I can deffinitely rule out faulty switch or cable.

https://gist.githubusercontent.com/xvybihal/b403c3c2677e423f3aea73feb8255a91/raw/d137ab85de894e2d5e3cb5bb9e2ffae16d023231/Gnubee%2520ping

Unfortunately I do not have UART cable to check what is going on, when GnuBee is not available via network.

This "thing" is nothing new, it was acting this way from the begining, with every kernel provided, and Deabian installed. Makes GnuBee pretty unusable for me, if I have to restart it several times a day.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 12, 2018

Thanks for the detailed bug report. I love details.
What is happening on the gnubee before it starts failing?
Is it just sitting there idle? Is it initiating network traffic itself? It is responding to requests?
Are there lots of requests, or occasional? or few? What protocol or protocols?
I've started a job fetching a 10K file over http every 2 seconds. I'll see if it is still working in the morning.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 12, 2018

It does not seem to be linked to network activity. Stopping the transmission daemon to reduce the bandwidth and number of connections does not prevent it to happen (neither postponing it). And I have downloaded an Ubuntu DVD torrent over a 1Gb link without any issue.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 12, 2018

So what is your gnubee doing when the problem happens? How do you first notice it? How long after boot does it typically fail? <1hr? <12hrs?
If I can reproduce the problem it is very likely that I can fix it
If not ... there are some kernel messages in that other bug which might be enough of a hint. Maybe I'll try copying a bigger file.

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Apr 13, 2018

In my case, the GnuBee is doing nothing when it happens. What do I mean by nothing? The system installed is clean, no additional network daemons or services are installed. No network transfers are happening (except of time sync, and common stuff in the base system). I do not use it, it just sits here and doing nothing. The time when it happens is just random. Sometimes its hours, sometime day or two.
Next time it happens, I will try to take out the SD card and copy some log here, thats probably the only thing I can do, without UART cable (which I am going to enquip myself with soon).

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Apr 13, 2018

Looking at the log/messages, there might be something helpful. Uploading whole file - it looks pretty similar with what @Adirelle posted messages.tar.gz

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 13, 2018

So what is your gnubee doing when the problem happens? How do you first notice it? How long after boot does it typically fail? <1hr? <12hrs?

Well, it seems totally random. The gnubee could be idle or I could be using it. Right now, it is running the following services : kernel nfs server, dropbear, dms (a upnp media server), ntpd, mysqld, transmission-daemon, ypbind, and there is a job running rdiff-backup once a day.

Sometimes it can be ok a whole week, sometimes only a few hours. I have the impression that the uptime rarely gets over 48 hours.

If I can reproduce the problem it is very likely that I can fix it

I understand that -- I am a developer myself -- but unfortunately I could not find something to trigger the bug. I have only got the kernel error message.

I have a script that reboots it within one minute after it happens. It could dump the result of some commands in a file before rebooting, if you had some suggestions.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 13, 2018

Thanks for posting the whole log file (did I say that I love details?)
The problem appears to be that interrupts get disabled on one of the CPUs - presumably CPU0 as that gets all device interrupts on my gnubee.. This is reported by RCU at timestamp 2179 which should be about 21 seconds after it happened. Then at timestamp 6317 (about 1 hour later) the network watchdog complains.
With interrupts disabled on one CPU, the network card will become unreliable. As polling is sometimes used you can still get some traffic through, but it won't be fast. A 'ping' probably works because it sends a packet every send, and only wants one in reply. If it checks for incoming packets whenever it transmits a packet (quite likely), it will appear to work normally.
The next question is: why do interrupts get disabled? Two things might be useful.
1/ @Adirelle if you could get your script to "echo t > /proc/sysrq-trigger" when the problem is detected, that might help. It should write stack traces for all process to the kernel log. Seeing those would be most helpful.
2/ @xvybihal If you could rebuild your kernel with LOCK_DEP enabled, that might produce more useful info (I hope). You need to enable CONFIG_PROVE_LOCKING and CONFIG_DEBUG_LOCKDEP and you may as well add CONFIG_PROVE_RCU. Then if it happens again, collect the logs the same way that you did before.
Thanks.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 13, 2018

@neilbrown I have setup a script that will logs the following commands when it happens. I will post the next one.

date
uptime
vmstat 1 1
netstat -ieW
netstat -aopenW
lsmod
ps -ef
echo t > /proc/sysrq-trigger
dmesg
@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 14, 2018

I have found a smoking gun.... I don't know if it is the smoking gun.
I went looking for places in the code which disable interrupts without clearly re-enabling them. I didn't expect to find any - that would be too easy. I found two!
One is almost certainly of no consequence - it always happens very early during initialization, so somethings must re-enable them.
The other is in the network code. I don't understand the code enough to know when it happens, but it looks like it is in response to some event like the cable being pulled (though I tried that and it doesn't trigger anything). The same bug is in the 4.4 kernel code
I've push out a fix to my gnubee/v4.15 kernel branch. Please test in you can.

@neheb

This comment has been minimized.

Copy link
Contributor

neheb commented Apr 14, 2018

@neilbrown Have you tried patching the stock Mediatek MMC driver to add support for mt7621? I tried and failed. Something about missing pinctl. I think I needed to edit the dts file...

This was my basis: jonpry/openwrt_mt7688@a85e6d9

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 14, 2018

@neilbrown Have you tried patching the stock Mediatek MMC driver to add support for mt7621?

No I haven't. As MMC is currently working, that driver is not a priority for me. My current priorities are roughly:

  • reboot - kernel currently leaves NOR flash in an inconsistent state
  • SATA access seems really slow - 10MB/sec!
  • network switch - enable VLAN support so different ports can be on different subnets
  • 2nd network interface: the SOC has 2 interfaces to the switch, only one works at present
  • crypto engine
  • PCI driver has horrible hacks to select correct interrupt

So I won't be focusing on MMC for quite a while. I'd be very happy for you or anyone else to dig into it and ask questions. If you have specific focused questions, I'd be happy to share any expertise I might have. I'd suggest opening a separate issue for each driver.

@neheb

This comment has been minimized.

Copy link
Contributor

neheb commented Apr 14, 2018

I just tested dd if=/dev/zero of=test count=1M with a speed result of 7.5MB/s on a btrfs array. No wonder transmission is slow...

I will dig through the MMC driver in a few days to see if I can get it working. Apparently the following commit to the one I linked has updated DTS entries.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 15, 2018

@neilbrown

Compiled your v4.15 and it boots (which is already something since I was not sure about what I was doing) but :

  • the initramfs does not find my root partition, since it lies on a md RAID1 array.
  • it messes with the attached switch, probably because its own switch is not configured.

Your kernel branch might or might not fix some bugs, but as long as it works, I prefer having a recent version quite easy to build over the ones from libreCMC/LEDE.

PS: I will try to find what I need to add to the initramfs to support rootfs on MD array.
PS2: I cross-compiled the kernel on a Alpine-Linux-based VM, I hope this will not cause any issue.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 15, 2018

Got it working ; I was also missing some other modules. I will let you know if the network lockup happens again.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 15, 2018

Another convert - hurray :-)
I've merged your patch - thanks.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 16, 2018

Ok. So on the bad news:

  • clock skew is back again (like #49 (comment)).
  • performance seems lower than with the provided kernel, but I am not sure why.

I will try to compare kernel settings with the ones from LibreCMC.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 16, 2018

@dgazineu Is there a way to flash the firmware from a running Linux ? (to shorten the whole "put your image on a USB stick, plug it on the GB, reboot, let uboot flash the firmware, remove the USB stick, reboot" cycle).

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 16, 2018

clock skew is back again (like #49 (comment)).

You need the cpuclock clock-frequency in arch/mips/boot/dts/ralink/gbpc1.dts to match the value set by the u-boot that you have installed. I have a u-boot that configures 900MHz so I set clock-frequency to 900000000.

performance seems lower than with the provided kernel, but I am not sure why.

I just discovered that large kernel modules hurt performance. I rebuilt with CONFIG_XFS=y (instead of =m, in O/.config) and filesystem throughput is a lot faster. I've pushed an update for the defconfig file.

Is there a way to flash the firmware from a running Linux ?

Probably, using flash_erase and flash_cp from mtd-utils. I haven't played much with them - be careful.
I set up my linux desktop as a tftp server and test kernels using tftpboot. It is fairly painless.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Apr 16, 2018

I have a u-boot that configures 900MHz so I set clock-frequency to 900000000.

Ah, I think mine is configured at 880Mhz. It seems there were some discrepancy between shipped u-boot and the kernel updates that were provided later.

I set up my linux desktop as a tftp server and test kernels using tftpboot. It is fairly painless.

To be sure I understand you right : once the new image is available through TFTP, you use the provided u-boot menu to download it and flash it. Or do you run the kernel without flashing it (which would be ideal) ?

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 16, 2018

flash_erase works as expected. flash_cp doesn't. Something wrong with the spi driver...

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Apr 16, 2018

I don't need to flash the kernel to test it.
The script I use to build the kernel finishes with

cp O/arch/mips/boot/uImage.bin /srv/tftpboot/GB-PCx_uboot.bin
echo 'tftpboot;bootm 80200000'

I reboot (or power-cycle) the gnubee and press '4' repeatedly during the early messages. Once is probably enough, but more doesn't hurt.
When I get the prompt I cut/paste that last message printed by the script.
For this to work you at least need to set serverip in your u-boot environment
e.g.

setenv serverip 192.168.1.4
saveenv

You might also want to set, or at least check (printenv) bootfile, and ipaddr

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Dec 26, 2018

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Dec 26, 2018

Two more:

  GEN     Makefile
scripts/kconfig/conf  --syncconfig Kconfig
  GEN     Makefile
  Using .. as source for kernel
  CALL    ../scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  GEN     usr/initramfs_data.cpio.lzma
  HOSTCC  lib/gen_crc64table
../lib/gen_crc64table.c:19:24: fatal error: linux/swab.h: No such file or directory
 #include <linux/swab.h>
                        ^
compilation terminated.
make[3]: *** [scripts/Makefile.host:90: lib/gen_crc64table] Error 1

and:

  AS      usr/initramfs_data.o
  AR      usr/built-in.a
  CC      drivers/watchdog/mt7621_wdt.o
../drivers/watchdog/mt7621_wdt.c:174:34: error: array type has incomplete element type ‘struct of_device_id’
 static const struct of_device_id mt7621_wdt_match[] = {
                                  ^~~~~~~~~~~~~~~~
../drivers/watchdog/mt7621_wdt.c:175:4: error: field name not in record or union initializer
  { .compatible = "mediatek,mt7621-wdt" },
    ^
../drivers/watchdog/mt7621_wdt.c:175:4: note: (near initialization for ‘mt7621_wdt_match’)
../drivers/watchdog/mt7621_wdt.c:174:34: warning: ‘mt7621_wdt_match’ defined but not used [-Wunused-variable]
 static const struct of_device_id mt7621_wdt_match[] = {
                                  ^~~~~~~~~~~~~~~~
make[4]: *** [../scripts/Makefile.build:292: drivers/watchdog/mt7621_wdt.o] Error 1

I'd report them to your linux fork but you disabled the issues.

Edit: here is the config file.

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Dec 27, 2018

Very similar build error here:

  CC      drivers/watchdog/mt7621_wdt.o
../drivers/watchdog/rt2880_wdt.c:188:34: error: array type has incomplete element type
 static const struct of_device_id rt288x_wdt_match[] = {
                                  ^
../drivers/watchdog/rt2880_wdt.c:189:2: error: field name not in record or union initializer
  { .compatible = "ralink,rt2880-wdt" },
  ^
../drivers/watchdog/rt2880_wdt.c:189:2: error: (near initialization for 'rt288x_wdt_match')
../drivers/watchdog/rt2880_wdt.c:188:34: warning: 'rt288x_wdt_match' defined but not used [-Wunused-variable]
 static const struct of_device_id rt288x_wdt_match[] = {
                                  ^
make[3]: *** [../scripts/Makefile.build:292: drivers/watchdog/rt2880_wdt.o] Error 1
make[3]: *** Waiting for unfinished jobs....
  CC      drivers/usb/phy/of.o
  CC      lib/xarray.o
../drivers/watchdog/mt7621_wdt.c:174:34: error: array type has incomplete element type
 static const struct of_device_id mt7621_wdt_match[] = {
                                  ^
../drivers/watchdog/mt7621_wdt.c:175:2: error: field name not in record or union initializer
  { .compatible = "mediatek,mt7621-wdt" },
  ^
../drivers/watchdog/mt7621_wdt.c:175:2: error: (near initialization for 'mt7621_wdt_match')
../drivers/watchdog/mt7621_wdt.c:174:34: warning: 'mt7621_wdt_match' defined but not used [-Wunused-variable]
 static const struct of_device_id mt7621_wdt_match[] = {
                                  ^
make[3]: *** [../scripts/Makefile.build:292: drivers/watchdog/mt7621_wdt.o] Error 1
make[2]: *** [../scripts/Makefile.build:516: drivers/watchdog] Error 2
make[2]: *** Waiting for unfinished jobs....
@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Dec 27, 2018

../lib/gen_crc64table.c:19:24: fatal error: linux/swab.h: No such file or directory

I have fixed this error by installing linux headers on the host. I find this a bit weird: the headers from current source should be used instead, isn't it ?

@smurfix

This comment has been minimized.

Copy link

smurfix commented Dec 27, 2018

Umm … no. gen_crc64table is run on the build host.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Dec 27, 2018

Umm … no. gen_crc64table is run on the build host.

Yes, I know. This is why I am perplexed: if I did not install the linux-headers package on the build host (which would not be possible on a non-linux distro, btw), I would not be able to cross-compile the kernel from its sources.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Dec 27, 2018

../drivers/watchdog/rt2880_wdt.c:188:34: error: array type has incomplete element type

I disabled the rt2880 watchdog, considering it would probably not be in the rt7621 but I am not sure of that.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Dec 28, 2018

Maybe we should ping @neilbrown . ^^

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Dec 29, 2018

The compile error in rt2880_wdt.c and mt7621_wdt.c is caused by
Commit ac3167257b9f ("headers: separate linux/mod_devicetable.h from linux/platform_device.h")

We need to add
#include <linux/mod_devicetable.h>
to both those files. I'll submit a patch.
I suspect that mt7621_wdt.c is the driver to use if you want a watchdog, though I haven't tested it.
There might be problems using the watchdog together with the flash-memory as the flash controller needs to be reset for a reboot, and the hardware watchdog doesn't force a reset. Something can probably be done as long as the flash controller isn't being used when the watchdog fires.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Dec 29, 2018

The problem with compiling gen_crc64table was caused by
Commit feba04fd2cf8 ("lib: add crc64 calculation routines")
and can be fixed by simply removing the
#include <linux/swab.h>
from the source file. I've posted a patch.

I've pushed out a new gnubee/v4.20 which contains these patches, and also updates to v4.20-final.

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Jan 2, 2019

@neilbrown I successfully build your gnubee/v4.20 for GB1 and it booted, but without any network interfaces. Is that known/expected? When I build 4.15 with same defconfig, it works as expected, network including.
Anybody else tried it?

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 3, 2019

@neilbrown I successfully build your gnubee/v4.20 for GB1 and it booted, but without any network interfaces.

That is not expected. Can you confirm that you have
CONFIG_NET_VENDOR_MEDIATEK_STAGING=y
CONFIG_NET_MEDIATEK_MT7621=y
CONFIG_NET_MEDIATEK_SOC_STAGING=y

in your .config ?
Does
dmesg | grep eth

produce anything?

@xvybihal

This comment has been minimized.

Copy link

xvybihal commented Jan 3, 2019

That is not expected. Can you confirm that you have
CONFIG_NET_VENDOR_MEDIATEK_STAGING=y
CONFIG_NET_MEDIATEK_MT7621=y
CONFIG_NET_MEDIATEK_SOC_STAGING=y

I only had CONFIG_NET_MEDIATEK_MT7621=y

Does
dmesg | grep eth

produce anything?

I can not check, because I flashed back the 4.15.18 which worked. I will try to build new image today with the suggested .config values.

Btw, in your gnubee/v4.20 branch, I can not see any gnubee defconfig, so I used the one from 4.15 and modified some stuff related to filesystems, fuse, etc. Maybe you forget to commit it, or we should use mt7621_defconfig?
Thanks

//edit: Amazing, I remotely flashed new build image and it worked. Now I even have two network interfaces. Thanks again @neilbrown - great work.

Used gnubee1_defconfig

gnubee ~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 8a:c3:7a:93:d8:4f brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fdf2:b583:42a8:0:88c3:7aff:fe93:d84f/64 scope global mngtmpaddr dynamic 
       valid_lft forever preferred_lft forever
    inet6 fe80::88c3:7aff:fe93:d84f/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 7e:05:be:8b:83:a9 brd ff:ff:ff:ff:ff:ff
gnubee ~ # uname -a
Linux gnubee.jvi.cz 4.20.0+ #3 SMP Thu Jan 3 08:25:00 CET 2019 mips GNU/Linux

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 3, 2019

Btw, in your gnubee/v4.20 branch, I can not see any gnubee defconfig,

I've just pushed out an update which contains gbpc1_defconfig which is just the config I use. Of course, one danger of using that is you only test the things I test, so you might not find bugs that I don't find...
Have fun.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 4, 2019

Speaking of that, I just compiled and booted the new version, with my own configuration file (see there). It seems to boot fine, with the network, but it fails to activate a raid logical volume. I have checked that the device mapper is properly configured, after https://github.com/Adirelle/gnubee-kernel/blob/master/gnubee1_defconfig#L1407.

Here are the relevant error messages:

[   32.512176] device-mapper: table: 252:5: raid: Component device(s) too small
[   32.526398] device-mapper: ioctl: error adding target to table
[   33.176958] device-mapper: table: 252:5: raid: Component device(s) too small
[   33.191491] device-mapper: ioctl: error adding target to table

252:5 is the device number of the actual LV on v4.15.

Here is the output of lvs -a -o path,name,kernel_major,kernel_minor,devices,attr,size,sync_percent with the v4.15 kernel:

  Path             LV              KMaj KMin Devices                           Attr       LSize   Cpy%Sync
  /dev/vgdata/data data             252    5 data_rimage_0(0),data_rimage_1(0) rwi-aor--- 925,00g 100,00
                   [data_rimage_0]  252    2 /dev/sda2(129)                    iwi-aor--- 925,00g
                   [data_rimage_1]  252    4 /dev/sdb2(1)                      iwi-aor--- 925,00g
                   [data_rmeta_0]   252    1 /dev/sda2(128)                    ewi-aor---   4,00m
                   [data_rmeta_1]   252    3 /dev/sdb2(0)                      ewi-aor---   4,00m
  /dev/vgdata/swap swap             252    0 /dev/sda2(0)                      -wi-ao---- 512,00m

Ouput of the same command with v4.20:

Path             LV              KMaj KMin Devices                           Attr       LSize   Cpy%Sync
  /dev/vgdata/data data              -1   -1 data_rimage_0(0),data_rimage_1(0) rwi---r--- 925,00g
                   [data_rimage_0]  252    2 /dev/sda2(129)                    Iwi-a-r-r- 925,00g
                   [data_rimage_1]  252    4 /dev/sdb2(1)                      Iwi-a-r-r- 925,00g
                   [data_rmeta_0]   252    1 /dev/sda2(128)                    ewi-a-r-r-   4,00m
                   [data_rmeta_1]   252    3 /dev/sdb2(0)                      ewi-a-r-r-   4,00m
  /dev/vgdata/swap swap             252    0 /dev/sda2(0)                      -wi-ao---- 512,00m

I am wondering what changed about DM raid that could cause this error. Glaring at git log gnubee/v4.15..gnubee/v4.20 have not resulted in the expected enlightenment.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 5, 2019

[ 32.512176] device-mapper: table: 252:5: raid: Component device(s) too small

That error message - and the check which leads to it - doesn't exist in 4.15. It was added in 4.16 by commit 188a212df1f3a2d7ea9bb0fc0ab4173042c23470

Can you report the output of "dmsetup table" and "dmsetup info" on 4.15 when the array is running properly?

@Adirelle

This comment has been minimized.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 6, 2019

Thanks. That allows me to rule out some possible causes. I think it is a code bug.
Could you please edit include/linux/device-mapper.h in the 4.15 code and change

static inline sector_t to_sector(unsigned long n)

to

static inline sector_t to_sector(unsigned long long n)

i.e. change "long" to "long long". Then rebuild and see if it works.
The size of your component device is 993211187200 bytes, which when stored in an "unsigned long" on a 32-bit arch is truncated to 1073741824. to_sector() converts this to 2097152 sectors, which is much smaller than the array requirement of 1939865600 sectors - hence the error.
When you confirm that this fixes the problem, I'll post a patch. If you would like to be acknowledged with a "Reported-and-tested-by:" tag, let me know and give me an email address.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 6, 2019

in the 4.15 code

1/ Should it not be in the 4.20 code, since it is not even checked in the 4.15 ?
2/ Does it only affect a check or should I expect some data corruption or other side effects with 4.15 ?

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 6, 2019

It fixes the error on v4.20. I made read-only fsck of the underlying filesystem and it seems ok. Should I test with v4.15 too ?

BTW, is there a way to build the initramfs and u-boot image in the path indicated by the "O" variable instead of the source tree ?

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 6, 2019

1/ Yes, 4.20. Sorry.
2/ No data corruption. You might get an error if you try to reshape an array or add a journal.

Thanks for testing and reporting - I'll send the patch upstream.
I doubt you would notice any difference if you made the same change in 4.15.

If you set GNUBEE_INITRAMFS_TREE in gnubee-tools/config to some other directory, the initramfs should be built there. The u-boot image is always built in the O directory ($GNUBEE_KERNEL_OBJECTS) and then copied to $GNUBEE_BUILD_DIR.
Everything should be configurable in the 'config' file.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 6, 2019

Thank you.

Here are more error messages (which does not seem to prevent anything from working, though):

[   11.599861] mt7621_gpio 1e000600.gpio: registering 32 gpios
[   11.615781] gpio gpiochip1: (1e000600.gpio-bank1): detected irqchip that is shared with multiple gpiochips: please fix the driver.
[   11.639095] mt7621_gpio 1e000600.gpio: registering 32 gpios
[   11.652240] gpio gpiochip2: (1e000600.gpio-bank2): detected irqchip that is shared with multiple gpiochips: please fix the driver.
[   11.675488] mt7621_gpio 1e000600.gpio: registering 32 gpios
[   11.775701] cacheinfo: Failed to find cpu0 device node
[   11.786392] cacheinfo: Unable to detect cache hierarchy for CPU 0
[   12.324852] ------------[ cut here ]------------
[   12.334100] WARNING: CPU: 2 PID: 1 at /home/user/gnubee/linux/drivers/mtd/spi-nor/spi-nor.c:3659 spi_nor_init+0x134/0x1d8
[   12.369010] enabling reset hack; may not recover from unexpected reboots
[   12.382345] Modules linked in:
[   12.388419] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0+ #11
[   12.400529] Stack : 00000123 00000000 00000000 80068b8c 00000001 811e3814 80640000 0000000b
[   12.417147]         807054b8 9bc81944 00000000 811e0000 80720000 00000001 9bc818d8 53260ce1
[   12.433766]         00000000 00000000 81220000 00000007 00000000 6465746e 00000123 00000000
[   12.450382]         00000000 00000122 81210000 79616d20 80720000 00000000 80740000 806db044
[   12.450402]         00000009 00000e4b 9bc81b88 00000000 00000000
[   12.450436]  802fca50 00000008 811e0008
[   12.450442]         ...
[   12.450448] Call Trace:
[   12.450472] [<8000c244>] show_stack+0x8c/0x130
[   12.450500] [<805c63a4>] dump_stack+0x94/0xd0
[   12.450514] [<80027d00>] __warn+0x10c/0x114
[   12.450524] [<80027d48>] warn_slowpath_fmt+0x40/0x64
[   12.450535] [<80375bf8>] spi_nor_init+0x134/0x1d8
[   12.450546] [<80378b88>] spi_nor_scan+0x8f8/0xa60
[   12.450562] [<80366d60>] m25p_probe+0x178/0x218
[   12.450573] [<8030b768>] really_probe+0x2cc/0x430
[   12.450595] [<803094a4>] bus_for_each_drv+0xac/0xcc
[   12.450604] [<8030b9ec>] __device_attach+0xbc/0x130
[   12.450616] [<8030a214>] bus_probe_device+0x3c/0xb0
[   12.450626] [<80307ef0>] device_add+0x494/0x5b0
[   12.450648] [<80390f08>] spi_add_device+0x148/0x1b0
[   12.450659] [<803919e8>] spi_register_controller+0x7a4/0x940
[   12.450679] [<8030d708>] platform_drv_probe+0x40/0x7c
[   12.450688] [<8030b768>] really_probe+0x2cc/0x430
[   12.450697] [<8030be98>] __driver_attach+0xb4/0x138
[   12.450707] [<803093a0>] bus_for_each_dev+0x6c/0xb0
[   12.450719] [<8030a5d8>] bus_add_driver+0x204/0x24c
[   12.450728] [<8030c7d8>] driver_register+0xd0/0x118
[   12.450738] [<80001638>] do_one_initcall+0x84/0x19c
[   12.450758] [<80753f2c>] kernel_init_freeable+0x248/0x250
[   12.450775] [<805e1fdc>] kernel_init+0x14/0x110
[   12.450784] [<80006838>] ret_from_kernel_thread+0x14/0x1c
[   12.450832] ---[ end trace b800848cea8dadd4 ]---

By the way, it seems the SATA drives are configured for UDMA/133 despite a link at 6.0 Gpbs:

[   13.045238] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   13.141553] ata2.00: ATA-10: ST1000LX015-1U7172, SDM1, max UDMA/133
[   13.154447] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[   13.262441] ata2.00: configured for UDMA/133

The same goes for ata1. The other slots are unused.

For reference, I used this config file. I have tried to disable stuff my GB1 do not need, to include always-used drivers and to enable as modules what I may need (like USB mass storage). I hope I have not disabled essential things.

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 6, 2019

The warning in spi-nor is annoying but harmless.
Some background can be read here: https://patchwork.ozlabs.org/patch/950299/
The hardware is not "broken" - but I didn't think it was worth fighting too hard.

I don't know much about ATA speeds so I cannot comment on that issue. I doubt it is related to your config-file choices.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 7, 2019

What about detected irqchip that is shared with multiple gpiochips: please fix the driver. ? Not that I have some use for the GPIO but could it cause a bug ? IIRC, there is only one GPIO available for user on the GB.

I don't know much about ATA speeds so I cannot comment on that issue.

Well, no HD nor SDD reaches 6 Gb/s (aka ATA-600, 600 to be compared to the 133 of UDMA-133) but I would like to be sure the disk rate was not limited by the link. However, it seems the ST1000LX015-1U7172 hardly reaches 93 MB/s so that should not be an issue.

I doubt it is related to your config-file choices.

Ok. While reviewing the drivers to enable, I was wondering about the hardware that is found or not on the GB-PC1. E.g., what is SPI used for and does the GB-PC1 use it ? Is it actually needed ? Same questions for I2C, ...

@neheb

Fun little project: https://github.com/vschagen/mtk-eip93

I am not familiar with hardware crypto engines. Could userland software (e.g. openss[lh] and the like) use the hardware crypto ?

@neilbrown

This comment has been minimized.

Copy link
Contributor

neilbrown commented Jan 7, 2019

SPI is used to access the Flash storage.
GPIO is used to drive the LEDs and to sense the push-button.
I don't think I2C is used.
Thanks for mentioning the irqchip thing - I'll look into it.
I believe most crypto libraries will use kernel-supported hardware when available, I'm fairly sure that includes openssl and I suspect that includes openssh.

@neheb

This comment has been minimized.

Copy link
Contributor

neheb commented Jan 7, 2019

The hardware only supports AES in CBC mode. SSH does not do CBC. dm-crypt defaults to XTS but can be configured to use CBC.

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 12, 2019

It seems the network bug (ping is working but not TCP nor UDP) is back. I have not identified what causes it and I have no suspicious message in dmesg or logs. It happens with both the 4.15.18 & 4.20 kernels.
Trying to ifdown+ifup eth0 does not work but a reboot does.

@cmm

This comment has been minimized.

Copy link

cmm commented Jan 13, 2019

@Adirelle do you use NFS (or Samba with sendfile() on)?

@Adirelle

This comment has been minimized.

Copy link

Adirelle commented Jan 13, 2019

@cmm the kernel NFS server, yes. I do not know if sendfile is enabled. I removed samba a few week ago.

@cmm

This comment has been minimized.

Copy link

cmm commented Jan 13, 2019

@Adirelle I started getting the network bug a lot once my Gnubee started serving mostly 1080p media (instead of mostly 720p, where I was having those lockups maybe once a month); the problem disappeared completely once I stopped using NFS and moved to Samba-sans-sendfile().

it is my not-entirely-informed guess that the VFS/network interplay in the kernel is screwy, probably due to how the switch code does locking. in fact, any in-kernel code that serves data streams through the switch is probably dangerous (I don't know if there is any apart from sendfile() & NFS, though). the actual problem here is that most SoC vendors just don't test such configurations -- most small NAS systems on the market don't do NFS at all, and the SoC in Gnubee is made for routers...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment