Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of mainline kernel support #75

Open
sumpfralle opened this issue Mar 18, 2018 · 139 comments
Open

Status of mainline kernel support #75

sumpfralle opened this issue Mar 18, 2018 · 139 comments

Comments

@sumpfralle
Copy link

I guess, it is of vital importance for the longevity of this project (and the hardware) to bring support the peripherals of GnuBee devices to the upstream kernel development tree. Otherwise we will end up with an outdated kernel (and thus more electronic landfill) somehwen ...

Since this is a not a short-term task, I propose to use this ticket for collecting the status of the ongoing mainlining efforts.

Could someone please start with summarizing the current state?
Thank you!

@neilbrown
Copy link
Contributor

https://github.com/neilbrown/linux/commits/gnubee/v4.15 mostly works. I just today noticed that only the first SATA controller works. Might fix that tomorrow.
I've posted some patches for inclusion in drivers/staging: https://lkml.org/lkml/2018/3/14/1035 Will repost with some improvements tomorrow.

@sumpfralle
Copy link
Author

Thank you for your quick response, for all the work you have put into mainline support and for the entertaining lwn article you wrote.
Your progress is way better, than I expected - this revives my positive emotions for the GnuBee platform. Thank you!

@Adirelle
Copy link

Adirelle commented Mar 29, 2018

@neilbrown Have you hit this bug with the ethernet driver ?

Right now, I have a job that checks the network connectivity every minute and reboot the GB if it fails...

I am very happy that a kernel hacker has got some interest in this project. Thanks for your time and your work.

@neilbrown
Copy link
Contributor

I've seen something like bug #54 when the Gnubee was sitting at the u-boot prompt, and once when sitting at a shell prompt prompt in the initramfs, but never with a mainline kernel fully booted and the network configured.
This message: https://groups.google.com/d/msg/gnubee/cJuFmwCu4XI/F1KwJSIfAgAJ describes the problem the way I see it - the whole switch attached to the gnubee dies.
My guess would be that there is some inter-switch protocol (possibly the Spanning Tree Protocol) which the embedded switch in the gnubee is messing up.
In mainline Linux the embedded switch is configured as a boring transparent switch with no smarts, and maybe that is why is doesn't confuse other switches as much.
It would be interesting to see what happens if the gnubee is directly connected to a PC instead of to a switch.

@alethiophile
Copy link

My experience is also that when it's booted into the Linux kernel, the network crash problem goes away. I see the problem when the board is in any of the three states uboot prompt/initramfs prompt/halted, but powered on.

@Adirelle
Copy link

Weird. It always happens on booted kernel (the librecmc-based one, provided by GB) and does not take the switch down. The GB still have ICMP (e.g. ping works) but the TCP and UDP become unusable. That how the jobs detects a failure : if it can ping a fixed IP but cannot connect to its echo service, it causes the system to reboot.

@neilbrown
Copy link
Contributor

Maybe there are two different bugs here - one that kills the switch and one that kills TCP but leaves ICMP working....

@xvybihal
Copy link

@Adirelle I also hit bug #54 with this mainline kernel. GnuBee PC1 is sitting here on my desk, working "fine" (quotes because of ping shown later). But when I come back some hours/day(s) later, I can not connect to it - the time when it becomes unavailable via network is random, as far as I can tell. I Have to poweroff and power on again, nothing else worked for me ((un)plug cable).

gnubee ~ # uname -a
Linux gnubee.jvi.cz 4.15.12+ #3 SMP Wed Apr 4 15:29:16 CEST 2018 mips GNU/Linux

gnubee ~ # cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

gnubee ~ # cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
	address 172.16.202.254
	netmask 255.255.0.0
	gateway 172.16.whatever

Strange thing is, that when I run ping long enough, I see that every other while it takes gnubee quite long time to respond (100+ ms). I can deffinitely rule out faulty switch or cable.

https://gist.githubusercontent.com/xvybihal/b403c3c2677e423f3aea73feb8255a91/raw/d137ab85de894e2d5e3cb5bb9e2ffae16d023231/Gnubee%2520ping

Unfortunately I do not have UART cable to check what is going on, when GnuBee is not available via network.

This "thing" is nothing new, it was acting this way from the begining, with every kernel provided, and Deabian installed. Makes GnuBee pretty unusable for me, if I have to restart it several times a day.

@neilbrown
Copy link
Contributor

Thanks for the detailed bug report. I love details.
What is happening on the gnubee before it starts failing?
Is it just sitting there idle? Is it initiating network traffic itself? It is responding to requests?
Are there lots of requests, or occasional? or few? What protocol or protocols?
I've started a job fetching a 10K file over http every 2 seconds. I'll see if it is still working in the morning.

@Adirelle
Copy link

It does not seem to be linked to network activity. Stopping the transmission daemon to reduce the bandwidth and number of connections does not prevent it to happen (neither postponing it). And I have downloaded an Ubuntu DVD torrent over a 1Gb link without any issue.

@neilbrown
Copy link
Contributor

So what is your gnubee doing when the problem happens? How do you first notice it? How long after boot does it typically fail? <1hr? <12hrs?
If I can reproduce the problem it is very likely that I can fix it
If not ... there are some kernel messages in that other bug which might be enough of a hint. Maybe I'll try copying a bigger file.

@xvybihal
Copy link

In my case, the GnuBee is doing nothing when it happens. What do I mean by nothing? The system installed is clean, no additional network daemons or services are installed. No network transfers are happening (except of time sync, and common stuff in the base system). I do not use it, it just sits here and doing nothing. The time when it happens is just random. Sometimes its hours, sometime day or two.
Next time it happens, I will try to take out the SD card and copy some log here, thats probably the only thing I can do, without UART cable (which I am going to enquip myself with soon).

@xvybihal
Copy link

Looking at the log/messages, there might be something helpful. Uploading whole file - it looks pretty similar with what @Adirelle posted messages.tar.gz

@Adirelle
Copy link

So what is your gnubee doing when the problem happens? How do you first notice it? How long after boot does it typically fail? <1hr? <12hrs?

Well, it seems totally random. The gnubee could be idle or I could be using it. Right now, it is running the following services : kernel nfs server, dropbear, dms (a upnp media server), ntpd, mysqld, transmission-daemon, ypbind, and there is a job running rdiff-backup once a day.

Sometimes it can be ok a whole week, sometimes only a few hours. I have the impression that the uptime rarely gets over 48 hours.

If I can reproduce the problem it is very likely that I can fix it

I understand that -- I am a developer myself -- but unfortunately I could not find something to trigger the bug. I have only got the kernel error message.

I have a script that reboots it within one minute after it happens. It could dump the result of some commands in a file before rebooting, if you had some suggestions.

@neilbrown
Copy link
Contributor

Thanks for posting the whole log file (did I say that I love details?)
The problem appears to be that interrupts get disabled on one of the CPUs - presumably CPU0 as that gets all device interrupts on my gnubee.. This is reported by RCU at timestamp 2179 which should be about 21 seconds after it happened. Then at timestamp 6317 (about 1 hour later) the network watchdog complains.
With interrupts disabled on one CPU, the network card will become unreliable. As polling is sometimes used you can still get some traffic through, but it won't be fast. A 'ping' probably works because it sends a packet every send, and only wants one in reply. If it checks for incoming packets whenever it transmits a packet (quite likely), it will appear to work normally.
The next question is: why do interrupts get disabled? Two things might be useful.
1/ @Adirelle if you could get your script to "echo t > /proc/sysrq-trigger" when the problem is detected, that might help. It should write stack traces for all process to the kernel log. Seeing those would be most helpful.
2/ @xvybihal If you could rebuild your kernel with LOCK_DEP enabled, that might produce more useful info (I hope). You need to enable CONFIG_PROVE_LOCKING and CONFIG_DEBUG_LOCKDEP and you may as well add CONFIG_PROVE_RCU. Then if it happens again, collect the logs the same way that you did before.
Thanks.

@Adirelle
Copy link

@neilbrown I have setup a script that will logs the following commands when it happens. I will post the next one.

date
uptime
vmstat 1 1
netstat -ieW
netstat -aopenW
lsmod
ps -ef
echo t > /proc/sysrq-trigger
dmesg

@neilbrown
Copy link
Contributor

I have found a smoking gun.... I don't know if it is the smoking gun.
I went looking for places in the code which disable interrupts without clearly re-enabling them. I didn't expect to find any - that would be too easy. I found two!
One is almost certainly of no consequence - it always happens very early during initialization, so somethings must re-enable them.
The other is in the network code. I don't understand the code enough to know when it happens, but it looks like it is in response to some event like the cable being pulled (though I tried that and it doesn't trigger anything). The same bug is in the 4.4 kernel code
I've push out a fix to my gnubee/v4.15 kernel branch. Please test in you can.

@neheb
Copy link
Contributor

neheb commented Apr 14, 2018

@neilbrown Have you tried patching the stock Mediatek MMC driver to add support for mt7621? I tried and failed. Something about missing pinctl. I think I needed to edit the dts file...

This was my basis: jonpry/openwrt_mt7688@a85e6d9

@neilbrown
Copy link
Contributor

@neilbrown Have you tried patching the stock Mediatek MMC driver to add support for mt7621?

No I haven't. As MMC is currently working, that driver is not a priority for me. My current priorities are roughly:

  • reboot - kernel currently leaves NOR flash in an inconsistent state
  • SATA access seems really slow - 10MB/sec!
  • network switch - enable VLAN support so different ports can be on different subnets
  • 2nd network interface: the SOC has 2 interfaces to the switch, only one works at present
  • crypto engine
  • PCI driver has horrible hacks to select correct interrupt

So I won't be focusing on MMC for quite a while. I'd be very happy for you or anyone else to dig into it and ask questions. If you have specific focused questions, I'd be happy to share any expertise I might have. I'd suggest opening a separate issue for each driver.

@neheb
Copy link
Contributor

neheb commented Apr 14, 2018

I just tested dd if=/dev/zero of=test count=1M with a speed result of 7.5MB/s on a btrfs array. No wonder transmission is slow...

I will dig through the MMC driver in a few days to see if I can get it working. Apparently the following commit to the one I linked has updated DTS entries.

@Adirelle
Copy link

Adirelle commented Apr 15, 2018

@neilbrown

Compiled your v4.15 and it boots (which is already something since I was not sure about what I was doing) but :

  • the initramfs does not find my root partition, since it lies on a md RAID1 array.
  • it messes with the attached switch, probably because its own switch is not configured.

Your kernel branch might or might not fix some bugs, but as long as it works, I prefer having a recent version quite easy to build over the ones from libreCMC/LEDE.

PS: I will try to find what I need to add to the initramfs to support rootfs on MD array.
PS2: I cross-compiled the kernel on a Alpine-Linux-based VM, I hope this will not cause any issue.

@Adirelle
Copy link

Got it working ; I was also missing some other modules. I will let you know if the network lockup happens again.

@neilbrown
Copy link
Contributor

Another convert - hurray :-)
I've merged your patch - thanks.

@Adirelle
Copy link

Ok. So on the bad news:

I will try to compare kernel settings with the ones from LibreCMC.

@Adirelle
Copy link

@dgazineu Is there a way to flash the firmware from a running Linux ? (to shorten the whole "put your image on a USB stick, plug it on the GB, reboot, let uboot flash the firmware, remove the USB stick, reboot" cycle).

@neilbrown
Copy link
Contributor

clock skew is back again (like #49 (comment)).

You need the cpuclock clock-frequency in arch/mips/boot/dts/ralink/gbpc1.dts to match the value set by the u-boot that you have installed. I have a u-boot that configures 900MHz so I set clock-frequency to 900000000.

performance seems lower than with the provided kernel, but I am not sure why.

I just discovered that large kernel modules hurt performance. I rebuilt with CONFIG_XFS=y (instead of =m, in O/.config) and filesystem throughput is a lot faster. I've pushed an update for the defconfig file.

Is there a way to flash the firmware from a running Linux ?

Probably, using flash_erase and flash_cp from mtd-utils. I haven't played much with them - be careful.
I set up my linux desktop as a tftp server and test kernels using tftpboot. It is fairly painless.

@Adirelle
Copy link

I have a u-boot that configures 900MHz so I set clock-frequency to 900000000.

Ah, I think mine is configured at 880Mhz. It seems there were some discrepancy between shipped u-boot and the kernel updates that were provided later.

I set up my linux desktop as a tftp server and test kernels using tftpboot. It is fairly painless.

To be sure I understand you right : once the new image is available through TFTP, you use the provided u-boot menu to download it and flash it. Or do you run the kernel without flashing it (which would be ideal) ?

@neilbrown
Copy link
Contributor

flash_erase works as expected. flash_cp doesn't. Something wrong with the spi driver...

@neilbrown
Copy link
Contributor

I don't need to flash the kernel to test it.
The script I use to build the kernel finishes with

cp O/arch/mips/boot/uImage.bin /srv/tftpboot/GB-PCx_uboot.bin
echo 'tftpboot;bootm 80200000'

I reboot (or power-cycle) the gnubee and press '4' repeatedly during the early messages. Once is probably enough, but more doesn't hurt.
When I get the prompt I cut/paste that last message printed by the script.
For this to work you at least need to set serverip in your u-boot environment
e.g.

setenv serverip 192.168.1.4
saveenv

You might also want to set, or at least check (printenv) bootfile, and ipaddr

@neilbrown
Copy link
Contributor

The warning in spi-nor is annoying but harmless.
Some background can be read here: https://patchwork.ozlabs.org/patch/950299/
The hardware is not "broken" - but I didn't think it was worth fighting too hard.

I don't know much about ATA speeds so I cannot comment on that issue. I doubt it is related to your config-file choices.

@Adirelle
Copy link

Adirelle commented Jan 7, 2019

What about detected irqchip that is shared with multiple gpiochips: please fix the driver. ? Not that I have some use for the GPIO but could it cause a bug ? IIRC, there is only one GPIO available for user on the GB.

I don't know much about ATA speeds so I cannot comment on that issue.

Well, no HD nor SDD reaches 6 Gb/s (aka ATA-600, 600 to be compared to the 133 of UDMA-133) but I would like to be sure the disk rate was not limited by the link. However, it seems the ST1000LX015-1U7172 hardly reaches 93 MB/s so that should not be an issue.

I doubt it is related to your config-file choices.

Ok. While reviewing the drivers to enable, I was wondering about the hardware that is found or not on the GB-PC1. E.g., what is SPI used for and does the GB-PC1 use it ? Is it actually needed ? Same questions for I2C, ...

@neheb

Fun little project: https://github.com/vschagen/mtk-eip93

I am not familiar with hardware crypto engines. Could userland software (e.g. openss[lh] and the like) use the hardware crypto ?

@neilbrown
Copy link
Contributor

SPI is used to access the Flash storage.
GPIO is used to drive the LEDs and to sense the push-button.
I don't think I2C is used.
Thanks for mentioning the irqchip thing - I'll look into it.
I believe most crypto libraries will use kernel-supported hardware when available, I'm fairly sure that includes openssl and I suspect that includes openssh.

@neheb
Copy link
Contributor

neheb commented Jan 7, 2019

The hardware only supports AES in CBC mode. SSH does not do CBC. dm-crypt defaults to XTS but can be configured to use CBC.

@Adirelle
Copy link

It seems the network bug (ping is working but not TCP nor UDP) is back. I have not identified what causes it and I have no suspicious message in dmesg or logs. It happens with both the 4.15.18 & 4.20 kernels.
Trying to ifdown+ifup eth0 does not work but a reboot does.

@cmm
Copy link

cmm commented Jan 13, 2019

@Adirelle do you use NFS (or Samba with sendfile() on)?

@Adirelle
Copy link

@cmm the kernel NFS server, yes. I do not know if sendfile is enabled. I removed samba a few week ago.

@cmm
Copy link

cmm commented Jan 13, 2019

@Adirelle I started getting the network bug a lot once my Gnubee started serving mostly 1080p media (instead of mostly 720p, where I was having those lockups maybe once a month); the problem disappeared completely once I stopped using NFS and moved to Samba-sans-sendfile().

it is my not-entirely-informed guess that the VFS/network interplay in the kernel is screwy, probably due to how the switch code does locking. in fact, any in-kernel code that serves data streams through the switch is probably dangerous (I don't know if there is any apart from sendfile() & NFS, though). the actual problem here is that most SoC vendors just don't test such configurations -- most small NAS systems on the market don't do NFS at all, and the SoC in Gnubee is made for routers...

@Adirelle
Copy link

Adirelle commented Mar 10, 2019

@neilbrown the initramfs init script made me a joke these days: as I was recompiling for the same kernel version, it refused to mount /lib/modules from the initramfs and used the /lib/modules instead, which was an issue since I changed the module settings. Is there something in the modules folder that could be used to known if the kernel builds are different ?

Edit: I was thinking about adding a file containing an hash of the .config file and using it to tell.

@neilbrown
Copy link
Contributor

Is there something in the modules folder that could be used to known if the kernel builds are different ?

No. For my gnubee-tools package, I create a 'stamp' file with the date when the modules were copied in, and compare that.
You could set CONFIG_LOCALVERSION_AUTO=y. This adds part of the git hash of the top commit to the version so there is no risk of using old libraries with a newer kernel.
I've updated the gnubee1_defconfig in v4.15 to have this changed.

@neilbrown
Copy link
Contributor

The mediatek network driver in mainline now supports the MT7621, so we don't need the drivers/staging driver.
It works with DSA support for the integrated switch. I now have it working on both network interfaces on the PC1 and all three on the PC2.
So I'm now using mainline (5.1-rc2) on my main gnubee and will be building firmware with mainline kernels from time to time. See the announcement in the google group.
I now consider the gnubee to be "fully supported" in mainline. Though there is still a bit of work to do, it is mostly cleaning up the code and getting it moved out of drivers/staging (spi is almost there, and mmc probably isn't far away).

@neheb
Copy link
Contributor

neheb commented Apr 1, 2019

The MMC driver situation is the same as the Ethernet. The mainline mtd-sd can work with it with some modifications.

@smurfix
Copy link

smurfix commented Apr 1, 2019

Nice. Thanks for the work. Does mainline have a usable kconfig file? if not (or not yet …), where can I find one?

@vgiralt
Copy link

vgiralt commented Apr 1, 2019

@neilbrown

I now consider the gnubee to be "fully supported" in mainline. Though there is still a bit of work to do, it is mostly cleaning up the code and getting it moved out of drivers/staging (spi is almost there, and mmc probably isn't far away).

This definitely deserves a congratulations! and Thank you!

@neilbrown
Copy link
Contributor

Look in https://github.com/neilbrown/linux.git branch gnubee/v5.1
This is v5.1-rc2 plus some staging patches plus a few little things from me. A couple of changes to the DTS files are needed and there are defconfig files in there. There are also (almost) identical config files in github.com/neilbrown/gnubee-tools.git

@neilbrown
Copy link
Contributor

The MMC driver situation is the same as the Ethernet. The mainline mtd-sd can work with it with some modifications.

Now might be the time to put that hypothesis to the test - GregKH has just removed the mt7621-mmc driver due to licensing concerns. If you have any specific information (patches??), could you share it please.

https://lkml.org/lkml/2019/4/2/311

@neheb
Copy link
Contributor

neheb commented Apr 2, 2019

Sure. Here are some.

jonpry/openwrt_mt7688@a85e6d9
jonpry/openwrt_mt7688@2487846

A few notes: The MMC driver there is basically the 4.9 mtk-sd one with all the patches from maybe 4.17 or 4.18 backported.

edit: Note that I haven't exactly gotten it working. Probably needs extra DTS entries.

@neheb
Copy link
Contributor

neheb commented Apr 2, 2019

Here's the diff that I was able to generate:

https://gist.github.com/neheb/3d9e4cbf966f8487114df19b49f28214

@neilbrown
Copy link
Contributor

Here's the diff that I was able to generate:

https://gist.github.com/neheb/3d9e4cbf966f8487114df19b49f28214

very helpful thanks. I just booted my PC2 of the SD card using the mainline driver - with these changes and some others. There is still polishing to do but the hard work is done.

Thanks!

@smurfix
Copy link

smurfix commented Jul 7, 2019

Is the the "boring polishing" stuff pushed to mainline by now? if not, what's left to be done?

@neilbrown
Copy link
Contributor

The changes needed to drivers/mmc/host/mtd-sd.c landed in 5.2-rc1.
The changes needed to the mt7621.dtsi device tree file are in staging-next and should land in 5.3-rc1.

PCI is the main outstanding driver that needs work. The clean-up work in staging has introduced a bug (occasional hang on boot) that no-one has found yet.

@Qwertie-
Copy link

Qwertie- commented Aug 4, 2019

Will mainline support mean I can download the ARM debian or fedora from their website and it will just work like it does for a desktop/laptop?

@neheb
Copy link
Contributor

neheb commented Aug 4, 2019

The GnuBee uses MIPS not ARM.

@Qwertie-
Copy link

Qwertie- commented Aug 5, 2019

So debian mips would work?

@vgiralt
Copy link

vgiralt commented Aug 5, 2019 via email

@gordon-quad
Copy link

Any updates here? Looks like majority of stuff should be already merged, according to post above about 5.2 and 5.3 RCs. What about PCI driver? Is there a config that I can use to build custom mainline kernel?

@neilbrown
Copy link
Contributor

In 5.17 (just released) everything except that device-tree file has been moved out of drivers/staging.
I'm running 5.15.22 on my devices using the kernel from https://github.com/neilbrown/linux/tree/gnubee/v5.15 and config file found in https://github.com/neilbrown/gnubee-tools/tree/master/kern_config
My kernel has 13 patches on top of 5.15.22. If/when I update to 5.17 I expect that to be a lot less, but probably not zero.

@sumpfralle
Copy link
Author

Just for my curiosity: Debian Bookworm is being released in two weeks and it will contain Linux 6.1.
Will that version be able to show the full beauty of the GnuBee?

Thanks for your work!

@neheb
Copy link
Contributor

neheb commented May 19, 2023

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests