Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really bad sound output #5

Open
asdplayer opened this issue Nov 25, 2018 · 25 comments
Open

Really bad sound output #5

asdplayer opened this issue Nov 25, 2018 · 25 comments

Comments

@asdplayer
Copy link

asdplayer commented Nov 25, 2018

I compiled and installed succesfully your kernel, but I got problems at runtime.
I used localmodconfig, and everything else is working. I also followed the
instructions given in this repository. I attached a dmesg and my kernel config that I used (I've renamed it to .txt just to upload it).
cx2072x_dmesg_while_using.log
cx2072x_kernel_config_4.19.4.txt

I tried to reproduce audio, but there are some problems.
Everything was also checked against a working USB soundcard, the Behringer UMC404HD, to rule out the most obvious issues in user software.

=== Play mp3/m4a/ogg/wav using Audacious player, to ALSA:
plays at 2x speed (and 2x pitch), with some noise, like gross aliasing.
The card is reported twice (this is normal I think), as:
-> sysdefault:CARD=chtcx2072x
-> usbstream:CARD=chtcx2072x
selecting the latter results in an error: "ALSA error: snd_pcm_open failed: Invalid argument." Why does it say "usbstream" anyway? The same is reported by aplay -L (list pcm outputs)

=== Play using Audacity editor, to ALSA:
plays at normal speed and pitch.
(There is consistent lagging of the UI when changing play/capture settings from the toolbars, while the soundcard is disabled and reinitialized many (~10) times, but this is fault of Audacity).

=== Play mp3/m4a/ogg/wav/video using Parole video player, which uses gstreamer, which uses PulseAudio:
There is major distortion, but playback speed is OK. See below for an hypothesis.

=== Play mp3/m4a/ogg/wav/video using VLC player, to ALSA:
Audio is played at 2x speed and 2x pitch, but is sinchronized to real time. So, some length of audio is played, then there is a void, then it restarts... At random intervals, maybe 3 gaps a second. The play time is therefore not shortened.
Using Pulseaudio backend instead of direct ALSA, VLC yelds the same exact result obtained from Parole.

=== Play test signals with kokkinizita's jaaa, through JACK, 1024 sample buffer:
I played various test signals.
Playing a 200 Hz sine:

  • There is a small high frequency distortion byproduct; there is maximum signal output about -42 dB (just under 8-bit noise level?), more than that produces harsh distorsion out of the soundcard. This seems like the wrong endianness is applied to the audio signal, so a small signal stays mainly in the center byte and is represented correctly; the least significant byte is over-represented resulting fundamentally in 1-bit error noise magnified by a 16-bit shift, and if the MSB is used, the signal is completely messed up.
  • 400 Hz output waveform is measured out of the soundcard (I used a tuner app and the Spectroid app on my mobile, and my ear roughly confirms), when no distorsion is happening. This seems like the codec is set for 96 kHz, while applications believe to play at 48 kHz (because they are so told) (but effectively they are requested twice the data per "real" time unit).
  • The 2nd (right) digital channel outputs sound to both channels of headphones and speakers. Playing the sample on the 1st (left) digital channel results in almost no sound, as if I were hearing just an analog crosstalk. (the 2nd channel is right, while the 1st is left, I think.)
    === Lowering buffer size to 256 samples or 128 samples:
    There is another type of problem in the signal output, I can't surely tell what that is. It's like the buffer cannot be less than 512 samples, and it is filled with 0's or random values if I choose a smaller buffer size. Sounds from audio players seem played at the right speed now, I'm not sure about pitch. It seems like the signal is mixed with another frequency that could be a byproduct of the incorrect handling of buffers. I have done guesswork measuring waveforms with a mobile phone, so it's just "wild" guesswork.

=== Playing various sounds through JACK, with synths and audio players:
Every application that outputs over around -45 dB of digital signal results is harsh distorsion out of the soundcard.
Some applications kind of work, they reproduce audio at 2x speed and 2x pitch, and output gets distorted over about -45 dB; they are: jaaa, synth_v1, audacious (which also has a JACK backend), yoshimi.
Some applications don't work, and sometimes wreck the JACK instance so that the JACK server is running, but no sound can be reproduced anymore by any application: mixxx, fluidsynth.
Bristol crashes 3 seconds after started, but that is kind of expected, as the code is old and unstable.
VLC (which has a JACK backend, too) sounds as before, pitch and speed are 2x, except that now intervals with and without sounds are equally spaced and have the same duration, like 50%/50%.

JACK is configured for "playback only", because I could not start it in "duplex" mode (so I only have output, and not input + output).
If I try to start JACK in duplex mode, it says something unhelpful like "overall operation failed", and in the dmesg log appear hundreds of these lines:
[ 177.914611] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
This is a good topic for another report maybe.

EDIT:
My computer is the Asus E200H, the soundcard is the CX20723.
And some typo fixes.
And another thing: If you see the need, in a week's time I can attach an oscilloscope to the headphone output to diagnose further; by now I don't have time. Let me know!

@heikomat
Copy link
Owner

heikomat commented Nov 27, 2018

Oh damn, you've gone way more in-depth than i ever have, nice work!
I'd be very interested in at least reproducing your findings, but i have only ever worked with debian-based distributions. Getting this to work on arch-based distributions would be really nice though.

this guy seems to have been successfull at using this kernel, but he probably used kernel version 4.17, and some things have changed since then.

I see your kernel-version is 4.19.4. The newset version i merged is 4.19. Did you apply the fixes of my cx2072x branch yourself?

Could you write up something like a reproduction manual? This way i might see if something is odd or missing, and could try it myself.

@asdplayer
Copy link
Author

I'm writing the complete process of compilation I followed. Basically I edit Arch linux build scripts (PKGBUILD files), either the ones used to compile official packages or user-provided ones in the AUR.
I made patches from your branch of the kernel. I don't have enough free space on my E200 to pull two source trees at the same time, so, instead of using diff and learning git, I copied and pasted from the Files changed tab of this page in your repository into patches I modified by hand. They output some warning messages when applied, but otherwise work (How unexpected!). I checked each file by hand afterwards.
By the way I also wonder why is there a difference in a seemingly unrelated file: drivers/pwm/pwm-lpss.c.
Having both the modified tree AND the patches, I tried various kernels, and I obtained the same results on all of them.

  • 4.19.1, official source tarball, my ugly cx2072x patches, -rt patches, some other patches for Arch and bfq-mq scheduler.
  • 4.19, from your repository, with the same patches (but without the need to use my ugly cx2072x patches).
  • 4.19.4, Arch Linux kernel Git repo and the cx2072x patches. No 'native optimizations' option was available on this one, so I chose 'Intel Atom' instead.

All these kernels used make localmodconfig, then I configured them further with make menuconfig to enable native compiler optimization and enable the two options for the cx2072x. And to check every thing I know.
The realtime kernels also had many security features disabled, including memory layout randomization and even features I don't entirely understand.
Instead, while configuring the kernel from the official Arch repo, I left everything as default (other than enabling cx2072x and compile optimization for the Atom architecture), so almost all sensible security features are compiled in.
Before the week ends I'll have more storage space, so I'll try to compile another kernel, based only on this repository, and see if it works.
I also tried to read the sources of the cx2072x driver, but I don't understand many things and I don't have proper documentation of the codec. For example it would be good to be able to use the inbuilt equalizer, or the 192 kHz sample rate, but I'm not ready to code yet.
Thank you for the time you spend on this, I'll post the complete compilation steps soon.

@heikomat
Copy link
Owner

heikomat commented Nov 28, 2018

The specific (seemingly unrelated) function you mentioned was not orignally added by me.
When i started this patched kernel, i based it on Fixes from tiwai. he had 3 different branches that fixed different things reagarding cherrytrail.

The specific commit including the reasoning for adding the function was this one

I'll later check if it is really no longer needed, and if so, remove it

@7twin
Copy link

7twin commented Dec 25, 2018

@heikomat @asdplayer
Just saw this - never got a notification for it. I did indeed not have it on 4.19 yet, as on my 4.19 setup I didn't yet have need for sound.

4.20 is supposed to come soon too, so I would be definitely interested in how to get it to work with either 4.19.12 or 4.20.

Thanks for doing all this work for getting sound working!

@heikomat
Copy link
Owner

heikomat commented Dec 26, 2018

@7twin nice to see you here :)
I'll merge 4.20 tomorrow (in about 10-12 hours from now) and make the regular debian/ubuntu build

heikomat pushed a commit that referenced this issue Dec 26, 2018
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().

At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.

When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared.  This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is cleared

There is some uncertainty in this analysis, but it seems to be fit the
observations.  Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.

Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.

The backtrace for the blocked process looked like:

PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
 #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
 #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
 #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
 #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
 #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
 #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
 #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
 #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
 #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
 #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David Howells <dhowells@redhat.com>
heikomat pushed a commit that referenced this issue Dec 26, 2018
Function graph tracing recurses into itself when stackleak is enabled,
causing the ftrace graph selftest to run for up to 90 seconds and
trigger the softlockup watchdog.

Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:200
200             mcount_get_lr_addr        x0    //     pointer to function's saved lr
(gdb) bt
\#0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:200
\#1  0xffffff80081d5280 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:153
\#2  0xffffff8008555484 in stackleak_track_stack () at ../kernel/stackleak.c:106
\#3  0xffffff8008421ff8 in ftrace_ops_test (ops=0xffffff8009eaa840 <graph_ops>, ip=18446743524091297036, regs=<optimized out>) at ../kernel/trace/ftrace.c:1507
\#4  0xffffff8008428770 in __ftrace_ops_list_func (regs=<optimized out>, ignored=<optimized out>, parent_ip=<optimized out>, ip=<optimized out>) at ../kernel/trace/ftrace.c:6286
\#5  ftrace_ops_no_ops (ip=18446743524091297036, parent_ip=18446743524091242824) at ../kernel/trace/ftrace.c:6321
\#6  0xffffff80081d5280 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:153
\#7  0xffffff800832fd10 in irq_find_mapping (domain=0xffffffc03fc4bc80, hwirq=27) at ../kernel/irq/irqdomain.c:876
\#8  0xffffff800832294c in __handle_domain_irq (domain=0xffffffc03fc4bc80, hwirq=27, lookup=true, regs=0xffffff800814b840) at ../kernel/irq/irqdesc.c:650
\#9  0xffffff80081d52b4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:205

Rework so we mark stackleak_track_stack as notrace

Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
heikomat pushed a commit that referenced this issue Dec 26, 2018
The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

 PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
  #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
  #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
  #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
  #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
  #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
 #10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Acked-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
heikomat pushed a commit that referenced this issue Dec 26, 2018
Commit 9b6f7e1 ("mm: rework memcg kernel stack accounting") will
result in fork failing if allocating a kernel stack for a task in
dup_task_struct exceeds the kernel memory allowance for that cgroup.

Unfortunately, it also results in a crash.

This is due to the code jumping to free_stack and calling
free_thread_stack when the memcg kernel stack charge fails, but without
tsk->stack pointing at the freshly allocated stack.

This in turn results in the vfree_atomic in free_thread_stack oopsing
with a backtrace like this:

#5 [ffffc900244efc88] die at ffffffff8101f0ab
 #6 [ffffc900244efcb8] do_general_protection at ffffffff8101cb86
 #7 [ffffc900244efce0] general_protection at ffffffff818ff082
    [exception RIP: llist_add_batch+7]
    RIP: ffffffff8150d487  RSP: ffffc900244efd98  RFLAGS: 00010282
    RAX: 0000000000000000  RBX: ffff88085ef55980  RCX: 0000000000000000
    RDX: ffff88085ef55980  RSI: 343834343531203a  RDI: 343834343531203a
    RBP: ffffc900244efd98   R8: 0000000000000001   R9: ffff8808578c3600
    R10: 0000000000000000  R11: 0000000000000001  R12: ffff88029f6c21c0
    R13: 0000000000000286  R14: ffff880147759b00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffc900244efda0] vfree_atomic at ffffffff811df2c7
 #9 [ffffc900244efdb8] copy_process at ffffffff81086e37
#10 [ffffc900244efe98] _do_fork at ffffffff810884e0
#11 [ffffc900244eff10] sys_vfork at ffffffff810887ff
#12 [ffffc900244eff20] do_syscall_64 at ffffffff81002a43
    RIP: 000000000049b948  RSP: 00007ffcdb307830  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000896030  RCX: 000000000049b948
    RDX: 0000000000000000  RSI: 00007ffcdb307790  RDI: 00000000005d7421
    RBP: 000000000067370f   R8: 00007ffcdb3077b0   R9: 000000000001ed00
    R10: 0000000000000008  R11: 0000000000000246  R12: 0000000000000040
    R13: 000000000000000f  R14: 0000000000000000  R15: 000000000088d018
    ORIG_RAX: 000000000000003a  CS: 0033  SS: 002b

The simplest fix is to assign tsk->stack right where it is allocated.

Link: http://lkml.kernel.org/r/20181214231726.7ee4843c@imladris.surriel.com
Fixes: 9b6f7e1 ("mm: rework memcg kernel stack accounting")
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
@heikomat
Copy link
Owner

As promised, 4.20 is merged and the installer-script updated

@7twin
Copy link

7twin commented Dec 26, 2018

Thanks, well done! will try to test it asap!

@7twin
Copy link

7twin commented Dec 28, 2018

@heikomat can confirm it working, thanks! the only thing I faced was having to rename: /net/netfilter/xt_rateest.ko to /net/netfilter/xt_RATEEST.ko else during make modules_install it would complain about not being able to stat that file.

I did get two warnings after the successful install about xt_rateest_put together with another one possibly not existing, but haven't seen anything break (yet), so not sure if that is anything to be worried about and possibly just the regular warning output.

@heikomat
Copy link
Owner

@7twin what are the steps you took to build the kernel?
Maybe we can dockerize the process and add a download for arch in future versions

@7twin
Copy link

7twin commented Dec 31, 2018

@heikomat good that you mentioned that, reminded me that I anyway wanted to write down the process of getting it to work on arch for a long time, but kept forgetting it, here it is: https://github.com/7twin/arch_sound_e200ha

I am terrible at docker though, so can't help much there, hopefully somebody can pick up those instructions and create a working docker image.

@heikomat
Copy link
Owner

Thanks for that!
I do a lot with docker, so i'll give it a try

@heikomat
Copy link
Owner

heikomat commented Jan 2, 2019

ok, i tried a few things, and building the kernel in a docker container is most likely not a problem.
What is a problem though, is that i'd like to make it build pacman-installable linux-header and linux-image files.
I found some tutorials on how to make them, but they require configuration files (like a PKGBUILD file) to work, which i don't have.

It is probably very possible to write these package-config-files and then use them to create these installable kernel packages, but as i'm not yet that familiar with arch and pacman, creating these would certainly take me hours or even a day or two, because i'd need to read and learn a lot, which i (at this moment) don't want to invest, as i have other things to do.

@Titotix
Copy link

Titotix commented Jan 4, 2019

@asdplayer I had same troubles you described but only for encoded sounds files (so wav for sounds and mov for video were working fine).
I did steps described by 7twin, it works for me, you should give it a try ;)

@asdplayer
Copy link
Author

@Titotix Thank you, I definitely have to try the new version. Unluckily my internet connection is down because some moron stepped on optical fiber left on the floor under a manhole cover (and obviously made a mess, the fiber is broken somewhere inside the pipe to my house probably because it was pulled too strong). I had some kind of disconnected holiday, remembering the lost beauty of ADSL. In two days I'll be back at University, if I have time left I'll give this a try. It's unbelievable how short vacations become when in fact you have to work...

@asdplayer
Copy link
Author

Good evening, I tried the new kernel.
Basically I have the same myriad of problems I had with the previous version.
Here is the PKGBUILD file I used to build the kernel: https://pastebin.com/eCdLU2Zr
To use it (on Arch), download all the files from this page containing files regarding the official kernel package of Arch (by hand, or using ABS (type asp export linux)); then use this PKGBUILD instead of the original one (or edit by yourself).
Here is the .config I used. Copy it in the PKGBUILD directory with the other files, overwrite the existing config file (without trailing dot).
The issues I have are similar to those I had with the previous version.
This time audio playback with Audacious player is much better: playback speed is right now, however the aliasing-like artifacts remain. They can be clearly heard on the end of a song, when it fades away but there is still some signal.
Jack audio doesn't work well, apparently applications play at twice the speed, or break because they are aware of the real playback speed they should achieve.
Playback happens only from the second (right) logical channel, to both headphone speakers; playing from the first (left) logical channel only results in a small distorted output (from both headphone speakers), like analog crosstalk. This was tested with Jack, if that matters.
I'm not sure why, but the second time I plugged in the hedphones, playback happened through the speakers anyway. It may be due to the fact I touched settings in alsamixer which may be related to the speaker/headphone switching and I messed up something.
I had no time to test further, I apologize for this. And I don't understand many things about programming, too. So feel free to ask for more testing, I'll get the notification email and answer ASAP.

@heikomat
Copy link
Owner

@asdplayer thanks for the info and config files. I'll give building the kernel packages using docker another try on the weekend, using these files.

If we have installable packages, we might reproduce your issues on other devices. we could also compare the config you used with the one used by @7twin

@7twin
Copy link

7twin commented Jan 10, 2019

@heikomat absolutely, I'll be happy to upload my .config I sourced on my e200ha if needed

@asdplayer
Copy link
Author

@heikomat Do you mean you want to try my compiled package? It doesn't seem too secure on your side, however if you want I can upload it on Drive or something.

@heikomat
Copy link
Owner

@asdplayer close. I meant that i'll try to build the package myself, using your config files

@asdplayer
Copy link
Author

Ok that's nicer, I misunderstood your message. :-| Thank you for your effort by the way.

@heikomat
Copy link
Owner

@7twin @asdplayer Apparently someone already made an arch package for this kernel O.o
https://aur.archlinux.org/packages/linux-cx2072x/

Can you guys check if this works for you? (i still haven't installed arch on my laptop ^^)

@7twin
Copy link

7twin commented Jan 12, 2019

@heikomat impressive to see, but to be quite honest I am not sure how I would use that, also lines like that make it seem somewhat odd: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=linux-cx2072x#n146

@heikomat
Copy link
Owner

Sorry for the delayed progress on this, but this month is really packed for me. A lot of work, and a lot of studying. I'm still interested in debugging this, but I just don't known when I'll be able to do so.

@7twin
Copy link

7twin commented Jan 24, 2019

@heikomat absolutely no issue, take your time, exams have much higher priority

@asdplayer
Copy link
Author

No problem, I'm short of time too.

heikomat pushed a commit that referenced this issue May 19, 2019
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset contains various small fixes for mlxsw.

Patch #1 fixes a warning generated by switchdev core when the driver
fails to insert an MDB entry in the commit phase.

Patches #2-#4 fix a warning in check_flush_dependency() that can be
triggered when a work item in a WQ_MEM_RECLAIM workqueue tries to flush
a non-WQ_MEM_RECLAIM workqueue.

It seems that the semantics of the WQ_MEM_RECLAIM flag are not very
clear [1] and that various patches have been sent to remove it from
various workqueues throughout the kernel [2][3][4] in order to silence
the warning.

These patches do the same for the workqueues created by mlxsw that
probably should not have been created with this flag in the first place.

Patch #5 fixes a regression where an IP address cannot be assigned to a
VRF upper due to erroneous MAC validation check. Patch #6 adds a test
case.

Patch #7 adjusts Spectrum-2 shared buffer configuration to be compatible
with Spectrum-1. The problem and fix are described in detail in the
commit message.

Please consider patches #1-#5 for 5.0.y. I verified they apply cleanly.

[1] https://patchwork.kernel.org/patch/10791315/
[2] Commit ce162bf ("mac80211_hwsim: don't use WQ_MEM_RECLAIM")
[3] Commit 39baf10 ("IB/core: Fix use workqueue without WQ_MEM_RECLAIM")
[4] Commit 75215e5 ("iwcm: Don't allocate iwcm workqueue with WQ_MEM_RECLAIM")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
heikomat pushed a commit that referenced this issue May 19, 2019
Syzkaller report this:

BUG: unable to handle kernel paging request at fffffbfff830524b
PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0
Oops: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23
Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2
RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4
RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258
RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269
R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250
R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830
FS:  00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 __list_add include/linux/list.h:60 [inline]
 list_add include/linux/list.h:79 [inline]
 proto_register+0x444/0x8f0 net/core/sock.c:3375
 nr_proto_init+0x73/0x4b3 [netrom]
 ? 0xffffffffc1628000
 ? 0xffffffffc1628000
 do_one_initcall+0xbc/0x47d init/main.c:887
 do_init_module+0x1b5/0x547 kernel/module.c:3456
 load_module+0x6405/0x8c10 kernel/module.c:3804
 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc
R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan
 bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc]
Dumping ftrace buffer:
   (ftrace buffer empty)
CR2: fffffbfff830524b
---[ end trace 039ab24b305c4b19 ]---

If nr_proto_init failed, it may forget to call proto_unregister,
tiggering this issue.This patch rearrange code of nr_proto_init
to avoid such issues.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
heikomat pushed a commit that referenced this issue May 19, 2019
By calling maps__insert() we assume to get 2 references on the map,
which we relese within maps__remove call.

However if there's already same map name, we currently don't bump the
reference and can crash, like:

  Program received signal SIGABRT, Aborted.
  0x00007ffff75e60f5 in raise () from /lib64/libc.so.6

  (gdb) bt
  #0  0x00007ffff75e60f5 in raise () from /lib64/libc.so.6
  #1  0x00007ffff75d0895 in abort () from /lib64/libc.so.6
  #2  0x00007ffff75d0769 in __assert_fail_base.cold () from /lib64/libc.so.6
  #3  0x00007ffff75de596 in __assert_fail () from /lib64/libc.so.6
  #4  0x00000000004fc006 in refcount_sub_and_test (i=1, r=0x1224e88) at tools/include/linux/refcount.h:131
  #5  refcount_dec_and_test (r=0x1224e88) at tools/include/linux/refcount.h:148
  #6  map__put (map=0x1224df0) at util/map.c:299
  #7  0x00000000004fdb95 in __maps__remove (map=0x1224df0, maps=0xb17d80) at util/map.c:953
  #8  maps__remove (maps=0xb17d80, map=0x1224df0) at util/map.c:959
  #9  0x00000000004f7d8a in map_groups__remove (map=<optimized out>, mg=<optimized out>) at util/map_groups.h:65
  #10 machine__process_ksymbol_unregister (sample=<optimized out>, event=0x7ffff7279670, machine=<optimized out>) at util/machine.c:728
  #11 machine__process_ksymbol (machine=<optimized out>, event=0x7ffff7279670, sample=<optimized out>) at util/machine.c:741
  #12 0x00000000004fffbb in perf_session__deliver_event (session=0xb11390, event=0x7ffff7279670, tool=0x7fffffffc7b0, file_offset=13936) at util/session.c:1362
  #13 0x00000000005039bb in do_flush (show_progress=false, oe=0xb17e80) at util/ordered-events.c:243
  #14 __ordered_events__flush (oe=0xb17e80, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:322
  torvalds#15 0x00000000005005e4 in perf_session__process_user_event (session=session@entry=0xb11390, event=event@entry=0x7ffff72a4af8,
  ...

Add the map to the list and getting the reference event if we find the
map with same name.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Fixes: 1e62856 ("perf symbols: Fix slowness due to -ffunction-section")
Link: http://lkml.kernel.org/r/20190416160127.30203-10-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
heikomat pushed a commit that referenced this issue May 19, 2019
Michael Chan says:

====================
bnxt_en: Misc. bug fixes.

6 miscellaneous bug fixes covering several issues in error code paths,
a setup issue for statistics DMA, and an improvement for setting up
multicast address filters.

Please queue these for stable as well.
Patch #5 (bnxt_en: Fix statistics context reservation logic) is for the
most recent 5.0 stable only.  Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
heikomat pushed a commit that referenced this issue Jul 13, 2019
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset contains various fixes for mlxsw.

Patch #1 fixes an hash polarization problem when a nexthop device is a
LAG device. This is caused by the fact that the same seed is used for
the LAG and ECMP hash functions.

Patch #2 fixes an issue in which the driver fails to refresh a nexthop
neighbour after it becomes dead. This prevents the nexthop from ever
being written to the adjacency table and used to forward traffic. Patch

Patch #4 fixes a wrong extraction of TOS value in flower offload code.
Patch #5 is a test case.

Patch #6 works around a buffer issue in Spectrum-2 by reducing the
default sizes of the shared buffer pools.

Patch #7 prevents prio-tagged packets from entering the switch when PVID
is removed from the bridge port.

Please consider patches #2, #4 and #6 for 5.1.y
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
heikomat pushed a commit that referenced this issue Jul 13, 2019
Puts range check before dereferencing the pointer.

Reproducer:

  # echo stacktrace > trace_options
  # echo 1 > events/enable
  # cat trace > /dev/null

KASAN report:

  ==================================================================
  BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
  Read of size 8 at addr ffff888069d20000 by task cat/1953

  CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  Call Trace:
   dump_stack+0x8a/0xce
   print_address_description+0x60/0x224
   ? trace_stack_print+0x26b/0x2c0
   ? trace_stack_print+0x26b/0x2c0
   __kasan_report.cold+0x1a/0x3e
   ? trace_stack_print+0x26b/0x2c0
   kasan_report+0xe/0x20
   trace_stack_print+0x26b/0x2c0
   print_trace_line+0x6ea/0x14d0
   ? tracing_buffers_read+0x700/0x700
   ? trace_find_next_entry_inc+0x158/0x1d0
   s_show+0xea/0x310
   seq_read+0xaa7/0x10e0
   ? seq_escape+0x230/0x230
   __vfs_read+0x7c/0x100
   vfs_read+0x16c/0x3a0
   ksys_read+0x121/0x240
   ? kernel_write+0x110/0x110
   ? perf_trace_sys_enter+0x8a0/0x8a0
   ? syscall_slow_exit_work+0xa9/0x410
   do_syscall_64+0xb7/0x390
   ? prepare_exit_to_usermode+0x165/0x200
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f867681f910
  Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
  RSP: 002b:00007ffdabf23488 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867681f910
  RDX: 0000000000020000 RSI: 00007f8676cde000 RDI: 0000000000000003
  RBP: 00007f8676cde000 R08: ffffffffffffffff R09: 0000000000000000
  R10: 0000000000000871 R11: 0000000000000246 R12: 00007f8676cde000
  R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000ec0

  Allocated by task 1214:
   save_stack+0x1b/0x80
   __kasan_kmalloc.constprop.0+0xc2/0xd0
   kmem_cache_alloc+0xaf/0x1a0
   getname_flags+0xd2/0x5b0
   do_sys_open+0x277/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 1214:
   save_stack+0x1b/0x80
   __kasan_slab_free+0x12c/0x170
   kmem_cache_free+0x8a/0x1c0
   putname+0xe1/0x120
   do_sys_open+0x2c5/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff888069d20000
   which belongs to the cache names_cache of size 4096
  The buggy address is located 0 bytes inside of
   4096-byte region [ffff888069d20000, ffff888069d21000)
  The buggy address belongs to the page:
  page:ffffea0001a74800 refcount:1 mapcount:0 mapping:ffff88806ccd1380 index:0x0 compound_mapcount: 0
  flags: 0x100000000010200(slab|head)
  raw: 0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
  raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                     ^
   ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com

Fixes: 4285f2f ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
heikomat pushed a commit that referenced this issue Jul 13, 2019
It is possible for an irq triggered by channel0 to be received later
after clks are disabled once firmware loaded during sdma probe. If
that happens then clearing them by writing to SDMA_H_INTR won't work
and the kernel will hang processing infinite interrupts. Actually,
don't need interrupt triggered on channel0 since it's pollling
SDMA_H_STATSTOP to know channel0 done rather than interrupt in
current code, just clear BD_INTR to disable channel0 interrupt to
avoid the above case.
This issue was brought by commit 1d069bf ("dmaengine: imx-sdma:
ack channel 0 IRQ in the interrupt handler") which didn't take care
the above case.

Fixes: 1d069bf ("dmaengine: imx-sdma: ack channel 0 IRQ in the interrupt handler")
Cc: stable@vger.kernel.org #5.0+
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Reported-by: Sven Van Asbroeck <thesven73@gmail.com>
Tested-by: Sven Van Asbroeck <thesven73@gmail.com>
Reviewed-by: Michael Olbrich <m.olbrich@pengutronix.de>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants