Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanning loop #1164

Closed
talkingseedling opened this issue Jun 7, 2021 · 22 comments
Closed

Scanning loop #1164

talkingseedling opened this issue Jun 7, 2021 · 22 comments
Labels

Comments

@talkingseedling
Copy link

Description

Error while scanning folders for new music. Navidrome looped indefinitely over the same folder, causing high CPU usage and strerr log generation until manual Navidrome restart.

Expected Behaviour

Successful scan of a music folder, or error then move on.

Steps to reproduce

Problem started with the automated scheduled scan.
A restart corrected the problem, not reproducible.

Platform information

  • Navidrome version: v0.43.0
  • Navidrome source: deluan/navidrome:latest
  • Operating System: Linux mint 20.1
  • Docker 19.03.12

image

Additional information

Discovered with the large amount of logs which Navidrome was pushing to the ELK logging server, at a rate of 18,000 log entries every 5 seconds, totaling 80 million logs before manual Navidrome shutdown.
image
image

The log file:
image

The message in all 80 million log files is exactly the same:
time="2021-06-03T02:54:33Z" level=warning msg="Skipping DirEntry" error="readdirent /music/Music/Ed Kuepper/Reflections of Ol' Golden Eye: no such file or directory"

After restarting Navidrome, all folders were successfully scanned.

@deluan
Copy link
Member

deluan commented Jun 8, 2021

Very weird! I never saw this, and cannot think of a way for this to be triggered in the current scanner code.... Unfortunatelly without a way to reproduce this, it will be impossible to figure this out... Is there any other message between these "Skipping DirEntry" messages?

@talkingseedling
Copy link
Author

talkingseedling commented Jun 8, 2021

Hey, no, all the log messages were identical, with what looks like 4 messages every 0.001 seconds. I'm not an ELK expert so I don't know how to enumerate through them all for differences, but I looked at 5 in a row, and then 5 random ones, and they were all the same.

image

My volume source data is mounted from a SAMBA share. Could this bug be triggered if the file share hiccuped during a scan?

image

@deluan
Copy link
Member

deluan commented Jun 14, 2021

Could this bug be triggered if the file share hiccuped during a scan?

Maybe.... But I checked the code and it does not make sense.... Well, as we can't reproduce it, I'll close this for now, but please let me know if you have any other related issue. Thanks!

@deluan deluan closed this as completed Jun 14, 2021
@talkingseedling
Copy link
Author

talkingseedling commented Jun 15, 2021

Hey, sorry for posting on a closed topic, seems it wasn't a one off. It's triggered again.

image

Different folder it got stuck on this time, but like last time all of the logs are it scanning the exact same folder non-stop.

image

The "folders scanned" number does not move.
image

@nellistc
Copy link

nellistc commented Jul 7, 2021

Can confirm this is occurring in 0.44.1 at random. As above, stopping and restarting navidrome will continue scanning until the error happens again. It happens on different directories each time.

When this happens, kernel reports:

CIFS VFS: Send error in read = -4

Drive was mounted using SMB protocol 1. Tried SMB2 and SMB3 - same problem.

Looks related to golang/go#38836.

If the same share is mounted with NFS, scanning completes successfully.

@talkingseedling
Copy link
Author

I noticed that when Navidrome was stuck in the scanning loop, every other docker container also reported an error accessing files, pointing to an issue with my SMB host (though as the SMB share seemed fine on observation, restarting all docker containers fixed this). Though while the other containers threw 1 error, Navidrome loops and throws many errors.

@deluan
Copy link
Member

deluan commented Jul 14, 2021

@nellistc Yeah, it seems to be related to that issue, but it is not the same. That issue used to break Navidrome scanner in older versions, but it was fixed in Go 1.15 (we are now on Go 1.16). And the solution/workaround proposed in that issue is actually still in place in Navidrome's Dockerfile:

ENV GODEBUG "asyncpreemptoff=1"

@talkingseedling, do you see this error when Navidrome is not scanning?

@talkingseedling
Copy link
Author

@deluan Elastic reports the term "async" was never present in any log message
image

If it helps, what is triggering this bug is when the docker container expects to find data in a volume mount, and the volume mount has failed in some way (different to the volume mount working, and a file just not being there). What I haven't tested yet, is if after the error happens, if the volume re-mounts and is viewable from "docker exec -it navidrome ls /music".

My SMB server (where the docker volume mounts point) also auto-updates, and I have not had this error since 3rd June.

@deluan deluan closed this as completed in eb8ffc6 Jul 16, 2021
@deluan
Copy link
Member

deluan commented Jul 16, 2021

Ok, I found and fixed the infinite loop. If the SMB fs fails when reading a directory, it will just skip that dir. To test it, you can use the deluan/navidrome:develop docker tag (should be available in 20 mins) EDIT: Already available.

Please let me know if this is working as expected. If not, feel free to reopen this issue.

@whorfin
Copy link
Contributor

whorfin commented Jul 17, 2021

Apologies for the churn - I made a comment in the commit, but that doesn't show up at all on the mobile client nor did it tag this, so I'm replicating here for visibility

@deluan The fix in eb8ffc6 will break the "find all subdirs when some are missing rx permissions" behavior fixed by #1054
The loop over ReadDir was not a bug.
https://pkg.go.dev/os#File.ReadDir

ReadDir reads the contents of the directory associated with the file f and returns a slice of DirEntry values in directory order. Subsequent calls on the same file will yield later DirEntry records in the directory.
...
If n <= 0, ReadDir returns all the DirEntry records remaining in the directory. When it succeeds, it returns a nil error (not io.EOF).

#1164 reports a scenario where mounts appear to be broken; the golang behavior here doesn't seem right (it should "succeed" and not remain stuck on the same directoryentry), but the fix is not to bail on the first error - there may be more entries available with a properly functioning filesystem.

The guts of this bug appear in this case limited to the unix os (os/dir_unix.go), so it looks like, again just for this case, it might also work to return when
the error is a PathError and the Op is "readdirent"
An alternative hack/workaround/sanity-check would be to look for identical, successive errors (full error string, including the failing path), and bail, assuming that to have been an impossible occurance

@deluan
Copy link
Member

deluan commented Jul 17, 2021

Are you sure? I just tested with this scenario:

$ ll
total 0
d-w-------  16 deluan  staff   512B 13 Oct  2020 Arctic Monkeys - Favourite Worst Nightmare (2007) [CD - AAC-LC - 320]
drwxr-xr-x@  4 deluan  staff   128B 13 Oct  2020 Steven Wilson - Cover Version (2003) [CD - FLAC - 16-44.1] {SWCD6}
drwxr-xr-x@ 12 deluan  staff   384B 13 Oct  2020 The Alan Parsons Project - I Robot (1977) [CD - MP3 - 320]
$

After a clean scan, the Arctic Monkeys album does not appear in my albums, but the other two are there, together with all other albums in my library. Logs:

...
WARN[0002] Skipping unreadable directory                 error="open /Users/deluan/Music/Music/Media/_/Multi-valued/Arctic Monkeys - Favourite Worst Nightmare (2007) [CD - AAC-LC - 320]: permission denied" path="/Users/deluan/Music/Music/Media/_/Multi-valued/Arctic Monkeys - Favourite Worst Nightmare (2007) [CD - AAC-LC - 320]"
...
INFO[0006] Finished processing changed folder            dir="/Users/deluan/Music/Music/Media/_/Módulo Especial" elapsed=8.7ms purged=0 updated=1
INFO[0006] Finished processing changed folder            dir="/Users/deluan/Music/Music/Media/_/Multi-valued/Steven Wilson - Cover Version (2003) [CD - FLAC - 16-44.1] {SWCD6}" elapsed=22ms purged=0 updated=2
INFO[0007] Finished processing changed folder            dir="/Users/deluan/Music/Music/Media/_/Multi-valued/The Alan Parsons Project - I Robot (1977) [CD - MP3 - 320]" elapsed=111.5ms purged=0 updated=10
INFO[0007] Finished processing changed folder            dir="/Users/deluan/Music/Music/Media/_/New Order - 1985-05-01 - Kosei Nenkin Hall Tokyo (FM broadcast)" elapsed=115.9ms purged=0 updated=15
...

The fact that we are calling ReadDir with n == -1, it reads all readable entries in one shot. In the previous implementation the ReadDir was returning an error. It was showing the warning message but was not leaving the for loop, causing it to loop forever. The new implementation shows the warning message, and skips the current folder, by returning err == nil, so it does not interrupts the scan.

@whorfin
Copy link
Contributor

whorfin commented Jul 17, 2021

Yes, I am sure
Note how the message in your case is not from the code in question; it is "Skipping unreadable directory" instead of "Skipping DirEntry"
This is why I made that change...
To prove the point, I pulled down and built the most recent. And it totally kills my scan:

INFO[0002] Executing initial scan
WARN[0004] Skipping DirEntry                             error="lstat /tank/whorfin/Audio/Synth: permission denied"
ERRO[0004] Media Folder is empty. Aborting scan.         folder=/tank/whorfin/Audio

Restoring fullReadDir to its previous version, I'm back in business:

INFO[0002] Executing initial scan
WARN[0004] Skipping DirEntry                             error="lstat /tank/whorfin/Audio/Synth: permission denied"
WARN[0004] Skipping DirEntry                             error="lstat /tank/whorfin/Audio/nuptials: permission denied"
WARN[0004] Skipping DirEntry                             error="lstat /tank/whorfin/Audio/ipwn: permission denied"
WARN[0004] Skipping unreadable directory                 error="open /tank/whorfin/Audio/Midi: permission denied" path=/tank/whorfin/Audio/Midi
WARN[0004] Skipping unreadable directory                 error="open /tank/whorfin/Audio/Phillip: permission denied" path=/tank/whorfin/Audio/Phillip
WARN[0004] Skipping unreadable directory                 error="open /tank/whorfin/Audio/Samples: permission denied" path=/tank/whorfin/Audio/Samples
WARN[0004] Skipping unreadable directory                 error="open /tank/whorfin/Audio/Sound Effects: permission denied" path="/tank/whorfin/Audio/Sound Effects"
WARN[0004] Skipping unreadable directory                 error="open /tank/whorfin/Audio/Sounds: permission denied" path=/tank/whorfin/Audio/Sounds

The "Skipping unreadable directory" entries are simply chowned 0 for navidrome's running group
The "Skipping DirEntry" entries are chowned 0 and also have further restrictive ACLs

@deluan
Copy link
Member

deluan commented Jul 17, 2021

The "Skipping DirEntry" entries are chowned 0 and also have further restrictive ACLs

Hummm.... How to simulate this error then? What kind of permissions you have in place to cause this?

Anyways, we cannot have the for {} loop, as it is definitely causing the infinite loop issue described above.

EDIT:

The guts of this bug appear in this case limited to the unix os (os/dir_unix.go), so it looks like, again just for this case, it might also work to return when
the error is a PathError and the Op is "readdirent"
An alternative hack/workaround/sanity-check would be to look for identical, successive errors (full error string, including the failing path), and bail, assuming that to have been an impossible occurance

I can implement one of these solution, but I need to be able to reproduce it here. Or else, feel free to submit a PR.

@whorfin
Copy link
Contributor

whorfin commented Jul 17, 2021

How does this look? lmk if you'd prefer a PR:

func fullReadDir(name string) ([]os.DirEntry, error) {
	f, err := os.Open(name)
	if err != nil {
		return nil, err
	}
	defer f.Close()

	var allDirs []os.DirEntry
	var prevErr error
	for {
		dirs, err := f.ReadDir(-1)
		allDirs = append(allDirs, dirs...)
		if err == nil {
			break
		}
		log.Warn("Skipping DirEntry", err)
		if prevErr == err {
			log.Warn("Duplicate DirEntry failure, bailing", err)
			break
		} else {
			prevErr = err
		}
	}
	sort.Slice(allDirs, func(i, j int) bool { return allDirs[i].Name() < allDirs[j].Name() })
	return allDirs, nil
}

I have the opposite problem of replicating the broken filesystem...
In my scenario, here's an example - "Midi" gets the usual "Skipping unreadable directory" while "Synth" gets "Skipping DirEntry" and kills my scan with your change.

$ /usr/bin/ls -lVd /tank/whorfin/Audio/Midi
drwx------   5 whorfin  staff         25 Feb 14 13:16 /tank/whorfin/Audio/Midi
                 owner@:rwxp-DaARWcCos:-------:allow
                 group@:------a-R-c--s:-------:allow
              everyone@:------a-R-c--s:-------:allow
$ /usr/bin/ls -lVd /tank/whorfin/Audio/Synth
drwx------+ 10 whorfin  staff         11 May 15 16:39 /tank/whorfin/Audio/Synth
           user:whorfin:rwxpdDaARWcCos:-------:allow

@deluan
Copy link
Member

deluan commented Jul 17, 2021

Yeah, the code above may work. But not sure if you can compare error interfaces like that. Maybe it is better to do this instead?

if prevErr.Error() == err.Error() {
...

By the way, why are you blocking ND's access to folders this way (like in the "Synth" folder)? Why not just remove read access? Aren't we try to put in place a workaround for an issue that may not exist, given that you could just chmod -rx folders you don't want to be scanned (or even add a .ndignore to it)?

@whorfin
Copy link
Contributor

whorfin commented Jul 17, 2021

I believe equality should work; it's normally not recommended precisely because it includes "too specific" information. If you would prefer string comparisons, which I agree are perhaps less ambiguous, perhaps this is clearer:

func fullReadDir(name string) ([]os.DirEntry, error) {
	f, err := os.Open(name)
	if err != nil {
		return nil, err
	}
	defer f.Close()

	var allDirs []os.DirEntry
	var prevErrStr = ""
	for {
		dirs, err := f.ReadDir(-1)
		allDirs = append(allDirs, dirs...)
		if err == nil {
			break
		}
		log.Warn("Skipping DirEntry", err)
		if prevErrStr == err.Error() {
			log.Warn("Duplicate DirEntry failure, bailing", err)
			break
		} else {
			prevErrStr = err.Error()
		}
	}
	sort.Slice(allDirs, func(i, j int) bool { return allDirs[i].Name() < allDirs[j].Name() })
	return allDirs, nil
}

@whorfin
Copy link
Contributor

whorfin commented Jul 17, 2021

As to the "why" question...
Navidrome isn't the only thing running on my server; ACLs are used to lock down some directories differentially from simple group permissions.
We could debate which is more proper - handling ACLs, which are common on ZFS, or handling broken filesystems where ReadDir never completes - but let's make it work in both cases, right?

deluan added a commit that referenced this issue Jul 18, 2021
@whorfin
Copy link
Contributor

whorfin commented Jul 18, 2021

Big thanks!
And major hat-tip for figuring out how to make a test scenario for this

@quotengrote
Copy link

Hi,

is this fix already in the latest release or should i stay on deluan/navidrome:develop?

@deluan
Copy link
Member

deluan commented Aug 9, 2021

It was not released yet, so if you need this, please keep using the develop build. Hopefully I'll be able to release a new version this week.

@quotengrote
Copy link

👌

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants