Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] #18

Asumerodi · 2017-05-21T16:26:30Z

I have created a batch conversion script to convert thousands of songs in zsh using this package. I am noticing however that that after several hundred jobs the notify feature fails and I get stuck in an infinite loop waiting for a job to finish, that is actually already done.

I know the job is finished because I am using the 'watch' utility to watch the subprocess to completion. I am using a setup more complex but similar to your example. I am using a loop mechanism to only process '$(nproc) * 2' jobs at a time as to not overload the cpu, so when this bug occurs the script can't continue to the next batch and I have to kill it to continue.

I have tried killing the worker after each iteration and starting a new one at the beginning of the loop but the same behavior continues. I have also tried modifying the script to run without limiting the number of jobs and the same behavior still occurs after 200-300 jobs.

This is a big pain because once it crosses the threshold, the script constantly fails and I can only process 2 or 3 interations on the next try before it fails again (my script checks for files that are already converted so this still constitutes as a job even though no conversion takes place).

Here is the relevant code section

readonly -i cores=$(( $(nproc) * 2 ))

for (( i=1, j=cores; i <= (${#files} + cores); i+=cores, j+=cores )); do
  if [[ $j -gt ${#files} ]]; then
    j=${#files}
  fi

async_start_worker async_conversion -n   ### < --- same behavior even if this is outside the loop
async_register_callback async_conversion completed_callback

  for (( goo=i; goo <= j; goo++ )); do
    if [[ $goo -le ${#files} ]]; then
      async_job async_conversion convert ${files[$goo]}
    fi
  done

  while (( COMPLETED_JOBS < j )); do  ### <---- infinite loop starts here
    echo $COMPLETED_JOBS -- $j
    sleep 0.001
  done

async_stop_worker async_conversion

done

mafredri · 2017-05-21T17:01:07Z

Hi @Asumerodi and thanks for your report, I find it a bit hard to understand exactly how this code is executing, could you post me a more complete sample?

Also, what version of zsh are you running? There are problems on at least 5.0.2 and 5.0.8 that I can't do anything about. Furthermore, the use of kill-signals (before zsh 5.2) is not optimal for a lot of simultaneous job execution.

One thing to keep in mind is that eventually the shell will not be able to create more forks, so there's a limit to how many async_jobs can run simultaneously, especially if your job creates new forks.

You could try changing this line to redirect errors inside the worker to a file, e.g. exec 2>> /tmp/err.log (note double >> for append). This might give some helpful hint as to what is going wrong.

Sorry I've got nothing concrete for you yet, need to understand the problem you're having better first 😄.

mafredri · 2017-05-21T17:02:13Z

Re: complete sample. You don't really need to share your business logic, code that runs with a "noop" function would help.

mafredri · 2017-05-21T17:29:46Z

Actually, I might in fact understand the problem here. Since you're using async in a script, ZLE is not available, and thus -n (notify) will only work via kill.

I recommend you change a few bits:

test_large_number_of_jobs() {
	readonly -i cores=4
	readonly -i files=1000
	integer COMPLETED_JOBS=0
	integer i j goo

	completed_callback() {
		print cb $@
		COMPLETED_JOBS=$((COMPLETED_JOBS + 1))
	}
	convert() {
		print $1
	}

	async_start_worker async_conversion

	for (( i=1, j=cores; i <= (files + cores); i+=cores, j+=cores )); do
		print $i $j
		if [[ $j -gt $files ]]; then
			j=$files
		fi

		for (( goo=i; goo <= j; goo++ )); do
			print $i $j $goo
			if [[ $goo -le $files ]]; then
				async_job async_conversion convert $goo
			fi
		done

		while (( COMPLETED_JOBS < j )); do
			sleep 0.001
			async_process_results async_conversion completed_callback
			echo $COMPLETED_JOBS -- $j
		done
	done

	async_stop_worker async_conversion
}

I changed some things to keep it simple when writing the test function, but the biggest change is that we don't use -n for notify in the worker and we call async_process_results async_conversion completed_callback manually in the loop. The way your code was structured there was never any need for the notify-part 😄.

Asumerodi · 2017-05-23T05:36:32Z

I know the code is ugly and was even reluctant to post in that form, and I appreciate the work around. I will certainly be using it for now. Zsh version is 5.3.1 on Arch Linux so unless there is a regression in this version I dunno if what you mention applies. Essentially the loop is running however many cores are detected on the system times 2, jobs. On my system that would be 16 jobs running concurrently, once all 16 finish, 16 more are started until all conversions are finished.

the only other meaningful code in the script besides option parsing is this function which handles conversion.

convert () {
  local -r file=$1

  if [[ -n ${outputdir} ]]; then
    local -r outdir=${outputdir[3]} # <--- user defined directory to  put converted files
  else
    local -r outdir=$MUSICDIR/ALAC # <--- if not specified on command line use default
  fi

  local -r outfile=${outdir}/${file%flac}m4a

# error check unimportant to execution
  if [[ -n ${musicdir} ]] && [[ $(dirname ${file%/*}) != "." ]]; then
      mkdir -p ${outdir}/${file%/*}
  else
    mkdir -p ${outdir}
  fi

# the print calls are just for the async_process_results
  if ! [[ -f ${outfile} ]]; then
    if ! ffmpeg -i ${file} -vn -c:a alac ${outfile}; then
      print -u2 "'${outfile##*/}'"
      return 1
    fi
    print "'${outfile##*/}'"
  else
    print -u2 "'${outfile##*/}': exists"
    return 3
  fi

  return 0
}

sorry if it still hard to follow this was just a brain dump that I want to fix up after I get it working correctly.

mafredri · 2017-05-24T18:52:01Z

I know the code is ugly and was even reluctant to post in that form, and I appreciate the work around.

Don't be so hard on yourself, it is fine 😄.

Zsh version is 5.3.1 on Arch Linux so unless there is a regression in this version I dunno if what you mention applies.

No, you're fine in that regard. I honestly didn't consider at first that this was running as a script (without ZLE).

the only other meaningful code in the script besides option parsing is this function which handles conversion.

Thanks for posting it. Now, I don't see anything obvious, however, would you mind changing $(dirname ${file%/*}) to ${${file:h}%/*} to see if it has any effect? I'm curious to see if not using a subshell here would help (fewer forks).

I really wish I could make the notification (-n) feature more reliable when run in scripts but kill-signals are not reliable in zsh, at least not when received in rapid succession, some will be lost along the way 😞.

There is one way to utilize the ZLE watcher in scripts as well, but you'd have to start an interactive shell to accomplish that, there's an example of how to do it in async_test.zsh#L482-L512.

PS. Out of curiosity, did you try the exec 2>> /tmp/err.log to see if it spat out any errors?

Asumerodi · 2017-05-25T18:29:45Z

I did try appending to '/tmp/err.log' on the line you mentioned however after the script gets stuck in the loop there is no error in that file its just empty. I also tried changing that 'dirname' to a parameter expansion as you suggested and am still getting the same behavior. Fortunately though your work around to call async_process_results directly works perfectly so at least I can get the job done, next step is to generalize the script to convert to arbitrary formats.

mafredri · 2017-06-07T15:26:26Z

Thanks for testing those cases @Asumerodi, it means I can cross some of my suspicions off the list. I'll try to look into this more deeply when I have some extra time. In the meantime, I'm glad that the work-around helped 🙂. Good luck with your script!

mafredri · 2020-10-22T16:50:47Z

@Asumerodi it may be 2 years too late, but I might have a fix for your original issue in #45. If you're still using async for these kinds of things, feel free to give it a spin. If not, just thought I'd let you know.

laggardkernel · 2020-11-25T04:03:27Z

Update 1: It turns out zpty -w $worker "$@"$'\0' failed after too many jobs have been sent.

Behavior

@mafredri I may have encounter similar problem in my async zsh prompt. I have made 22 sections in the prompt async before. But I found 3 of the sections are never rendered in my prompt.

What I've digged

I tried to echo in function _async_send_job and I'm sure zpty -w $worker "$@"$'\0' is called and the job is passed into the worker. Later, I tried enabling debug in GH-45, and I found no info about the last three section in the debug log.

_async_send_job() {

	echo "_async_send_job: $@"
	echo "_async_send_job: $@" >> /tmp/sz.log
	zpty -w $worker "$@"$'\0'
}

_async_send_job: _async_eval setopt EXTENDED_GLOB
_async_send_job: sz::async_wrapper sz::vcs vcs·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::package package·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::node node·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::ruby ruby·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::elm elm·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::elixir elixir·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::xcode xcode·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::swift swift·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::golang golang·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::php php·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::rust rust·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::haskell haskell·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::julia julia·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::vagrant vagrant·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::docker docker·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::aws aws·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::conda conda·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::pyenv pyenv·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::dotnet dotnet·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::ember ember·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::kubectl kubectl·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::terraform terraform·\|·prompt·\|·

The keywords ember, kubectl, terraform didn't appear in the debug log assigned by ASYNC_DEBUG_WORKER_STDERR=/tmp/sz.log.

To check if the worker is killed in my prompt code. I tried intercept the function async_stop_worker.

async_stop_worker() {
	echo "----------------" >> /tmp/sz.log
	echo "Stopping $@" >> /tmp/sz.log

And it's only killed and printed once in precmd hook, which is the expected behavior.

Seems a bug in zpty?

Reproduce the bug

Load my prompt spacezsh and enable all async section by modify default setting in default.zsh.

https://github.com/laggardkernel/spacezsh-prompt/blob/31ea7c92519efc8248330eb8d83bb9f17e002c85/presets/default.zsh#L9-L45

if [ -z "$SPACESHIP_PROMPT_ORDER" ]; then
  SPACESHIP_PROMPT_ORDER=(
    user               # Username section
    host               # Hostname section
    dir                # Current directory section
    vcs::async         # Version control system section (git, hg, svn)
    # git::async         # deprecated
    # hg::async          # deprecated

    package::async     # Package version
    node::async        # Node.js section
    ruby::async        # Ruby section
    elm::async         # Elm section
    elixir::async      # Elixir section
    xcode::async       # Xcode section
    swift::async       # Swift section
    golang::async      # Go section
    php::async         # PHP section
    rust::async        # Rust section
    haskell::async     # Haskell Stack section
    julia::async       # Julia section
    vagrant::async     # Vagrant section
    docker::async      # Docker section
    aws::async         # Amazon Web Services section
    venv               # virtualenv section
    conda::async       # conda virtualenv section
    pyenv::async       # Pyenv section
    dotnet::async      # .NET section
    ember::async       # Ember.js section
    kubectl::async     # Kubectl context section
    terraform::async   # Terraform workspace section

    line_sep           # Line break
    # vi_mode           # deprecated
    char               # Prompt character, with vi-mode indicator integrated
  )
fi

Temporary Solution

Cause not all the section is time-consuming, I'll make the section async as needed depending on the execution time of the async job I get in the callback function.

mafredri changed the title ~~notify featues is unreliable on large number of jobs [1000+]~~ Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] Jun 7, 2017

This was referenced Oct 22, 2020

Fixes for zsh/zpty fd message exchange #45

Open

Hang when cat is aliased to bat #44

Closed

laggardkernel mentioned this issue Nov 25, 2020

Some jobs are not done when too many sections are run async in the worker laggardkernel/spacezsh-prompt#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] #18

Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] #18

Asumerodi commented May 21, 2017 •

edited

mafredri commented May 21, 2017

mafredri commented May 21, 2017

mafredri commented May 21, 2017

Asumerodi commented May 23, 2017 •

edited

mafredri commented May 24, 2017 •

edited

Asumerodi commented May 25, 2017

mafredri commented Jun 7, 2017

mafredri commented Oct 22, 2020

laggardkernel commented Nov 25, 2020 •

edited

Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] #18

Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+] #18

Comments

Asumerodi commented May 21, 2017 • edited

mafredri commented May 21, 2017

mafredri commented May 21, 2017

mafredri commented May 21, 2017

Asumerodi commented May 23, 2017 • edited

mafredri commented May 24, 2017 • edited

Asumerodi commented May 25, 2017

mafredri commented Jun 7, 2017

mafredri commented Oct 22, 2020

laggardkernel commented Nov 25, 2020 • edited

Behavior

What I've digged

Reproduce the bug

Temporary Solution

Asumerodi commented May 21, 2017 •

edited

Asumerodi commented May 23, 2017 •

edited

mafredri commented May 24, 2017 •

edited

laggardkernel commented Nov 25, 2020 •

edited