Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about train_propositional.sh #19

Closed
aymeric75 opened this issue May 23, 2022 · 1 comment
Closed

Questions about train_propositional.sh #19

aymeric75 opened this issue May 23, 2022 · 1 comment

Comments

@aymeric75
Copy link

Hello,

It seems that jbsub is a custom scheduler that we don't have access to. On my cluster one is using srun

So I tried to replace the first line (l.47) in train_propositional.sh that calls jbsub with srun, here is my file so far:


#!/bin/bash

set -e

trap exit SIGINT


ulimit -v 16000000000

export PYTHONUNBUFFERED=1
# sokoban problem 2 has the same small screen size as problem 0, and has more than 20000 states unlike problem 0.
# ('sokoban_image-20000-global-global-0-train.npz', array([56, 56,  3]), (3613, 1, 9408)) --- probelm 0 has only 3613 states!
# ('sokoban_image-20000-global-global-2-train.npz', array([56, 56,  3]), (19999, 1, 9408))
export skb_train=sokoban_image-20000-global-global-2-train
export SHELL=/bin/bash
export common


task (){
    script=$1 ; shift
    mode=$1
    # main training experiments. results are used for planning experiments

    $common $script $mode hanoi     4 4           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     3 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     4 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode hanoi     5 9           {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode puzzle    mnist    3 3  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode lightsout digital    5  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common $script $mode lightsout twisted    5  {} $comment ::: 5000 ::: CubeSpaceAE_AMA{3,4}Conv
    $common -queue x86_12h $script $mode puzzle    mandrill 4 4  {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_24h $script $mode puzzle    mandrill 4 4  {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
    $common -queue x86_6h  $script $mode sokoban   $skb_train    {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_12h $script $mode sokoban   $skb_train    {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
    $common -queue x86_12h $script $mode blocks    cylinders-4-flat {} $comment ::: 20000 ::: CubeSpaceAE_AMA3Conv
    $common -queue x86_24h $script $mode blocks    cylinders-4-flat {} $comment ::: 20000 ::: CubeSpaceAE_AMA4Conv
}

export -f task


proj=$(date +%Y%m%d%H%M)sae-planning
number=2

################################################################
## Train the network, and run plot, summary, dump for as the job finishes
#common="parallel -j 1 --keep-order jbsub -mem 16g -cores 1+1 -queue x86_6h -proj $proj -require 'v100||a100'"
common="parallel -j 1 --keep-order srun -N 1 -p g100_usr_interactive --gres=gpu:1 -proj $proj -require 'v100||a100'"



export comment=kltune$number
parallel -j 1 --keep-order task ./train_kltune.py learn_summary_plot_dump ::: {1..30}

exit

Which creates the error:

srun: fatal: Can not execute 202205230755sae-planning

I have hard time understanding what the "202205230755sae-planning" executable corresponds to, as well as what is the "-proj" argument of jbsub

Best regards

Aymeric

@guicho271828
Copy link
Owner

-proj is just a tag to assign to jobs. in my experience both LFS, Torque had this feature, surely slurm has one too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants