Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/bin/sh: 1: qsub: not found #27

Closed
Karol-G opened this issue Mar 26, 2021 · 6 comments
Closed

/bin/sh: 1: qsub: not found #27

Karol-G opened this issue Mar 26, 2021 · 6 comments

Comments

@Karol-G
Copy link
Collaborator

Karol-G commented Mar 26, 2021

Hi Sarthak,

when I try to train on the toy dataset with the samples/config_classification.yaml I get the error /bin/sh: 1: qsub: not found. I believe this originates from 'parallel_compute_command' in the config. I am using the newest pull from gandalf-refactor and am using Linux.

The train command:

python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda

Full error log:

Submitting job for testing split 0 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 4
/bin/sh: 1: qsub: not found

This is the model.yaml (which is the samples/config_classification.yaml):

# affix version
version:
  {
    minimum: 0.0.8,
    maximum: 0.0.8 # this should NOT be made a variable, but should be tested after every tag is created
  }
# Choose the model parameters here
model:
  {
    dimension: 3, # the dimension of the model and dataset: defines dimensionality of computations
    base_filters: 30, # Set base filters: number of filters present in the initial module of the U-Net convolution; for IncU-Net, keep this divisible by 4
    architecture: vgg16, # options: unet, resunet, fcn, uinc, vgg, densenet
    batch_norm: True, # this is only used for vgg
    final_layer: None, # can be either sigmoid, softmax or none (none == regression)
    amp: False, # Set if you want to use Automatic Mixed Precision for your operations or not - options: True, False
    n_channels: 3, # set the input channels - useful when reading RGB or images that have vectored pixel types
  }
# this is to enable or disable lazy loading - setting to true reads all data once during data loading, resulting in improvements
# in I/O at the expense of memory consumption
in_memory: False
# this will save the generated masks for validation and testing data for qualitative analysis
save_masks: False
# Set the Modality : rad for radiology, path for histopathology
modality: rad
# Patch size during training - 2D patch for breast images since third dimension is not patched 
patch_size: [64,64,64]
# uniform: UniformSampler or label: LabelSampler
patch_sampler: uniform
# Number of epochs
num_epochs: 100
# Set the patience - measured in number of epochs after which, if the performance metric does not improve, exit the training loop - defaults to the number of epochs
patience: 50
# Set the batch size
batch_size: 1
# Set the initial learning rate
learning_rate: 0.001
# Learning rate scheduler - options: triangle, triangle_modified, exp, reduce-on-lr, step, more to come soon - default hyperparameters can be changed thru code
scheduler: triangle
# Set which loss function you want to use - options : 'dc' - for dice only, 'dcce' - for sum of dice and CE and you can guess the next (only lower-case please)
# options: dc (dice only), dc_log (-log of dice), ce (), dcce (sum of dice and ce), mse () ...
# mse is the MSE defined by torch and can define a variable 'reduction'; see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss
# use mse_torch for regression/classification problems and dice for segmentation
loss_function: mse
# this parameter weights the loss to handle imbalanced losses better
weighted_loss: True 
#loss_function:
#  {
#    'mse':{
#      'reduction': 'mean' # see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss for all options
#    }
#  }
# Which optimizer do you want to use - adam/sgd
opt: adam
# this parameter controls the nested training process
# performs randomized k-fold cross-validation
# split is performed using sklearn's KFold method
# for single fold run, use '-' before the fold number
nested_training:
  {
    testing: 5, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: 5 # this controls the validation data splits for model training
  }
## pre-processing
# this constructs an order of transformations, which is applied to all images in the data loader
# order: resize --> threshold/clip --> resample --> normalize
# 'threshold': performs intensity thresholding; i.e., if x[i] < min: x[i] = 0; and if x[i] > max: x[i] = 0
# 'clip': performs intensity clipping; i.e., if x[i] < min: x[i] = min; and if x[i] > max: x[i] = max
# 'threshold'/'clip': if either min/max is not defined, it is taken as the minimum/maximum of the image, respectively
# 'normalize': performs z-score normalization: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.ZNormalization
# 'normalize_nonZero': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels
# 'normalize_nonZero_masked': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels with the stats applied on non-zero pixels
# 'crop_external_zero_planes': crops all non-zero planes from input tensor to reduce image search space
# 'resample: resolution: X,Y,Z': resample the voxel resolution: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# 'resample: resolution: X': resample the voxel resolution in an isotropic manner: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# resize the image(s) and mask (this should be greater than or equal to patch_size); resize is done ONLY when resample is not defined
data_preprocessing:
  {
    'normalize',
    # 'normalize_nonZero', # this performs z-score normalization only on non-zero pixels
    'resample':{
      'resolution': [1,2,3]
    },
    #'resize': [128,128], # this is generally not recommended, as it changes image properties in unexpected ways
    'crop_external_zero_planes', # this will crop all zero-valued planes across all axes
  }
# various data augmentation techniques
# options: affine, elastic, downsample, motion, ghosting, bias, blur, gaussianNoise, swap
# keep/edit as needed
# all transforms: https://torchio.readthedocs.io/transforms/transforms.html?highlight=transforms
# 'kspace': one of motion, ghosting or spiking is picked (randomly) for augmentation
# 'probability' subkey adds the probability of the particular augmentation getting added during training (this is always 1 for normalize and resampling)
data_augmentation: 
  {
    default_probability: 0.5,
    'affine',
    'elastic',
    'kspace':{
      'probability': 1
    },
    'bias',
    'blur': {
      'std': [0, 1] # default std-dev range, for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomBlur
    },
    'noise': { # for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomNoise
      'mean': 0, # default mean
      'std': [0, 1] # default std-dev range
    },
    'anisotropic':{
      'axis': [0,1],
      'downsampling': [2,2.5]
    },
  }
# parallel training on HPC - here goes the command to prepend to send to a high performance computing
# cluster for parallel computing during multi-fold training
# not used for single fold training
# this gets passed before the training_loop, so ensure enough memory is provided along with other parameters
# that your HPC would expect
# ${outputDir} will be changed to the outputDir you pass in CLI + '/${fold_number}'
# ensure that the correct location of the virtual environment is getting invoked, otherwise it would pick up the system python, which might not have all dependencies
parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'
## queue configuration - https://torchio.readthedocs.io/data/patch_training.html?#queue
# this determines the maximum number of patches that can be stored in the queue. Using a large number means that the queue needs to be filled less often, but more CPU memory is needed to store the patches
q_max_length: 40
# this determines the number of patches to extract from each volume. A small number of patches ensures a large variability in the queue, but training will be slower
q_samples_per_volume: 5
# this determines the number subprocesses to use for data loading; '0' means main process is used
q_num_workers: 2 # scale this according to available CPU resources
# used for debugging
q_verbose: False

Best
Karol

@Geeks-Sid
Copy link
Collaborator

Hi @Karol-G , Thanks for using GaNDLF.

It would be great if you comment out the line :

#parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'

so that it doesn't attempt to do a parallel compute, similarly, I would also recommend you to disable folded validation as:

nested_training:
  {
    testing: 1, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: -5 # this controls the validation data splits for model training
  }

for running it once.

@Karol-G
Copy link
Collaborator Author

Karol-G commented Mar 26, 2021

I commented out parallel_compute_command and testing and validation instead of nested_training, else I get the error The parameter 'nested_training' needs to be defined.
But now i get the following error:

Using default folds for testing split:  -5
Using default folds for validation split:  -5
Using previously saved parameter file ./experiments/2d_classification/output_dir/parameters.pkl
Traceback (most recent call last):
  File "gandlf_run", line 75, in <module>
    main()
  File "gandlf_run", line 70, in main
    TrainingManager(dataframe=data_full, headers = headers, outputDir=model_path, parameters=parameters, device=device, reset_prev = reset_prev)
  File "/content/GaNDLF-refactor/GANDLF/training_manager.py", line 146, in TrainingManager
    device=device, params=parameters, testing_data=testingData)
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 322, in training_loop
    metrics = params["metrics"]
KeyError: 'metrics'

Seems a metric is missing. How do I define it for a classification task?

@sarthakpati
Copy link
Collaborator

HI @Karol-G,

You can add a key metrics in the parameter file with the value ['mse'] for classification. I will update this on the testing config for clarity.

Cheers,
Sarthak

@Karol-G
Copy link
Collaborator Author

Karol-G commented Mar 26, 2021

Hmm, I still get the same error with:

metrics:
  - mse

or:
metrics: ['mse']

@sarthakpati
Copy link
Collaborator

Ah, I think you need to try with this:

python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda \
-reset_prev True # this will remove all writes to disk (such as training/validation data and parameters) from previous run

@Karol-G
Copy link
Collaborator Author

Karol-G commented Mar 26, 2021

Ah yes that fixed it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants