## Building a container

#### SSH into submit node (on local machine)

- SSH into the submit node. For example, for my personal submit node, I do: `ssh pravindran@submit2.chtc.wisc.edu` and follow through the terminal messages.

#### Initiate an interactive session (on CHTC submit node)

- Create a submit file for starting an interactive session on a "working" machine. We will build our container using the interactive session on the working machine. 

- If the submit file is called `interactive_session_launcher.sub`, on the submit machine the command `condor_submit -i interactive_session_launcher.sub` launches an interactive session on the working machine. Note the working machine is arbitrarily assigned and changes everytime. So backup your creations in the interactive sessions before exiting the session.  

- Contents of file: `interactive_session_launcher.sub` :
```
    # Submit file to use when launching an interactive session
    universe = vanilla
    log = interactive.log

    #If your build job needs access to any files in your /home directory, 
    transfer them to your job using transfer_input_files
    #transfer_input_files =
    
    +IsBuildJob = true
    requirements = (OpSysMajorVer =?= 8)
    request_cpus = 1
    request_memory = 32GB
    request_disk = 64GB

    queue
```

- Note the following about the contents of `interactive_session_launcher.sub`:
    - We have requested 32GB of memory, 64GB of disk, and 1 cpu. Insufficient resource requests will lead to memory errors when building the containers on the working machine. If something like this happens, investigate messages in the file `interactive.log`.

- It all goes well, in a few seconds, you will be in an interactive session on the working machine.


#### Build the container (on CHTC working node)
- Create a definition file what we want in our container. 

- Contents of file: `pytch113_container.def` 
    ```
    Bootstrap: docker
    From: pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel

    %post
        conda install -c anaconda pandas
        conda install -c anaconda scipy
        conda install -c anaconda scikit-learn
    ```
    
- Note the following about the contents of `pytch113_container.def`:
    - The curated PyTorch 1.13.0 container is used.
    - After that we install pandas, scipy, and scikit-learn. 
    
- Submit a job that creates the container as follows: `apptainer build pytch113_container.sif pytch113_container.def`. If all goes well the file `pytch113_container.sif` will be created.

- Move `pytch113_container.sif` to the `/staging/<username>`: `mv pytch113_container.sif /staging/<username>`

- Exit from working node.

## Preparing for jobs submission


#### Create the ingredients (on CHTC submit node)
- The executable bash script that is run by every job with its own set of parameters. Here are contents for the file `executable_for_job.sh`:
    ```
    #!/bin/bash

    DATAFILE=$1
    PARAM=$2
    OUTDIR=$3

    python main.py --data_file $DATAFILE --out_dir $OUTDIR --param $PARAM
    ```
    
- The executable expects 3 arguments when invoked. 
- ---
- The `params_for_jobs.txt` contains values for the arguments expected by `executable_for_job.sh`. Here are the contents of `params_for_jobs.txt`:
    ```
    t1.pt, out, double
    t1.pt, out, triple
    t1.pt, out, quadruple
    t2.pt, out, double
    t2.pt, out, triple
    t2.pt, out, quadruple
    ```
- ---
- The jobs submission file (`jobs_submitter.sub`) takes the `executable_for_job.sh` and `params_for_jobs.txt` files and high-throughputs the computation into multiple jobs: it launches as the specified number of jobs and maps one job (an invocation of the `executable_for_job.sh` script with one setting for parameters from `params_for_jobs.txt`) to one machine on CHTC.

- Contents of `jobs_submitter.sub`:
    ```
    # Submit jobs.

    # Provide HTCondor with the name of your .sif file and universe information
    # (`universe = container` is optional as long as `container_image` is specified)
    container_image = pytch113_container.sif
    universe = container

    executable = executable_for_job.sh
    arguments = $(DATAFILE) $(PARAM) $(OUTDIR)

    # Tell HTCondor to transfer the my-container.sif file to each job
    transfer_input_files = file:///staging/pravindran/pytch113_container.sif, main.py, params.py, data/$(DATAFILE)
    transfer_output_files = $(OUTDIR)/$(DATAFILE)_X$(PARAM).pt
    transfer_output_remaps = "$(DATAFILE)_X$(PARAM).pt = out/$(DATAFILE)_X$(PARAM).pt"

    log = logs/$(CLUSTER).log
    error = errors/$(CLUSTER)_$(PROCESS).err
    output = outputs/$(CLUSTER)_$(PROCESS).out

    # Make sure you request enough disk for the container image in addition to your other input files
    request_cpus = 4
    request_memory = 32GB
    request_disk = 64GB      

    queue DATAFILE, PARAM, OUTDIR from params_for_jobs.txt
    ```
    
- Notes about contents of `jobs_submitter.sub`
    - Expects jobs to be submitted from the directory in which `main.py` and `params.py` reside.
    - Expects `pytch113_container.sif` to be in `/staging/pravindran/`.
    - Put `executable_for_job.sh`, `params_for_jobs.txt`, and `jobs_submitter.sub` in the directory that contains `main.py` and `params.py`.
    - The specification in `jobs_submitter.sub` will make the data file that is specified as input `data/$(DATAFILE)` to be copied as `$(DATAFILE)` in the same directory as `main.py` on the working node assigned to the job. So `main.py` must handle this. Also, our `main.py` creates a directory `out/` and puts the output file (say `job_output.txt`) inside it. CHTC copies `out/job_output.txt` into the same directory from which the jobs were submitted on the submit node. Hence if reorg of the transferred output data is required, we do an output remap in `transfer_output_remaps`. CHTC cannot create this `out/` and expects that directory `out/` exists in directory from which the jobs were submitted. We have to create the specified directory structure before submitting the jobs (see below). The path specified in `transfer_output_remaps` need not match those created by `main.py` and so `transfer_output_remaps` can be used to rename the files as well. 

## High throughput processing

#### Submitting the jobs (on CHTC submit node)

- Clone this repo (`chtc_toy`) from GitHub: `git clone https://github.com/prabu-github/chtc_toy.git`.
- Get into `chtc_toy` directory: `cd chtc_toy`.
- Copy files into correct location: 
    - `cp chtc/executable_for_job.sh .`
    - `cp chtc/params_for_jobs.txt .`
    - `cp chtc/jobs_submitter.sub .`
- Give permisssions: `chmod +777 executable_for_job.sh`.
- Create the `out/` directory: `mkdir out`.
- Submit: `condor_submit jobs_submitter.sub`.