# Reproducing FakET experiments on SHREC21 data

Some of the data preparation is done directly in this notebook. E.g. slicing the mrc files, or creating noiseless or noisy projections or running the reconstructions. However, the GPU tasks are only created here to be submitted to SLURM. The jobs can be submitted to SLURM directly from this notebook if the setup is appropriate. Therefore, it would be ideal if in your setup, this notebook was running in the appropriate conda environment from where it can execute tasks on multiple CPUs and at the same time submit GPU or CPU jobs to SLURM using sbatch command.

> **WARNING:** **Naming convention**
>
> Before the article submission we changed the naming of the methods to make it easier to understand our experiments and aim.
In this notebook, we still use the old naming. More precisely the old naming maps to new according to the following map:
> ```   
>naming = {
>    'baseline': 'BENCHMARK',
>    'gauss': 'BASELINE',
>    'styled': 'FAKET',
>    'noisy': 'NOISY',
>}
>```

> **TIP:** **Reproducing the paper**
>
> If you just want to run the experiments we did, everytime you come across cells that generate sbatch files
you do not actually have to re-create them. Just use the sbatch files we provided in the repo. However, it
is highly-likely your SLURM config is different than ours and you will need to re-create the files anyways.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import shutil
import numpy as np
import multiprocessing
import gpuMultiprocessing
from itertools import product
from os.path import join as pj
import matplotlib.pyplot as plt
from faket.data import vol_to_valid, get_theta
from faket.data import load_mrc, save_mrc, save_conf
from faket.noisy import estimate_noise_curves
from faket.noisy import aggregate, noise_projections
from faket.transform import radon_3d, reconstruct_mrc

In [None]:
# The data should be stored in the following folder
data_folder = 'data/shrec2021_extended_dataset/'

## Slicing volumes to valid voxels

In [None]:
# SHREC21 provides the data in square shape even
# thought the data is stored only in the center
# The following values specify where to slice
z_valid = (0.32226, 0.67382)  # Valid range normalized

In [None]:
# slice class_mask.mrc to faket/class_mask.mrc
for N in range(10):
    vol_to_valid(data_folder, f'model_{N}', 'class_mask', z_valid, 
                 out_fname='faket/class_mask.mrc')

In [None]:
# slice occupancy_mask.mrc to faket/occupancy_mask.mrc
for N in range(10):
    vol_to_valid(data_folder, f'model_{N}', 'occupancy_mask', z_valid, 
                 out_fname='faket/occupancy_mask.mrc')

In [None]:
# slice reconstruction.mrc to faket/reconstruction_shrec.mrc
for N in range(10):
    vol_to_valid(data_folder, f'model_{N}', 'reconstruction', z_valid, 
                 out_fname='faket/reconstruction_shrec.mrc')

## Creating projections from grandmodel

### Noiseless projections

In [None]:
# create faket/projections_noiseless.mrc by measuring the grandmodel_unbinned.mrc with Radon transform
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = {
        'input_mrc': pj(data_folder, f'model_{N}', 'grandmodel_unbinned.mrc'),
        'output_mrc': pj(data_folder, f'model_{N}', 'faket/projections_noiseless.mrc'),
        'radon_kwargs': {
            'theta': get_theta(data_folder, N),
            'dose': 0,
            'out_shape': 1024,
            'slice_axis': 1,
            # circle=False because we measure with the data outside the circle 
            # but later we cut the measurements to desired shape 
            # SHREC did it this way - confirmed from a personal communication
            'circle': False
        }
    }
    volume = load_mrc(conf['input_mrc'])
    sinogram = radon_3d(volume, **conf['radon_kwargs'])
    save_conf(conf['output_mrc'], conf)
    save_mrc(sinogram.astype(np.float32), conf['output_mrc'], overwrite=True)
print('Done')

### Noisy & Content & Gauss projections

Adding Gaussian noise in projection space and matching tilt-wise mean&std of style projections.
Noise statistics are estimated automatically from style projections based on the whole training set. 

In [None]:
%%time

noisy_conf = {
    'noise_estimation_noisy_paths': [pj(data_folder, f'model_{N}', f'projections_unbinned.mrc') for N in range(9)],
    'noise_estimation_clean_paths': [pj(data_folder, f'model_{N}', 'faket/projections_noiseless.mrc') for N in range(9)],
    'noise_estimation_aggregation_order': 2, 
    'noise_estimation_aggregation_n': 'all', 
}

print('Estimating noise')
# Estimating the noise statistics for each tilt of each training tomogram
all_curves = estimate_noise_curves(
    projections_noisy_paths=noisy_conf['noise_estimation_noisy_paths'], 
    projections_clean_paths=noisy_conf['noise_estimation_clean_paths'])

# Aggregate stats
agg_o = noisy_conf['noise_estimation_aggregation_order']
agg_n = noisy_conf['noise_estimation_aggregation_n']
aggregated_rs = aggregate([c['rs'] for c in all_curves], order=agg_o, n=agg_n)
aggregated_σs = aggregate([c['stds'] for c in all_curves], order=agg_o, n=agg_n)
aggregated_style_μs = aggregate([c['noisy_μs'] for c in all_curves], order=agg_o, n=agg_n)
aggregated_style_σs = aggregate([c['noisy_σs'] for c in all_curves], order=agg_o, n=agg_n)

# Aggregate stats for Gauss modality (best possible guess if setting sigma of global gaussian noise manually)
# This is supposed to mimic what people in CryoEM do nowadays when generating synthetic data (assuming they
# add the noise in the projection space, which is not necessarily true as some might go easier way and just
# add the noise in the reconstruction space.
global_aggregated_rs = aggregate([c['rs'] for c in all_curves], order=0, n=agg_n)
global_aggregated_σs = aggregate([c['stds'] for c in all_curves], order=0, n=agg_n)
global_aggregated_style_μs = aggregate([c['noisy_μs'] for c in all_curves], order=0, n=agg_n)
global_aggregated_style_σs = aggregate([c['noisy_σs'] for c in all_curves], order=0, n=agg_n)

print('Noise estimated.\n')

# create faket/projections_content.mrc and faket/projections_noisy.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    
    # Noisy modality
    conf = {
        'input_mrc': pj(data_folder, f'model_{N}', 'faket/projections_noiseless.mrc'), 
        'output_mrc': pj(data_folder, f'model_{N}', 'faket/projections_noisy.mrc'),
        'r': list(aggregated_rs),
        'std': list(aggregated_σs),
        'style_mrc': None,  # Instead of one style_mrc we use aggregated stats from training set
        'style_means': list(aggregated_style_μs),
        'style_stds': list(aggregated_style_σs),
        'seed': N,
    }
    save_conf(conf['output_mrc'], dict(noisy_conf, **conf))
    noise_projections(**conf)
    
    # Content modality
    conf.update({
        'std': list(aggregated_σs * 0.25),
        'output_mrc': pj(data_folder, f'model_{N}', 'faket/projections_content.mrc'),
    })
    save_conf(conf['output_mrc'], dict(noisy_conf, **conf))
    noise_projections(**conf)
    
    # Gauss modality
    conf.update({
        'r': list(global_aggregated_rs),
        'std': list(global_aggregated_σs),
        'style_means': list(global_aggregated_style_μs),
        'style_stds': list(global_aggregated_style_σs),
        'output_mrc': pj(data_folder, f'model_{N}', 'faket/projections_gauss.mrc'),
    })
    gauss_conf = dict(noisy_conf, **conf)
    gauss_conf['noise_estimation_aggregation_order'] = 0
    save_conf(conf['output_mrc'], gauss_conf)
    noise_projections(**conf)

print('Done')

To get just plain addition of gaussian noise with one std do what is above just with order = 0.\
Summary of what is happening is bellow:

```
one_r = aggregate([c['rs'] for c in all_curves], order=0, n='all')[0]
one_σ = aggregate([c['stds'] for c in all_curves], order=0, n='all')[0]

x = standardize_per_tilt(projections_noiseless) * one_r
rng = np.random.default_rng(seed=seed)
x += (rng.normal(loc=0, scale=1, size=np.prod(x.shape)).reshape(x.shape) * one_σ)

# And finally match x per tilt with mean and std of style or provided arrays
```

:TIP:

### Styled projections

Neural Style Transfer in projection space (needs GPU in order to be reasonably fast).\
On CPU, to do one NST on 61x1024x1024 sinogram takes about 5 min.

In [None]:
nstc = {  # NEURAL STYLE TRANSFER BASE CONFIG
    # The commented params will be set later
    # 'content': 'example.mrc',
    # 'style': 'example.mrc',
    # '--init': 'example.mrc',
    # '--output': 'example.mrc', 
    # '--random-seed': None,
    '--style-weights': 1.0,  # if number of style images is 1, 1.0 is the same as None
    '--content-weight': 1.0,  # weight of the content loss relative to style loss
    '--tv-weight': 0,  # No Total Variation is desired here
    '--min-scale': 1024,
    '--end-scale': 1024,
    '--iterations': 1,
    '--initial-iterations': 1,
    '--save-every': 2,
    '--step-size': 0.15,
    '--avg-decay': 0.99,
    '--style-scale-fac': 1.0,
    '--pooling': 'max',
    '--devices': 'cuda:0',
    '--seq_start' : 0,
    '--seq_end' : 61,
    '--content_layers': '8',
    '--content_layers_weights': '100',
    '--model_weights': 'pretrained'
}

# to run on cpu: f"srun --qos=cpusonly --cpus-per-task=1 --ntasks=1 --export=ALL "
def get_command(expname, nst_command, config):
    command = (
    "#!/bin/bash\n"
    f"#SBATCH --job-name {expname}\n"
    "#SBATCH --qos=gpus2\n"
    "#SBATCH --gres=gpu:1\n"
    "#SBATCH --cpus-per-task=5\n"
    "#SBATCH --nodes 1\n"
    "#SBATCH --ntasks=1\n"
    "#SBATCH --output reproduce/nst/%x.out\n\n"
    f"EXPNAME={expname} PYTHONHASHSEED=0 "
    f"{nst_command} {config['content']} {config['style']} "
    f"{' '.join([f'{k} {v}' for k, v in config.items() if k.startswith('--')])}")
    return command

In [None]:
# create faket/projections_styled.mrc
# will create number of sbatch files in the selected folder
NST_command = 'python3 -m faket.style_transfer.cli'
folder = 'reproduce/nst'
os.makedirs(folder, exist_ok=True)

jobs = []
for N in range(0, 10):
    style_N = (N + 1) % 9 # For the last train model we take style stats from the first train model
    
    EXPNAME = f'nst_tomogram_{N}'  # Just for visualizing the progress
    tomo_folder = pj(data_folder, f'model_{N}', 'faket')

    conf = nstc.copy()
    conf.update({
        'content': pj(tomo_folder, 'projections_content.mrc'),
        'style': pj(data_folder, f'model_{style_N}', 'projections_unbinned.mrc'), 
        '--init': pj(tomo_folder, 'projections_noisy.mrc'),
        '--output': pj(tomo_folder, 'projections_styled.mrc'), 
        '--random-seed': N,
    })
    
    command = get_command(EXPNAME, NST_command, conf)
    job_path = pj(folder, f'{EXPNAME}.sbatch')
    with open(job_path, 'w') as fl:
        fl.writelines(command)  
    jobs.append(job_path)

In [None]:
# Submit jobs to the SLURM queue
for job_path in jobs:
    !sbatch $job_path

## Computing reconstructions

The reconstructions run directly from this notebook (i.e. no sbatch jobs).

In [None]:
recc = {  # RECONSTRUCTION BASE CONFIG
    'downsample_angle' : 1,  # Sinogram downsampling in theta dimension (1 = no downsampling)
    'downsample_pre' : 2,  # Sinogram downsampling (1 = no downsampling)
    'order' : 3,  # Downsampling in space with spline interpolation of order (0 - 5)
    'filtering' : 'approxShrec',  # Filter used during reconstruction in FBP algorithm
    'filterkwargs' : None,
    'downsample_post' : 1,  # Reconstruction downsampling
    'z_valid': z_valid, # 2-tuple range of valid pixels in Z dimension normalized from 0 to 1. (0., 1.) or None for all.
    'software': 'radontea',  # or imod
    'ncpus': 61, # multiprocessing.cpu_count(),  # Number of CPUs to use while reconstructing with radontea
}

### Noiseless reconstructions

In [None]:
# reconstruct faket/projections_noiseless.mrc to produce faket/reconstruction_noiseless.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = recc.copy()
    conf.update({
        'input_mrc' :  pj(data_folder, f'model_{N}', 'faket/projections_noiseless.mrc'), 
        'theta': pj(data_folder, f'model_{N}', 'alignment_simulated.txt'), 
        'output_mrc' : pj(data_folder, f'model_{N}', 'faket/reconstruction_noiseless.mrc')
    })
    reconstruct_mrc(**conf)

### Baseline reconstructions (in paper referred to as BENCHMARK)

In [None]:
# reconstruct projections_unbinned.mrc to produce faket/reconstruction_baseline.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = recc.copy()
    conf.update({
        'fix_edges_proj': 8,  # Fixing artifacts in SHREC projections
        'fix_edges_rec': 20,  # Fixing artifacts in SHREC reconstructions
        'input_mrc' :  pj(data_folder, f'model_{N}', 'projections_unbinned.mrc'), 
        'theta': pj(data_folder, f'model_{N}', 'alignment_simulated.txt'), 
        'output_mrc' : pj(data_folder, f'model_{N}', 'faket/reconstruction_baseline.mrc')
    })
    reconstruct_mrc(**conf)

### Noisy reconstructions

In [None]:
# reconstruct faket/projections_noisy.mrc to produce faket/reconstruction_noisy.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = recc.copy()
    conf.update({
        'input_mrc' : pj(data_folder, f'model_{N}', 'faket/projections_noisy.mrc'), 
        'theta': pj(data_folder, f'model_{N}', 'alignment_simulated.txt'), 
        'output_mrc' : pj(data_folder, f'model_{N}', 'faket/reconstruction_noisy.mrc')
    })
    reconstruct_mrc(**conf)

### Gauss reconstructions (in paper referred to as BASELINE)

In [None]:
# reconstruct faket/projections_noisy.mrc to produce faket/reconstruction_noisy.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = recc.copy()
    conf.update({
        'input_mrc' : pj(data_folder, f'model_{N}', 'faket/projections_gauss.mrc'), 
        'theta': pj(data_folder, f'model_{N}', 'alignment_simulated.txt'), 
        'output_mrc' : pj(data_folder, f'model_{N}', 'faket/reconstruction_gauss.mrc')
    })
    reconstruct_mrc(**conf)

### Styled reconstructions (in paper referred to as FAKET)
This cell has to be run only after all commands from `reproduce/nst` are done. 

In [None]:
# reconstruct faket/projections_styled.mrc to produce faket/reconstruction_styled.mrc
for N in range(0, 10):
    print(f'Processing N: {N}')
    conf = recc.copy()
    conf.update({
        'input_mrc' : pj(data_folder, f'model_{N}', f'faket/projections_styled.mrc'), 
        'theta': pj(data_folder, f'model_{N}', 'alignment_simulated.txt'), 
        'output_mrc' : pj(data_folder, f'model_{N}', f'faket/reconstruction_styled.mrc')
    })
    reconstruct_mrc(**conf)

# Deep Finder experiments

Train on 9 tomograms, eval on validation (last training) tomogram at every epoch. No early stopping. \
All jobs are written for SLURM submission system, except those that run within this notebook with multiprocessing.

## Training

In [None]:
# The SBATCH config will depend on your SLURM config, therefore will probably not run
# without your manual adjustment. E.g. --qos names are fully custom.
def get_full_DF_training_command(DF_training_command, config, with_header=True):
    command_header = '' if not with_header else (
        "#!/bin/bash\n"
        f"#SBATCH --job-name={config['expname']}\n"
        "#SBATCH --qos=gpus4\n" # normal
        "#SBATCH --gres=gpu:1\n"
        "#SBATCH --cpus-per-task=10\n"
        "#SBATCH --nodes=1\n"
        "#SBATCH --ntasks=1\n"
        "#SBATCH --output=reproduce/training/%x.out\n\n")

    command = (
        f"EXPNAME={config['expname']} PYTHONHASHSEED=0 "
        f"{DF_training_command} "
        f"--training_tomogram_ids {' '.join(list(zip(*config['training_tomograms']))[0])} "
        f"--training_tomograms {' '.join(list(zip(*config['training_tomograms']))[1])} "
        f"{' '.join([f'{k} {v}' for k, v in config.items() if k.startswith('--')])} "       
    )
    return command_header + command

In [None]:
# Run the training of Deep Finder for all the modalities.
DF_training_command ='python3 faket/deepfinder/launch_training.py'
folder = 'reproduce/training'
os.makedirs(folder, exist_ok=True)

jobs_per_gpu = 2
# You can fit 2 of these jobs on the same A100 40GB GPU. 
# This way you can save ~40% of time as opposed to running 1 job per GPU.
# 10 CPUs per task both tasks is enough (did not try less, but more does not help.)
# Running 2 jobs in parallel (using 9 training tomograms) takes ~1060s to finish. 
# Running 2 jobs in parallel (using 1 training tomogram) takes ~120s to finish.

jobs = []
# modalities = ['rstyled']#, 'styled', 'noisy', 'baseline', 'shrec', 'noiseless', 'gauss']
modalities = ['baseline', 'gauss', 'styled', 'noisy']
n_tomos = range(9)
num_seeds = 6
num_epochs = 70

for idf in modalities:
    experiment_names = [f'exp_{idf}']
    training_tomograms = [f'{idf}']

    # command_queue = []
    for experiment_name, training_tomogram in zip(experiment_names, training_tomograms):
        for i, N in enumerate(range(1, num_seeds + 1)):
            training_conf = {
                "expname": f"{experiment_name}_seed-{N:02d}",
                "training_tomograms": [[str(i), training_tomogram] for i in n_tomos],
                "--training_tomo_path": data_folder,
                "--num_epochs": num_epochs,
                "--out_path": pj('data', 'results', experiment_name, f'seed{N}'),
                "--save_every": 1, 
                "--seed": N,
                # If continue_training_path is the same as out_path - continue from last epoch.
                # If it is a path to a specific weights.h5 file, continue from there.
                "--continue_training_path": pj('data', 'results', experiment_name, f'seed{N}'),

            }
            with_header = False if i % jobs_per_gpu else True
            command = get_full_DF_training_command(DF_training_command, training_conf, with_header)
            
            # Add the slurm header only once per file
            if with_header:
                job_path = pj(folder, f"{training_conf['expname']}.sbatch")
                mode = 'w'
            else:
                command = f'&\n\n' + command
                mode = 'a'
            
            # Tell slurm to wait for all jobs to finish 
            if jobs_per_gpu > 1 and ((i+1) % jobs_per_gpu == 0):
                command += '&\n\nwait < <(jobs -p)'  
            
            with open(job_path, mode) as fl:
                fl.writelines(command) 
            if job_path not in jobs:
                jobs.append(job_path)

In [None]:
# When do my jobs finish approximately? Just an approx. helper function.
# Based on timing on our hardware. There are decent chances your timing will be different.
def duration_training(n_modalities, n_tomos, n_seeds, n_epochs, n_parallel, n_gpus):
    from datetime import datetime, timedelta
    assert n_parallel in [1, 2]
    tomo_epoch = [None, 110, 125] # seconds    
    duration_h = ((tomo_epoch[n_parallel] / 3600) * n_epochs * n_seeds \
                  * n_tomos * n_modalities) / n_parallel / n_gpus
    print(f'Jobs will take ~{duration_h:.2f}h to finish.')
    end = datetime.now() + timedelta(hours=duration_h)
    print('From now that would be {:%d/%m/%Y %H:%M:%S}'.format(end))
    
# Do not forget to specify how many GPUs do you have available
duration_training(len(modalities), len(n_tomos), num_seeds, num_epochs, jobs_per_gpu, n_gpus=8)

In [None]:
# Submit jobs to the SLURM queue
for job_path in jobs:
    !sbatch $job_path

## Segmentation

Wait until training jobs are computed!

In [None]:
# The SBATCH config will depend on your SLURM config, therefore will probably not run
# without your manual adjustment. E.g. --qos names are fully custom.
def get_full_DF_analysis_command(DF_analysis_command, config, with_header=True):
    command_header = '' if not with_header else (
        "#!/bin/bash\n"
        f"#SBATCH --job-name={config['expname']}\n"
        "#SBATCH --qos=gpus4\n"
        "#SBATCH --gres=gpu:1\n"
        "#SBATCH --cpus-per-task=10\n"
        "#SBATCH --nodes=1\n"
        "#SBATCH --ntasks=1\n"
        "#SBATCH --output=reproduce/segmentation/%x.out\n\n")
    
    command=(
        f"EXPNAME={config['expname']} PYTHONHASHSEED=0 "
        f"{DF_analysis_command} "
        f"{' '.join([f'{k} {v}' for k, v in config.items() if k.startswith('--')])} "
    )
    return command_header + command

In [None]:
# Running the segmentation for all modalities with baseline as a test tomogram

DF_segmentation_command ='python3 faket/deepfinder/launch_segmentation.py'
folder = 'reproduce/segmentation'
os.makedirs(folder, exist_ok=True)

jobs_per_gpu = 3
# You can fit 3 of these jobs on the same A100 40GB GPU. 
# Segmenting 1 epoch checkpoint should take approx. 160s.

jobs = []

seed_ids = range(1, 7)
num_epochs = range(1, 71, 1)
modalities = {
    'styled': {
        'test_tomograms': ['shrec'],
        'test_tomogram_indices': [9]
    }, 
    'noisy': {
        'test_tomograms': ['shrec'],
        'test_tomogram_indices': [9]
    }, 
    'baseline': {
        'test_tomograms': ['shrec'],
        'test_tomogram_indices': [9]
    }, 
    'gauss': {
        'test_tomograms': ['shrec'],
        'test_tomogram_indices': [9]
    },
}

i = 0
segmentation_command_queue = []
segmentation_job_paths = []
segmentation_kwargs = []
for modality, tomograms in modalities.items():
    for test_tomo_modality in tomograms['test_tomograms']:
        for test_tomo_index in tomograms['test_tomogram_indices']:
            for N, epoch in product(seed_ids, num_epochs):
                analysis_conf = {
                    "expname": f'exp_{modality}_seed{N}_epoch{epoch:03d}_2021_model_{test_tomo_index}_{test_tomo_modality}',
                    "--test_tomo_path" : data_folder,
                    "--test_tomo_idx" : test_tomo_index, 
                    "--test_tomogram" : test_tomo_modality,
                    "--num_epochs" : epoch,
                    "--DF_weights_path" : pj('data', 'results', f'exp_{modality}', f'seed{N}'),
                    "--out_path" : pj('data', 'results', f'exp_{modality}', f'seed{N}'), 
                    # "--overwrite"
                }
                
                segmentation_kwargs.append(analysis_conf)
                
                with_header = False if i % jobs_per_gpu else True
                command = get_full_DF_analysis_command(DF_segmentation_command, analysis_conf, with_header)
                segmentation_command_queue.append(command)

                # Add the slurm header only once per file
                if with_header:
                    job_path = pj(folder, f"{analysis_conf['expname']}.sbatch")
                    mode = 'w'
                else:
                    command = f'&\n\n' + command
                    mode = 'a'

                # Tell slurm to wait for all jobs to finish 
                if jobs_per_gpu > 1 and ((i+1) % jobs_per_gpu == 0):
                    command += '&\n\nwait < <(jobs -p)'  
                    
                segmentation_job_paths.append(job_path)

                with open(job_path, mode) as fl:
                    fl.writelines(command) 
                if job_path not in jobs:
                    jobs.append(job_path)
                
                i+=1 # job counter      

In [None]:
# Submit jobs to the SLURM queue
for job_path in jobs:
    !sbatch $job_path

In [None]:
# Double checking if all was computed
missing = set()
for kwargs, jobpath in zip(segmentation_kwargs, segmentation_job_paths):
    out_path = kwargs['--out_path']
    tidx = kwargs['--test_tomo_idx']
    tt = kwargs['--test_tomogram']
    epochs = kwargs['--num_epochs']
    p = f'{out_path}/epoch{epochs:03d}_2021_model_{tidx}_{tt}_bin2_labelmap.mrc'
    if not os.path.exists(p):
        missing.add(jobpath)

missing

In [None]:
# Submit jobs that were not computed to the SLURM queue
for job_path in missing:
    !sbatch $job_path

## Clustering & Evaluation

In [None]:
# Before running the evaluation, make sure that `particle_locations.txt` file is present in the test tomogram folder.
for model in ['model_8', 'model_9']:
    src = pj(data_folder, model, 'particle_locations.txt')
    dst = pj(data_folder, model, 'faket', 'particle_locations.txt')
    shutil.copy(src, dst)

In [None]:
# The SBATCH config will depend on your SLURM config, therefore will probably not run
# without your manual adjustment. E.g. --qos names are fully custom.

# Template of the sbatch file that runs commands from each line of the <folder>/command.queue file
sbatch_template = '''#!/bin/bash
#SBATCH --job-name={jobname}
#SBATCH --qos=cpus150
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=1-{njobs}  # Must start from 1 until end (including) so 1-3 runs indices 0, 1, 2
#SBATCH --open-mode=append
#SBATCH --output={folder}/%x.out

sed -n ${slurm_task_id_var}p {folder}/{jobname}.queue | bash
'''

# Clustering jobs
DF_clustering_command ='python3 -m faket.deepfinder.launch_clustering'
folder_clustering = 'reproduce/clustering'
command_queue_clustering = []
os.makedirs(folder_clustering, exist_ok=True)

# Evaluation jobs
DF_evaluation_command ='python3 faket/deepfinder/launch_evaluation.py'
folder_evaluation = 'reproduce/evaluation'
command_queue_evaluation = []
os.makedirs(folder_evaluation, exist_ok=True)

# Storing kwargs if we want to run the jobs with multiprocessing instead of SLURM
kwargs_clustering = []

# Here we take the same seed_ids, num_epochs, modalities from the segmentation task.
for modality, tomograms in modalities.items():
    for test_tomo_modality in tomograms['test_tomograms']:
        for test_tomo_index in tomograms['test_tomogram_indices']:
            for N, epoch in product(seed_ids, num_epochs):
                analysis_conf = {
                    "expname": f'exp_{modality}_seed{N}_epoch{epoch:03d}_2021_model_{test_tomo_index}_{test_tomo_modality}',
                    # kwargs must stay in this order otherwise multiprocessing starmap will make problems
                    "--test_tomogram" : test_tomo_modality,
                    "--test_tomo_idx" : test_tomo_index, 
                    "--num_epochs" : epoch,
                    "--label_map_path" : pj('data', 'results', f'exp_{modality}', f'seed{N}'),
                    "--out_path" : pj('data', 'results', f'exp_{modality}', f'seed{N}'), 
                    "--n_jobs": 1,  # Always use 1 since we will later use multiprocessing
                    # "--overwrite": True, # Specify true only if running with multiprocessing, otherwise ''
                    # "--only_apply_thresholding": True, # Specify true only if running with multiprocessing, otherwise ''
                    # clustering also accepts n_jobs, overwrite, only_apply_thresholding args
                }
                
                command_clustering = get_full_DF_analysis_command(DF_clustering_command, analysis_conf, with_header=False)
                command_queue_clustering.append(command_clustering)
                
                eval_conf = {k: v for k, v in analysis_conf.items() if k in ["expname", "--test_tomogram", "--test_tomo_idx", "--num_epochs", "--label_map_path", "--out_path"]}
                command_evaluation = get_full_DF_analysis_command(DF_evaluation_command, eval_conf, with_header=False)
                command_queue_evaluation.append(command_evaluation)
                
                del analysis_conf['expname']
                kwargs_clustering.append(analysis_conf)
        
# Save the command queue to a file and create the associated sbatch file 
jobname = 'clustering_baseline'
clustering_filename = pj(folder_clustering, jobname)
with open(f'{clustering_filename}.queue', 'w') as fl:
    fl.writelines('\n'.join(command_queue_clustering))
with open(f'{clustering_filename}.sbatch', 'w') as fl:
    fl.writelines(sbatch_template.format(jobname=jobname, njobs=len(command_queue_clustering), 
                                         folder=folder_clustering, slurm_task_id_var='{SLURM_ARRAY_TASK_ID}'))

# Save the command queue to a file and create the associated sbatch file 
jobname = 'evaluation_baseline'
evaluation_filename = pj(folder_evaluation, jobname)
with open(f'{evaluation_filename}.queue', 'w') as fl:
    fl.writelines('\n'.join(command_queue_evaluation))
with open(f'{evaluation_filename}.sbatch', 'w') as fl:
    fl.writelines(sbatch_template.format(jobname=jobname, njobs=len(command_queue_evaluation), 
                                         folder=folder_evaluation, slurm_task_id_var='{SLURM_ARRAY_TASK_ID}'))

In [None]:
# Submit clustering jobs to SLURM
!sbatch $clustering_filename\.sbatch

In [None]:
# Submit evaluation jobs to SLURM
!sbatch $evaluation_filename\.sbatch

**All the results should now be stored in the `data/results` folder.** \
**Use the `figures.ipynb` to further visualize the results.**