# Phase 1ii: Garbbing Sequences & Running IQtree

```
Parameters
-------------
save_dir: str
    Path to directory for saving outputs in.
cache_dir: str 
    Path to directory for cached objects in.
sequence_db: str
    Path to fasta file containing sequences.
partition: str
    Name of partition to use when calling `sbatch`.
```

In [None]:
save_dir=runs_of_pipeline/2025-02-05
cache_dir=cache
sequence_db=None
partition=None

In [None]:
# For some reason this still has to be called so that slurm's sbatch to use the beast_pype environment.
source activate beast_pype

## Grabbing All Sequences

In [None]:
cached_fasta_with_root=$cache_dir/sequences_with_root.fasta_file
gabbing_sequences_out_file=$cache_dir/grabbing_sequences.out
cached_ids_with_root=$cache_dir/all_IDs.txt
job_ids_file=$cache_dir/slurm_job_ids.txt

In [None]:
job_name=grabbing_all_sequences 
job_id=$(sbatch -p $partition -J $job_name -o $gabbing_sequences_out_file -c 8 --mem=4G --parsable --time=1-00:00:00\
    --wrap="seqkit grep $sequence_db -w 0 -f $cached_ids_with_root > $cached_fasta_with_root; echo 'slurm_job_complete'")
echo "${job_id}" >> $job_ids_file

The `search_out_file_for_complete_phrase` function below is a hold function stopping the running of a notebook until a phrase appears in the slurm .out file.
In th cell below this function holds the notebook up until all sequnces have been grabbed.

In [None]:
search_out_file_for_complete_phrase () {
    job_complete=False
    while [ "$job_complete" != "True" ]
        do
        if [ -f "$1" ]; then # If file exists. If a slurm job is pending the .out file may not have been created yet.
            if grep -Fxq $2 $1; then # Checks for phrase in .out file.
                job_complete=True # Trigger beark while loop clause.
            else
                sleep 10
            fi
        else
            sleep 10
        fi
        done
    }

search_out_file_for_complete_phrase $gabbing_sequences_out_file 'slurm_job_complete'

## Sorting Sequences Between strains

In [None]:
for i in $save_dir/*
    do                 
    if [ -d $i ] && [ "$i" != "$cache_dir" ]; then
        varient=${i##*/}
        job_name="${xml_set}_strain_sequences"
        job_id=$(sbatch -p $partition -J $job_name -o $i/sequences.out -c 1 --mem=1G --parsable --time=4:00:00\
            --wrap="seqkit grep $cached_fasta_with_root -w 0 -f '$i/strain_IDs.txt' > '$i/sequences.fasta'; echo 'slurm_job_complete'")
        echo "${job_id}" >> $job_ids_file
        job_name="${xml_set}_strain_with_root_sequences"
        job_id=$(sbatch -p $partition -J $job_name -o $i/sequences_with_root.out -c 1 --mem=1G --parsable --time=4:00:00\
            --wrap="seqkit grep $cached_fasta_with_root -w 0 -f '$i/strain_with_root_IDs.txt' > '$i/sequences_with_root.fasta'; echo 'slurm_job_complete'")
        echo "${job_id}" >> $job_ids_file
        fi
    done

The code below uses `search_out_file_for_complete_phrase` to wait until the sbatch jobs creating the sequences.fasta and sequences_with_root.fasta for each strain have finished.

In [None]:
for i in $save_dir/*
    do 
    if [ -d $i ] && [ "$i" != "$cache_dir" ]; then
        sequences_out=$i/sequences.out
        search_out_file_for_complete_phrase $sequences_out 'slurm_job_complete'
        sequences_with_root_out=$i/sequences_with_root.out
        search_out_file_for_complete_phrase $sequences_with_root_out 'slurm_job_complete'
    fi
    done

To check the progress of the slurm jobs use the terminal command: `squeue --me`.