# Beak Pipeline Tutorial

In [57]:
# Create pipeline
from pathlib import Path
from beak.remote import Pipeline

# line magic to auto-reload modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


First, create a pipeline object

In [84]:
pipe = Pipeline(
    host="shr-zion.stanford.edu",
    user="mbolivas",
    key_path="~/.ssh/shr-zion",
)

Create a test fasta file

In [85]:
test_fasta = Path("hAcyP2_expansive_query.fasta")
test_fasta.write_text(""">hAcyP2
MSTAQSLKSVDYEVFGRVQGVCFRMYTEDEARKIGVVGWVKNTSKGTVTGQVQGPEDKVNSMKSWLSKVGSPSSRIDRTNFSNEKTISKLEYSNFSIRY
""")

108

Next, create the pipeline. This will:
1. Search the sequence against the UniProtKB database
2. Perform taxonomy annotation for the hits
3. Align the sequences

In [86]:
pipe.search("hAcyP2_expansive_query.fasta", 
            database="swissprot", 
            threads=20) \
    .filter(motif="G.VQGV") \
    .taxonomy(database="swissprot") \
    .align(threads=20)

print(pipe)

Pipeline:
  Input: hAcyP2_expansive_query.fasta
  Steps (4):
    1. search (database=swissprot, threads=20)
    2. filter
    3. taxonomy (database=swissprot)
    4. align (threads=20)


Execute the pipeline

In [87]:
# Execute
job_id = pipe.execute(job_name="acyp2_search")

Created remote directory: /home/mbolivas/beak_jobs/005084e6
Uploading input file...
#!/bin/bash
set -e

# Pipeline execution script
echo "Pipeline started: $(date)" > /home/mbolivas/beak_jobs/005084e6/status.txt
echo 'RUNNING' >> /home/mbolivas/beak_jobs/005084e6/status.txt

# Initialize context
declare -A CONTEXT

# Step 1: search
mkdir -p /home/mbolivas/beak_jobs/005084e6/01_search
mmseqs easy-search \
  /home/mbolivas/beak_jobs/005084e6/input.fasta \
  /srv/protein_sequence_databases/swissprot \
  /home/mbolivas/beak_jobs/005084e6/01_search/results.m8 \
  /home/mbolivas/beak_jobs/005084e6/01_search/tmp \
  --threads 20

# Extract hit sequences
cut -f2 /home/mbolivas/beak_jobs/005084e6/01_search/results.m8 | sort -u > /home/mbolivas/beak_jobs/005084e6/01_search/acc_list.txt
CONTEXT[hit_count]=$(wc -l < /home/mbolivas/beak_jobs/005084e6/01_search/acc_list.txt)
echo "Found ${CONTEXT[hit_count]} hits"
if [ -f /srv/protein_sequence_databases/swissprot.lookup ]; then
  grep -Ff /home/mbol

In [88]:
pipe.debug_status(job_id)

    PID TTY          TIME CMD
 210316 ?        00:00:00 pipeline.sh
Pipeline started: Wed Nov 12 01:52:44 AM UTC 2025
RUNNING
total 36
drwxrwxr-x  3 mbolivas mbolivas 4096 Nov 12 01:52 .
drwxrwxr-x 26 mbolivas mbolivas 4096 Nov 12 01:52 ..
drwxrwxr-x  3 mbolivas mbolivas 4096 Nov 12 01:52 01_search
-rw-r--r--  1 mbolivas mbolivas  108 Nov 12 01:52 input.fasta
-rw-rw-r--  1 mbolivas mbolivas 6611 Nov 12 01:52 nohup.out
-rw-rw-r--  1 mbolivas mbolivas    7 Nov 12 01:52 pid.txt
-rwx--x--x  1 mbolivas mbolivas 3568 Nov 12 01:52 pipeline.sh
-rw-rw-r--  1 mbolivas mbolivas   58 Nov 12 01:52 status.txt
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type

Check the status of the job

In [89]:
pipe.print_detailed_status(job_id, watch=1)


Pipeline: acyp2_search (005084e6)
Status: RUNNING | Runtime: 0:00:15
  ✓ Step 1: search (database=swissprot, threads=20) [COMPLETED]
    └─ hits: 164
  ⠏ Step 2: filter [RUNNING]
  ○ Step 3: taxonomy (database=swissprot) [PENDING]
  ○ Step 4: align (threads=20) [PENDING]

(Press Ctrl+C to stop watching)


Stopped watching.


In [66]:
# Get results from specific step
results = pipe.get_step_results(job_id, step_number=3)

✓ Downloaded step 3 results to 2db006f3_step3_align_alignment.fasta


In [67]:
pipe.cleanup(job_id)

✓ Cleaned up job 2db006f3
