<a href="https://colab.research.google.com/github/gbouras13/baktfold/blob/dev/run_baktfold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Batkfold

[baktfold](https://github.com/gbouras13/baktfold) is a sensitive annotation tool for annotating bacterial genomes, MAGs and plasmids.

baktfold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate hypothetical protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a series of databases of protein structures (AFDB Clusters, SwissProt, CATH and PDB).

For now, baktfold takes either the `.json` output file from [batka](https://github.com/oschwengers/bakta) or an input of amino acid sequences (in `.faa` format - these can be any protein sequences you like, not necessarily the output from bakta).

If you want to annotate a genome, you will need to run bakta first to get the `.json` file - if you are compute limited, we recommend the awesome [bakta webserver](https://bakta.computational.bio/).

* **Before you start, please make sure you the runtime is T4 GPU (or any other kind of GPU if you have $$$), otherwise baktfold won't be installed properly**
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator

* To run the cells, press the play button on the left side
* Cells 1, 2 and 3 installs baktfold and downloads the databases/models.
* Once cells 1-3 have been run, you can re-run Cell 4 (to run `baktfold run` for genomes) and/or Cell 5 (to run `baktfold proteins` for proteins) to annotate different inputs as many times as you would like



In [None]:

#@title 1. Install miniforge

#@markdown This cell installs miniforge

%%bash

set -e

PYTHON_VERSION=$(python3 -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")

echo "python version ${PYTHON_VERSION}"

if [ ! -f CONDA_READY ]; then
  echo "installing miniforge"

  # miniforge 25.9.1 introduces some issue - latest as of 7 Nov 2025 - see https://github.com/gbouras13/phold/issues/106
  # issue is fixed if you use the previous release (25.3.1)
  #wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
  wget -qnc https://github.com/conda-forge/miniforge/releases/download/25.3.1-0/Miniforge3-25.3.1-0-Linux-x86_64.sh
  bash Miniforge3-25.3.1-0-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
  conda config --set auto_update_conda false
  touch CONDA_READY
fi

pip install --upgrade matplotlib matplotlib-inline



In [None]:
#@title 2. Install baktfold

#@markdown This cell installs baktfold. It will take a few minutes. Please be patient

BAKTFOLD_VERSION="0.0.2"

# add paths
import sys
sys.path.append("/usr/local/bin")

# Update environment variables for shell usage
import os
os.environ["PATH"] = "/usr/local/bin:" + os.environ["PATH"]

# create envs
# pharokka isn't compatible with Python 3.13 (Google Colab default)
# so it needs a separate env
from pathlib import Path
flag_file = Path("BAKTFOLD_READY")
if not flag_file.exists():
  !conda install -y -c bioconda baktfold=={BAKTFOLD_VERSION} pytorch=*=cuda*
  # Touch the flag file
  flag_file.touch()

In [None]:
#@title 3. Download baktfold databases

#@markdown This cell downloads the baktfold database. It will take some time (5-15 minutes probably depending on Zenodo's traffic). Please be patient. Perhaps go for a walk or have a coffee or tea.


%%time

print("Downloading baktfold database. This will take a few minutes. Please be patient :)")
!baktfold install -d baktfold_db -t 8 --foldseek-gpu




In [6]:
#@title 4. baktfold run (bakta json input)

#@markdown First, upload your bakta json file.

#@markdown Click on the folder icon to the left and use the file upload button.

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown Then provide a directory for baktfold's output using BAKTFOLD_OUT_DIR.
#@markdown The default is 'output_baktfold'.

#@markdown You can also provide a prefix for your output files with BAKTFOLD_PREFIX.
#@markdown If you provide nothing it will default to 'baktfold'.

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of Baktfold will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is BAKTFOLD_OUT_DIR.zip, where BAKTFOLD_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
INPUT_FILE = '' #@param {type:"string"}

if os.path.exists(INPUT_FILE):
    print(f"Input file {INPUT_FILE} exists")
else:
    print(f"Error: File {INPUT_FILE} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

BAKTFOLD_OUT_DIR = 'output_baktfold'  #@param {type:"string"}
BAKTFOLD_PREFIX = 'baktfold'  #@param {type:"string"}
FORCE = True  #@param {type:"boolean"}


# Construct the command
command = (
    f'baktfold run -d baktfold_db -i {INPUT_FILE} -t 4 '
    f'-o {BAKTFOLD_OUT_DIR} -p {BAKTFOLD_PREFIX} --foldseek-gpu'
)

if FORCE is True:
  command = f"{command} -f"

# Execute the command
try:
    print("Running baktfold")
    subprocess.run(command, shell=True, check=True)
    print("baktfold completed successfully.")
    print(f"Your output is in {BAKTFOLD_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{BAKTFOLD_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(BAKTFOLD_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), BAKTFOLD_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Input file assembly.json exists
Running baktfold
baktfold completed successfully.
Your output is in output_baktfold.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_baktfold.zip
CPU times: user 90.4 ms, sys: 4.93 ms, total: 95.3 ms
Wall time: 1min 42s


In [7]:
#@title 4. baktfold proteins (protein input)

#@markdown First, upload your file containing protein amino acid sequences (`.faa` format).

#@markdown Click on the folder icon to the left and use the file upload button.

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown Then provide a directory for baktfold's output using BAKTFOLD_PROTEINS_OUT_DIR.
#@markdown The default is 'output_baktfold_proteins'.

#@markdown You can also provide a prefix for your output files with BAKTFOLD_PREFIX.
#@markdown If you provide nothing it will default to 'baktfold'.


#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of Baktfold will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is BAKTFOLD_PROTEINS_OUT_DIR.zip, where BAKTFOLD_PROTEINS_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".



%%time
import os
import sys
import subprocess
import zipfile
INPUT_FILE = '' #@param {type:"string"}

if os.path.exists(INPUT_FILE):
    print(f"Input file {INPUT_FILE} exists")
else:
    print(f"Error: File {INPUT_FILE} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

BAKTFOLD_PROTEINS_OUT_DIR = 'output_baktfold_proteins'  #@param {type:"string"}
BAKTFOLD_PREFIX = 'baktfold'  #@param {type:"string"}
FORCE = True  #@param {type:"boolean"}


# Construct the command
command = (
    f'baktfold proteins -d baktfold_db -i {INPUT_FILE} -t 4 '
    f'-o {BAKTFOLD_PROTEINS_OUT_DIR} -p {BAKTFOLD_PREFIX} --foldseek-gpu'
)

if FORCE is True:
  command = f"{command} -f"

# Execute the command
try:
    print("Running baktfold")
    subprocess.run(command, shell=True, check=True)
    print("baktfold completed successfully.")
    print(f"Your output is in {BAKTFOLD_PROTEINS_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{BAKTFOLD_PROTEINS_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(BAKTFOLD_PROTEINS_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), BAKTFOLD_PROTEINS_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Input file assembly.hypotheticals.faa exists
Running baktfold
baktfold completed successfully.
Your output is in output_baktfold_proteins.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_baktfold_proteins.zip
CPU times: user 8.73 ms, sys: 8.25 ms, total: 17 ms
Wall time: 1min 52s
