# AMULETY CLI Tutorial

## Introduction

This tutorial demonstrates how to use AMULETY command line interface (CLI) to translate and embed both BCR (B-cell receptor) and TCR (T-cell receptor) sequences. AMULETY supports a wide range of embedding models for different immune receptor types.

Before getting started, please install AMULETY using `pip install amulety`. You can check available commands from AMULETY by running the help command:

In [2]:
# If AMULETY is installed via pip
! amulety --help


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mamulety [OPTIONS] COMMAND [ARGS]...[0m[1m                                    [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-install[0m[1;36m-completion[0m          Install completion for the current shell.      [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-

### Available Commands

AMULETY provides three main commands:

1. **`translate-igblast`** - Translates nucleotide sequences to amino acid sequences using IgBlast
2. **`embed`** - Embeds sequences using various models (BCR, TCR, and protein language models)
3. **`check-deps`** - Check if optional embedding dependencies are installed

### Supported Models

AMULETY supports multiple categories of embedding models:

**BCR Models:**
- `ablang` - AbLang model for antibody sequences
- `antiberta2` - AntiBERTa2 RoFormer model
- `antiberty` - AntiBERTy model
- `balm-paired` - BALM-paired model for heavy-light chain pairs

**TCR Models:**
- `tcr-bert` - TCR-BERT model for T-cell receptors
- `tcrt5` - TCRT5 model (beta chains only)

**Immune Models (BCR & TCR):**
- `immune2vec` - Immune2Vec model for both BCR and TCR

**Protein Language Models:**
- `esm2` - ESM2 protein language model
- `prott5` - ProtT5 protein language model
- `custom` - Custom/fine-tuned models from HuggingFace

### Chain Types

AMULETY supports different chain input formats:
- **H** - Heavy chain (BCR) or Beta/Delta chain (TCR)
- **L** - Light chain (BCR) or Alpha/Gamma chain (TCR)
- **HL** - Heavy-Light paired chains (BCR) or Beta-Alpha/Delta-Gamma paired chains (TCR)
- **LH** - Light-Heavy paired chains (BCR) or Alpha-Beta/Gamma-Delta paired chains (TCR)
- **H+L** - Both chains separately (not paired)

## Translating nucleotides to amino acid sequences

The inputs to the embedding models are [AIRR format files](https://docs.airr-community.org/en/stable/datarep/overview.html#datarepresentations) with immune receptor amino acid sequences. If the AIRR file only contains nucleotide sequences, the `amulety translate-igblast` command can help with the translation. The input requires:

- Path to the V(D)J sequence AIRR file
- Output directory path to write the translated sequences
- Reference IgBlast database to perform alignment and translation

### Download BCR example data and reference database
The following command downloads an example AIRR format file of BCR sequences and the reference IgBlast database.

In [None]:
# Create tutorial directory and download example data
mkdir -p tutorial
wget -P tutorial https://zenodo.org/records/11373741/files/AIRR_subject1_FNA_d0_1_Y1.tsv

# Download and extract IgBlast reference database
wget -P tutorial -c https://github.com/nf-core/test-datasets/raw/airrflow/database-cache/igblast_base.zip
unzip tutorial/igblast_base.zip -d tutorial
rm tutorial/igblast_base.zip

### Run the translation command

In [10]:
! amulety translate-igblast ../tutorial/AIRR_subject1_FNA_d0_1_Y1.tsv ../tutorial/output ../tutorial/igblast_base


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-20 13:40:26,220 - INFO - Converting AIRR table to FastA for IgBlast translation...
2025-08-20 13:40:26,224 - INFO - Calling IgBlast for running translation...
2025-08-20 13:40:27,860 - INFO - Saved the translations in the dataframe (sequence_aa contains the full translation and sequence_vdj_aa contains the VDJ translation).
2025-08-20 13:40:27,864 - INFO - Took 1.64 seconds
2025-08-20 13:40:27,864 - INFO - Saved the translations in ../tutorial/output/AIRR_subject1_FNA_d0_1_Y1_translated.tsv file.


## Embedding sequences

Now we are ready to embed the sequences using various models. AMULETY uses a unified `embed` command that supports all available models.

### Basic usage

The basic syntax for the embed command is:

```bash
amulety embed --input-airr [INPUT_FILE] --chain [CHAIN] --model [MODEL] --batch-size [BATCH_SIZE] --output-file-path [OUTPUT]
```

### Required arguments:

* `--chain`: Chain(s) to embed
  - For BCR: `H` (Heavy), `L` (Light), `HL` (Heavy-Light pairs), `LH` (Light-Heavy pairs), `H+L` (Both chains separately)
  - For TCR: `H` (Beta/Delta), `L` (Alpha/Gamma), `HL` (Beta-Alpha/Delta-Gamma pairs), `LH` (Alpha-Beta/Gamma-Delta pairs), `H+L` (Both chains separately)

* `--model`: The embedding model to use (see model list above)

* `--output-file-path`: Path to save embeddings (supports `.pt`, `.csv`, `.tsv` extensions)

* `input_file`: Path to the input AIRR file

### Optional arguments:

* `--sequence-col`: Column containing amino acid sequences (default: `sequence_vdj_aa`)
* `--cell-id-col`: Column containing single-cell barcodes (default: `cell_id`)
* `--batch-size`: Mini-batch size for processing (default: 50)
* `--cache-dir`: Directory for caching model weights (default: `/tmp/amulety`)
* `--duplicate-col`: Column for selecting best chain when multiple exist (default: `duplicate_count`)

### Custom model arguments (for `--model custom`):

* `--model-path`: HuggingFace model name or local path
* `--embedding-dimension`: Embedding dimension
* `--max-length`: Maximum sequence length

### Output formats:

- `.pt` files: PyTorch tensors saved with `torch.save()` (embeddings only)
- `.csv/.tsv` files: Include cell barcodes/sequence IDs as indices with embeddings

The package automatically detects and uses GPU when available. Adjust `--batch-size` to avoid GPU out-of-memory errors.

### BCR embedding examples

Let's demonstrate embedding BCR sequences using different models:

#### AntiBERTy (BCR-specific model)

In [18]:
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model antiberty --batch-size 2 --output-file-path ../tutorial/test_embedding.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-20 11:51:53,375 - INFO - Detected single-cell data format
2025-08-20 11:51:53,376 - INFO - Processing single-cell data...
2025-08-20 11:51:53,438 - INFO - AntiBERTy loaded. Size: 26.03 M
2025-08-20 11:51:53,438 - INFO - Batch 1/48
2025-08-20 11:51:53,475 - INFO - Batch 2/48
2025-08-20 11:51:53,502 - INFO - Batch 3/48
2025-08-20 11:51:53,527 - INFO - Batch 4/48
2025-08-20 11:51:53,553 - INFO - Batch 5/48
2025-08-20 11:51:53,579 - INFO - Batch 6/48
2025-08-20 11:51:53,606 - INFO - Batch 7/48
2025-08-20 11:51:53,633 - INFO - Batch 8/48
2025-08-20 11:51:53,658 - IN

#### AntiBERTa2 (BCR-specific model)

In [4]:
# Embed heavy-light chain pairs using AntiBERTa2
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model antiberta2 --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_antiberta2.pt



 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-21 11:55:10,483 - INFO - Detected single-cell data format
2025-08-21 11:55:10,484 - INFO - Processing single-cell data...
RoFormerForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly defined. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits f

#### AbLang (BCR-specific model with separate heavy/light models)

In [None]:
# Embed both heavy and light chains separately using AbLang
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H+L --model ablang --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_ablang.pt

### BALM-paired model (BCR paired chains)

BALM-paired is a specialized model for BCR heavy-light chain pairs. It automatically downloads the model weights when first used:

In [None]:
wget -P tutorial https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
tar -xzf tutorial/BALM-paired.tar.gz -C tutorial
rm tutorial/BALM-paired.tar.gz

--2024-06-06 14:54:13--  https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.103.159, 188.184.98.238, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.103.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1129993036 (1.1G) [application/octet-stream]
Saving to: ‘tutorial/BALM-paired.tar.gz.1’


2024-06-06 14:54:40 (41.3 MB/s) - ‘tutorial/BALM-paired.tar.gz.1’ saved [1129993036/1129993036]



In addition to the parameters mentioned above, we need to specify the following parameters:

* `modelpath`: the path to the downloaded model weights

* `embedding-dimension`: the dimension of the embedding

* `max-length`: maximum length taken by the model

In [None]:
# Embed heavy-light chain pairs using BALM-paired
# The model will be automatically downloaded on first use
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain HL --model balm-paired --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_balm_paired.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 15:21:05,068 - INFO - Processing single-cell BCR data...
2024-06-06 15:21:05,068 - INFO - Concatenating heavy and light chain per cell...
2024-06-06 15:21:07,869 - INFO - Model size: 303.92M
Batch 1/4

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already b

### Protein Language Models

Then we want to use the same dataset to embed using the general protein language models.

#### ESM2 (Protein language model)

In [1]:
# Embed heavy chains only using ESM2
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model esm2 --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_esm2.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-09-08 22:52:48,775 - INFO - Detected single-cell data format
2025-09-08 22:52:48,777 - INFO - Processing both BCR and TCR sequences from the file.
2025-09-08 22:52:48,777 - INFO - Single-cell AIRR data detected (all entries have cell_id).
2025-09-08 22:52:48,778 - INFO - Removed 102 sequences not matching H chain
tokenizer_config.json: 100%|██████████████████| 95.0/95.0 [00:00<00:00, 281kB/s]
vocab.txt: 100%|█████████████████████████████| 93.0/93.0 [00:00<00:00, 1.31MB/s]
special_tokens_map.json: 100%|██████████████████| 125/125 [00:00<00:00, 592kB/s]
config.json:

### Immune2Vec 
Immune2Vec requires manual installation follows by:

In [None]:
#Installing Immune2Vec
# Clone repository
git clone https://bitbucket.org/yaarilab/immune2vec_model.git

# please store the path: /path/to/immune2vec_model for later use:
# using custom path
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model immune2vec --immune2vec-path /path/to/immune2vec_model --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_immune2vec.pt

### Custom/Fine-tuned models

You can use custom or fine-tuned models from HuggingFace or local paths using the `custom` model type:

In [2]:
# Example: Using a fine-tuned ESM2 model from HuggingFace
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model custom \
  --model-path "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-localization" \
  --embedding-dimension 320 \
  --max-length 512 \
  --batch-size 2 \
  --output-file-path ../tutorial/custom_model_embeddings.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-09-08 23:12:04,178 - INFO - Detected single-cell data format
2025-09-08 23:12:04,179 - INFO - Processing both BCR and TCR sequences from the file.
2025-09-08 23:12:04,179 - INFO - Single-cell AIRR data detected (all entries have cell_id).
2025-09-08 23:12:04,180 - INFO - Removed 102 sequences not matching H chain
tokenizer_config.json: 100%|████████████████████| 108/108 [00:00<00:00, 247kB/s]
vocab.txt: 100%|█████████████████████████████| 93.0/93.0 [00:00<00:00, 1.22MB/s]
special_tokens_map.json: 100%|██████████████████| 125/125 [00:00<00:00, 599kB/s]
config.json:

### TCR embedding examples

AMULETY also supports TCR-specific models. Here we also provide TCR example data and you can download and have a try: 

In [None]:
# Create tutorial directory and download TCR example data
# TBD...
wget -P tutorial https://zenodo.org/records/11373741/TBD...

#### TCR-BERT (TCR-specific model)

In [3]:
# Embed TCR beta-alpha chain pairs using TCR-BERT
# Note: This assumes you have TCR data in AIRR format
! amulety embed --input-airr ../tutorial/AIRR_tcr_sample.tsv --chain HL --model tcr-bert --batch-size 2 --output-file-path ../tutorial/tcr_embeddings_tcrbert.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-09-06 15:41:06,325 - INFO - Detected single-cell data format
2025-09-06 15:41:06,326 - INFO - Single-cell AIRR data detected (all entries have cell_id).
2025-09-06 15:41:06,330 - INFO - Loading TCR-BERT model for TCR embedding...
2025-09-06 15:41:06,937 - INFO - Successfully loaded TCR-BERT model
2025-09-06 15:41:06,937 - INFO - TCR-BERT model loaded. Size: 57.39 M
2025-09-06 15:41:06,937 - INFO - TCR-BERT Batch 1/25.
2025-09-06 15:41:06,989 - INFO - TCR-BERT Batch 2/25.
2025-09-06 15:41:07,022 - INFO - TCR-BERT Batch 3/25.
2025-09-06 15:41:07,052 - INFO - TCR-BER

#### TCRT5 (TCR beta chain only)

In [2]:
# Embed TCR beta chains using TCRT5 (only supports H/beta chains)
! amulety embed --input-airr ../tutorial/AIRR_tcr_sample.tsv --chain H --model tcrt5 --batch-size 2 --output-file-path ../tutorial/tcr_embeddings_tcrt5.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-09-06 15:40:29,129 - INFO - Detected single-cell data format
2025-09-06 15:40:29,131 - INFO - Single-cell AIRR data detected (all entries have cell_id).
2025-09-06 15:40:29,131 - INFO - Removed 100 sequences not matching H chain
2025-09-06 15:40:29,133 - INFO - Loading TCRT5 model for TCR embedding...
tokenizer_config.json: 21.1kB [00:00, 6.70MB/s]
spiece.model: 100%|██████████████████████████| 238k/238k [00:00<00:00, 2.87MB/s]
added_tokens.json: 2.35kB [00:00, 10.1MB/s]
special_tokens_map.json: 2.64kB [00:00, 9.78MB/s]
The tokenizer class you load from this check

## Checking dependencies

Some models require additional dependencies that are not installed by default. You can check which dependencies are missing:

In [12]:
# Check which optional dependencies are missing
! amulety check-deps


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

Checking AMULETY dependencies...

IgBlast (for translate-igblast command):
  IgBlast (igblastn) is available

Embedding model dependencies:
2025-09-08 23:27:12,074 - INFO - Available models: AntiBERTy, AbLang, TCREMP, TCR-BERT, TCRT5, ESM2, ProtT5
  1 dependencies are missing.
  AMULETY will raise ImportError with installation instructions when these models are used.

  To install missing dependencies:
    • Immune2Vec: git clone https://bitbucket.org/yaarilab/immune2vec_model.git && add to Python path

  Note: Models will provide detailed installation instructions whe