# AMULETY CLI Tutorial

## Introduction

This tutorial demonstrates how to use AMULETY command line interface (CLI) to translate and embed both BCR (B-cell receptor) and TCR (T-cell receptor) sequences. AMULETY supports a wide range of embedding models for different immune receptor types.

Before getting started, please install AMULETY using `pip install amulety`. You can check available commands from AMULETY by running the help command:

In [2]:
# If AMULETY is installed via pip
! amulety --help


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mamulety [OPTIONS] COMMAND [ARGS]...[0m[1m                                    [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-install[0m[1;36m-completion[0m          Install completion for the current shell.      [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-

### Available Commands

AMULETY provides three main commands:

1. **`translate-igblast`** - Translates nucleotide sequences to amino acid sequences using IgBlast
2. **`embed`** - Embeds sequences using various models (BCR, TCR, and protein language models)
3. **`check-deps`** - Check if optional embedding dependencies are installed

### Supported Models

AMULETY supports multiple categories of embedding models:

**BCR Models:**
- `ablang` - AbLang model for antibody sequences
- `antiberta2` - AntiBERTa2 RoFormer model
- `antiberty` - AntiBERTy model
- `balm-paired` - BALM-paired model for heavy-light chain pairs

**TCR Models:**
- `tcr-bert` - TCR-BERT model for T-cell receptors
- `tcremp` - TCREMP model for repertoire-level tasks
- `tcrt5` - TCRT5 model (beta chains only)

**Immune Models (BCR & TCR):**
- `immune2vec` - Immune2Vec model for both BCR and TCR

**Protein Language Models:**
- `esm2` - ESM2 protein language model
- `prott5` - ProtT5 protein language model
- `custom` - Custom/fine-tuned models from HuggingFace

### Chain Types

AMULETY supports different chain input formats:
- **H** - Heavy chain (BCR) or Beta/Delta chain (TCR)
- **L** - Light chain (BCR) or Alpha/Gamma chain (TCR)
- **HL** - Heavy-Light paired chains (BCR) or Beta-Alpha/Delta-Gamma paired chains (TCR)
- **LH** - Light-Heavy paired chains (BCR) or Alpha-Beta/Gamma-Delta paired chains (TCR)
- **H+L** - Both chains separately (not paired)

## Translating nucleotides to amino acid sequences

The inputs to the embedding models are [AIRR format files](https://docs.airr-community.org/en/stable/datarep/overview.html#datarepresentations) with immune receptor amino acid sequences. If the AIRR file only contains nucleotide sequences, the `amulety translate-igblast` command can help with the translation.

### Requirements for translation:
- Path to the V(D)J sequence AIRR file
- Output directory path to write the translated sequences
- Reference IgBlast database to perform alignment and translation

### Download example data and reference database

In [None]:
# Create tutorial directory and download example data
mkdir -p tutorial
wget -P tutorial https://zenodo.org/records/11373741/files/AIRR_subject1_FNA_d0_1_Y1.tsv

# Download and extract IgBlast reference database
wget -P tutorial -c https://github.com/nf-core/test-datasets/raw/airrflow/database-cache/igblast_base.zip
unzip tutorial/igblast_base.zip -d tutorial
rm tutorial/igblast_base.zip

### Run the translation command

In [10]:
! amulety translate-igblast ../tutorial/AIRR_subject1_FNA_d0_1_Y1.tsv ../tutorial/output ../tutorial/igblast_base


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-20 13:40:26,220 - INFO - Converting AIRR table to FastA for IgBlast translation...
2025-08-20 13:40:26,224 - INFO - Calling IgBlast for running translation...
2025-08-20 13:40:27,860 - INFO - Saved the translations in the dataframe (sequence_aa contains the full translation and sequence_vdj_aa contains the VDJ translation).
2025-08-20 13:40:27,864 - INFO - Took 1.64 seconds
2025-08-20 13:40:27,864 - INFO - Saved the translations in ../tutorial/output/AIRR_subject1_FNA_d0_1_Y1_translated.tsv file.


## Embedding sequences

Now we are ready to embed the sequences using various models. AMULETY uses a unified `embed` command that supports all available models.

### Basic usage

The basic syntax for the embed command is:

```bash
amulety embed --input-airr [INPUT_FILE] --chain [CHAIN] --model [MODEL] --batch-size [BATCH_SIZE] --output-file-path [OUTPUT]
```

### Required arguments:

* `--chain`: Chain(s) to embed
  - For BCR: `H` (Heavy), `L` (Light), `HL` (Heavy-Light pairs), `LH` (Light-Heavy pairs), `H+L` (Both chains separately)
  - For TCR: `H` (Beta/Delta), `L` (Alpha/Gamma), `HL` (Beta-Alpha/Delta-Gamma pairs), `LH` (Alpha-Beta/Gamma-Delta pairs), `H+L` (Both chains separately)

* `--model`: The embedding model to use (see model list above)

* `--output-file-path`: Path to save embeddings (supports `.pt`, `.csv`, `.tsv` extensions)

* `input_file`: Path to the input AIRR file

### Optional arguments:

* `--sequence-col`: Column containing amino acid sequences (default: `sequence_vdj_aa`)
* `--cell-id-col`: Column containing single-cell barcodes (default: `cell_id`)
* `--batch-size`: Mini-batch size for processing (default: 50)
* `--cache-dir`: Directory for caching model weights (default: `/tmp/amulety`)
* `--duplicate-col`: Column for selecting best chain when multiple exist (default: `duplicate_count`)

### Custom model arguments (for `--model custom`):

* `--model-path`: HuggingFace model name or local path
* `--embedding-dimension`: Embedding dimension
* `--max-length`: Maximum sequence length

### Output formats:

- `.pt` files: PyTorch tensors saved with `torch.save()` (embeddings only)
- `.csv/.tsv` files: Include cell barcodes/sequence IDs as indices with embeddings

The package automatically detects and uses GPU when available. Adjust `--batch-size` to avoid GPU out-of-memory errors.

### BCR embedding examples

Let's demonstrate embedding BCR sequences using different models:

#### AntiBERTy (BCR-specific model)

In [18]:
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model antiberty --batch-size 2 --output-file-path ../tutorial/test_embedding.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-20 11:51:53,375 - INFO - Detected single-cell data format
2025-08-20 11:51:53,376 - INFO - Processing single-cell data...
2025-08-20 11:51:53,438 - INFO - AntiBERTy loaded. Size: 26.03 M
2025-08-20 11:51:53,438 - INFO - Batch 1/48
2025-08-20 11:51:53,475 - INFO - Batch 2/48
2025-08-20 11:51:53,502 - INFO - Batch 3/48
2025-08-20 11:51:53,527 - INFO - Batch 4/48
2025-08-20 11:51:53,553 - INFO - Batch 5/48
2025-08-20 11:51:53,579 - INFO - Batch 6/48
2025-08-20 11:51:53,606 - INFO - Batch 7/48
2025-08-20 11:51:53,633 - INFO - Batch 8/48
2025-08-20 11:51:53,658 - IN

#### AntiBERTa2 (BCR-specific model)

In [4]:
# Embed heavy-light chain pairs using AntiBERTa2
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model antiberta2 --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_antiberta2.pt



 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-21 11:55:10,483 - INFO - Detected single-cell data format
2025-08-21 11:55:10,484 - INFO - Processing single-cell data...
RoFormerForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly defined. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits f

#### AbLang (BCR-specific model with separate heavy/light models)

In [None]:
# Embed both heavy and light chains separately using AbLang
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H+L --model ablang --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_ablang.pt

### BALM-paired model (BCR paired chains)

BALM-paired is a specialized model for BCR heavy-light chain pairs. It automatically downloads the model weights when first used:

In [None]:
wget -P tutorial https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
tar -xzf tutorial/BALM-paired.tar.gz -C tutorial
rm tutorial/BALM-paired.tar.gz

--2024-06-06 14:54:13--  https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.103.159, 188.184.98.238, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.103.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1129993036 (1.1G) [application/octet-stream]
Saving to: ‘tutorial/BALM-paired.tar.gz.1’


2024-06-06 14:54:40 (41.3 MB/s) - ‘tutorial/BALM-paired.tar.gz.1’ saved [1129993036/1129993036]



In addition to the parameters mentioned above, we need to specify the following parameters:

* `modelpath`: the path to the downloaded model weights

* `embedding-dimension`: the dimension of the embedding

* `max-length`: maximum length taken by the model

In [None]:
# Embed heavy-light chain pairs using BALM-paired
# The model will be automatically downloaded on first use
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain HL --model balm-paired --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_balm_paired.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 15:21:05,068 - INFO - Processing single-cell BCR data...
2024-06-06 15:21:05,068 - INFO - Concatenating heavy and light chain per cell...
2024-06-06 15:21:07,869 - INFO - Model size: 303.92M
Batch 1/4

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already b

### TCR embedding examples

AMULETY also supports TCR-specific models. Let's demonstrate with some examples:

#### TCR-BERT (TCR-specific model)

In [None]:
# Embed TCR beta-alpha chain pairs using TCR-BERT
# Note: This assumes you have TCR data in AIRR format
! amulety embed --input-airr ../tutorial/tcr_data.tsv --chain HL --model tcr-bert --batch-size 2 --output-file-path ../tutorial/tcr_embeddings_tcrbert.pt

#### TCRT5 (TCR beta chain only)

In [None]:
# Embed TCR beta chains using TCRT5 (only supports H/beta chains)
! amulety embed --input-airr ../tutorial/tcr_data.tsv --chain H --model tcrt5 --batch-size 2 --output-file-path ../tutorial/tcr_embeddings_tcrt5.pt

#### TCREMP (TCR repertoire-level model)

In [7]:
# Embed TCR sequences using TCREMP (may require --skip-clustering for stability)
! amulety embed --input-airr ../tutorial/sample_tcr_data.tsv --chain H --model tcremp --skip-clustering --batch-size 2 --output-file-path ../tutorial/tcr_embeddings_tcremp.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-21 13:40:17,936 - INFO - Detected single-cell data format
2025-08-21 13:40:17,937 - INFO - Using CDR3 column: cdr3_aa
2025-08-21 13:40:17,937 - INFO - Processing single-cell data...
2025-08-21 13:40:17,940 - INFO - Loading TCREMP model for TCR embedding...
2025-08-21 13:40:21,736 - INFO - Mapping amulety chain 'H' to TCREMP chain 'TRB'
2025-08-21 13:40:21,736 - INFO - Running TCREMP command-line tool...
DEBUG: Output file: /var/folders/g4/w7sz7ldj4yb673_hlj0jrc700000gn/T/tmp5i_n8eq8/output/tcremp_output_tcremp.parquet
DEBUG: File exists: True
DEBUG: File size: 

#### ESM2 (Protein language model)

In [None]:
# Embed heavy chains only using ESM2
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model esm2 --batch-size 2 --output-file-path ../tutorial/AIRR_subject1_FNA_d0_1_Y1_esm2.pt

### Immune2Vec (Universal immune receptor model)

Immune2Vec can be used for both BCR and TCR sequences:

In [None]:
# Embed sequences using Immune2Vec (works for both BCR and TCR)
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model immune2vec --batch-size 2 --output-file-path ../tutorial/immune2vec_embeddings.pt

### Custom/Fine-tuned models

You can use custom or fine-tuned models from HuggingFace or local paths using the `custom` model type:

In [None]:
# Example: Using a fine-tuned ESM2 model from HuggingFace
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model custom \
  --model-path "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-localization" \
  --embedding-dimension 320 \
  --max-length 512 \
  --batch-size 2 \
  --output-file-path ../tutorial/custom_model_embeddings.pt

## Checking dependencies

Some models require additional dependencies that are not installed by default. You can check which dependencies are missing:

In [None]:
# Check which optional dependencies are missing
amulety check-deps

## Advanced usage tips

### Chain selection for multiple chains

When your data contains multiple chains of the same type per cell, AMULETY uses the `duplicate_count` column by default to select the best chain. You can specify a custom column:

In [None]:
# Use a custom quality score column for chain selection
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain HL --model antiberta2 --duplicate-col quality_score --batch-size 2 --output-file-path ../tutorial/embeddings_custom_selection.pt

### Memory management

For large datasets or limited GPU memory, adjust the batch size:

In [None]:
# Use smaller batch size for large models or limited memory
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain HL --model antiberta2 --batch-size 1 --output-file-path ../tutorial/embeddings_small_batch.pt

### Custom sequence columns

If your AIRR file uses different column names for sequences:

In [5]:
# Use custom sequence column name
! amulety embed --input-airr ../tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv --chain H --model antiberty --sequence-col sequence_aa --batch-size 2 --output-file-path ../tutorial/embeddings_custom_col.pt


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-21 11:55:52,980 - INFO - Detected single-cell data format
2025-08-21 11:55:52,980 - INFO - Processing single-cell data...
2025-08-21 11:55:53,079 - INFO - AntiBERTy loaded. Size: 26.03 M
2025-08-21 11:55:53,079 - INFO - Batch 1/48
2025-08-21 11:55:53,121 - INFO - Batch 2/48
2025-08-21 11:55:53,147 - INFO - Batch 3/48
2025-08-21 11:55:53,174 - INFO - Batch 4/48
2025-08-21 11:55:53,200 - INFO - Batch 5/48
2025-08-21 11:55:53,229 - INFO - Batch 6/48
2025-08-21 11:55:53,257 - INFO - Batch 7/48
2025-08-21 11:55:53,287 - INFO - Batch 8/48
2025-08-21 11:55:53,314 - IN

## Real-world examples and advanced usage

Let's demonstrate AMULETY with practical examples that mirror real research workflows.

### Installing IgBlast (Required for Translation)

The `translate-igblast` command requires IgBlast to be installed. If you encounter the error `'igblastn' not found`, install it:

In [None]:
# Install IgBlast using conda (recommended)
%conda install -c bioconda igblast -y

# Verify installation
!which igblastn

# Test if IgBlast works
!igblastn -help | head -5

### Installing optional dependencies

Some advanced models require manual installation. Let's check what's missing and install them:

In [None]:
# Check which dependencies are missing
amulety check-deps

#### Installing TCREMP (for TCR repertoire analysis)

TCREMP requires manual installation from GitHub:

In [None]:
# Install TCREMP (requires Python 3.11+)
git clone https://github.com/antigenomics/tcremp.git
cd tcremp
pip install .
cd ..

#### Installing Immune2Vec (for universal immune receptor embeddings)

Immune2Vec also requires manual installation:

In [None]:
# Install Immune2Vec
git clone https://bitbucket.org/yaarilab/immune2vec_model.git
cd immune2vec_model
# The repository will be used by AMULETY automatically
cd ..

### Example 1: BCR analysis workflow

A typical BCR analysis workflow using different embedding approaches:

In [1]:
import pandas as pd

# Create sample BCR data (based on real AIRR format)
bcr_data = {
    'sequence_id': ['BCR_001', 'BCR_002', 'BCR_003', 'BCR_004', 'BCR_005'],
    'sequence_vdj_aa': [
        'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYWMNWVRQAPGKGLEWVANIKQDGTEKYYVDSVKGRFTISRDNAEDSVYLQMNSLRAEDTAVYYCARENLPSFFYYDSSAYLPEATFDFWGQGTMVTVSS',
        'QSVLTQPPSVSAAPGQKVTISCSGSSSNIGNNYLSWYQQLPGTPPKLLIYENNQRPSGIPDRFSGSKSGTSATLDITGLQTGDEADYYCGTWDSSLSAGVFGGGTKLTVL',
        'QLQLQESGSGLVKPSQTLSLTCAVSGGSINSGDYSWSWIRQPPGKGLEWIGSIYHSGSTSYNPSLKSRVTISVDRSKNQLSLKLSSATAADTAVYYCARSTVNIWGTFEYWGQGTLVTVSS',
        'DIQMTQSPSSLSASVGDRVTITCRASQGISNSLAWFQQKPGKAPKLLLYTASRLESGVPSRFSGSGSGTDYTLTISSLQPEDFATYYCQQYYSSVMYTFGQGTKLEIK',
        'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYWMNWVRQAPGKGLEWVANIKQDGTEKYYVDSVKGRFTISRDNAEDSVYLQMNSLRAEDTAVYYCARENLPSFFYYDSSAYLPEATFDFWGQGTMVTVSS'
    ],
    'locus': ['IGH', 'IGL', 'IGH', 'IGK', 'IGH'],
    'cell_id': ['cell_1', 'cell_1', 'cell_2', 'cell_2', 'cell_3'],
    'duplicate_count': [23, 118, 6, 23, 15]
}

# Save to file
bcr_df = pd.DataFrame(bcr_data)
bcr_df.to_csv('../tutorial/sample_bcr_data.tsv', sep='\t', index=False)
print("Sample BCR data created with", len(bcr_df), "sequences")
print("\nData preview:")
print(bcr_df.head())
print("\nColumns:", list(bcr_df.columns))
print("\nChain distribution:")
print(bcr_df['locus'].value_counts())

Sample BCR data created with 5 sequences

Data preview:
  sequence_id                                    sequence_vdj_aa locus  \
0     BCR_001  EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYWMNWVRQAPGKGLE...   IGH   
1     BCR_002  QSVLTQPPSVSAAPGQKVTISCSGSSSNIGNNYLSWYQQLPGTPPK...   IGL   
2     BCR_003  QLQLQESGSGLVKPSQTLSLTCAVSGGSINSGDYSWSWIRQPPGKG...   IGH   
3     BCR_004  DIQMTQSPSSLSASVGDRVTITCRASQGISNSLAWFQQKPGKAPKL...   IGK   
4     BCR_005  EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYWMNWVRQAPGKGLE...   IGH   

  cell_id  duplicate_count  
0  cell_1               23  
1  cell_1              118  
2  cell_2                6  
3  cell_2               23  
4  cell_3               15  

Columns: ['sequence_id', 'sequence_vdj_aa', 'locus', 'cell_id', 'duplicate_count']

Chain distribution:
locus
IGH    3
IGL    1
IGK    1
Name: count, dtype: int64


In [15]:
# BCR embedding workflow - comparing different approaches

! echo "=== BCR Embedding Workflow ==="

# 1. BCR-specific model: AntiBERTy (heavy chains)
! echo "1. Embedding with AntiBERTy (BCR-specific)..."
! amulety embed --input-airr ../tutorial/sample_bcr_data.tsv --chain H --model antiberty --batch-size 2 --output-file-path ../tutorial/bcr_antiberty.pt

# 2. BCR-specific model: AbLang (both heavy and light chains)
! echo "2. Embedding with AbLang (separate H/L models)..."
! amulety embed --input-airr ../tutorial/sample_bcr_data.tsv --chain H+L --model ablang --batch-size 2 --output-file-path ../tutorial/bcr_ablang.pt

# 3. Protein language model: ESM2 (general protein model)
! echo "3. Embedding with ESM2 (protein language model)..."
! amulety embed --input-airr ../tutorial/sample_bcr_data.tsv --chain H --model esm2 --batch-size 2 --output-file-path ../tutorial/bcr_esm2.pt

# 4. Paired chain model: BALM-paired (if available)
! echo "4. Embedding with BALM-paired (paired chains)..."
! amulety embed --input-airr ../tutorial/sample_bcr_data.tsv --chain HL --model balm-paired --batch-size 2 --output-file-path ../tutorial/bcr_balm_paired.pt

! echo "BCR embedding workflow completed!"

=== BCR Embedding Workflow ===
1. Embedding with AntiBERTy (BCR-specific)...

 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m0.1[0m.[1;36m1[0m

2025-08-20 13:54:59,981 - INFO - Detected single-cell data format
2025-08-20 13:54:59,981 - INFO - Processing single-cell data...
2025-08-20 13:55:00,085 - INFO - AntiBERTy loaded. Size: 26.03 M
2025-08-20 13:55:00,085 - INFO - Batch 1/2
2025-08-20 13:55:00,124 - INFO - Batch 2/2
2025-08-20 13:55:00,149 - INFO - Took 0.06 seconds
2025-08-20 13:55:00,151 - INFO - Saved embedding at ../tutorial/bcr_antiberty.pt
2. Embedding with AbLang (separate H/L models)...

 █████  ███    ███ ██    ██ ██    

### Example 2: TCR analysis workflow

A comprehensive TCR analysis using different embedding models:

In [18]:
# Create sample TCR data (based on real test data)
tcr_data = {
    'sequence_id': ['TCR_001', 'TCR_002', 'TCR_003', 'TCR_004', 'TCR_005', 'TCR_006'],
    'sequence_vdj_aa': [
        'CASSLAPGATNEKLFF',  # TRB (beta chain)
        'CAVNTGNQFYF',       # TRA (alpha chain)
        'CASSLVGQGAYEQYF',   # TRB (beta chain)
        'CAVRDMEYGNKLVF',    # TRA (alpha chain)
        'CASSLPGQGAYEQYF',   # TRB (beta chain)
        'CAVKDSNYQLIW'       # TRA (alpha chain)
    ],
    'cdr3_aa': [
        'CASSLAPGATNEKLFF',
        'CAVNTGNQFYF',
        'CASSLVGQGAYEQYF',
        'CAVRDMEYGNKLVF',
        'CASSLPGQGAYEQYF',
        'CAVKDSNYQLIW'
    ],
    'locus': ['TRB', 'TRA', 'TRB', 'TRA', 'TRB', 'TRA'],
    'cell_id': ['cell_1', 'cell_1', 'cell_2', 'cell_2', 'cell_3', 'cell_3'],
    'duplicate_count': [10, 8, 15, 12, 20, 18],
    'v_call': ['TRBV7-9*01', 'TRAV8-4*01', 'TRBV7-2*01', 'TRAV13-1*01', 'TRBV7-2*01', 'TRAV12-1*01'],
    'j_call': ['TRBJ2-1*01', 'TRAJ49*01', 'TRBJ2-7*01', 'TRAJ56*01', 'TRBJ2-7*01', 'TRAJ33*01']
}

# Save to file
tcr_df = pd.DataFrame(tcr_data)
tcr_df.to_csv('../tutorial/sample_tcr_data.tsv', sep='\t', index=False)
print("Sample TCR data created with", len(tcr_df), "sequences")
print("\nChain distribution:")
print(tcr_df['locus'].value_counts())
print("\nCDR3 length distribution:")
print(tcr_df['cdr3_aa'].str.len().describe())

Sample TCR data created with 6 sequences

Chain distribution:
locus
TRB    3
TRA    3
Name: count, dtype: int64

CDR3 length distribution:
count     6.000000
mean     13.833333
std       1.940790
min      11.000000
25%      12.500000
50%      14.500000
75%      15.000000
max      16.000000
Name: cdr3_aa, dtype: float64


### Quick Test: Verify Command Format

Let's test the correct command format with a simple example:

In [None]:
# Test the correct command format
!python3 -m amulety embed --help | head -10