# AMULETY CLI Tutorial

## Introduction

This tutorial demonstrates how to use AMULETY command line interface (CLI) to translate and embed BCR sequences. Before getting started, please install AMULETY using `pip install amulety`. You can check available commands from AMULETY by running `amulety --help`. 

In [1]:
amulety --help


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mamulety [OPTIONS] COMMAND [ARGS]...[0m[1m                                    [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-install[0m[1;36m-completion[0m          Install completion for the current shell.      [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-show[0m[1;3

### Translating nucleotides to amino acid sequences

The inputs to the embedding models are [AIRR format file](https://docs.airr-community.org/en/stable/datarep/overview.html#datarepresentations) with antibody amino acid sequences. If the AIRR file only contains nucleotide sequences, `amulety translate-igblast` command can help with the translation. The input requires
- Path to the V(D)J sequence AIRR file
- Output directory path to write the translated sequences
- Reference IgBlast database to perform alignment and translation

The following command downloads an example AIRR format file and the reference IgBlast database. 

In [None]:
mkdir tutorial
wget -P tutorial https://zenodo.org/records/11373741/files/AIRR_subject1_FNA_d0_1_Y1.tsv
wget -P tutorial -c https://github.com/nf-core/test-datasets/raw/airrflow/database-cache/igblast_base.zip
unzip tutorial/igblast_base.zip -d tutorial
rm tutorial/igblast_base.zip

Now we are ready to run the translation command as follows. 

In [7]:
amulety translate-igblast tutorial/AIRR_subject1_FNA_d0_1_Y1.tsv tutorial tutorial/igblast_base


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 13:50:25,146 - INFO - Converting AIRR table to FastA for IgBlast translation...
2024-06-06 13:50:25,156 - INFO - Calling IgBlast for running translation...
2024-06-06 13:50:33,793 - INFO - Saved the translations in the dataframe (sequence_aa contains the full translation and sequence_vdj_aa contains the VDJ translation).
2024-06-06 13:50:33,795 - INFO - Saved the translations in tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv file.
2024-06-06 13:50:33,815 - INFO - Took 8.67 seconds


The command outputs an AIRR file with three new columns appended to the original data table:
- `sequence_aa`: the whole translated sequence

- `sequence_vdj_aa`: the translated sequence part of VDJ (excluding constant region)

- `sequence_alignment_aa`: the translated sequence part of VDJ with gaps annotated as - when there are amino acid deletions.

## Embeddings

Now we are ready to embed the seequences. AMULETY currently supports three published pre-trained model (antiberty, antiBERTa2, ESM2-650M) as well as weights from customized pre-trained model from the huggingface framework.


### Published pre-trained models

The input arguments for the published models include:

* `input_file_path`: Path to the input AIRR file containing the translated columns

* `chain`: Chain(s) to embed: heavy only (H), light only (L), heavy and light concatenated per cell barcode (HL)

* `output_file_path`: Path to the output embedding matrix for corresponding chain(s). We currently support file extension csv, tsv, pt. csv and tsv file contains the cell barcode and/or sequence ID as indices. pt file, which is saved by `torch.save`, doesn't contain index but the order will be maintained as the original data for H and L option.

Optional arguments include:

* `sequence-col`: the column to the amino acid sequence (default is `sequence_vdj_aa`)

* `cell-id-col`: the column to the single-cell barcode (default is `cell_id`)

* `batch-size`: the mini-batch size for embedding the sequences. 

The package will auto-detect GPU and use GPU when it is available. Note that `batch-size` parameter can be adjusted to avoid GPU out-of-memory error. 

The output file containing the embeddings will be written as specified by `output_file_path`. 

#### AntiBERTy

In [9]:
amulety antiberty tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv HL tutorial/AIRR_subject1_FNA_d0_1_Y1_antiBERTy.tsv


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 14:24:48,605 - INFO - Processing single-cell BCR data...
2024-06-06 14:24:48,605 - INFO - Concatenating heavy and light chain per cell...
2024-06-06 14:24:48,626 - INFO - Embedding 95 sequences using antiberty...
2024-06-06 14:24:49,541 - INFO - AntiBERTy loaded. Size: 26.03 M
2024-06-06 14:24:49,541 - INFO - Batch 1/1
2024-06-06 14:25:11,820 - INFO - Took 22.28 seconds
2024-06-06 14:25:11,883 - INFO - Saved embedding at tutorial/AIRR_subject1_FNA_d0_1_Y1_antiBERTy.tsv


#### AntiBERTa2

In [10]:
amulety antiberta2 tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv HL tutorial/AIRR_subject1_FNA_d0_1_Y1_antiBERTa2.tsv


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 14:25:23,884 - INFO - Processing single-cell BCR data...
2024-06-06 14:25:23,884 - INFO - Concatenating heavy and light chain per cell...
model.safetensors: 100%|██████████████████████| 811M/811M [00:02<00:00, 314MB/s]
2024-06-06 14:25:31,000 - INFO - AntiBERTa2 loaded. Size: 202.642462 M
2024-06-06 14:25:31,000 - INFO - Batch 1/1.
2024-06-06 14:28:01,505 - INFO - Took 150.5 seconds
2024-06-06 14:28:01,602 - INFO - Saved embedding at tutorial/AIRR_subject1_FNA_d0_1_Y1_antiBERTa2.tsv


#### ESM2-650M

In [11]:
amulety esm2 tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv HL tutorial/AIRR_subject1_FNA_d0_1_Y1_esm2.tsv


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 14:28:32,409 - INFO - Processing single-cell BCR data...
2024-06-06 14:28:32,409 - INFO - Concatenating heavy and light chain per cell...
2024-06-06 14:28:44,899 - INFO - ESM2 650M model size: 652.36 M
2024-06-06 14:28:44,903 - INFO - Batch 1/2.


### Pre-trained weights of customized model

We will download the [pre-trained weights](https://zenodo.org/records/8237396/files/BALM-paired.tar.gz) from [BALM-paired model](https://www.sciencedirect.com/science/article/pii/S2666389924000758?via%3Dihub). 

In [15]:
wget -P tutorial https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
tar -xzf tutorial/BALM-paired.tar.gz -C tutorial
rm tutorial/BALM-paired.tar.gz

--2024-06-06 14:54:13--  https://zenodo.org/records/8237396/files/BALM-paired.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.103.159, 188.184.98.238, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.103.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1129993036 (1.1G) [application/octet-stream]
Saving to: ‘tutorial/BALM-paired.tar.gz.1’


2024-06-06 14:54:40 (41.3 MB/s) - ‘tutorial/BALM-paired.tar.gz.1’ saved [1129993036/1129993036]



In addition to the parameters mentioned above, we need to specify the following parameters:

* `modelpath`: the path to the downloaded model weights

* `embedding-dimension`: the dimension of the embedding

* `max-length`: maximum length taken by the model

In [5]:
amulety custommodel tutorial/BALM-paired_LC-coherence_90-5-5-split_122222 \\
 tutorial/AIRR_subject1_FNA_d0_1_Y1_translated.tsv HL \\
 tutorial/AIRR_subject1_FNA_d0_1_Y1_BALM-paired.tsv \\
 --embedding-dimension 1024 \\
 --batch-size 25 \\
 --max-length 510


 █████  ███    ███ ██    ██ ██      ███████ ████████     ██    ██
██   ██ ████  ████ ██    ██ ██      ██         ██         ██  ██
███████ ██ ████ ██ ██    ██ ██      █████      ██          ████
██   ██ ██  ██  ██ ██    ██ ██      ██         ██           ██
██   ██ ██      ██  ██████  ███████ ███████    ██           ██

AMULETY: Adaptive imMUne receptor Language model Embedding Tool
 version [1;36m1.0[0m

2024-06-06 15:21:05,068 - INFO - Processing single-cell BCR data...
2024-06-06 15:21:05,068 - INFO - Concatenating heavy and light chain per cell...
2024-06-06 15:21:07,869 - INFO - Model size: 303.92M
Batch 1/4

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already b