### VJ Gene Assignment
In this example, we'll see how to find the closest V and J genes in AntPack's database
for an input sequence, get the sequences of those VJ genes, get the sequence of
all VJ genes in the same family, and see the date when AntPack's database
was last updated.

In [1]:
from antpack import VJGeneTool, SingleChainAnnotator

vj_tool = VJGeneTool()

In [2]:
test_sequence = "VQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARYDGIYGELDFWGQGTLVTVSS"

We can number the sequence first using ``SingleChainAnnotator``, because we'll often need the numbering
for other tasks, and then pass that numbering to ``vj_tool.assign_numbered_sequence()``. Or, we can ask
vj_tool to number for us. Note that if we enter a sequence with invalid amino acids (e.g. X or -), or we
supply an invalid species (not one of 'human', 'mouse'), or we supply
other invalid input, the tool will return None for both V and J genes.

In [3]:
v_gene, j_gene, v_percent_identity, j_percent_identity = vj_tool.assign_sequence(test_sequence, species="human")

In [4]:
print(v_gene)
print(j_gene)
print(v_percent_identity)
print(j_percent_identity)

IGHV1-69*01
IGHJ4*01
0.8877551020408163
0.8571428571428571


If we need to, we can see what the sequence of those genes are. AntPack stores those sequences in its internal
db pre-aligned using the IMGT numbering scheme, so each sequence will be length 128 with gaps inserted as appropriate.
We can use SingleChainAnnotator to number the sequence, then we can do some simple manipulation to
convert the numbered sequence to the same format so we can see how well it lines up. The IMGT scheme contains
128 positions (any letters above and beyond this are designated with a letter), so when annotating
our input sequence to get it to match up to the V-gene, we just extract numbered positions where the
number is in 1 through 128 as illustrated below.

In [5]:
# Get the V and J gene sequences
vgene_seq = vj_tool.get_vj_gene_sequence("IGHV1-69*01", species="human")
jgene_seq = vj_tool.get_vj_gene_sequence("IGHJ4*01")

# Now let's prep our input sequence so it can be directly compared to the V and J gene.

ntool = SingleChainAnnotator()
numbering, _, _, _ = ntool.analyze_seq(test_sequence)

formatted_seq = ["-" for i in range(128)]
expected_positions = {str(i) for i in range(128)}

for ntoken, letter in zip(numbering, test_sequence):
    if ntoken in expected_positions:
        # We have to subtract 1 here because Python numbers from 0, IMGT numbers from 1.
        formatted_seq[int(ntoken) - 1] = letter

print(vgene_seq)
print(jgene_seq)
print("".join(formatted_seq))

QVQLVQSGA-EVKKPGSSVKVSCKASGGTF----SSYAISWVRQAPGQGLEWMGGIIPI--FGTANYAQKFQ-GRVTITADESTSTAYMELSSLRSEDTAVYYCAR----------------------
------------------------------------------------------------------------------------------------------------------FDYWGQGTLVTVSS
-VQLVQSGA-EVKKPGSSVKVSCKASGGTF----SSYAISWVRQAPGQGLEWMGGI--------IPIFQKFQ-GRVTITADESTSTAYMELSSLRSEDTAVYYCARYDGI-YGELDFWGQGTLVTVS-


If we want to, we can also see the sequences of all other V and J genes in this family (or any other family of
interest for that matter). As an example, let's pull the sequences and names of all V genes for humans
in family IGHV1. These are not guaranteed to be returned in any particular order.

In [6]:
sequences, names = vj_tool.get_vj_gene_family("IGHV1", species="human")
print(sequences[:5])
print(names[:5])

['QVQLVQSGA-EVKKPGASVKVSCKASGYTF----TSYGISWVRQAPGQGLEWMGWISAY--NGNTNYAQKLQ-GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR----------------------', 'QVQLVQSGA-EVKKPGASVKVSCKASGYTF----TSYGISWVRQAPGQGLEWMGWISAY--NGNTNYAQKLQ-GRVTMTTDTSTSTAYMELRSLRSDDMAVYYCAR----------------------', 'QVQLVQSGA-EVKKPGASVKVSCKASGYTF----TSYGISWVRQAPGQGLEWMGWISAY--NGNTNYAQKLQ-GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR----------------------', 'QVQLVQSGA-EVKKPGASVKVSCKASGYTF----TGYYMHWVRQAPGQGLEWMGRINPN--SGGTNYAQKFQ-GRVTSTRDTSISTAYMELSRLRSDDTVVYYCAR----------------------', 'QVQLVQSGA-EVKKPGASVKVSCKASGYTF----TGYYMHWVRQAPGQGLEWMGWINPN--SGGTNYAQKFQ-GRVTMTRDTSISTAYMELSRLRSDDTAVYYCAR----------------------']
['IGHV1-18*01', 'IGHV1-18*03', 'IGHV1-18*04', 'IGHV1-2*01', 'IGHV1-2*02']


Finally, let's see when AntPack's VJ database was last updated. Note that AntPack's VJ database is pulled from
IMGT's but with some exclusions -- we exclude for example genes where the functionality is not "F" or the
gene is partial. This also indicates which species and receptors are currently supported.

In [7]:
vj_tool.retrieve_db_dates()

{'human': {'IGKV': '2024-05-13',
  'IGLV': '2024-05-13',
  'IGLJ': '2024-05-13',
  'IGHV': '2024-05-13',
  'IGHJ': '2024-05-13',
  'IGKJ': '2024-05-13'},
 'mouse': {'IGKV': '2024-05-13',
  'IGLV': '2024-05-13',
  'IGHJ': '2024-05-13',
  'IGLJ': '2024-05-13',
  'IGHV': '2024-05-13',
  'IGKJ': '2024-05-13'}}