<a href="https://colab.research.google.com/github/mariusmessemaker/STARSolo-inDrop-V3/blob/master/reconstruct_tcrs_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Full length tcr reconstruction
This notebook reconstructs full length tcr sequences from V, J and CDR3 annotations. When run on google colab, this notebook is assigned its own online environment. Therefore, everything done here will not affect your local files. However, it is possible to up- and download files to the directories (folder icon on the left).



## how to run:
There are text and code blocks. This is a text block. The code blocks have a  button in the top left corner, which looks like `ln [ ]` or `ln [12]` (can be any number). To run a code block you click that button.
- Each code block should be  **run in the order they are presented in**.
- The code does not need to be edited, but there are some options that can be changed (see section input & section functionalities).
 
**IMPORTANT**: any changes made in a previous session will not be saved (To be safe, we should determine all default settings).
## In-/Output

### *Input*
The input can be defined after the 3rd text block:  '3. specify file path and load the dataset'. Here the following files need to be provided:
-  a '.csv' file that contains the following columns (can be obtained by exporting/saving an excel file as .csv)
    - V and J annotations columns should contain the following column names: `TRAV', 'TRAJ',  'TRBV' and 'TRBJ`
    - CDR3a and CDR3b annotations should contain the following column names: `'cdr3_alpha_aa', 'cdr3_beta_aa'`
- two translation dictionaries (optional, can also keep default)
    - Multiple versions can be found in `/IMGT_versions`
    - These translate (ambiguous) annotations to IMGT standardized format, and takes the corresponding V and J sequences.
    - When no allele information is provided, *01 is used
    - One should contain the IMGT annoations as keys with the $AA$ (amino acid) sequences as values
    - The other should contain the IMGT annoations as keys with the $NT$ (nucleotide) sequences as values

### *Output*
- a '.csv' file that contains the original columns and the reconstructed sequence


## Functionalities:
- choose to include/exclude the leader in the `reconstruct_full_tcr` by adapting: `include_leader=False` (exclude) or `include_leader=True` (include)
- choose a specific translation dictionary. 
  - 'after benchmark' (added TRAJ58 and adapted 2 sequences from IMGT to Ensemble)
  - Functional: functional annotations only (no ORF annotations)

## TODO:
- The constant region (after the joinging segment) is not yet added
- allow to add custom sequences before/ inbetween segments
- better descriptions of what different translation dicts entail
- make an abstracted version where you upload the file to be reconstructed and press enter
- add reference to benchmark

  
### Benchmark

The reconstruction algorithm has been benchmarked on an internal dataset (RootPath), and an publically available dataset (10x).
Compared to the reconstruction of RootPath this reconstruction method matched near 100%$^1$ of their reconstructions (900+ TCRA and TCRB).
The 10x data was biological sample of 10k TCRS (50/50 TCRA/TCRB). The reconstruction fidelity of this dataset was $>85\%$
The remaining $15\%$ was explained by biological and/or technical noise and by missing info for allelic differences (alleles are often not annotated).

1. small differences were explained by a possible error and difference in assumptions between the unknown RootPath script and this method.

### 1. Run this cell to copy the required files from github (only needed when running from google colab)

In [1]:
!cd ~
!rm -r TCR_reconstruction/
!git clone https://github.com/bpkwee/TCR_reconstruction
!pip install pysam

rm: cannot remove 'TCR_reconstruction/': No such file or directory
Cloning into 'TCR_reconstruction'...
remote: Enumerating objects: 219, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 219 (delta 6), reused 4 (delta 4), pack-reused 207[K
Receiving objects: 100% (219/219), 486.92 KiB | 12.48 MiB/s, done.
Resolving deltas: 100% (128/128), done.
Collecting pysam
  Downloading pysam-0.18.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (14.9 MB)
[K     |████████████████████████████████| 14.9 MB 15.9 MB/s 
[?25hInstalling collected packages: pysam
Successfully installed pysam-0.18.0


### 2. next import all relevant functions
- import the functions that were downloaded from github
- initialize the logging

In [2]:
# from logger.logger import init_logger
# from vdj_reconstruction_utils import reconstruct_full_tcr
# from vdj_reconstruction_utils import reconstruct_vdj

from TCR_reconstruction.logger.logger import init_logger
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_vdj
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_full_tcr
import pandas

logger = init_logger('TCR_reconstruction.log', level_msg='INFO')

### 3. specify file path and load the dataset
If you upload your own, right to the file should be 3 dots (in google colab) here click 'copy file path'
If your file is not loaded correctly, consider changing the `'delimeter=';'` to the correct delimeter.

Example:
```

saved_dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
```

In [None]:
#dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
dataset = pandas.read_csv(dataset_file_path,delimiter=';' ,low_memory=False)
dataset.head(10)

Unnamed: 0,cdr3_alpha_aa,cdr3_beta_aa,TRAV,TRAJ,TRBV,TRBD,TRBJ,data_origin
0,CALRTYKYIF,CASGYWKLAGGPQETQYF,TRAV19,TRAJ40,TRBV7-2,TRBD2,TRBJ2-5,10x
1,CAVNAGGGSQGNLIF,CASSTRSSYEQYF,TRAV8-1,TRAJ42,TRBV19,,TRBJ2-7,10x
2,CAGGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,TRBD1,TRBJ2-7,10x
3,CAENEGGGSQGNLIF,CASSSRAGGEQYF,TRAV5,TRAJ42,TRBV19,TRBD2,TRBJ2-7,10x
4,CAVGGGGGSQGNLIF,CASSIRASYEQYF,TRAV8-3,TRAJ42,TRBV19,,TRBJ2-7,10x
5,CAMNPAWGGATNKLIF,CSASPGDYEQYF,TRAV12-3,TRAJ32,TRBV20-1,TRBD1,TRBJ2-7,10x
6,CAGSTSGSRLTF,CSATYEQYF,TRAV39,TRAJ58,TRBV20-1,,TRBJ2-7,10x
7,CAGAHGSSNTGKLIF,CASSIRSAYEQYF,TRAV27,TRAJ37,TRBV19,,TRBJ2-7,10x
8,CAAGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,,TRBJ2-7,10x
9,CAREHMDSNYQLIW,CASSQLGRGDNEQFF,TRAV9-2,TRAJ33,TRBV7-9,TRBD1,TRBJ2-1,10x


### 4. Reconstruct the v,d and j aa and nt sequences from TRAV, TRAJ, TRBV, TRBJ annotations


In [None]:
for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
    dataset[vdj + '_imgt_aa'],  dataset[vdj + '_seq_aa'],  dataset[
        vdj + '_imgt_nt'],  dataset[vdj + '_seq_nt'] = reconstruct_vdj(dataset,
                                                                       vdj,
                                                                       translation_dict_nt,
                                                                       translation_dict_aa)
dataset.head(10)

Unnamed: 0,cdr3_alpha_aa,cdr3_beta_aa,TRAV,TRAJ,TRBV,TRBD,TRBJ,data_origin,TRAV_imgt_aa,TRAV_seq_aa,TRAV_imgt_nt,TRAV_seq_nt,TRAJ_imgt_aa,TRAJ_seq_aa,TRAJ_imgt_nt,TRAJ_seq_nt,TRBV_imgt_aa,TRBV_seq_aa,TRBV_imgt_nt,TRBV_seq_nt,TRBD_imgt_aa,TRBD_seq_aa,TRBD_imgt_nt,TRBD_seq_nt,TRBJ_imgt_aa,TRBJ_seq_aa,TRBJ_imgt_nt,TRBJ_seq_nt
0,CALRTYKYIF,CASGYWKLAGGPQETQYF,TRAV19,TRAJ40,TRBV7-2,TRBD2,TRBJ2-5,10x,TRAV19*01,AQKVTQAQTEISVVEKEDVTLDCVYETRDTTYYLFWYKQPPSGELV...,TRAV19*01,atgctgactgccagcctgttgagggcagtcatagcctccatctgtg...,TRAJ40*01,TTSGTYKYIFGTGTRLKVLA,TRAJ40*01,actacctcaggaacctacaaatacatctttggaacaggcaccaggc...,TRBV7-2*01,GAGVSQSPSNKVTEKGKDVELRCDPISGHTALYWYRQSLGQGLEFL...,TRBV7-2*01,atgggcaccaggctcctcttctgggtggccttctgtctcctggggg...,TRBD2*01,GTSGG,TRBD2*01,gggactagcggggggg,TRBJ2-5*01,QETQYFGPGTRLLVL,TRBJ2-5*01,accaagagacccagtacttcgggccaggcacgcggctcctggtgctcg
1,CAVNAGGGSQGNLIF,CASSTRSSYEQYF,TRAV8-1,TRAJ42,TRBV19,,TRBJ2-7,10x,TRAV8-1*01,AQSVSQHNHHVILSEAASLELGCNYSYGGTVNLFWYVQYPGQHLQL...,TRAV8-1*01,atgctcctgttgctcataccagtgctggggatgatttttgccctga...,TRAJ42*01,NYGGSQGNLIFGKGTKLSVKP,TRAJ42*01,tgaattatggaggaagccaaggaaatctcatctttggaaaaggcac...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,,,,,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
2,CAGGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,TRBD1,TRBJ2-7,10x,TRAV27*01,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,TRAV27*01,atggtcctgaaattctccgtgtccattctttggattcagttggcat...,TRAJ42*01,NYGGSQGNLIFGKGTKLSVKP,TRAJ42*01,tgaattatggaggaagccaaggaaatctcatctttggaaaaggcac...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,TRBD1*01,GTGG,TRBD1*01,gggacagggggc,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
3,CAENEGGGSQGNLIF,CASSSRAGGEQYF,TRAV5,TRAJ42,TRBV19,TRBD2,TRBJ2-7,10x,TRAV5*01,GEDVEQSLFLSVREGDSSVINCTYTDSSSTYLYWYKQEPGAGLQLL...,TRAV5*01,atgaagacatttgctggattttcgttcctgtttttgtggctgcagc...,TRAJ42*01,NYGGSQGNLIFGKGTKLSVKP,TRAJ42*01,tgaattatggaggaagccaaggaaatctcatctttggaaaaggcac...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,TRBD2*01,GTSGG,TRBD2*01,gggactagcggggggg,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
4,CAVGGGGGSQGNLIF,CASSIRASYEQYF,TRAV8-3,TRAJ42,TRBV19,,TRBJ2-7,10x,TRAV8-3*01,AQSVTQPDIHITVSEGASLELRCNYSYGATPYLFWYVQSPGQGLQL...,TRAV8-3*01,atgctcctggagcttatcccactgctggggatacattttgtcctga...,TRAJ42*01,NYGGSQGNLIFGKGTKLSVKP,TRAJ42*01,tgaattatggaggaagccaaggaaatctcatctttggaaaaggcac...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,,,,,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
5,CAMNPAWGGATNKLIF,CSASPGDYEQYF,TRAV12-3,TRAJ32,TRBV20-1,TRBD1,TRBJ2-7,10x,TRAV12-3*01,QKEVEQDPGPLSVPEGAIVSLNCTYSNSAFQYFMWYRQYSRKGPEL...,TRAV12-3*01,atgatgaaatccttgagagttttactggtgatcctgtggcttcagt...,TRAJ32*01,NYGGATNKLIFGTGTLLAVQP,TRAJ32*01,tgaattatggcggtgctacaaacaagctcatctttggaactggcac...,TRBV20-1*01,GAVVSQHPSWVICKSGTSVKIECRSLDFQATTMFWYRQFPKQSLML...,TRBV20-1*01,atgctgctgcttctgctgcttctggggccaggctccgggcttggtg...,TRBD1*01,GTGG,TRBD1*01,gggacagggggc,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
6,CAGSTSGSRLTF,CSATYEQYF,TRAV39,TRAJ58,TRBV20-1,,TRBJ2-7,10x,TRAV39*01,ELKVEQNPLFLSMQEGKNYTIYCNYSTTSDRLYWYRQDPGKSLESL...,TRAV39*01,atgaagaagctactagcaatgattctgtggcttcaactagaccggt...,TRAJ58*01,*ETSGSRLTFGEGTQLTVNP,,,TRBV20-1*01,GAVVSQHPSWVICKSGTSVKIECRSLDFQATTMFWYRQFPKQSLML...,TRBV20-1*01,atgctgctgcttctgctgcttctggggccaggctccgggcttggtg...,,,,,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
7,CAGAHGSSNTGKLIF,CASSIRSAYEQYF,TRAV27,TRAJ37,TRBV19,,TRBJ2-7,10x,TRAV27*01,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,TRAV27*01,atggtcctgaaattctccgtgtccattctttggattcagttggcat...,TRAJ37*01,GSGNTGKLIFGQGTTLQVKP,TRAJ37*01,tggctctggcaacacaggcaaactaatctttgggcaagggacaact...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,,,,,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
8,CAAGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,,TRBJ2-7,10x,TRAV27*01,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,TRAV27*01,atggtcctgaaattctccgtgtccattctttggattcagttggcat...,TRAJ42*01,NYGGSQGNLIFGKGTKLSVKP,TRAJ42*01,tgaattatggaggaagccaaggaaatctcatctttggaaaaggcac...,TRBV19*01,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TRBV19*01,ATGAGCAACCAGGTGCTCTGCTGTGTGGTCCTTTGTCTCCTGGGAG...,,,,,TRBJ2-7*01,SYEQYFGPGTRLTVT,TRBJ2-7*01,ctcctacgagcagtacttcgggccgggcaccaggctcacggtcacag
9,CAREHMDSNYQLIW,CASSQLGRGDNEQFF,TRAV9-2,TRAJ33,TRBV7-9,TRBD1,TRBJ2-1,10x,TRAV9-2*01,GNSVTQMEGPVTLSEEAFLTINCTYTATGYPSLFWYVQYPGEGLQL...,TRAV9-2*01,atgaactattctccaggcttagtatctctgatactcttactgcttg...,TRAJ33*01,DSNYQLIWGAGTKLIIKP,TRAJ33*01,tggatagcaactatcagttaatctggggcgctgggaccaagctaat...,TRBV7-9*01,DTGVSQNPRHKITKRGQNVTFRCDPISEHNRLYWYRQTLGQGPEFL...,TRBV7-9*01,atgggcaccagcctcctctgctggatggccctgtgtctcctggggg...,TRBD1*01,GTGG,TRBD1*01,gggacagggggc,TRBJ2-1*01,SYNEQFFGPGTRLTVL,TRBJ2-1*01,ctcctacaatgagcagttcttcgggccagggacacggctcaccgtg...


### 5. Calculate how many annotations could be matched to a IMGT sequence:

In [None]:
total_len = len(  dataset)
for nt_or_aa in ['aa', 'nt']:
    for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
        count = sum(  dataset[vdj + '_seq_' + nt_or_aa].notna())
        count_original = sum(  dataset[vdj].notna())
        print('{0} imputed: {1} / {4} total annotations ({2}) ({3})'.format(vdj, count, total_len,
                                                                                  nt_or_aa.upper(),
                                                                                  count_original))

TRAV imputed: 16448 / 16448 total annotations (16448) (AA)
TRAJ imputed: 16448 / 16448 total annotations (16448) (AA)
TRBV imputed: 16421 / 16448 total annotations (16448) (AA)
TRBD imputed: 11805 / 16448 total annotations (16448) (AA)
TRBJ imputed: 16448 / 16448 total annotations (16448) (AA)
TRAV imputed: 16448 / 16448 total annotations (16448) (NT)
TRAJ imputed: 16294 / 16448 total annotations (16448) (NT)
TRBV imputed: 16421 / 16448 total annotations (16448) (NT)
TRBD imputed: 11805 / 16448 total annotations (16448) (NT)
TRBJ imputed: 16448 / 16448 total annotations (16448) (NT)


### 6. reconstructing the full sequence for the beta and alpha TCR
For clarity it selects only the original columns and the reconstructed sequence.
If you want all columns you should comment out (add '#' before the text):
`dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]`

In [None]:
# beta
dataset['full_seq_reconstruct_beta_aa'] = reconstruct_full_tcr(dataset['TRBV_seq_nt'],
                                                               dataset['TRBV_seq_aa'],
                                                               dataset['TRBJ_seq_nt'],
                                                               dataset['TRBJ_seq_aa'],
                                                               dataset['cdr3_beta_aa'],
                                                               include_leader=False)
# alpha
dataset['full_seq_reconstruct_alpha_aa'] = reconstruct_full_tcr(dataset['TRAV_seq_nt'],
                                                                dataset['TRAV_seq_aa'],
                                                                dataset['TRAJ_seq_nt'],
                                                                dataset['TRAJ_seq_aa'],
                                                                dataset['cdr3_alpha_aa'],
                                                                include_leader=False)

dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]
dataset.head(10)

Unnamed: 0,full_seq_reconstruct_beta_aa,full_seq_reconstruct_alpha_aa,cdr3_alpha_aa,cdr3_beta_aa,TRAV,TRAJ,TRBV,TRBD,TRBJ
0,GAGVSQSPSNKVTEKGKDVELRCDPISGHTALYWYRQSLGQGLEFL...,AQKVTQAQTEISVVEKEDVTLDCVYETRDTTYYLFWYKQPPSGELV...,CALRTYKYIF,CASGYWKLAGGPQETQYF,TRAV19,TRAJ40,TRBV7-2,TRBD2,TRBJ2-5
1,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,AQSVSQHNHHVILSEAASLELGCNYSYGGTVNLFWYVQYPGQHLQL...,CAVNAGGGSQGNLIF,CASSTRSSYEQYF,TRAV8-1,TRAJ42,TRBV19,,TRBJ2-7
2,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,CAGGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,TRBD1,TRBJ2-7
3,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,GEDVEQSLFLSVREGDSSVINCTYTDSSSTYLYWYKQEPGAGLQLL...,CAENEGGGSQGNLIF,CASSSRAGGEQYF,TRAV5,TRAJ42,TRBV19,TRBD2,TRBJ2-7
4,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,AQSVTQPDIHITVSEGASLELRCNYSYGATPYLFWYVQSPGQGLQL...,CAVGGGGGSQGNLIF,CASSIRASYEQYF,TRAV8-3,TRAJ42,TRBV19,,TRBJ2-7
5,GAVVSQHPSWVICKSGTSVKIECRSLDFQATTMFWYRQFPKQSLML...,QKEVEQDPGPLSVPEGAIVSLNCTYSNSAFQYFMWYRQYSRKGPEL...,CAMNPAWGGATNKLIF,CSASPGDYEQYF,TRAV12-3,TRAJ32,TRBV20-1,TRBD1,TRBJ2-7
6,GAVVSQHPSWVICKSGTSVKIECRSLDFQATTMFWYRQFPKQSLML...,ELKVEQNPLFLSMQEGKNYTIYCNYSTTSDRLYWYRQDPGKSLESL...,CAGSTSGSRLTF,CSATYEQYF,TRAV39,TRAJ58,TRBV20-1,,TRBJ2-7
7,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,CAGAHGSSNTGKLIF,CASSIRSAYEQYF,TRAV27,TRAJ37,TRBV19,,TRBJ2-7
8,DGGITQSPKYLFRKEGQNVTLSCEQNLNHDAMYWYRQDPGQGLRLI...,TQLLEQSPQFLSIQEGENLTVYCNSSSVFSSLQWYRQEPGEGPVLL...,CAAGGSQGNLIF,CASSIRSAYEQYF,TRAV27,TRAJ42,TRBV19,,TRBJ2-7
9,DTGVSQNPRHKITKRGQNVTFRCDPISEHNRLYWYRQTLGQGPEFL...,GNSVTQMEGPVTLSEEAFLTINCTYTATGYPSLFWYVQYPGEGLQL...,CAREHMDSNYQLIW,CASSQLGRGDNEQFF,TRAV9-2,TRAJ33,TRBV7-9,TRBD1,TRBJ2-1


### 7. Calculate statistics on the reconstruction:

In [None]:
print('Could reconstruct full BETA TCR for {0} entries of total {1} CDR3b entries'.format(
    sum(dataset['full_seq_reconstruct_beta_aa'].notna()),
    sum(   dataset['cdr3_beta_aa'].notna())))

print('Could  reconstruct full ALPHA TCR for {0} entries of total {1} CDR3a entries'.format(
    sum(   dataset['full_seq_reconstruct_alpha_aa'].notna()),
    sum(   dataset['cdr3_alpha_aa'].notna())))

Could reconstruct full BETA TCR for 16421 entries of total 16448 CDR3b entries
Could  reconstruct full ALPHA TCR for 16448 entries of total 16448 CDR3a entries


### 8. Lastly: save the output
To download the output, click the folder icon on the left and, click the three dots besides the file and click download.

In [None]:
dataset.to_csv('reconstructed_tcrs.csv')

dataset.to_csv('reconstructed_tcrs.csv')