Skip to content
T cell receptor amino acid sequence probability estimation.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

vdjRec version 0.1

TCR amino acid sequence generation and recombination probability estimation in R. Some R skills are required to use advanced features.


This software allows to calculate recombination probability of TCR sequence, simulate TCR repertoire, and identify candidate condition-associated T-cell receptor sequences using only patient cohort. For details of the approach, see preprint.

Software requirements

Any OS where R is available (Linux, OS X, Windows), however parallel computing is currently not available on Windows.


  1. Install R distribution of choice (i.e. from R Core team, or Microsoft R Open )
  2. Install BioStrings package from bioconductor: open R console and execute following commands:
  1. Download this repository and unzip

Quick start

Sequence generation

Currently only TCRB generation is supported (more loci would be available soon!).

User need to specify V, J and desired number of recombination events.

Note: translate and inframe_only options discard out-of-frame sequences from the output.

#Generate both inframe and out-of-frame nucleotide sequences:
#all 100 recombination events are in the output 

#Lets generate some in frame TCR beta amino acid sequences:
#Only in frame sequences are in the output (appox. third of recombination events), output is translated

#Lets generate some in frame TCR beta nucleotide sequences:
#Only in frame sequences are in the output (appox. third of recombination events), output is NOT translated

Immunoseq format uses different V and J segment naming than IMGT, use supplied conversion tables if V and J names are in immunoseq format:


Generative probability estimation

How to estimate generative probability for CDR3 amino acid sequence? Generate a lot of TCR aminoacid sequences and count, how many of them correspond to our sequences of interest!

To save memory, sequence generation is performed in iter batches of nrec size, and may be performed in parallel on several cores, so total number of simulated sequences is nrec *iter*cores

#load sample file with TCRB CDR3 amino acid sequences
#estimate counts
CDR3s_p<-estimate_pgen_aa(CDR3s,iter=3,nrec=5e5,V="TRBV5-1",J="TRBJ1-1",colname="CDR3.amino.acid.sequence")#specify column name with CDR3 aa seqs.
#we could easily convert counts to probs

#We could do more with more cores (currently OS X and Linux). 
#Note, that now total number of generated sequences is nrec*iter*cores = 5e5*3*2.

To check for contamination, we could also save generated nucleotide variants for few sequences of interest (target parameter):

#function now returns list with data and variants of target

#as expected, all have same amino acid sequence as target
#table with counts is now in data element of the list: 
sum(CDR3s_p$data$sim_num) #number of simulated amino acid sequences corresponding on data

Identification of condition-associated clonotypes

This is most advanced part. Note, that all analysis is needed to be done for each VJ combination separately.

We would need two things:

  1. Table with CDR3 amino acid sequence and columns indicating presence (or read count) in our donors. Typically, this is a result of merge of sample dataframes by CDR3 amino acid sequence. See demo/TCRBV05-01_TCRBJ01-01.csv for example.
  2. Table with sizes of repertoire of given donors VJ combination (= number of unique CDR3 beta clonotypes with this VJ combination), and column indicating if you want to use this sample for current analysis (normally this indicates, that sample corresponds to patient cohort). See demo/samples_TCRBV05-01_TCRBJ01-01.csv for example.

Table 1. example:

CDR3.amino.acid.sequence Donor1.Read.count Donor2.Read.count Donor3.Read.count

Table 2. example:

sample count analysis
Donor1 3040 TRUE
Donor2 5304 TRUE
Donor3 1356 FALSE

Below is a real world example of analysis for TRBV7-6 TRBJ1-4 combination from Emerson et al, Nature genetics, 2017.


#load sharing data for TRBV7-6 TRBJ1-4 combination for data from Emerson et al, Nature genetics, 2017.

#lets generate 2e9 sequences to estimate generation probability. 
#This takes some time, approx 2 hours on 8-core intel i7 processor. 
#For demo purposes skip it, and load precomputed table (see below)
CDR3s_p<-estimate_pgen_aa(CDR3s,iter=500,nrec=5e5,cores=8, V="TRBV7-6",J="TRBJ1-4")

#Analogous command for single core (use on windows), this simulation takes approx 16 hours of time, so do not run it.
CDR3s_p<-estimate_pgen_aa(CDR3s,iter=500*8,nrec=5e5,cores=1, V="TRBV7-6",J="TRBJ1-4")

#To save time, we could load precomputed table:
#read sample sheet


This outputs the same table with the additional columns, such as p-value and effect size:

#filter clones with 0 in silico rearrangements:

#do multiple testing correction and output significant results only with few columns:

Here is short final output description. See methods section in the manuscript for detailed description.

Field Description
sim_num raw number of simulated TCR amino acid sequences having this CDR3AA
ML ML estimate of probability of observing sequence from data (P_data)
donors number of donors in selected cohort having this CDR3AA
P_post theoretical probability to observe sequence, estimated from recombination model
pval_post p-value (not corrected for multiple testing)
effect_size log10(effect size) is log10(ML)-log10(P_post)

Experiment design and power analysis

For the details see Designing the experiment section in the manuscript. Idea is to make a simulation to check, if clone with given P_data and P_post (which is q times lower than P_data) would be found in cohort of size n, sequencing depths vector nvec, significance threshold thres by our method. One could do niter simulations, and function would return number of simulations when clone is below significant threshold, and also number of donors with clone for each simulation. Function return list of number of significant tests for each P_data value (power) and number of donors with sequence in each simulation (sizes, which is a matrix - columns are simulations, each row correspond to donor).
Let's do a quick example:


#do niter=100 simulations for clone with effect size q=5,
#and cohort size n=30 for given values of pdata:
tst_q5<-do_power_analysis(thres = 0.0001,niter = 100,q=5,n=30,nvec=rep(1e3,30),pdata=10^-seq(7,2,length.out = 18))
#lets plot the results!
#Number of significant test depending of clone Pdata
plot(10^-seq(7,2,length.out = 18),tst_q5$power,log="x",type="l",xlab="Pdata'",ylab="# significant results (out of 100)",ylim=c(0,100))
#If clone is less abundant in population, it is harder to find it.

#Average (over 100 simulations for each Pdata) number of donors with sequence:
plot(10^-seq(7,2,length.out = 18),rowSums(tst_q5$sizes)/100,log="x",type="l",xlab="Pdata'",ylab="# of donors with sequence")
You can’t perform that action at this time.