V(D)J junction mapping
A script to extract CDR3 sequences. Will properly combine reads coming from paired and overlapped data and perform analysis for both raw and assembled data.
Performs CDR3 extraction and V/J segment determination for both raw
(Checkout output) and assembled-data. Gene parameter
required unless metadata (
--sample-metadata) is provided that
specifies gene for each sample; supported genes are TRA, TRB, TRG,
TRD, IGH, IGK and IGL. If either of assembly_output_folder
or checkout_output_folder is not specified, the processing will be
done solely for the remaining input, this is useful e.g. if one wants
quickly process the assembled data. Otherwise only samples and file
types (paired, overlapped or single) that are present in both outputs
will be used. Processing both raw and assembled data is required for
second stage error correction (removal of hot-spot errors).
java -jar migec.jar CdrBlastBatch [options] -R gene [checkout_output_folder/ or .] [assemble_output_folder/ or .] output_folder
Several default CdrBlast parameters could be set,
--default-mask <R1=[0,1]:R2=[0,1]> - mask which specifies for which
read(s) in paired-end data to perform CDR3 extraction. In case of
0:0 mask will process only overlapped reads
default species to be used for all samples, human (used by default) or
--default-file-types - default file types (paired,
overlapped or single) to be processed for each sample. If several file
types are specified, the corresponding raw and assembled files will be
combined and used as an input to CdrBlast
--default-quality-threshold <Phred=[2..40],CQS=[2..40]> - quality
threshold pair, default for all samples. First threshold in pair is used
for raw sequence quality (sequencing quality phred) and the second one
is used for assembled sequence quality (CQS score, the fraction of reads
in MIG that contain dominant letter at a given position)
no sorting is performed for output files which speeds up processing.
Could be safely used in full pipeline as FilterCdrBlastResults will
provide final clonotype table in sorted format
A sample metadata file could also be provided with
--sample-metadata <file_name> argument to guide the batch CDR3
extraction. This file should have the following tab-separated table
|Sample ID||Species||Gene||File types||Mask||Quality threshold pair|
See section below for more details.
The output of V-D-J mapping routines of MIGEC is a standard tab-delimited clonotype table with some information on the number of reads and UMI tags that correspond to a given clonotype.
Each clonotype is specified by count, fraction, V, D and J segment identifier list, CDR3 nucleotide and amino acid sequence.
The positions of last V nucleotide, first and last D nucleotide, and first J nucleotide specify the germline region markup within the hypervariable CDR3 sequence, they are given in 0-based coordinates where 0 marks the first base of CDR3.
total reads and
good reads fields contain the number of reads
supporting a given clonotype prior to and after the quality filtering.
total events and
good events fields contain the number of UMI tags
supporting a given clonotype prior to and after the quality filtering. For raw
(unassembled) data these are equal to
total reads and
A script to map V-(D)-J junctions, extract CDR3 sequences and assemble clonotypes.
java -jar migec.jar CdrBlast [options] -R gene file1.fastq[.gz] [file2.fastq[.gz] ...] output_file
Standard, assuming an example of a library containing T-cell Receptor Alpha Chain sequences
in case of MIG-assembled data:
java -jar migec.jar CdrBlast -a -R TRA assembly/S1_R2.fastq.gz cdrblast/S1_asm.cdrblast.txt
for raw data:
java -jar migec.jar CdrBlast -R TRA checkout/S1_R2.fastq.gz cdrblast/S1_raw.cdrblast.txt
to concatenate and process two or more FASTQ files at once:
java -jar migec.jar CdrBlast -R TRA checkout/S1_R2.fastq.gz checkout/S2_R2.fastq.gz cdrblast/S12_raw.cdrblast.txt
-R is required, supported genes are TRA, TRB,
TRG, TRD, IGH, IGK and IGL. Several chains can be specified,
-R TRA,TRB or
-R IGH,IGL. Species could be provided with
-S parameter, by default uses HomoSapiens, supported species are
HomoSapiens, MusMusculus and others. Assembled data should be passed
to the script with
--same-sample option should be
used if several assembled files are provided from the same sample, so
duplicate UMIs will be discarded and not counted twice.
To get a sorted output use
-o option, otherwise sorting will be
performed at FilterCdrBlastResults step. Note that both raw and
assembled data should be processed to apply the last step of filtration.
In order to use all alleles, not just the major (*01 ones), use the
--all-alleles option. To include non-coding segments (V segment
pseudogenes) use the