Releases: milaboratory/mixcr
MiXCR v4.2.0
Built-in support for new protocols
-
BD Rpahsody full-length protocol
-
Smart-Seq2 single cell RNA-Seq protocol
-
Oxford Nanopore long-read technology
Sample barcodes
Complete support of sample barcodes that may be picked up from all possible sources:
- from names of input files;
- from index I1/I2 FASTQ files;
- from sequence header lines;
- from inside the tag pattern.
Now one can analyze multiple patient samples at once. Along with a powerful file name expansion functionality, one can process any kind of sequencing protocol with any custom combination of sample, cell and UMI barcoding.
Processing of multiple samples can be done in two principal modes in respect to sample barcodes: (1) data can be split by samples right on the align
stage and processed separately, or (2) all samples can be processed as a single set of sequences and separated only on the very last exportClones
step, both approaches have their pros and cons allowing to use the best strategy given the experimental setup and study goals.
New robust filters for single cell and molecular barcoded data
For 10x Genomics and other fragmented protocols, a new powerful k-mer based filtering algorithm is now used to eliminate cross-cell contamination coming from plasmatic cells.
For UMI filtering, a new algorithm from the paper by J. Barron (2020) allows for better automated histogram thresholding in barcoded data filtering.
List of all changes
Sample barcodes
- support for more than two
fastq
files as input (I1
andI2
reads support) - multiple possible sources of data for sample resolution:
- sequences extracted with tag pattern (including those coming from
I1
andI2
reads) - samples can be based on specific pattern variant (with multi-variant patterns, separated by
||
, allows to easily adopt MiGEC-style-like sample files) - parts of file names (extracted using file name expansion mechanism)
- sequences extracted with tag pattern (including those coming from
- flexible sample table matching criteria
- matching multiple tags
- matching variant id from multi-variant tag patterns
- special
--sample-table
mixin option allowing for flexible sample table definition in a tab-delimited table form - special
--infer-sample-table
mixin option to infer sample table for sample tags from file name expansion - special generic presets for multiplexed data analysis scenarios (e.g.
generic-tcr-amplicon-separate-samples-umi
) align
command now optionally allows to split output alignments by sample into separatevdjca
filesexportClones
command now supports splitting the output into multiple files by sampleanalyze
command supports new splitting behaviour of thealign
command, separately running all the analysis steps for all the output files (if splitting is enabled)
Filters and error correction
- preset for 10X VDJ BCR enhanced with k-mer-based filter to eliminate rare cross-cell contamination from plasmatic cells
- new advanced thresholding algorithm from the paper by J. Barron (2020) allows for better automated histogram thresholding in barcoded data filtering
- rework of clustering step aimed at PCR / reverse-transcription error correction in
assemble
, now it correctly handles any possible tag combination (sample, cell or molecule) - new feature to add histogram preprocessing steps in automated thresholding
Quality trimming
- turn on default quality trimming (
trimmingQualityThreshold
changed from0
to10
), this setting showed better performance in many real world use-cases
Reference library
- reference V/D/J/C gene library upgrade to repseqio v2.1 (see changelog)
New commands
- added command
exportReportsTable
that prints file in tabular format with report data from commands that were run
Other
- optimized aligner parameters for long-read data
- fixed system temp folder detection behaviour, now mixcr respects
TMPDIR
environment variable - rework of preset-mixin logic, now external presets (like those starting from
local:...
) are packed into the output*.vdjca
file onalign
step, the same applies to all externally linked information, like tag whitelists and sample lists. This behaviour facilitates better analysis reproducibility and more transparent parameter logistics. - new mixin options to adjust tag refinement whitelists with
analyze
:--set-whitelist
and--reset-whitelist
- removed
refineTagsAndSort
options-w
and--whitelist
; corresponding deprecation error message printed if used - new grouping feature for
exportClones
, allowing to normalize values for-readFraction
and-uniqueTagFraction ...
columns to totals for certain compartments instead of normalizing to the whole dataset. This feature allows to output e.g. fractions of reads inside the cell. - new mixin options
--add-export-clone-table-splitting
,--reset-export-clone-table-splitting
,--add-export-clone-grouping
and--reset-export-clone-grouping
- improved sensitivity of
findAlleles
command - add tags info in
exportAlignmentsPretty
andexportClonesPretty
- add
--chains
filter forexportShmTrees
,exportShmTreesWithNodes
,exportShmTreesNewick
andexportPlots shmTrees
commands - fixed old bug #353, now all aligners favor leftmost J gene in situations where multiple genes can ve found in the sequence (i.e. mis-spliced mRNA)
- fixes exception in
align
happening for not-parsed sequences withwriteFailedAlignments=true
- new filter and parameter added in
assemblePartial
; parameter name isminimalNOverlapShare
, it controls minimal relative part of N region that must be covered by the overlap to conclude that two reads are from the same V(D)J rearrangement - default paired-end overlap parameters changed to slightly more relaxed version
- better criteria for alignments to be accepted for the
assemblePartial
procedure - fixed NPE in
assemblePartial
executed for the data without C-gene alignment settings - fixed rare exception in
exportAirr
command - by default exports show messages like 'region_not_covered' for data that can't be extracted (requesting
-nFeature
for not covered region or not existed tag). Option--not-covered-as-empty
will save previous behaviour - info about genes with enough data to find allele was added into report of
findAlleles
and description of alleles - fixed error message appearing when analysis parameter already assigned to
null
is overridden bynull
using the-O...
option - fixed wrong reporting of number of trimmed letters from the right side of R1 and R2 sequence
- fixed error message about repeated generic mixin overrides
- fixed error of
exportClones
with some arguments - fixes for report indention artefacts
- fixed bug when chains filter set to
ALL
inexportAlignments
was preventing not-aligned records to be exported - fixed runtime exception in
assemble
rising in analysis of data with CELL barcodes but without UMIs, with turned off consensus assembly - fixed bug leading to incorrect mixin option ordering during it's application to parameters bundle
- minor change to the contigAssembly filtering parametrization
- added mix-in
--export-productive-clones-only
- warning message about automatically set
-Xmx..
JVM option inmixcr
script - safer automatic value for
-Xms..
- fix: added
species
flag to 10x, nanopore and smart-seq2 presets
MiXCR v4.1.2
Major changes
- Command
findShmTrees
now can build trees from inputs with different tags - Added
--impute-germline-on-export
and--dont-impute-germline-on-export
toexportAlignments
andexportClones
commands
Minor improvements
- Now, instead of specifying separately multiple tags of the same type (i.e. CELL1+CELL2+CELL3) in filters, one can use
convenient aliases (likeallTags:Cell
,allTags:Molecule
). This also facilitates creation of a more generic base
presets implementing common single-cell and UMI filtering strategies. - Several command line interface improvements
- Migration from
<tag_name>
to<tag_type>
semantics in export columns and--split-by-tag
options
Fixes
- fixes bug with
saveOriginalReads=true
onalign
leading to errors down the pipeline analyze
now correctly terminates on first error- correct progress reporting in
align
with multiple input files provided by file name expansion mechanism - fix
--only-observed
behaviour inexportShmTreesWithNodes
- fix missing tile in heatmap
- fix some cases of usage of
-O...
Presets
- Fixed issue with mouse presets from MiLaboratories
- Fixed presets with whitelists
- Fixed missing material type and species in several presets
- Added template switch region trimming for RACE protocols
- Added presets for
- Thermo Fisher Oncomine kits
- ParseBio single-cell protocols
- iRepertoire kits
- Preset for protocol described in Vergani et al. (2017)
- Cellecta AIR kit
MiXCR v4.1.1
Overview
With this release we continue extending the set of supported single-cell protocols by adding new ready-to-use presets to our collection. Additionally to newly supported protocols and features required for their reliable processing this release comes with many usability optimizations and stability improvements. See details below:
Major changes
- presets for analysis of all types of BD Rhapsody data (see docs for the list of supported kits)
- analysis of data produced by single-cell protocol described in Han et al. (2014) (see docs)
- special presets for exom data analysis
exom-cdr3
andexom-full-length
- initial support for overlap-extension-based chain pairing protocols
- possibility to export groups of similar columns specifying single option (like
-allAAFeatures <from_reference_point> <to_reference_point>
) - user-friendly alternative for
-uniqueTagsCount
--allUniqueTagsCount
; allows to export counts of unique tag combinations (useful for protocols with multipleCELL
andUMI
barcodes) - new "by sequence" filters for all somatic hypermutation trees (SHMT) exports
- new weighted auto-threshold selection and complementary metric histogram aggregation modes (i.e. y-axis on reads-pre-UMI plots now can show number of reads instead of number of UMIs)
- detected allelic variants are now can also be exported in fasta format right from the
findAlleles
command - better algorithm for seed sequence selection in consensus assembly routine in
assemble
; increases productive consensus count for cases with multi-variant tag groups (i.e. birthday paradoxes in UMI data or single-cell data analysis without UMIs)
Minor changes:
- minor adjustments for existing presets
- many CLI and parameter validation fixes, more human-readable error messages, better protection from common input errors
- support for preset-embedded tag whitelists for protocols with small number of barcode variants
- options
--use-local-temp
,--threads
,--not-aligned-R1
(R2
) and--not-parsed-R1
(R2
) are now available inanalyze
, additionally to individual step commands - bugfix for imputation in export for compound gene features
- other minor fixes and enhancements
MiXCR v4.1
Overview
MiXCR 4.1 features two major functional upgrades:
- essential fixes and improvements for the single-cell and molecular-barcoded data processing algorithms
- new powerful set of tools for allelic variant discovery and analysis of antibody hypermutation trees
Along with these features, release brings radically simplified user interface, which reduces all the complexities of repertoire analysis pipeline down to a single command, where only one option, the “preset”, has to be specified. MiXCR 4.1 is shipped with many of specifically optimized presets, for most of the repertoire analysis cases. Upgrades, introduced in this release, also significantly increases transparency of analysis pipeline, by providing a diverse set of new graphical QC reports and adding dozens of new metrics to textual and JSON reports. Additionally, this release incorporates tens of important fixes, performance optimizations and stability improvements.
Documentation portal
Along with the software release, we present a new documentation portal. It features a clean content organization, informative illustrations, deep guides on many real-world repertoire analysis scenarios and detailed descriptions for each of the MiXCR commands and analysis presets.
Welcome to https://docs.milaboratories.com/
Improvements for single-cell and molecular-barcoded data analysis
Based on our deep research of a large number of single-cell and molecular barcoded datasets, generated with dozens of protocols and instruments in a wide set of laboratory setups, we developed several important upgrades to the algorithms engaged in analysis of tagged data. With all the improvements and fixes, MiXCR 4.1 produces clean and reliable results for the majority of popular wet-lab protocols, being robust to a wide range of protocol noises, cross contamination mechanisms and artifacts. The set of tools offered by MiXCR 4.1 allows it to be applied for virtually any data of such type.
Featured fixes and upgrades:
- new high-performance aligner settings optimized for single-cell T- and B-cell receptor datasets
- important fixes for
assemblePartial
algorithm for tagged data - redesign of tag correction algorithm to increase performance and decrease memory consumption
- whitelist-based barcode correction in
refineTagsAndSort
step (f/k/acorrectAndSortTags
) - comprehensive options for data filtering, applied right after barcode sequence correction (in
refineTagsAndSort
) - algorithms for automated threshold selection in
refineTagsAndSort
filters - multiple improvements for consensus assembly algorithm (which pre-assembles consensuses from tagged groups in
assemble
); increased performance and stability in respect to data artifacts - automated inference of minimal number of reads in consensus
- de-contamination filters in
assemble
to fight cross-cell contaminations - rework of
assembleContigs
algorithm to increase robustness in respect to data artifacts - many new QC metrics from tag pattern parsing, sequence correction, consensus to contig assembly algorithms
SHM trees & Allele discovery
MiXCR 4.1 introduces two new comprehensive tools for analysis of hypermutation trees of antibodies. The first is the de-novo discovery of V and J gene alleles provided by the findAlleles
command. And the second is the SHM trees reconstruction tool provided by the findShmTrees
command. These two features go hand in hand and help each other to accurately separate allelic variants from somatic mutations and reconstruct mutation tree topology, given the set of samples for the same individual. We implemented new original algorithms for these tasks, both are based on sophisticated analysis of alignments with germline segments, rather than naive reconstruction of mutation histories regardless of the sequence structure, as implemented in other tools. This functionality is accompanied by a set of commands to export SHM trees in several formats: exportShmTrees
, exportShmTreesWithNodes
, exportShmTreesNewick
and exportPlots shmTrees
.
- For the correct lineage tree reconstruction, it is critical to first have accurate V- and J-gene allele information for a particular donor or mouse strain. Hence, it is highly recommended to first run
findAlleles
and re-align all clonotype sequences (option-o
) to a newly generated individual reference V- and J-gene library. findAlleles
utilizes an allele inference algorithm which can use even somatically hypermutated clonal sequences as input data.- Both
findAlleles
andfindShmTrees
commands support multiple.clns
files input - so the alleles can be inferred and lineage trees can be reconstructed using all available datasets. Note that it only makes sense to use datasets derived from an individual donor (or homogenic mouse strain) per command launch. - All commands produce extensive reports and auxiliary tables providing additional transparency in the algorithm performance
Presets and refreshed CLI
From now on, most users can run the whole pipeline, specifying just a single option, the preset name, in addition to the input and output file names.
MiXCR provides tens of fine tuned sets of parameters (presets) to extract repertoires from the data generated with most of the commercially available kits and instruments as well as with the well established open protocols, including single-cell, bulk repertoire sequencing with or without molecular-barcodes and non-enriched data like RNA-Seq.
For example you can run the whole analysis (from fastq to clonesets) for the dataset generated with MiLaboratories Human TCR RNA Multiplex kit using the following command:
mixcr analyze milab-human-tcr-rna-multiplex-cdr3 input_file_R1.fastq.gz input_file_R2.fastq.gz results_prefix
This will produce a full set of intermediate files, with tsv clonesets and extensive report files both in txt
and json
formats.
The preset functionality is accompanied by the set of special high level command line options, we call mixins, that help to adapt the selected preset if experimental setup requires non-standard analysis (though it is not required in most cases).
The following improvements were made to MiXCR’s CLI:
analyze
command was completely redesigned (see example above)- mixin options were introduced; can be specified on
analyze
,align
or, for some mixins, on other pipeline stages - new refreshed and polished CLI help
- new safer and more reliable file name expansion mechanism,
{{a}}
and{{R}}
pattern elements added; now one can specify... input_file_{{R}}.fastq.gz output.vdjca
instead of... input_file_R1.fastq.gz input_file_R2.fastq.gz output.vdjca
- all reports and analysis parameters are now embedded into the output files and can be easily retrieved afterwords
Graphical QC plots
MiXCR 4.1 introduces a new exportQc
command to visualize different quality control metrics including alignment performance, chain usage, reads coverage, barcode abundance distribution, automatically selected correction threshold etc.
Many other fixes & improvements
- fix a bunch of visualization issues #743, #747, #748, #749, #750, #751
- added bar plot gene usage plots
- added gene family usage plots
- better naming for diversity and overlap measures
- rename
biophysics
tocdr3metrics
in postanalysis - support of svg / png and other graphical formats in
exportPlots
- allow samples with different data types (umi/no-umi) been used in overlapScatter when
implement cutting contig results by assemble region - introduce
--pairwise-comparisons
instead of--hide-pairwise-comparison
inexportPlots diversity / biophysics
- fixed wrong sign for hydrophobicity metric in downstream analysis
- fixed incorrect behaviour of clonotype splitting by V, J and C genes
- multiple bug fixes for post analysis downsampling
- added
--show-significance
option inexportPlots diversity / biophysics
- fix NPE in overlap browser when some clone do not contain gene feature specified in overlap criteria
- splitting of clones on export; there is no need to run
exportClones
command multiple times (only “by chain” option is currently implemented) - new export fields for single-cell and molecular barcodes (i.e.
-tagFraction
) - fixes for
--not-aligned-R1/2
option for tagged analysis - incomplete V gene feature correction for AIRR export, if vFeatureToAlign was adjusted to exclude primer sequence from alignment
- options to export reads that were not parsed according to the tag pattern (
--not-parsed-R1/2
) - start from BAM file
- CLI and several other parts are (re)implemented in Kotlin
- temporary files are now by default are placed to the system temp folder; option to move them in the folder of output files
--use-system-temp
- fixed bug in
assemble
report caused by pre-clone assembler which did not reportedfailed to extract target
- fixed NPE in assembleContigs with disjoint features (#727)
- better ChainsUsage report (#732)
- factor-by option for overlap downstream analysis
- allow lowercase...
MiXCR v4.0
Comprehensive support for Single-Cell and Molecular barcodes
- flexible and fast pattern matching engine to parse barcodes from the data; allows to fit the pipeline to any
commercially available or in-house wet lab protocol with molecular or/and cell barcodes - error correction in barcode sequences
- two cooperating UMI and/or Cell-barcode-based steps for clonal sequence reconstruction:
- consensus assembly (i.e. for well-framed amplicon sequencing)
- contig assembly (i.e. for 10x-like enzymatically fragmented data)
- tag information preserved on all analysis steps and extensive QC reports are generated throughout the pipeline,
providing maximal visibility into analysis performance and giving a powerful tool for wet lab issues investigation
See the following usage examples:
- https://github.com/milaboratory/mixcr/wiki/Analysis-of-10X-Single-Cell-data
- https://github.com/milaboratory/mixcr/wiki/Analysis-of-full-length-BCR-data-with-unique-molecular-identifiers
Downstream analysis
Set of powerful downstream analysis features with the ability to export postanalysis results in tabular format and vector plots with various statistical comparisons.
- Ability to group samples by metadata values and compare repertoire features between groups
- Comprehensive repertoire normalization and filtering
- Statistical significance tests with proper p-value adjustment
- Repertoire overlap analysis
- Vector plots output (.svg / .pdf)
- Tabular outputs
See the following usage guide:
Overlap browser
Added command exportClonesOverlap
allowing to efficiently build and export overlap of the arbitrary number of clonesets.
Major rework of contig assembly algorithm
- significantly increased accuracy and stability
- works with or without molecular or cell barcodes
- can be applied to (sc)RNASeq data with reasonable IG/TCR coverage to reconstruct long sequence outside the CDR3
Export in AIRR format
- multiple options to export alignment or clonal data in AIRR format
- provides better compatibility with 3rd-party tools from AIRR community (see also RepSeq.IO feature for generation of fasta libraries with IMGT-like gaps from repseqio formatted references)
See here for usage example.
Other improvements and changes
- new built-in reference library with new species and newest genome based library for human
(see changelog here) - complete rewrite of IO for intermediate files (much faster IO with parallel serialization and deserialization,
more compact files - each block is compressed with LZ4, versatile random access features provides additional speedup) - faster hash-based external (file-based) sorting algorithm for alignment and other regrouping tasks in UMI/Single-cell
related tasks and operations requiring alignment to clone mapping - input sequence quality-score based trimming enabled by default
- support for human-readable alignments export from *.clna files by clone index
- all steps are cleaned-up to be completely pure, i.e. for the same input, output will always be byte-to-byte equal
(no analysis date or other variable pieces of information leaks to the output files) - more stable amino acid and combined amino acid plus nucleotide mutations export
- slight default analysis parameter optimization
Obtaining a license file
MiXCR requires a license file to run. Academic users with no commercial funding can quickly obtain a MiXCR license for free at https://licensing.milaboratories.com/. We are committed to support academic community and provide our software free of charge for scientists doing non-profit research. Commercial trial license can be requested at https://licensing.milaboratories.com or by email to licensing@milaboratories.com.
For details see: https://github.com/milaboratory/mixcr/wiki/Using-license
MiXCR v3.0.13
- Fixed bug with wrong V gene selection in
assembleContigs
. analyze
doesn't use.clna
when contig assembly is not specified- Fixed
AlignConfiguration
to account for trimming - Added
--threads
option toanalyze
- Added
--library
option toanalyze
- Bug fix in partial assembler
MiXCR is free for non-profit use only (see LICENSE for details)!
For commercial use please contact licensing@milaboratory.com.
MiXCR v3.0.12
- Built-in reference library upgraded to v1.6 (see changes)
- Additional
mixcr
script optimizations for docker
MiXCR is free for non-profit use only (see LICENSE for details)!
For commercial use please contact licensing@milaboratory.com.
MiXCR v3.0.11
- Fixes exception in
assemble
for multi-assembling-feature cases with zero length sequences - Fixes empty
FR3
imputed sequence in cases with zero assembled nucleotides on the 5' side of CDR3 - MiXCR execution script optimized for Docker (Java 11 is recommended for running MiXCR in container environment)
- Other fixes for MiXCR script
Starting from this release we will maintain Official MiXCR Docker Image.
MiXCR is free for non-profit use only (see LICENSE for details)!
For commercial use please contact licensing@milaboratory.com.
MiXCR v3.0.10
- Fixes
NPE
in very rare cases with incompatible V gene selection forassembleContigs
in case of partially annotated gene libraries - Several fixes for sequence imputation algorithm in
exportAlignments
/exportClones
MiXCR is free for non-profit use only (see LICENSE for details)!
For commercial use please contact licensing@milaboratory.com.
MiXCR v3.0.9
- Fixed wrong behaviour with score-based pre-filtering in
split-by-V/J=true
cases - Chain usage statistics added to
align
andassemble
JSON reports - Fixed rare IndexOutOfBounds exception in
-nFeatureImputed ...
- Added shortcut for
--json-report
=-j
- Sanity check for common mistake in
analyze
parameters
MiXCR is free for non-profit use only (see LICENSE for details)!
For commercial use please contact licensing@milaboratory.com.