The PacBio Barcode Demultiplexer
Lima, the PacBio barcode demultiplexer, is the standard tool to identify barcode sequences in PacBio single-molecule sequencing data. Starting in SMRT Link v5.1.0, it is the tool that powers the Demultiplex Barcodes GUI-based analysis application. Previous versions of SMRT Link called lima's predecessors, pbbarcode and bam2bam, for demultiplexing. This new tool provides a better end-to-end user experience for analysis of multiplexed samples.
Latest version can be installed via bioconda package
Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.
Lima can demultiplex samples that have a unique per-sample barcode pair and have been pooled and sequenced on the same SMRT cell. There are four different methods to associate barcodes with a sample, by PCR or ligation:
- Sequence-specific primers
- Barcoded universal primers
- Barcoded adapters
- Probe-based linear barcoded adapters
In addition, there are three different barcode library designs. In order to describe a barcode library design, one can view it from a SMRTbell or read perspective. As lima supports CLR subread and CCS read demultiplexing, the following terminology is based on the per (sub-)read view.
In the overview above, the input sequence is flanked by adapters on both sides. The bases adjacent to an adapter are referred to as barcode regions. A read can have up to two barcode regions, leading and trailing. Either or both adapters can be missing and consequently the leading and/or trailing region is not being called.
For the symmetric and tailed library design, the same barcode is attached to both sides of the insert sequence of interest; the only difference is the orientation of the trailing barcode. For identification, one read with a single barcode region is sufficient.
For the asymmetric design, a different barcode pair is attached to the sides of the insert sequence of interest. In order to be able to identify a different barcode pair, a read with leading and trailing barcode regions is required.
Output barcode pairs are generated from the identified barcodes.
The barcode names are combined using the
-- infix, for example
The sort order is defined by the barcode indices, lowest first.
Lima offers the following features:
- Process both, CLR subreads and CCS reads
- BAM, FASTA, FASTQ in- and output
- Extensive reports that allow in-depth quality control
- Clip barcode sequences and annotate
- Agnostic of input barcode sequence orientation
- Split output files by barcode
- Full PacBio dataset support
- Peek into the first N ZMWs and get average barcode score
- Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
- Enhanced filtering options to remove ambiguous calls
- Double demux to remove PCR primers after barcode demultiplexing
Version 2.0.0: Full changelog here
Note: Any existing output files will be overwritten after execution.
Note: Always use
--peek-guess to remove spurious barcode hits.
Run on CLR subread data:
$ lima movie.subreads.bam barcodes.fasta prefix.bam $ lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
Run on CCS data:
$ lima --ccs movie.ccs.bam barcodes.fasta prefix.bam $ lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml
If you do not need to import the demultiplexed data into SMRT Link, it is advised
--no-pbi, omit the pbi index file, to minimize time to result.
Symmetric or Tailed options
CLR: --same CCS: --same --ccs
CLR: --different CCS: --different --ccs
$ lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \ m54317_180718_075644.demux.subreadset.xml --different --peek-guess
Input data is either CLR unaligned subreads, straight from a Sequel I/II, or
unaligned CCS reads, generated by CCS;
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a
with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.
In addition, CCS reads input are also supported as FASTA or FASTQ, optionally
Barcodes are provided as a FASTA file, one entry per barcode sequence, no duplicate sequences, only upper-case bases, orientation agnostic (forward or reverse-complement, but NOT reversed). Example:
>bc1000 CTCTACTTACTTACTG >bc1001 GTCGTATCATCATGTA >bc1002 AATATACCTATCATTA
Please name your barcodes with an alphabetic character prefix to avoid later confusion of barcode name and index. Duplicate names or sequences are not permitted.
Lima processes input reads grouped by ZMW, except if
--per-read is chosen.
All barcode regions along the read are processed individually.
The final per-ZMW result is a summary over all barcode regions,
a pair of selected barcodes from the provided set of candidate barcodes;
subreads from the same ZMW will have the same barcode and barcode quality.
For a particular target barcode region, every barcode sequence gets aligned
as given and as reverse-complement, and higher scoring orientation is chosen;
the result is a list of scores over all candidate barcodes.
If only same barcode pairs are of interest, symmetric/tailed, please use
--same to filter out different barcode pairs.
If only different barcode pairs are of interest, asymmetric, please use
--different to require at least two barcodes to be read and remove pairs with
the same barcode.
Lima generates multiple output files per default, all starting with the same
prefix as the output file, omitting suffixes
.consensusreadset.xml. The report infix is
$ lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml
For all output files, the prefix will be
The first file
prefix.bam contains clipped records, annotated with
barcode tags, that passed filters.
Alternatively, if output file is fasta or fastq, the header of each sequence contains all tags, separated by a single whitespace, that would be present in the BAM format. Example FASTQ header:
@m54006_171006_044150/4588126/ccs bc=3,3 bl=CGCGCGTGTGTGCGTG bq=100 bt=CGCGCGTGTGTGCGTG bx=16,16 cx=12 qe=2235 ql=p\tttropqorrtnnH qs=16 qt=G^\IGR]K8S>>^\^p
In- and output compatibility matrix:
For CLR data, only XML and BAM are valid in- and output file types.
For CCS data, use following compatibility matrix:
This means, you can use CCS FASTQ reads as input and FASTA as output, but not BAM as output.
$ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.fastq --same
$ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.bam --same FATAL -|- Unsupported combination of FASTQ input and BAM output.
The second file is
prefix.lima.report, a tab-separated file about each ZMW, unfiltered.
This report contains any information necessary to investigate the demultiplexing
process and the underlying data.
A single row contains all reads of a single ZMW. For
--per-read, each row
contains one subread and ZMWs might span multiple rows.
The third file is
prefix.lima.summary, shows how many ZMWs have been filtered,
how ZMWs many are same/different, and how many reads have been filtered.
ZMWs input (A) : 213120 ZMWs above all thresholds (B) : 176356 (83%) ZMWs below any threshold (C) : 36764 (17%) ZMW marginals for (C): Below min length : 26 (0%) Below min score : 0 (0%) Below min end score : 5138 (13%) Below min passes : 0 (0%) Below min score lead : 11656 (32%) Below min ref span : 3124 (8%) Without SMRTbell adapter : 25094 (68%) With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio Undesired hybrids : xxx (xx%) <- Only with --peek-guess Undesired same pairs : xxx (xx%) <- Only with --different Undesired diff pairs : xxx (xx%) <- Only with --same Undesired 5p--5p pairs : xxx (xx%) <- Only with --isoseq Undesired 3p--3p pairs : xxx (xx%) <- Only with --isoseq Undesired single side : xxx (xx%) <- Only with --isoseq Undesired no hit : xxx (xx%) <- Only with --isoseq ZMWs for (B): With same pair : 162244 (92%) With different pair : 14112 (8%) Coefficient of correlation : 32.79% ZMWs for (A): Allow diff pair : 157264 (74%) Allow same pair : 188026 (88%) Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio Reads for (B): Above length : 1278461 (100%) Below length : 2787 (0%)
Explanation of each block:
Number of ZMWs that went into lima, how many ZMWs have been passed into the output file, and how many did not qualify.
For those ZMWs that did not qualify, the marginal counts of each filter; each filter is explained in great detail elsewhere in this document.
For those ZMWs that passed, how many have been flagged as having a same or different barcode pair. And what is the coefficient of variation for the barcode ZMW yield distribution in percent.
For all input ZMWs, how many allow calling a same or different barcode pair. This a simplified version of, how many ZMWs have at least one full pass to allow a different barcode pair call and how many ZMWs have at least half an adapter, allowing a same barcode pair call.
For those ZMWs that qualified, list the number of reads that are above and below the provided
--min-lengththreshold (details see here).
The fourth file is
prefix.lima.counts, a tsv file, that shows the counts of each
observed barcode pair; only passing ZMWs are counted.
$ column -t prefix.lima.counts IdxFirst IdxCombined IdxFirstNamed IdxCombinedNamed Counts 1 1 bc1002 bc1002 113 14 14 bc1015 bc1015 129 18 18 bc1019 bc1019 106
Using the option
--dump-clips, clipped barcode regions are stored in the file
$ head -n 6 prefix.lima.clips >m54007_170702_064558/4850602/6488_6512 bq:34 bc:11 CATGTCCCCTCAGTTAAGTTACAA >m54007_170702_064558/4850602/6582_6605 bq:37 bc:11 TTTTGACTAACTGATACCAATAG >m54007_170702_064558/4916040/4801_4816 bq:93 bc:10
Using the option
--dump-removed, records that did not pass provided thresholds
or are without barcodes, are stored in the file
One DataSet, SubreadSet or ConsensusReadset, is generated per output BAM file.
One PBI file is generated per output BAM file.
Positive predictive values
Performance is measured as positive predictive value (PPV); it measures TP/(TP+FP), the ratio of true positive calls over all true and false positive calls. It informs us how much cross-calling has been observed between the desired barcode pairs. It is also known as precision. In order to compute a PPV, distinct amplicons of known lengths and origin are barcoded, sequenced, demultiplexed, and mapped back to the set of known references. With this approach, true and false positive calls can be counted per barcode pair. The resulting PPV is due to misidentification by the demultiplexing algorithm, caused by many different external factors, such as poorly synthesized barcode molecules, contamination between barcode wells, and insert contamination during the library preparation.
Depending on the barcoding mode, same or different barcodes on the ends of the insert, and the number of barcodes used, PPV varies.
Examples for different barcoding schemes, (x) indicating use of a barcode pair:
8-plex same / symmetric
28-plex different / asymmetric
36-plex same+different / symmetric+asymmetric
Following libraries contain 2kb amplicons with vector-sequence-specific primers amplified. Sequencing movies are 6 hours long with additional 2 hours pre-extension. The instrument version is 5.0.0 and the chemistry is S/P2-C2. For each ZMW, all sequenced barcode regions were respected.
- With increasing number of barcodes, PPV decreases.
- Same barcode pair libraries have higher PPV than different barcode pair libraries.
- Mixing same and different barcode pairs in one library leads to very bad PPV and is not supported.
The yield is, after the PPV, the next most important metric. Lima removes unwanted barcode pairs that are undesired to increase PPV, accepting a decrease in yield.
Example 384-plex symmetric (look at the bars above the x-axis):
Compare it to a 384-plex asymmetric run:
The reason behind the yield decrease for asymmetric is, in order to identify a ZMW as asymmetric, both flanking barcodes of an insert have to be observed; ZMWs whose polymerase read does not contain at least two adapters have to be removed. In contrast, for the symmetric case, it is sufficient to see a single barcode region.
Lima offers a set of options for processing, including trivial and sophisticated filters.
Reads with length below
N bp after demultiplexing are omitted. The default is
ZMWs with no reads passing are omitted.
Reads with length above
N bp are omitted for scoring in the demultiplexing
step. The default is
0, meaning deactivated.
Threshold for the average barcode score of the leading and trailing ends.
ZMWs with barcode score below
N are omitted. The default is
It is advised to set it to
This threshold is applied to the two individual barcode scores, the leading and
ZMWs with at least one individual barcode score below
N are omitted.
The default is
Simplified example: A ZMW is tagged with two barcodes
All leading barcode regions match to
A with score
all trailing barcode regions match
B with score
On average, the barcode score is
filters on an individual barcode level, checking
--min-end-score 45, this ZMW would not pass, because
B is below the
This filter can be used to remove ZMWs that have one good and one bad call,
only useful for asymmetric barcoding schemes with different barcodes in a pair.
For libraries with the same barcode in pair, this option is identical to
Those options are used in combination to remove ZMWs that have spurious barcode
--min-ref-span defines the minimum reference span relative to
the barcode length to call a barcode region scoring.
--min-scoring-regions defines the minimum number of scoring barcode
regions. ZMWs with less than
-min-scoring-regions N scoring regions are
ZMWs with less than
N full passes, a read with a leading and
trailing adapter, are omitted. The default is
0, no full-pass needed. Example
0 pass : insert - adapter - insert 1 pass : insert - adapter - INSERT - adapter - insert 2 passes: insert - adapter - INSERT - adapter - INSERT - adapter - insert
Only use up to first
N barcode pair regions for barcode identification.
The default is
0, deactivated filter. This is equivalent to the
maximum number of scored adapters in bam2bam.
Only use up to first
N barcode regions for barcode identification.
The default is
0, deactivated filter. Setting to 1 enables single-pass
and single-barcode calculation.
Only use the flanking regions of up to first
N adapters for barcode identification.
Only full adapters, surrounded by subreads, are used.
The default is
0, deactivated filter. Setting to 1 enables single-pass barcode
calculation using forward and reverse pass.
Caution: If a subread between two full flanking adapters gets removed in PPA, lima will score frankstein flanking barcodes for two consecutive subreads with what it believes one adapter in the middle.
Only use reads flanked by adapters on both sides for barcode identification, full-pass reads.
Per default, the two identified barcode idx are sorted ascending, as in CLR data,
the correct order cannot be determined. This affects the
--split-bam file names;
IdxHighestNamed will have the same order as
IdxCombined. This option only makes sense for single read data,
such as CCS.
If you are using an asymmetric barcode design with
and your input is CCS, you can use
--keep-idx-order to preserve
the order. If your input is CLR subreads and you use
NxN asymmetric pairs,
there is no way to distinguish between pairs
Score and tag per subread, instead per ZMW.
The candidate region size multiplier:
barcode_length * multiplier, default
Optionally, you can specify the region size in base pairs with
-A,--match-score Score for a sequence match. -B,--mismatch-penalty Penalty for a mismatch. -D,--deletion-penalty Deletions penalty. -I,--insertion-penalty Insertion penalty. -X,--branch-penalty Branch penalty.
Set defaults to
-A 1 -B 4 -D 3 -I 3 -X 1
--peek N allows to look at the first
N ZMWs of the input and
return the mean barcode score. This allows to test multiple test
files and see which set of barcodes has been used.
--guess N performs demultiplexing twice. In the first iteration,
all barcodes are tested per ZMW. Afterwards, the barcode pair occurrences are counted
and their mean barcode score is tested against the provided threshold
only those barcode pairs that pass this threshold are used in the second round.
In this second round of demultiplexing, only barcodes from the selected
barcode pairs are being tested for each ZMW. Finally, only ZMWs from barcode
pairs that were selected in the first round, are included in the BAM output.
--different are being respected and can be used as
prefix.lima.guess file shows the decision process. Example:
$ column -t *guess IdxFirst IdxCombined IdxFirstNamed IdxCombinedNamed NumZMWs MeanScore Picked 0 0 bc1002 bc1002 174 76 1 0 4 bc1002 bc1048 1 43 0 9 9 bc1080 bc1080 3 16 0 10 10 bc1093 bc1093 742 75 1 10 14 bc1093 bc1115 2 55 1 12 12 bc1101 bc1101 4 18 0
The minimum ZMW abundance to whitelist a barcode. This filter is
the minimum barcode score provided by
--guess. The default is 0.
If there are in total less barcoded ZMWs than the provided threshold,
the guess feature is automatically deactivated.
The optimal way is to use both advanced options in combination, e.g.,
--peek 1000 --guess 45. Lima will run twice on the input data.
For the first 1000 ZMWs, lima will guess the barcodes and store the mask of
In the second run, the barcode mask is used to demultiplex all ZMWs.
Equivalent to the
Infer Barcodes Used parameter option in SMRT Link.
Sets the following options:
--peek 50000 --guess 45 --guess-min-count 10.
If used in combination with
--peek 50000 --guess 75 --guess-min-count 100.
If used in combination with
--peek 50000 --guess 75 --guess-min-count 10.
Identify barcodes in molecules that only have barcodes adjacent to one adapter. This approach makes no assumption about an alternating pattern of barcoded and barcode-free adapters. In contrast, a 1D k-means similar to the original Lloyd algorithm is employed to identify two clusters to separate low- and high-scoring barcode regions. This method does not suffer from irregular adapter calls, but the additional flexibility might lead to yet-unknown problems.
For this mode, high-scoring barcode regions are whitelisted. Only whitelisted
barcode regions contribute to the final mean barcode score and to the
Minimum ratio of scored vs sequenced adapters. The default is
Set the first barcode to be barcode index 0.
Spawn a threadpool of
The default is
0, meaning all available cores.
This option also controls the number of threads used for BAM and PBI compression.
By default, each thread consumes
N ZMWs per chunk for processing.
The default is
Do not produce BAM output, nor PBI. Useful if only the reports are of interest, as time to results is lower.
Do not produce any reports. Useful if only the demultiplexed BAM is of interest.
For quality control, we offer two R scripts to help you troubleshoot your data. The first is for low multiplex data. The second is for high plex data, easily showing 384 barcodes.
The first is for the
$ Rscript --vanilla scripts/r/report_detail.R prefix.lima.report
The second, optional argument is the output file type
png as default:
$ Rscript --vanilla scripts/r/report_detail.R prefix.lima.report pdf
You can also restrict output to only barcodes of interest, using the barcode name not the index. For example, all barcode pairs that contain the barcode "bc1002":
$ Rscript --vanilla scripts/r/report_detail.R prefix.lima.report png bc1002
A specific barcode pair "bc1020--bc1045"; note that, the script will look for both combinations "bc1020--bc1045" and "bc1045--bc1020":
$ Rscript --vanilla scripts/r/report_detail.R prefix.lima.report png bc1020--bc1045
Or any combination of those two:
$ Rscript --vanilla scripts/r/report_detail.R prefix.lima.report pdf bc1002 bc1020--bc1045 bc1321
Per-barcode read yield:
Score per number of adapters (lines) and all adapters (histogram). What are half adapters?
Read length (99.9% percentile, 1000 binwidth)
Grouped by barcode, same y-axis :
Grouped by barcode, free y-axis:
Not grouped into facets, line histogram:
Barcoded vs. non-barcoded:
HQ length (99.9% percentile, 2000 binwidth)
Grouped by barcode, same y-axis:
Grouped by barcode, free y-axis:
Not grouped into facets, line histogram:
Barcoded vs. non-barcoded:
Adapters (99.9% percentile, 1 binwidth)
Number of adapters:
The second script is for high-plex data in one
$ Rscript --vanilla scripts/r/report_summary.R prefix.lima.report
Yield per barcode:
Score distribution across all barcodes:
Score distribution per barcode:
Read length distribution per barcode:
HQ length distribution per barcode:
Bad adapter ratio histogram:
Barcode score and clipping position are computed by a Smith-Waterman algorithm. The dynamic-programming matrix has the barcode on the vertical and the target sequence on the horizontal axis. The initialization of the first row and column follows a glocal alignment; global in the reference, local in the query. The best score is determined by chosing the maximum in the last row, which is also the clipping position. This allows us to skip overhang from the adapter or alien DNA like primer IDs or known as molecular identifiers.
For the trailing barcode region, the sequence of the reference window gets reverse-complemented and the clipping position gets transformed back into the correct coordinate system.
The barcode score is an indicator how well the chosen barcode pair matches. After identifying the highest barcode score, it gets normalized:
(100 * sw_score) / (sw_match_score * barcode_length)
The range is between 0 and 100, whereas 0 is no hit and 100 perfect match. The provided mean score is the mean of both normalized barcode scores.
Why lima and not bam2bam?
Lima was developed to provide a better user experience in working with PacBio barcoded sequencing data. Both use an identical core alignment step, but the algorithm to identify barcode pairs and overall usability have been improved.
Which minimum barcode score?
The old bam2bam tool required a minimum barcode score threshold of
generate reliable output. This is not true for lima. Both, no threshold and
26 were tested extensively with downstream applications to assure that
results are not convoluted by contaminants.
A much lower threshold can be used, because additional internal filters in
lima remove unreliable calls that go beyond simplistic min-score thresholding.
How fast is fast?
Example: 200 barcodes, asymmetric mode (try each barcode forward and reverse-complement), 300,000 CCS reads. On my 2014 iMac with 4 cores + HT:
503.57s user 11.74s system 725% cpu 1:11.01 total
Those 1:11 minutes translate into 0.233 milliseconds per ZMW, 1.16 microseconds per barcode for both sides aligning forward and reverse-complement, and 291 nanoseconds per alignment. This includes IO.
Why doesn't lima utilize the maximum number of provided cores?
This might be a simple IO bottleneck. With a barcode.fasta containing only a few barcodes, most of the time is spent reading and writing BAM files, as the barcode identification is too fast.
Is there a way to show the progress?
No. Please run
wc -l prefix.report to get the number of already processed ZMWs.
Can I have upper- and lower-case bases in my barcodes?
You can, but lima is case-insensitive and will convert them to upper case before the alignment step.
Can I split my data by barcode?
You can either iterate over the
prefix.bam file N times or use
--split-bam. Each barcode has its own BAM file called
The optional parameter
--split-bam-named, names the files by their barcode names instead
of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
in barcode names are replaced with an underscore in the file name.
This mode might consume more memory. Read the next FAQ entry for more information.
In addition, a
prefix.datastore.json is generated to wrap the individual dataset
Why is the memory consumption really high?
Most likely this is due to
The latter is activated per default in SMRT Link.
Lima is able to stream up to 500 barcode pairs to individual split BAM files.
If more than 500 barcode pairs are detected, additional output is buffered first.
In this case, memory usage (RES column in top) is approximately the size of the
The maximum concurrent output BAM file handles can be adjusted with
--bam-handles N. The default is 500.
Examples, how memory usage is affected by
--bam-handles-verbose is only used to visualize the BAM output file handles.
Memory usage reported using memusg:
$ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 9 --bam-handles-verbose Open stream 7--7 Open stream 3--3 Open stream 5--5 Open stream 1--1 Open stream 4--4 Open stream 6--6 Open stream 0--0 Open stream 2--2 Open stream 210--210 memusg: peak=86,728 $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 4 --bam-handles-verbose Open stream 7--7 Open stream 3--3 Open stream 5--5 Open stream 1--1 Buffered stream 0--0 Buffered stream 2--2 Buffered stream 4--4 Buffered stream 6--6 Buffered stream 210--210 memusg: peak=113,476 $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 0 --bam-handles-verbose Buffered stream 0--0 Buffered stream 1--1 Buffered stream 2--2 Buffered stream 3--3 Buffered stream 4--4 Buffered stream 5--5 Buffered stream 6--6 Buffered stream 7--7 Buffered stream 210--210 memusg: peak=132,276
What are half adapters?
If there is an adapter call with only one barcode region, as the high-quality region finder cut right through the adapter, or the preceding or succeeding subread was too short and got removed, or the sequencing reaction started/stopped there, we call such an adapter half. Thus, there are also 1.5, 2.5, N+0.5 adapter calls.
ZMWs with half or only one adapter can be used to identify same barcode pairs; positive-predictive value might be reduced compared to high adapter calls. For asymmetric designs with different barcodes in a pair, at least a single full-pass read is required; this can be two adapters, two half-adapters, or a combination.
What are bad adapters?
In the subreads.bam file, each subread has a context flag
It annotates, among other things, if a subread has flanking adapters,
before and/or after. Adapter finding has been improved and can also find
molecularly missing adapters or those obscured by a local decrease in accuracy.
This may lead to missing or obscured bases in the flanking barcode.
Such adapters are called "bad", since they don't align with the adapter reference
Regions flanking those bad adapters are problematic, because they can fully or
partially miss the barcode bases, leading to wrong classification of the
Lima can handle those adapters, by ignoring regions flanking
bad adapters. For this, lima computes the ratio of
number of bad adapters divided by number of all adapters.
--bad-adapter-ratio is set to
0 and does not perform any filtering.
In this mode, bad adapters are handled just like good adapters.
*.lima.summary file contains one row with the number of
ZMWs that have at least 25% bad adapters, but otherwise pass all other filters.
This metric can be used as a diagnostic to assess the library prep.
--bad-adapter-ratio is set non-zero positive
bad adapter flanking barcode regions are treated as missing.
If a ZMW has a higher ratio of bad adapters than provided, the ZMW
is being filtered and consequently removed from the output.
*.lima.summary file contains two additional rows.
With bad adapter : 10349 (28%) Bad adapter yield loss : 10112 (5%)
The first row counts the number of ZMWs that have too high bad adapter ratios and the percentage is with respected to the number of all ZMW not passing. The second row counts the number of ZMWs that only get removed because of too high bad adapter ratios and the percentage is with respect the number of all input ZMWs and consequently is the effective yield loss caused by bad adapters.
If a ZMW has ~50% bad adapters, one side of the molecule is molecularly missing an adapter. For 100% bad adapter, both sides are missing adapters. A lower than ~40% percentage indicates decreased local accuracy during sequencing leading to adapter sequences not being found. If a high percentage of ZMWs is molecularly missing adapters, you should improve library prep.
Why are different barcode pair hits reported in --same mode?
Lima tries all barcode combinations and
--same only filters BAM output.
Sequences flanked by different barcodes are still reported, but are not
written to BAM. By not enforcing only same barcode pairs, lima gains
higher PPV, as your sample might be contaminated and contains unwanted
barcode pairs; instead of enforcing one same pair, lima rather
filters such sequences. Every symmetric / tailed library contains few asymmetric
templates. If many different templates are called, your library preparation
might be bad.
Why are same barcode pair hits reported in the default different mode?
Even if your sample is labeled asymmetric, same hits are simply sequences flanked by the same barcode ID.
But my design does not include same barcode pairs! We are aware of this, but it happens that some ZMWs do not have sufficient signal to call a pair with different barcodes.
How do barcode indices correspond to the input sequences?
Input barcode sequences are tagged with an incrementing counter. The first
sequence is barcode
0 and the last barcode
numBarcodes - 1.
I used the tailed library prep, what options to choose?
How can I demultiplex data with one adapter only being barcoded?
What are undesired hybrids?
When running with
--peek-guess or similar manual option combination and
different barcode pairs are found during peek, the full chip may contain
low-abundant different barcode pairs that were identified during peek
individually, but not as a pair. Those unwanted barcode pairs are called
hybrids in lima.
How can I demultiplex IsoSeq data?
Even if you only want to remove IsoSeq primers, lima is the tool of choice.
- Remove all duplicate sequences.
- Annotate sequence names with a
>primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >sample_brain_3p AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG >sample_liver_3p AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
- Use the
--isoseqmode. Run in combination with
--peek-guessto remove spurious false positive.
- Output will be only different pairs with a
Those options are very conservative to remove any spurious and ambiguous calls, in order to guarantee that only proper asymmetric (barcoded) primer are used in downstream analyses. Good libraries reach >75% CCS reads passing lima filters.
What is a universal spacer sequence and how does it affect demultiplexing?
For library designs that include an identical sequence between adapter and barcode, e.g. probe-based linear barcoded adapters samples, lima offers a special mode that is activated if it finds a shared prefix sequence among all provided barcode sequences. Example:
>custombc1 ACATGACTGTGACTATCTCACACATATCAGAGTGCG >custombc2 ACATGACTGTGACTATCTCAACACACAGACTGTGAG
In this case, lima detects the shared prefix
removes it internally from all barcodes. Subsequently, it increases the
window size by the length
L of the prefix sequence.
--window-size-bp N is used, the actual window size is
L + N.
--window-size-mult M is used, the actual window size is
(L + |bc|) * M.
Because the alignment is semi-global, a leading reference gap can be added without any penalty to the barcode score.
Why do most of my ZMWs get filtered by the score lead threshold?
The score lead measures how close the best barcode call is to the second best. Possible solutions without seeing your data:
- Is that sample actually barcoded?
- Are your barcode sequences genetically too close for SMRT sequencing?
Try CCS calling first and demultiplex with
- Are the synthesized products clean and not degenerate?
- Did the sequencing run perform optimally, is the accuracy in the expected range?
- Did you run lima twice, first on the original and then on the already demultiplexed data? This is not supported, as the barcodes have been clipped and removed.
Try to decrease
--score-lead, with the potential risk of introducing
What is different in lima to bam2bam?
- CCS read support
- Barcodes of every adapter gets scored for CLR subreads
- Does not enforce symmetric barcode pairing, which increases PPV
- For asymmetric barcodes,
limacan report the identified order, instead of ascending sorting
- Calls barcodes per barcode region and does not enforce adapter coupling
- Nice reports for QC
Can I remove PCR primers after demultiplexing?
Yes! After demultiplexing, just lima on the output again with your PCR primer(s).
Can I limit the output files per directory?
If you use output BAM splitting, it can happen that you get a lot of output files.
--files-per-directory N creates subdirectories and outputs at most
barcodes per directory.
--peek-guess does not work with XML input!
If your input XML file contains
<BioSamples>, lima will deactivate barcode
--peek-guess and only output barcodes specified in this section.
The assumption is that you know exactly which barcodes have been used and need no
inference. If this assumption is wrong, like the barcodes in the XML are wrong,
you can either just use BAM as input or use
Help, I get
ERROR: Could not find matching barcodes!
If you happen to get following error message
ERROR: Could not find matching barcodes! Check that the set of barcodes contains the used sequences and the correct mode has been selected: same or different.
then your XML input contains BioSamples with different barcode names than the
barcode.fasta file. Please check that you've used the correct
barcodes. You can ignore barcodes specified in the XML with
CCS or demux first?
Many people have been wondering, what is the recommended order for a multiplexed HiFi pool:
- first ccs and then demux
- first demux and then ccs
Use 2k ecoli amplicons with barcoded overhang adapters, symmetric. Workflow steps:
- Generate CCS
- Demux subreads and whitelist on CCS hole numbers
- Demux CCS
- Compare both sets of hole numbers
Verbatim results for one chip:
Generated CCS reads : 274185 Demuxed CCS reads : 269919 (98.44%) Demuxed subreads : 271068 (98.86%)
Venn diagrams for two chips:
Just based on those numbers, one would say, pick subread demuxing. Here comes the but. Demuxing subreads is very IO heavy and takes ~100x longer than demuxing CCS. For the sake of time to result and disk space, perform CCS first and demux afterwards.
Q: Is there any systematic reason for reads that get correctly called by subread demux but not ccs or vice versa?
Majority of what is subread output only is on the verge of being called at all. The problem with the current CCS draft stage is that it sometimes trims a few bases, which is generally not a big issue for demuxing, but if the barcode is molecularly damaged, too short or of low quality, a few missing bases lead to being uncallable.
Again something that is on the verge being called. The reason for the ~300 reads at 100 score, no idea so far. In general, this is 0.1% of the data. Let's investigate those ~300 calls and plot their subread demux barcode scores.
It's curious why they didn't get called, but for 0.1% not worth changing any parameters now, but worth future investigation.
- Add support for FASTA and FASTQ
-kwith by-strand HiFi reads
- Add barcode to read groups, use one barcode pair per RG
- Fix double demux, used to clip wrongly for the second round of demuxing
- Output N barcodes per subdirectory with
--files-per-directory Nand output splitting
- BioSample awareness for XML input and split output and allow ignoring them with
3to allow longer spacers
- Do not report no adapter hits as too short inserts
--guessbarcode score to
--peek-guess --ccsare combined
- Enable double demux of CCS data
- Print run time, CPU time, and peak memory consumption with
- New CLI UX
- Output N barcodes per subdirectory with
--bad-adapter-ratioto remove ZMWs with molecularly missing adapters
- Fix rare case, where a read only matches one barcode and not a single alternative
--no-bamto automatically omit pbi
- Allow combination of
- Add clip lengths as
- Enable single-barcode samples
- Implicitly call
- Add clip lengths as
- 1.7.1: Fix rare-care PBI generation bug, included in SMRT Link 6.0.0
- 1.7.0: Fix corner-case bug
- 1.6.1: Fix
--min-end-scorein combination with
- New filter
- Add latest filters to summary file
- New IsoSeq default parameters
- Fix streaming of asymmetric BAM files
- New filter
- 1.5.0: Support spacer sequence between adapter and barcode
- New filter
- Single-side library improvements
- New filter
--peek-guessuses only full-length ZMWs
- Streaming of split BAM files
- New fat binary build approach
- 1.1.0: IsoSeq support
- 1.0.0: Initial release, included in SMRT Link 5.1.0
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.