some sequences of output file not matching input sequences #334

anakro · 2018-02-05T15:53:05Z

Hi. I am encountering a problem while annotating 100.000 and 1 million sequences with MiXCR 2.1.8 version. There is one sequences in 100.000 number of sequences and around 100 sequences in 1 million sequences of the column readSeq of the output file, which does not match with any of the sequences of the input file. Please let me know where the problem might be.

anakro · 2018-02-05T16:10:19Z

Forgot to add that this happens in the alignment step (AlignmentExport).

PoslavskySV · 2018-02-05T19:33:30Z

Could you paste exact commands you used?

anakro · 2018-02-06T09:43:12Z

For alignment with 100.000 sequences. I tried with version 2.1.5 and 2.1.8 and there is always the same result - one sequences is not found in an input file and in one million sequences many more sequences.

/Users/anak/Downloads/mixcr-2.1.8/mixcr align -r alignmenReportIgSimulator_100000_sequences.log -c IGH -s HomoSapiens --save-description -f /Users/anak/Documents/masterThesis/simulators/ig_simulator/ig_simulator_test_100000_sequences/final_repertoire.fasta /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.vdjca

For export with 100.000 sequences:
/Users/anak/Downloads/mixcr-2.1.8/mixcr exportAlignments /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.vdjca /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.txt

PoslavskySV · 2018-02-06T09:49:36Z

Can you check whether reverse complement of those missed sequences are also not present in the initial file?

anakro · 2018-02-06T10:50:15Z

I just checked. Unfortunately they are not.

anakro · 2018-02-06T10:55:03Z

Just to make it a bit clear. The number of output sequences matches the number of input sequences for 100.000 and for one million there are 4 sequences less in the output file due to failed assignment of a gene. The number is alright, but the nucleotide sequences written in the column readSeq somehow does not matches for some sequences with the input file (they are not found in the input file)

dbolotin · 2018-02-06T11:22:31Z

Please try to remove all sequences with 'N' nucleotides from the dataset and rerun the analysis. 'N' nucleotides are converted to random letter with zero quality inside the align pipeline. This might be the reason of the issue.

I mark it as bug, as this behaviour is not expected, but I can't give estimates on the time this will be fixed.

anakro · 2018-02-14T16:18:59Z

I removed all the N, but some sequences still did not match with the input file sequences. Those sequences were reverse complement of the input sequences. Apparently the problem is N and reverse complement. It might be convenient for a user to see in an output file a column of the names of the sequences (the same as they are labeled in a fasta file), then all the tracking of the sequences would be much simpler. Just a modest advice. Thank you a lot for such a quick help.

dbolotin · 2018-02-14T18:46:19Z

Thanks for suggestion! In the latest stable version (2.1.9) tracking of reads with their headers is possible using -g option on align stage, which attaches header content to each read, and then with -descrR1 (see here) field on exportAlignments stage which will output corresponding field along with other alignment information.

Tracking of alignment to clone mapping is also possible with index files (see this section).

In 2.2 commands and pipeline in general for tracking of alignments to clones mapping will change. Many enhancements to this infrastructure are made. For example, it will work correctly with assemblePartial (current index and alignment formats can't track events where several alignments are merge into a single alignment). So don't rely on current API in this respect in your scripts. MiXCR v2.2 will be released somewhere in the end of March. If you are interested in this functionality I can share alpha binaries here, or you can build them yourself from develop branch of this repo.

anakro · 2018-02-16T12:05:33Z

Thank you so much. I wrote a script for matching the sequences, therefore now they are matching perfectly. The only question and problem I am still having is, how to determine whether the sequence in the output file of the alignment is productive? Is there any information about in-frame, out-of frame sequence?

dbolotin · 2018-02-16T13:59:15Z

This is not yet implemented. Please see #206 for details. Scheduled for 2.3 (this summer). If you have any suggestions in this respect, they are most welcome, please comment on the issue #206.

PoslavskySV added the question label Feb 5, 2018

dbolotin added the bug label Feb 6, 2018

dbolotin closed this as completed Jul 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some sequences of output file not matching input sequences #334

some sequences of output file not matching input sequences #334

anakro commented Feb 5, 2018

anakro commented Feb 5, 2018

PoslavskySV commented Feb 5, 2018

anakro commented Feb 6, 2018

PoslavskySV commented Feb 6, 2018

anakro commented Feb 6, 2018

anakro commented Feb 6, 2018

dbolotin commented Feb 6, 2018

anakro commented Feb 14, 2018

dbolotin commented Feb 14, 2018

anakro commented Feb 16, 2018

dbolotin commented Feb 16, 2018

some sequences of output file not matching input sequences #334

some sequences of output file not matching input sequences #334

Comments

anakro commented Feb 5, 2018

anakro commented Feb 5, 2018

PoslavskySV commented Feb 5, 2018

anakro commented Feb 6, 2018

PoslavskySV commented Feb 6, 2018

anakro commented Feb 6, 2018

anakro commented Feb 6, 2018

dbolotin commented Feb 6, 2018

anakro commented Feb 14, 2018

dbolotin commented Feb 14, 2018

anakro commented Feb 16, 2018

dbolotin commented Feb 16, 2018