Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some sequences of output file not matching input sequences #334

Closed
anakro opened this issue Feb 5, 2018 · 11 comments
Closed

some sequences of output file not matching input sequences #334

anakro opened this issue Feb 5, 2018 · 11 comments

Comments

@anakro
Copy link

anakro commented Feb 5, 2018

Hi. I am encountering a problem while annotating 100.000 and 1 million sequences with MiXCR 2.1.8 version. There is one sequences in 100.000 number of sequences and around 100 sequences in 1 million sequences of the column readSeq of the output file, which does not match with any of the sequences of the input file. Please let me know where the problem might be.

@anakro
Copy link
Author

anakro commented Feb 5, 2018

Forgot to add that this happens in the alignment step (AlignmentExport).

@PoslavskySV
Copy link
Member

Could you paste exact commands you used?

@anakro
Copy link
Author

anakro commented Feb 6, 2018

For alignment with 100.000 sequences. I tried with version 2.1.5 and 2.1.8 and there is always the same result - one sequences is not found in an input file and in one million sequences many more sequences.

/Users/anak/Downloads/mixcr-2.1.8/mixcr align -r alignmenReportIgSimulator_100000_sequences.log -c IGH -s HomoSapiens --save-description -f /Users/anak/Documents/masterThesis/simulators/ig_simulator/ig_simulator_test_100000_sequences/final_repertoire.fasta /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.vdjca

For export with 100.000 sequences:
/Users/anak/Downloads/mixcr-2.1.8/mixcr exportAlignments /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.vdjca /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.txt

@PoslavskySV
Copy link
Member

Can you check whether reverse complement of those missed sequences are also not present in the initial file?

@anakro
Copy link
Author

anakro commented Feb 6, 2018

I just checked. Unfortunately they are not.

@anakro
Copy link
Author

anakro commented Feb 6, 2018

Just to make it a bit clear. The number of output sequences matches the number of input sequences for 100.000 and for one million there are 4 sequences less in the output file due to failed assignment of a gene. The number is alright, but the nucleotide sequences written in the column readSeq somehow does not matches for some sequences with the input file (they are not found in the input file)

@dbolotin
Copy link
Member

dbolotin commented Feb 6, 2018

Please try to remove all sequences with 'N' nucleotides from the dataset and rerun the analysis. 'N' nucleotides are converted to random letter with zero quality inside the align pipeline. This might be the reason of the issue.

I mark it as bug, as this behaviour is not expected, but I can't give estimates on the time this will be fixed.

@dbolotin dbolotin added the bug label Feb 6, 2018
@anakro
Copy link
Author

anakro commented Feb 14, 2018

I removed all the N, but some sequences still did not match with the input file sequences. Those sequences were reverse complement of the input sequences. Apparently the problem is N and reverse complement. It might be convenient for a user to see in an output file a column of the names of the sequences (the same as they are labeled in a fasta file), then all the tracking of the sequences would be much simpler. Just a modest advice. Thank you a lot for such a quick help.

@dbolotin
Copy link
Member

Thanks for suggestion! In the latest stable version (2.1.9) tracking of reads with their headers is possible using -g option on align stage, which attaches header content to each read, and then with -descrR1 (see here) field on exportAlignments stage which will output corresponding field along with other alignment information.

Tracking of alignment to clone mapping is also possible with index files (see this section).

In 2.2 commands and pipeline in general for tracking of alignments to clones mapping will change. Many enhancements to this infrastructure are made. For example, it will work correctly with assemblePartial (current index and alignment formats can't track events where several alignments are merge into a single alignment). So don't rely on current API in this respect in your scripts. MiXCR v2.2 will be released somewhere in the end of March. If you are interested in this functionality I can share alpha binaries here, or you can build them yourself from develop branch of this repo.

@anakro
Copy link
Author

anakro commented Feb 16, 2018

Thank you so much. I wrote a script for matching the sequences, therefore now they are matching perfectly. The only question and problem I am still having is, how to determine whether the sequence in the output file of the alignment is productive? Is there any information about in-frame, out-of frame sequence?

@dbolotin
Copy link
Member

This is not yet implemented. Please see #206 for details. Scheduled for 2.3 (this summer). If you have any suggestions in this respect, they are most welcome, please comment on the issue #206.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants