New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some sequences of output file not matching input sequences #334
Comments
Forgot to add that this happens in the alignment step (AlignmentExport). |
Could you paste exact commands you used? |
For alignment with 100.000 sequences. I tried with version 2.1.5 and 2.1.8 and there is always the same result - one sequences is not found in an input file and in one million sequences many more sequences. /Users/anak/Downloads/mixcr-2.1.8/mixcr align -r alignmenReportIgSimulator_100000_sequences.log -c IGH -s HomoSapiens --save-description -f /Users/anak/Documents/masterThesis/simulators/ig_simulator/ig_simulator_test_100000_sequences/final_repertoire.fasta /Users/anak/Documents/gitMasterThesis/MiXCR_results/IgSimulator_aligned_sequences/100000_sequences/output_file_100000_sequences_IgSimulator_MiXCR.vdjca For export with 100.000 sequences: |
Can you check whether reverse complement of those missed sequences are also not present in the initial file? |
I just checked. Unfortunately they are not. |
Just to make it a bit clear. The number of output sequences matches the number of input sequences for 100.000 and for one million there are 4 sequences less in the output file due to failed assignment of a gene. The number is alright, but the nucleotide sequences written in the column readSeq somehow does not matches for some sequences with the input file (they are not found in the input file) |
Please try to remove all sequences with 'N' nucleotides from the dataset and rerun the analysis. 'N' nucleotides are converted to random letter with zero quality inside the align pipeline. This might be the reason of the issue. I mark it as bug, as this behaviour is not expected, but I can't give estimates on the time this will be fixed. |
I removed all the N, but some sequences still did not match with the input file sequences. Those sequences were reverse complement of the input sequences. Apparently the problem is N and reverse complement. It might be convenient for a user to see in an output file a column of the names of the sequences (the same as they are labeled in a fasta file), then all the tracking of the sequences would be much simpler. Just a modest advice. Thank you a lot for such a quick help. |
Thanks for suggestion! In the latest stable version (2.1.9) tracking of reads with their headers is possible using Tracking of In 2.2 commands and pipeline in general for tracking of alignments to clones mapping will change. Many enhancements to this infrastructure are made. For example, it will work correctly with |
Thank you so much. I wrote a script for matching the sequences, therefore now they are matching perfectly. The only question and problem I am still having is, how to determine whether the sequence in the output file of the alignment is productive? Is there any information about in-frame, out-of frame sequence? |
Hi. I am encountering a problem while annotating 100.000 and 1 million sequences with MiXCR 2.1.8 version. There is one sequences in 100.000 number of sequences and around 100 sequences in 1 million sequences of the column readSeq of the output file, which does not match with any of the sequences of the input file. Please let me know where the problem might be.
The text was updated successfully, but these errors were encountered: