Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maf file output correct? #1

Closed
kingralph80 opened this issue Apr 7, 2020 · 27 comments
Closed

maf file output correct? #1

kingralph80 opened this issue Apr 7, 2020 · 27 comments

Comments

@kingralph80
Copy link

kingralph80 commented Apr 7, 2020

Hi,

I am using GSAlign for the alignment of two diverse maize genomes with parameters:
-dp -ind 100 -clr 50 -alen 200 -t 48
-gp /usr/bin/gnuplot \

GSAlign runs through without errors, however, when I try to convert the MAF file to PSL by kent tool mafToPsl I always get:
Coordinates out of range line 311372 of B73v5_vs_Oh43_ch1_10.maf

Cheers

@hsinnan75
Copy link
Owner

Thanks for reporting this issue. I'll check the Maf output.

@kingralph80
Copy link
Author

Big thanks,

Also, I really like GSAlign as it's so fast. Is there a more detailed explanation for some of the parameters? I compared the -sen mode with default but got a shorter alignment and lower ASI with -sen.

@hsinnan75
Copy link
Owner

Thank you for saying that. I will keep improving GSAlign to make it more useful. I'll also put more explanation of the parameters in the help message. Thank you again.

@kingralph80
Copy link
Author

kingralph80 commented Apr 7, 2020

Hi I think I figured it out. It's the ---- that are added in the alignment.

When I checked the maf I found for the first line:
##maf version=1
a score=1654529
s chr1 102260581 1654865 + 308452471 aaggaaggagcatgcggaagagaaaaat...

When I then checked the length of the alignment, it was 1654992 characters long. Of those 127 were ----. If you subtract that -> 1654865. So this would explain why mafToPsl says out of range and why when I make a chain file from the maf downstream tools also give errors with that chain file.

I assume this is because of gapped alignments? Do you have to count the length of the alignment with --?

@hsinnan75
Copy link
Owner

It is wired. Gaps are not counted when I output the alignment length. I'll check the codes again. Thank you!

@hsinnan75
Copy link
Owner

I did some tests and confirmed that the gaps are not counted. Could you please give me your sequences? I'd like to use your sequences to check the output. Thank you.

@kingralph80
Copy link
Author

Correct, they are not counted. I thought that would be the issue given the error with mafToPsl. Okay, I will share the sequences through google drive.

@kingralph80
Copy link
Author

kingralph80 commented Apr 14, 2020

Question: If you use any of your maf files that you generate with GSAlign you can convert it to PSL with mafToPsl from uscs tools? I am trying the convert the maf to a chain file and add GenomeAlignmentTools to improve the chain file. Since mafToPsl failed I used maf convert which did convert the GSAlign maf file to chain but if I use that chain with any tool from GenomeAlignmentTools I get errors. Sorry I get to the upload today but could you also test your maf files and if you can convert them to chain format?

@hsinnan75
Copy link
Owner

I'll see if I could provide output with PSL format directly in GSAlign. Before that, I'll check my maf output using your dataset. Thank you!

@hsinnan75
Copy link
Owner

@kingralph80 I run some tests on your sequences with default parameters and your parameters, and then I use mafToPsl (from kentUtils) to convert maf into psl format, there is no any warnings or errors. The ucsc-maftopsl could not be run in my server, could you please download kentUtils and run that mafToPsl with the maf file. Thank you!

@kingralph80
Copy link
Author

kingralph80 commented Apr 15, 2020

You are right, no error message but the file size is 0 bytes. Could you tell me which kentUtils are you referring too? https://github.com/ENCODE-DCC/kentUtils is what I used
I am assuming the psl file you get from our test chr. is not 0 bytes?

Also changed to default parameters and my maf file is 200218210 with Alignment#=25114 (total alignment length=98983799) ANI=98.41%

@hsinnan75
Copy link
Owner

It's wired. The PSL file I got was exactly the same number of alignments as GSAlign outputed. We downloaded the same kentUtils. And my command was mafToPsl CML332.chr10 B73.chr10 output.maf output.psl. I found the order of two reference names does not matter. They both produced the same result. I was wonder how is the coverage rate you expected?

@kingralph80
Copy link
Author

kingralph80 commented Apr 21, 2020

Hi. so thanks for resolving this. In the end, the error originally was that the genomes we used have the same chromosome names. All the maize once we compare use chr1 to chr10. The tools we used to compare GSAlign to like cactus, renamed them in the maf using our Input in the command line. GSAlign does not seem to not have such an input parameter. For the test, we then renamed the chroms ourselves a typo in the chromosome name resulted in the 0 byte psl output.

@kingralph80
Copy link
Author

kingralph80 commented Apr 21, 2020

Would it be possible to add an optional input parameter for a suffix for target and query so that if needed the same chr labels can be changed in the output maf? This way the user does not need to rename them all? In our case, its ~30 genomes but others may have many more.

@hsinnan75
Copy link
Owner

hsinnan75 commented Apr 21, 2020

Hi, I've updated GSAlign to 1.0.14. It adds ref and qry as prefix for reference and query chromosomes, respectively in the MSA/ALN file. I'll add two more parameters to let user input their own prefix/suffix for the two genomes in the future version.

@kingralph80
Copy link
Author

Argh, when I use the full genome not just chr10 we are back to the original error. MafToPsl says:
"Coordinates out of range line 103592 of B73v5.CML322.50.200.maf"

I will post the line that causes the error (103592) plus the line prior in the next message but I think mafToPSL is correct. the line that causes the issue starts with:
s qry.chr1 304926865 658 + 304927522

If you add 304926865 (Start) + 658 (Alignment) = 304927523, this cannot be if the whole source length is 304927522. Its off by one bp.

@kingralph80
Copy link
Author

kingralph80 commented Apr 21, 2020

The full alignment that causes the error in the maf file:

a score=645
s ref.chr1 308444775 654 + 308452471 gtgttcatggtttcggttttgggtctcgggATTTAGGATTTAAGGGTTTGGGGTTTAGTGTGATGTAGCTCAAGGGATGATAAAAACAACCTTACTTATTTCAGTAGCCAAAACATGGGCGGTGG----GTGGGAGGCATGAAATTGTTGTTCAATGACGAAAACATGTGTCACGGGAGGCCAAAAATAGACTTGCAGGCCTCCGGGAACCAAAACGTGTACTATGGTTTGAACACTTATGTACTATGGTTGGACACTAAATTGGTTGTTCTACCATGAAAACATGTGTTGTAGCTCACGCGAGGCCAAAAACAACCTTACCAGTCTTAGTGGCCAAAACATGGGTGGCGGACGGGAGGCATGAAATCGTTGTTCAACGACGAAAACATGTGTCGCGGGAGGCCAAAAATAGTCTTACCGACCTCCGGGGACCAAAACGTGTACTATGGTTTGGAGACTCAAGTAATCTTCTATTATGGTTTGGTCATTAAAATCTtttagtgttggtgtttagggttatggtttcaggtgtaggTttattgattagggtttggggttcggggcttagggtttagggtttggggtttggggttaagggttcgggtgtggggttcatggttttGggtttggggctcatggtttagggtttaggattTTG
s qry.chr1 304926865 658 + 304927522 gtgttcatggtttcggttttgggtctcggggtttaggatttaagggtttggggtttAGTGTGATGTAGCTCAAGGGATGATAAAAACAACCTTACTTATTTCAGTAGCCAAAACATGGGCGGTGGGCGCGCGGGAGGCATGAAATTGTTGTTCAATGACGAAAACATGTGTCACGGGAGGCCAAAAATAGACTTGCAGGCCTCCGGGAACCAAAACATGTACTATGGTTTGAACACTTATGTACTATGGTTGGACACTAAATTGGTTGTTCTACCATGAAAACGTGTGTTGTAGCTCACGCGAGGCCAAAAACAACCTTACCAGTCTTAGTGGCCAAAACATGGGTGGGGGACGGGAGGCATGAAATCGTTGTTCGACGACGAAAACATGTGTAGCGGGAGGCCAAAAATAGTCTTACCGACCTCCGGGGACCAAAACGTGTACTATGGTTTGGAGACTCAAGTAATCTTCTATTATGGTTTGGTCATTAAAATCTtttagtgttggtgtttagggttatggtttcaggtgtaggattattgattagggtttggggttcggggcttagggtttagggtttggggtttggggttaagggttcgggtgtggggttcatggtttttggtttggggctcatggtttagggtttaggattTTG

@hsinnan75
Copy link
Owner

hsinnan75 commented Apr 21, 2020 via email

@kingralph80
Copy link
Author

Ok I will post this at the kent utilities.

@kingralph80
Copy link
Author

kingralph80 commented Apr 24, 2020

Hi,
I used the GSAlign output maf, converted it to chain (maf-convert) and wanted to use it with picard LiftOverVcf. However, the tool throws out an error that again sounds very similar: "Terminal block seen before end of chain".
So I went to the UCSC definition for the maf file format:

under
Lines starting with "s" -- a sequence within an alignment block
start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

and under
Lines starting with "e" -- information about empty parts of the alignment block
start -- The start of the non-aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

From these lines, I would assume maf is 0-based not 1-based,or ? I double-checked and a 1- based coordinate is not mentioned in the definition of maf ( https://genome.ucsc.edu/FAQ/FAQformat.html#format5 ).

HOWEVER, there is a second maf format (Mutation Annotation Format instead of Multiple Alignment format). Are there different definitions of the MAF format?

@kingralph80
Copy link
Author

Sorry for the back and forth. I really just want to be able to use the GSAlign output further.

@hsinnan75
Copy link
Owner

hsinnan75 commented Apr 24, 2020 via email

@kingralph80
Copy link
Author

Big thanks for supporting the tool. I will try the new version right away.

@kingralph80
Copy link
Author

kingralph80 commented Apr 24, 2020

I check. Same error although the output looks different:

The line that caused the error:
s qry_chr7 180451555 45989 - 180455515

Not sure I see the error. Does - strand means the start is the stop position?
I doubled checked. The length of the chrom. 7 is 180455515.

@kingralph80
Copy link
Author

kingralph80 commented Apr 27, 2020

The authors of mafToPSL at UCSC confirmed that maf MAF is a 0-based. In case of this error the MAF definition is:

start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

So the way I understand this should the alignment not read:
s qry_chr7 180405566 45989 - 180455515

180451555 - 45989 = Start position and - strand indicates its reverse complement?

@hsinnan75
Copy link
Owner

Thanks for the information. And yes, the "-" indicates the reverse-complemented strand.

@kingralph80
Copy link
Author

Dear Hsinnan,

your latest commit solved this reported issue. Thanks a lot for fixing and support your great tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants