Support bigRmsk #1236

jrobinso · 2022-10-11T19:11:22Z

"bigRmsk" files will currently not work in IGV due to negative block starts. See issue #1130. The format is valid BED9, but we need the extra fields. Fix is to properly interpret the negative block starts.

Autosql: http://genome.ucsc.edu/goldenPath/help/examples/bigRmskBed.as
Also see: http://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as

maximilianh · 2022-10-11T20:25:19Z

bigRmsk is a perfectly valid BED9. Does IGV not read the BED Type from the bigBed file? The bigBed header stores the number of defined fields (9 here) and so IGV should not even try to parse fields 10-12…

jrobinso · 2022-10-12T04:29:53Z

@maximilianh Perhaps I didn't phrase this correctly, will correct it, yes its perfectly valid but not really useful for display without properly treating the extra fields.

In general we read the number of defined fields and don't try to parse extra fields, but for certain special fields we do. This came up previously, and you had indicated (I think) that the defined names (e.g. blockstarts in this case) are never re-defined and if found in an extra field should be safe to use. So when we see "blockstarts" and "blocksizes" in extra fields we use them. This came up, I think, in "bigCat" format but perhaps for different fields.

In the bigRmsk case fields 10-12 are neccessary to do a reasonable display. Now that I know that blockstarts can be negative to table name bigRmsk I will just make a special case.

So yes I could strictly just look at defined fields but for bigGenePred, bigCat, and bigRmsk this would yield suboptimal displays, e.g. without the blocks. The extra fields are not just comments.

jrobinso · 2022-10-12T04:38:08Z

Another one we look for is exonFrames. So the complete set of "extra" fields currently parsed are as follows

blockCount 
blockSizes
blockStarts
exonFrames

If we see these attribute names we assume they are as defined for bed format, or in the case of exonFrames for genePredExt format. This has worked so far except for this special case, I need to dig more deeply into this to determine what IGV will do with the negative blockStarts, thus I opened this ticket for myself, there's no implication that anything is incorrect in the files. Just unexpected.

maximilianh · 2022-10-12T12:45:17Z

Maybe I didn't explain this well: the header of all bigBed files contains the number of fields that are "parsable". It's here in the C code: https://github.com/ucscGenomeBrowser/kent/blob/34d24ed91f2d7a04467840c6763532345e4c0529/src/lib/bbiRead.c#L116 /* Read rest of defined bits of header, byte swapping as needed. */ bbi->version = udcReadBits16(udc, isSwapped); bbi->zoomLevels = udcReadBits16(udc, isSwapped); bbi->chromTreeOffset = udcReadBits64(udc, isSwapped); bbi->unzoomedDataOffset = udcReadBits64(udc, isSwapped); bbi->unzoomedIndexOffset = udcReadBits64(udc, isSwapped); bbi->fieldCount = udcReadBits16(udc, isSwapped); bbi->fieldCode should be "9" here. And IGV should not parse more than 9 fields as a consequence. This is independent of any field names in the autoSql definition. I imagine the javascript parser also reads this field?

…

On Wed, Oct 12, 2022 at 6:38 AM Jim Robinson ***@***.***> wrote: Another one we look for is exonFrames. So the complete set of "extra" fields current parsed are as follows blockCount blockSizes blockStarts exonFrames If we see these attribute names we assume they are as defined for bed format, or in the case of exonFrames for genePredExt format. This has worked so far except for this special case, I need to dig more deeply into this to determine what IGV will do with the negative blockStarts, thus I opened this ticket for myself, there's no implication that anything is incorrect in the files. Just unexpected. — Reply to this email directly, view it on GitHub <#1236 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TNZZMIKIYZYZ4QNINTWCY6DXANCNFSM6AAAAAARCSW6T4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jrobinso · 2022-10-12T16:15:54Z

@maximilianh Yes I understand, and fieldCount is used. However its necessary to parse some of the extra fields to support IGV tracks, otherwise the "big" versions of formats such as genePredExt/bigGenePred and rmsk/bigRmsk will be a regression from the point of view of the user, for example for rmsk we will not be able to draw the blocks, for bigGenePred we will not be able to do protein translation without "exonFrames" . This is not a problem in general, this special case of unexpected negative blockStarts just needs to be handled. This is not a major problem.

jrobinso self-assigned this Oct 5, 2023

jrobinso added this to the 2.17.0 milestone Oct 5, 2023

jrobinso removed this from the 3.0.0 milestone Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bigRmsk #1236

Support bigRmsk #1236

jrobinso commented Oct 11, 2022 •

edited

Loading

maximilianh commented Oct 11, 2022 •

edited

Loading

jrobinso commented Oct 12, 2022 •

edited

Loading

jrobinso commented Oct 12, 2022 •

edited

Loading

maximilianh commented Oct 12, 2022 via email

jrobinso commented Oct 12, 2022 •

edited

Loading

Support bigRmsk #1236

Support bigRmsk #1236

Comments

jrobinso commented Oct 11, 2022 • edited Loading

maximilianh commented Oct 11, 2022 • edited Loading

jrobinso commented Oct 12, 2022 • edited Loading

jrobinso commented Oct 12, 2022 • edited Loading

maximilianh commented Oct 12, 2022 via email

jrobinso commented Oct 12, 2022 • edited Loading

jrobinso commented Oct 11, 2022 •

edited

Loading

maximilianh commented Oct 11, 2022 •

edited

Loading

jrobinso commented Oct 12, 2022 •

edited

Loading

jrobinso commented Oct 12, 2022 •

edited

Loading

jrobinso commented Oct 12, 2022 •

edited

Loading