New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout of tabix based features is inconsistent due to assignment of IDs #1244

Closed
cmdcolin opened this Issue Oct 28, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@cmdcolin
Copy link
Contributor

cmdcolin commented Oct 28, 2018

Reviewing #1227 I saw that the layout of features based on tabix files is not optimal probably due to feature IDs being weirdly assigned. This issue exists on 1.15.4 and dev

Small example file

histones.gff.gz

Screenshot

screenshot-localhost-2018 10 28-17-25-08

@cmdcolin cmdcolin referenced this issue Oct 28, 2018

Merged

Use @gmod/vcf #1227

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Oct 28, 2018

Note that this possibly affects earlier than 1.15.3/earlier too but seems more noticeable once 1.15.4 is added

1.15.3
screenshot-localhost-2018 10 28-17-43-26 2
1.15.4
screenshot-localhost-2018 10 28-17-48-26

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Oct 29, 2018

Yeah, it seems like the file offset isn't going to work for a unique ID. I think it's because (and I could be wrong here) the file offset is calculated not from the beginning of the file but from the beginning of a block. But since adjacent blocks are merged before decompressing, the offset might be different depending on whether prior blocks were merged or not.

I looked and I think GFF3Tabix is the only thing using the file offset now, so hopefully that's the only thing we have to fix. Maybe once I figure out what to do with VCFTabix I'll do the same here.

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Nov 2, 2018

I think for GFF3tabix we should just make the unique ID be the CRC32 of the whole line. }:-{

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Nov 2, 2018

or the ID could just be the whole line, which might actually be faster due to string pooling. might try both ways

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 15, 2018

The issue for GFF is merged now

@cmdcolin cmdcolin closed this Nov 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment