-
Notifications
You must be signed in to change notification settings - Fork 2
tskit tree format integration #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hyanwong directed me over here. Happy to answer questions re: tskit-rust. |
Thanks @molpopgen! Is it correct for me to represent parent-child relationships between lineages both in the individuals and the edges? |
|
I can answer that one: the individual parent is the "pedigree" parent, like your actual mother or father. The parent and child of an edge can represent a lineage that stretches over many pedigree generations, joining one node in the tree sequence with an ancestral node. It's not the case that every simulated individual has to have a node in the tree sequence: they could be simply thrown away during simulation, and only the lineage kept. Does that make sense? |
|
@molpopgen Integrating my code with tskit-rust was quite a smooth experience, which speaks to the great quality of your Rust wrapper. |
What he said. Different "parents", basically. |
|
Also, you don't need individuals to have parents. It's purely a helpful thing if you want to reconstruct the pedigree. Parents of individuals have only recently been introduced into tskit - we didn't bother with them until a few months ago. The edges are the key thing. |
Ah, I think this goes at my confusing use of the word individual in my own simulation. Since I am simulating lineages by using their representative present-time individuals, I use a 1-1 mapping in necsim-rust. So I guess that's why I have the special case where the individual parent relations and node relations are the same. |
This is quite interesting ... I remember that individuals must be ordered by their parent-child relationships. And the docs say that nodes have no ordering requirement (I think edges do, but I can sort those later on ...). Do you think I should produce the individual relationships (not doing so might be more efficient), and should I keep individuals at all or just set all nodes to -1? |
|
Right, if literally every parent and offspring is kept as a node, and you don't have any recombination (i.e. a node only every has one parent), then the two are the same. However, you might want to "simplify" the tree, which (by default) removes "unary nodes" (i.e. non-coalescent nodes), in which case what was (say) two edges, from 0 -> 10 and 10 -> 20 will be merged into a single edge, 0 -> 20, and the individual parent of node 0 will point to null after simplification. |
I would create individuals (because that's where you'll probably save the X/Y location information), but simply not bother saving the parent information them for the moment, and simply save that in the edge table. You could always add data to the "parent" column in the individuals table later on, if you find you need it. |
|
I've thought a bit more about this ... Right now, I will keep the relationships between individuals. Adding that information requires a special ordering of insertions, but not doing that still doesn't allow me to fully exploit the data format. I guess what I'm really curious about is what happens in tskit under the hood. Do you require that node and individual IDs are contiguous? If not, it might be most efficient to use arbitrary IDs. Would something like that be possible? Otherwise, my current solution seems like a good one. |
Sure - if you think you will use that information, then by all means store it.
Yes, the ID of a node or an individual is simply its row index in the node table and individual table respectively. If you want non-contiguous IDs then you'll need to have a load of unused individuals in the middle.
It's only the individuals that have a required order, I think. But you could allocate the individuals in any order, then sort the individuals table (I assume we can sort this table and it will renumber the parent IDs as required, but I haven't checked). The |
Ok, that clear that up :)
That's a good point. Should I add it to just the individuals or the nodes as well? |
Hmm, we don't sort the individuals to be in the required order. That's a pain. Sorry. That's an issue that should probably be fixed. |
It depends. The "nodes" are conventionally thought of as genomes, so a (diploid) individual will usually have 2 nodes. I guess you probably don't have this distinction in necsim? Nodes are the key internal objects, so perhaps best just adding it to those? |
Ah - I was right the first time. We do sort individuals, but it's just not in the documentation yet: So you can allocate the individuals in an arbitrary order, then sort the tables, and it should all "just work". |
|
(NB: documentation is just about to be corrected: tskit-dev/tskit#1562) |
|
I've done a bit of code cleanup and added the metadata in #c24959b. Now, every node (and individual for good measure) stores the little endian encoding of the lineage reference, which tskit-demo-v2.zip demonstrates. |
That's good to hear! I guess what still binds me to at least a similar ordering is that I need to know the node IDs when creating edges. That's why I asked about contiguous IDs - if they could have been freely chosen, I could have used something I already know. Instead, I now have to postpone edge creation until both the parent and child lineage have been inserted. |
|
Yeah, you have to create the nodes first (the "add_row" returns the new ID) then make the edges. Or you could add a whole load of nodes in one go using |
The rust API may (I should know this, right...) not allow for add columns directly. |
And, with good reason. There's no |
0b96094 to
bd1f9ed
Compare
c1770cc to
ef630f4
Compare
|
Rebase of head ref |
ee29183 to
e48c98c
Compare
molpopgen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks rather straightforward to me. Just a few comments, mostly out of curiosity and to make sure I understand what is happening.
93a4fcc to
b8ad384
Compare
|
Rebase of head ref |
11f4217 to
75451e2
Compare
75451e2 to
f1c6384
Compare
|
Thanks @hyanwong and @molpopgen for your help with this PR! |
* (ml5717) Initial draft implementation of tskit tree integration * (ml5717) Added some documenting comments * (ml5717) Some code refactoring + added lineage reference metadata * (ml5717) Fixed tskit reporter for independent lineages * (ml5717) Experiment with WIP tskit newtype IDs * (ml5717) Upgraded to tskit 0.5 * (ml5717) Switched HashMap to FnvHasher
Integration of the necsim-rust simulation library with the tskit data model inside the
necsim-plugins-tskitplugin:Demo Jupyter Notebook (execute inside repo root dir): tskit-demo.zip