Skip to content

Conversation

@juntyr
Copy link
Owner

@juntyr juntyr commented Jul 8, 2021

Integration of the necsim-rust simulation library with the tskit data model inside the necsim-plugins-tskit plugin:

  • lineages map to individuals
  • speciation and coalescence events map to nodes
  • parent-child relationships map to individual parent-child relationships and node edges
  • provenance information is provided

Demo Jupyter Notebook (execute inside repo root dir): tskit-demo.zip

@juntyr juntyr requested a review from hyanwong July 8, 2021 13:56
@molpopgen
Copy link

@hyanwong directed me over here. Happy to answer questions re: tskit-rust.

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

@hyanwong directed me over here. Happy to answer questions re: tskit-rust.

Thanks @molpopgen! Is it correct for me to represent parent-child relationships between lineages both in the individuals and the edges?

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

I can answer that one: the individual parent is the "pedigree" parent, like your actual mother or father. The parent and child of an edge can represent a lineage that stretches over many pedigree generations, joining one node in the tree sequence with an ancestral node. It's not the case that every simulated individual has to have a node in the tree sequence: they could be simply thrown away during simulation, and only the lineage kept.

Does that make sense?

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

@molpopgen Integrating my code with tskit-rust was quite a smooth experience, which speaks to the great quality of your Rust wrapper.

@molpopgen
Copy link

I can answer that one: the individual parent is the "pedigree" parent, like your actual mother or father. The parent and child of an edge can represent a lineage that stretches over many pedigree generations, joining one node in the tree sequence with an ancestral node. It's not the case that every simulated individual has to have a node in the tree sequence: they could be simply thrown away during simulation, and only the lineage kept.

Does that make sense?

What he said. Different "parents", basically.

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

Also, you don't need individuals to have parents. It's purely a helpful thing if you want to reconstruct the pedigree. Parents of individuals have only recently been introduced into tskit - we didn't bother with them until a few months ago. The edges are the key thing.

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

I can answer that one: the individual parent is the "pedigree" parent, like your actual mother or father. The parent and child of an edge can represent a lineage that stretches over many pedigree generations, joining one node in the tree sequence with an ancestral node. It's not the case that every simulated individual has to have a node in the tree sequence: they could be simply thrown away during simulation, and only the lineage kept.

Does that make sense?

Ah, I think this goes at my confusing use of the word individual in my own simulation. Since I am simulating lineages by using their representative present-time individuals, I use a 1-1 mapping in necsim-rust. So I guess that's why I have the special case where the individual parent relations and node relations are the same.

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

Also, you don't need individuals to have parents. It's purely a helpful thing if you want to reconstruct the pedigree. Parents of individuals have only recently been introduced into tskit - we didn't bother with them until a few months ago. The edges are the key thing.

This is quite interesting ... I remember that individuals must be ordered by their parent-child relationships. And the docs say that nodes have no ordering requirement (I think edges do, but I can sort those later on ...). Do you think I should produce the individual relationships (not doing so might be more efficient), and should I keep individuals at all or just set all nodes to -1?

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

Right, if literally every parent and offspring is kept as a node, and you don't have any recombination (i.e. a node only every has one parent), then the two are the same.

However, you might want to "simplify" the tree, which (by default) removes "unary nodes" (i.e. non-coalescent nodes), in which case what was (say) two edges, from 0 -> 10 and 10 -> 20 will be merged into a single edge, 0 -> 20, and the individual parent of node 0 will point to null after simplification.

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

should I keep individuals at all or just set all nodes to -1?

I would create individuals (because that's where you'll probably save the X/Y location information), but simply not bother saving the parent information them for the moment, and simply save that in the edge table. You could always add data to the "parent" column in the individuals table later on, if you find you need it.

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

I've thought a bit more about this ...

Right now, I will keep the relationships between individuals. Adding that information requires a special ordering of insertions, but not doing that still doesn't allow me to fully exploit the data format.

I guess what I'm really curious about is what happens in tskit under the hood. Do you require that node and individual IDs are contiguous? If not, it might be most efficient to use arbitrary IDs. necsim-rust already has an internal lineage naming system that is consistent when different parts of a simulation are run independently. If I could directly feed in these u64 identifiers as individual / node names, then I wouldn't have to worry about the insertion order and the indices would be more meaningful.

Would something like that be possible? Otherwise, my current solution seems like a good one.

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

I've thought a bit more about this ...

Right now, I will keep the relationships between individuals. Adding that information requires a special ordering of insertions, but not doing that still doesn't allow me to fully exploit the data format.

Sure - if you think you will use that information, then by all means store it.

I guess what I'm really curious about is what happens in tskit under the hood. Do you require that node and individual IDs are contiguous?

Yes, the ID of a node or an individual is simply its row index in the node table and individual table respectively. If you want non-contiguous IDs then you'll need to have a load of unused individuals in the middle.

If not, it might be most efficient to use arbitrary IDs. necsim-rust already has an internal lineage naming system that is consistent when different parts of a simulation are run independently. If I could directly feed in these u64 identifiers as individual / node names, then I wouldn't have to worry about the insertion order and the indices would be more meaningful.

It's only the individuals that have a required order, I think. But you could allocate the individuals in any order, then sort the individuals table (I assume we can sort this table and it will renumber the parent IDs as required, but I haven't checked).

The u64 identifiers sound like they might be useful node or individual metadata?

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

Yes, the ID of a node or an individual is simply its row index in the node table and individual table respectively. If you want non-contiguous IDs then you'll need to have a load of unused individuals in the middle.

Ok, that clear that up :)

The u64 identifiers sound like they might be useful node or individual metadata?

That's a good point. Should I add it to just the individuals or the nodes as well?

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

It's only the individuals that have a required order, I think. But you could allocate the individuals in any order, then sort the individuals table (I assume we can sort this table and it will renumber the parent IDs as required, but I haven't checked).

Hmm, we don't sort the individuals to be in the required order. That's a pain. Sorry. That's an issue that should probably be fixed.

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

That's a good point. Should I add it to just the individuals or the nodes as well?

It depends. The "nodes" are conventionally thought of as genomes, so a (diploid) individual will usually have 2 nodes. I guess you probably don't have this distinction in necsim?

Nodes are the key internal objects, so perhaps best just adding it to those?

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

It's only the individuals that have a required order, I think. But you could allocate the individuals in any order, then sort the individuals table (I assume we can sort this table and it will renumber the parent IDs as required, but I haven't checked).

Hmm, we don't sort the individuals to be in the required order. That's a pain. Sorry. That's an issue that should probably be fixed.

Ah - I was right the first time. We do sort individuals, but it's just not in the documentation yet:

tskit-dev/tskit#1199

So you can allocate the individuals in an arbitrary order, then sort the tables, and it should all "just work".

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

(NB: documentation is just about to be corrected: tskit-dev/tskit#1562)

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

I've done a bit of code cleanup and added the metadata in #c24959b. Now, every node (and individual for good measure) stores the little endian encoding of the lineage reference, which tskit-demo-v2.zip demonstrates.

@juntyr
Copy link
Owner Author

juntyr commented Jul 8, 2021

Ah - I was right the first time. We do sort individuals, but it's just not in the documentation yet:

tskit-dev/tskit#1199

So you can allocate the individuals in an arbitrary order, then sort the tables, and it should all "just work".

That's good to hear! I guess what still binds me to at least a similar ordering is that I need to know the node IDs when creating edges. That's why I asked about contiguous IDs - if they could have been freely chosen, I could have used something I already know. Instead, I now have to postpone edge creation until both the parent and child lineage have been inserted.

@hyanwong
Copy link
Collaborator

hyanwong commented Jul 8, 2021

Yeah, you have to create the nodes first (the "add_row" returns the new ID) then make the edges. Or you could add a whole load of nodes in one go using add_columns and keep track of the IDs yourself.

@molpopgen
Copy link

molpopgen commented Jul 8, 2021

Yeah, you have to create the nodes first (the "add_row" returns the new ID) then make the edges. Or you could add a whole load of nodes in one go using add_columns and keep track of the IDs yourself.

The rust API may (I should know this, right...) not allow for add columns directly.

@molpopgen
Copy link

molpopgen commented Jul 8, 2021

Yeah, you have to create the nodes first (the "add_row" returns the new ID) then make the edges. Or you could add a whole load of nodes in one go using add_columns and keep track of the IDs yourself.

The rust API may (I should know this, right...) not allow for add columns directly.

And, with good reason. There's no tsk_foo_add_columns for any table type foo in the C API. So those "column"-wise ops are a feature of the Python API only.

@github-actions github-actions bot force-pushed the tskit branch 7 times, most recently from 0b96094 to bd1f9ed Compare July 9, 2021 13:57
@github-actions github-actions bot force-pushed the tskit branch 5 times, most recently from c1770cc to ef630f4 Compare October 29, 2021 09:42
@github-actions
Copy link

Rebase of head ref tskit on base ref main failed. Conflicts must be resolved manually.

Copy link

@molpopgen molpopgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks rather straightforward to me. Just a few comments, mostly out of curiosity and to make sure I understand what is happening.

@github-actions github-actions bot force-pushed the tskit branch 3 times, most recently from 93a4fcc to b8ad384 Compare November 3, 2021 08:03
@github-actions
Copy link

github-actions bot commented Nov 8, 2021

Rebase of head ref tskit on base ref main failed. Conflicts must be resolved manually.

@juntyr
Copy link
Owner Author

juntyr commented Nov 9, 2021

Thanks @hyanwong and @molpopgen for your help with this PR!

@juntyr juntyr merged commit 17a69bd into main Nov 9, 2021
@juntyr juntyr deleted the tskit branch November 9, 2021 08:30
juntyr added a commit that referenced this pull request May 8, 2022
* (ml5717) Initial draft implementation of tskit tree integration

* (ml5717) Added some documenting comments

* (ml5717) Some code refactoring + added lineage reference metadata

* (ml5717) Fixed tskit reporter for independent lineages

* (ml5717) Experiment with WIP tskit newtype IDs

* (ml5717) Upgraded to tskit 0.5

* (ml5717) Switched HashMap to FnvHasher
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants