(tskitconvert_vignette)=

# Converting data to tskit format

:::{note}
The examples in this vignette involve population objects with no ancestry.
This is intentional, as it keeps the execution times to a minimum while still illustrating the concepts.
:::

In [1]:
import fwdpy11

At the end of a simulation, {func}`fwdpy11.DiploidPopulation.dump_tables_to_tskit` will generate a {class}`tskit.TreeSequence` object.
This object can be saved to a file using {func}`tskit.TreeSequence.dump`, resulting in a "trees file".

Saving data in `tskit`'s format gives you access to a huge array of methods for downstream analysis of your simulated data.
See the [tskit docs](https://tskit.dev/tskit/docs/stable/index.html) for much more on what you can do.
Much of what you need is found in the docs for {class}`tskit.TreeSequence`.

What happens when we call {func}`fwdpy11.DiploidPopulation.dump_tables_to_tskit`?

1. All tables are converted from `fwdpy11` format to `tskit` format in a {class}`tskit.TableCollection`.
2. For individuals, populations, and mutations, row-level metadata are encoded.
   The {ref}`next vignette <tskit_metadata_vignette>` covers how to interact with this information.
3. Optionally, top-level metadata are encoded, which is controlled by keyword arguments to {func}`fwdpy11.DiploidPopulation.dump_tables_to_tskit`.

## Obtaining a `tskit` tree sequence

In [2]:
pop = fwdpy11.DiploidPopulation(1000, 1.0)

By default, export to `tskit` returns a {class}`tskit.TreeSequence`:

In [3]:
ts = pop.dump_tables_to_tskit()
type(ts)

tskit.trees.TreeSequence

In general, we want this type in order to dump straight to a file:

```python
# not executed
ts.dump("treefile.trees")
```

We can also do a lot of analysis using this type.
However, certain tasks would be easier if this type knew about the metadata generated by `fwdpy11`.

### A lightweight wrapper

The class {class}`fwdpy11.tskit_tools.WrappedTreeSequence` is a light wrapper around {class}`tskit.TreeSequence`.
This class has properties to easily fetch the top-level metadata.
We use these properties in the examples below.

Instances of this wrapper class can be generated in several ways:

* Directly:

In [4]:
wrapper = fwdpy11.tskit_tools.WrappedTreeSequence(ts=ts)
print(type(wrapper), type(wrapper.ts))

<class 'fwdpy11.tskit_tools.trees.WrappedTreeSequence'> <class 'tskit.trees.TreeSequence'>


* Via a call to {func}`fwdpy11.tskit_tools.load`

* By passing an option when exporting from `fwdpy11`:

In [5]:
ts = pop.dump_tables_to_tskit(wrapped=True)
print(type(ts), type(wrapper.ts))

<class 'fwdpy11.tskit_tools.trees.WrappedTreeSequence'> <class 'tskit.trees.TreeSequence'>


To dump this wrapped type to a file:

```python
# not executed
ts.ts.dump("treefile.trees")
```

In fact, all "`tskit` things" can be done by accessing {attr}`fwdpy11.tskit_tools.WrappedTreeSequence.ts`.

Trying to create the wrapped type from a tree sequence not generated by `fwdpy11` is an error:

In [6]:
import tskit
tc = tskit.TableCollection(100.)
ts = tc.tree_sequence()
wrapper = fwdpy11.tskit_tools.WrappedTreeSequence(ts)

ValueError: this tree sequence was not generated by fwdpy11

## Metadata

The `tskit` data model includes [metadata](https://tskit.dev/tskit/docs/stable/metadata.html).
Metadata means anything not strictly needed for operations on trees/tree sequences.
Examples may include population names, properties of mutations such as selection coefficients, geographic locations of individuals, etc..
These sorts of metadata are included as optional columns in tables.
See the `tskit` [data model](https://tskit.dev/tskit/docs/stable/data-model.html) documentation for more detail.

Metadata can also include "top-level" information that is stored as metadata for the tree sequence (or {class}`tskit.TableCollection`) itself.
Much of this vignette concerns top-level metadata.

### Metadata schema

The `tskit` metadata require a specification, or *schema*, that is used to validate the metadata contents.

The top-level metadata schema is:

In [7]:
print(fwdpy11.tskit_tools.metadata_schema.TopLevelMetadata)

{'codec': 'json',
 'properties': {'data': {'description': 'This field is reserved for the user '
                                        'to fill.',
                         'type': ['object', 'string']},
                'demes_graph': {'description': 'A demographic model specified '
                                               'using demes.This information '
                                               'will be redundant with that '
                                               'stored in model_params,but it '
                                               'may be useful as it allows '
                                               'reconstruction of the YAML '
                                               'filefrom the tree sequence.',
                                'type': 'object'},
                'generation': {'description': 'The value of pop.generation at '
                                              'the time datawere exported to '
                                    

We see that the metadata are encoded as a [JSON schema](https://json-schema.org/).
Only one field is required, `generation`.
This required field is automatically populated by {func}`fwdpy11.DiploidPopulation.dump_tables_to_tskit`.
The remaining fields are filled in by keyword arguments passed to that function.

Much of what follows concerns filling in this top-level metadata.
To do this, we use incomplete data (populations with no ancestry) to illustrate the concepts.

## Populating top-level metadata

Top-level metadata are an attempt to improve reproducibility.
However, all of these features are optional.
Because a simulation project using Python may be very complex, with many local and non-local imports, etc., fully documenting the top-level metadata and/or "provenance" may be impractical.
In some cases, reproducibility may instead by best managed by a `git` repository where script and `Makefile`/`Snakefile` work flows coexist.

### Adding random number seeds

You may add a random number seed to the top-level metadata.
Doing so is optional because we cannot guarantee that you used this seed to construct a {class}`fwdpy11.GSLrng` and then immediately run the simulation.
For example, if you built the `rng` object and then simulated two replicates, the seed is only valid to reproduce the first.
You do not know the state of the `rng` when the second simulation started.

In [8]:
ts = pop.dump_tables_to_tskit(wrapped=True)
assert ts.seed is None

In [9]:
ts = pop.dump_tables_to_tskit(seed=54321, wrapped=True)
assert ts.seed == 54321

In [10]:
ts = pop.dump_tables_to_tskit(seed=-4616, wrapped=True)

ValueError: seed must be >=0, got -4616

### Adding `ModelParams` objects

Another optional top-level metadata item are instances of {class}`fwdpy11.ModelParams`.
Including this is optional for several reasons, but a big one is that this class can be arbitrarily hard to serialize.
The examples listed below will work with types provided by `fwdpy11` (although we haven't tested them all!), but user-defined types may be more challenging.

In [11]:
ts = pop.dump_tables_to_tskit(wrapped=True)
assert ts.model_params is None

Here is an example of Gaussian stabilizing selection around an optimum trait value of `0.0`:

In [12]:
pdict = {
    'nregions': [],
    'sregions': [fwdpy11.GaussianS(0, 1, 1, 0.10)],
    'recregions': [fwdpy11.PoissonInterval(0, 1, 1e-3)],
    'gvalue': fwdpy11.Additive(2., fwdpy11.GSS(optimum=0.0, VS=10.0)),
    'rates': (0.0, 1e-3, None),
    'simlen': 10 * pop.N,
    'prune_selected': False,
}

params = fwdpy11.ModelParams(**pdict)

We can pass the `params` object on when exporting the data to `tskit`:

In [13]:
ts = pop.dump_tables_to_tskit(model_params=params, wrapped=True)
assert ts.model_params == params

Some simulation designs make use of multiple `ModelParams` objects.
Imagine, for example, that we simulate the above model until equilibrium, and then simulate another 100 generations while adapting to a new optimum:

:::{note}
The details are omitted here:

1. This simulation requires two calls to {func}`fwdpy11.evolvets`
2. It could have been done instead with a single call to {func}`fwdpy11.evolvets` and using
   {class}`fwdpy11.GSSmo` and one `ModelParams` object instead of {class}`fwdpy11.GSS`
   and two parameters objects.
:::

In [14]:
import copy
pdict2 = copy.deepcopy(pdict)

pdict2['gvalue'] = fwdpy11.Additive(2., fwdpy11.GSS(optimum=5.0, VS=10.0)),
pdict2['simlen'] = 100

params2 = fwdpy11.ModelParams(**pdict2)

We can pass both objects to `tskit` via a {class}`dict`:

In [15]:
ts = pop.dump_tables_to_tskit(model_params={'burnin': params, 'adaptation': params2}, wrapped=True)
type(ts.model_params)

dict

In [16]:
assert ts.model_params['burnin'] == params
assert ts.model_params['adaptation'] == params2

### Including a `demes` graph

You may include a {class}`demes.Graph` in the top-level metadata.
If you also include a {class}`fwdpy11.ModelParams` (see above), including the `demes` graph gives redundant information.
However, including the graph may be useful if downstream analysis will involve other tools compatible with the `deme` specification.
With the graph as metadata, you can extract it and reconstruct the original `YAML` file, or send it to another Python package that understands it.

The following hidden code block defines a function to return a {class}`demes.Graph` from `YAML` input stored in a string literal.

In [17]:
def gutenkunst():
    import demes
    yaml = """
description: The Gutenkunst et al. (2009) OOA model.
doi:
- https://doi.org/10.1371/journal.pgen.1000695
time_units: years
generation_time: 25

demes:
- name: ancestral
  description: Equilibrium/root population
  epochs:
  - {end_time: 220e3, start_size: 7300}
- name: AMH
  description: Anatomically modern humans
  ancestors: [ancestral]
  epochs:
  - {end_time: 140e3, start_size: 12300}
- name: OOA
  description: Bottleneck out-of-Africa population
  ancestors: [AMH]
  epochs:
  - {end_time: 21.2e3, start_size: 2100}
- name: YRI
  description: Yoruba in Ibadan, Nigeria
  ancestors: [AMH]
  epochs:
  - start_size: 12300
- name: CEU
  description: Utah Residents (CEPH) with Northern and Western European Ancestry
  ancestors: [OOA]
  epochs:
  - {start_size: 1000, end_size: 29725}
- name: CHB
  description: Han Chinese in Beijing, China
  ancestors: [OOA]
  epochs:
  - {start_size: 510, end_size: 54090}

migrations:
- {demes: [YRI, OOA], rate: 25e-5}
- {demes: [YRI, CEU], rate: 3e-5}
- {demes: [YRI, CHB], rate: 1.9e-5}
- {demes: [CEU, CHB], rate: 9.6e-5}
"""
    return demes.loads(yaml)

In [18]:
graph = gutenkunst()
ts = pop.dump_tables_to_tskit(demes_graph=graph, wrapped=True)

In [19]:
assert ts.demes_graph == graph

Since this is an optional metadata field, accessing it will return `None` if no graph was provided:

In [20]:
ts = pop.dump_tables_to_tskit(wrapped=True)
assert ts.demes_graph is None

### User-defined metadata

It may be useful to store arbitrary data in the output.
For example, this could be a {class}`dict` containing information obtained from a {ref}`time series <recorders_vignette>` analysis of a simulation.

The simplest situation is to use a `dict` containing simple objects:

In [21]:
ts = pop.dump_tables_to_tskit(data={'x': 3}, wrapped=True)
ts.data

{'x': 3}

Any use of user-defined types will require conversion to string, however:

In [22]:
class MyData(object):
    def __init__(self, x):
        self.x = x

    def __repr__(self):
        return f"MyData(x={self.x})"

# Send in the str representation:
ts = pop.dump_tables_to_tskit(data=str(MyData(x=77)), wrapped=True)

Getting the data back out requires an `eval`:

In [23]:
assert eval(ts.data).x == 77

:::{warning}
Recording data like this can become arbitrary difficult.
The Python import system may make it more difficult to recover data than the examples here would indicate.
Further, if the data are very large, then other output formats are likely more appropriate.
:::

## Setting population table metadata

You can also set the population table metadata when exporting data.
Doing so requires a {class}`dict` mapping the integer label for each population to another {class}`dict`.

For example, let's create an demographic model from the `demes` graph that used above:

In [24]:
model = fwdpy11.discrete_demography.from_demes(gutenkunst())
type(model)

fwdpy11.demographic_models.demographic_model_details.DemographicModelDetails

This object contains a mapping from integer labels to the deme names:

In [25]:
model.metadata['deme_labels']

{0: 'ancestral', 1: 'AMH', 2: 'OOA', 3: 'YRI', 4: 'CEU', 5: 'CHB'}

We can make the required `dict` like this:

In [26]:
pop_md = {}
for key, value in model.metadata['deme_labels'].items():
    pop_md[key] = {'name': value}

To actually illustrate the use of this metadata, we need to make sure that our `tskit` output actually has a population table:

In [27]:
# initialize a population with the right number of demes...
multideme_pop = fwdpy11.DiploidPopulation([100]*len(pop_md), 1.)
ts = multideme_pop.dump_tables_to_tskit(population_metadata=pop_md, wrapped=True)
for pop in ts.ts.populations():
    print(pop.metadata)

{'name': 'ancestral'}
{'name': 'AMH'}
{'name': 'OOA'}
{'name': 'YRI'}
{'name': 'CEU'}
{'name': 'CHB'}


We could also easily add the deme description to the metadata from the `demes` graph:

In [28]:
graph = gutenkunst()
graph.demes

[Deme(name='ancestral', description='Equilibrium/root population', start_time=inf, ancestors=[], proportions=[], epochs=[Epoch(start_time=inf, end_time=220000.0, start_size=7300, end_size=7300, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='AMH', description='Anatomically modern humans', start_time=220000.0, ancestors=['ancestral'], proportions=[1.0], epochs=[Epoch(start_time=220000.0, end_time=140000.0, start_size=12300, end_size=12300, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='OOA', description='Bottleneck out-of-Africa population', start_time=140000.0, ancestors=['AMH'], proportions=[1.0], epochs=[Epoch(start_time=140000.0, end_time=21200.0, start_size=2100, end_size=2100, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='YRI', description='Yoruba in Ibadan, Nigeria', start_time=140000.0, ancestors=['AMH'], proportions=[1.0], epochs=[Epoch(start_time=140000.0, end_time=0, start_size=12300, end_size=12300

In [29]:
pop_md = {}
for i,deme in enumerate(graph.demes):
    pop_md[i] = {'name': deme.name, "description": deme.description}

pop_md

{0: {'name': 'ancestral', 'description': 'Equilibrium/root population'},
 1: {'name': 'AMH', 'description': 'Anatomically modern humans'},
 2: {'name': 'OOA', 'description': 'Bottleneck out-of-Africa population'},
 3: {'name': 'YRI', 'description': 'Yoruba in Ibadan, Nigeria'},
 4: {'name': 'CEU',
  'description': 'Utah Residents (CEPH) with Northern and Western European Ancestry'},
 5: {'name': 'CHB', 'description': 'Han Chinese in Beijing, China'}}