(tablefs)=

# Calculating frequency spectra from tree sequences

The mutation frequency spectrum is an array-like object representing the
number of mutation events occurring at different frequencies.  Generating
frequency spectra, or `fs`, from simulations is handled by {func}`fwdpy11.TableCollection.fs`.

We have to define some conventions:

* A sample list is a list of node indexes.  The node indexes
  are stored in a {class}`numpy.ndarray` with {class}`numpy.dtype`
  `numpy.int32`.
* We support `fs` calculated from multiple lists of samples.
* An `fs` always contains the zero and fixed bins.  Thus, for
  a sample of `n` nodes, there are `n + 1` entries in the `fs`.
  The first and last values are for frequencies zero and `n`, respectively.
  Singletons start in bin `1`, etc..
* An `fs` from a single sample list is represented by a {class}`numpy.ma.MaskedArray`,
  object with the zero and fixed bins masked.
* An `fs` from more than one sample list is stored in a {class}`sparse.COO` sparse
  matrix.

For our examples, we will initialize a {class}`fwdpy11.DiploidPopulation` from
a {class}`tskit.TreeSequence` generated by {func}`msprime.simulate`.

:::{note}

These examples do not show how to get `fs` separately
for neutral and non-neutral mutations.  See
{func}`fwdpy11.TableCollection.fs` for details.

:::

In [1]:
import fwdpy11
import msprime
import numpy as np

rng = fwdpy11.GSLrng(4321678)
config = [msprime.PopulationConfiguration(500), msprime.PopulationConfiguration(500)]
ts = msprime.simulate(
    population_configurations=config,
    Ne=500.0,
    random_seed=777,
    migration_matrix=np.array([0, 0.1, 0.1, 0]).reshape(2, 2),
    recombination_rate=0.25,
)

pop = fwdpy11.DiploidPopulation.create_from_tskit(ts)
md = np.array(pop.diploid_metadata, copy=False)
np.unique(md["deme"], return_counts=True)

nmuts = fwdpy11.infinite_sites(rng, pop, 0.1)
nmuts

2923

The following blocks show several methods for obtaining the `fs` from lists of nodes.
First, let's get the lists of nodes from the two demes in our population:

In [2]:
nodes = np.array(pop.tables.nodes, copy=False)
alive_nodes = pop.alive_nodes
deme0_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 0)[0]]
deme1_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 1)[0]]

Get an `fs` from nodes found only in deme 0:

In [3]:
pop.tables.fs([deme0_nodes[:10]])

masked_array(data=[--, 434, 208, 111, 94, 62, 61, 47, 46, 45, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

Get a joint `fs` from nodes from each deme:

In [4]:
fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]])
fs

0,1
Format,coo
Data Type,int32
Shape,"(11, 6)"
nnz,54
Density,0.8181818181818182
Read-only,True
Size,1.1K
Storage ratio,4.1


Obtain the full {class}`numpy.ndarray` for the joint `fs`:

In [5]:
fs.todense()

array([[  0, 155,  26,   0,   0,   0],
       [275, 117,  31,   8,   3,   0],
       [104,  61,  30,  11,   2,   0],
       [ 24,  33,  31,  16,   7,   0],
       [ 20,  20,  36,  13,   3,   2],
       [  1,  17,  22,  17,   4,   1],
       [  1,   6,  22,  18,  10,   4],
       [  0,   5,  12,  14,  13,   3],
       [  0,   2,   7,  13,  10,  14],
       [  0,   1,   3,   4,  26,  11],
       [  0,   0,   1,   4,   6,  18]], dtype=int32)

:::{warning}

The joint `fs` can take a lot of memory!

:::

We can use standard array operations to get the marginal `fs` from our joint `fs`:

In [6]:
fs.sum(axis=1).todense()
fs.sum(axis=0).todense()

array([425, 417, 221, 118,  84,  53])

:::{note}

Be careful when processing sparse matrix objects!  Naive application of regular
{mod}`numpy` functions can lead to erroneous results.  Be sure to check the
{mod}`sparse` documentation.

:::

The marginalization can be tedious for many samples, so you can have it happen automatically,
in which case a {class}`dict` is returned, keyed by sample list index:

In [7]:
fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]], marginalize=True)
for key, value in fs.items():
    print(key)
    print(value)
    print(value.data)

0
[-- 434 208 111 94 62 61 47 46 45 --]
[181 434 208 111  94  62  61  47  46  45  29]
1
[-- 417 221 118 84 --]
[425 417 221 118  84  53]


:::{note}

Marginalizing in this way preserves the convention that the 1-d `fs`
objects are instances of {class}`numpy.ma.MaskedArray`.

:::

To see how the {class}`dict` keying works, let's flip the sample lists:

In [8]:
fs = pop.tables.fs([deme1_nodes[50:55], deme0_nodes[:10]], marginalize=True)
for key, value in fs.items():
    print(key)
    print(value)
    print(value.data)

0
[-- 417 221 118 84 --]
[425 417 221 118  84  53]
1
[-- 434 208 111 94 62 61 47 46 45 --]
[181 434 208 111  94  62  61  47  46  45  29]


If you only want the `fs` from particular regions of the genome.  By default,
the `fs` is the sum across windows:

In [9]:
pop.tables.fs([deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)])

masked_array(data=[--, 101, 32, 13, 12, 13, 9, 13, 9, 10, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

You can get the `fs` separately by window, too:

In [10]:
pop.tables.fs(
    [deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)], separate_windows=True
)

[masked_array(data=[--, 56, 23, 5, 9, 7, 8, 8, 4, 4, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32),
 masked_array(data=[--, 45, 9, 8, 3, 6, 1, 5, 5, 6, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32)]

You can also get a joint `fs` marginalized by sample list and separated
by window.  In this case, the return value is a {class}`list` containing
the {class}`dict` for each window:

In [11]:
pop.tables.fs(
    [deme0_nodes[:10], deme1_nodes[:20]],
    windows=[(0.1, 0.2), (0.8, 0.9)],
    marginalize=True,
    separate_windows=True,
)

[{0: masked_array(data=[--, 56, 23, 5, 9, 7, 8, 8, 4, 4, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 27, 24, 19, 13, 12, 2, 4, 4, 5, 1, 7, 2, 4, 2, 4,
                     1, 2, 2, 2, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False, False, False, False, False, False, False,
                     False, False, False, False,  True],
         fill_value=999999)},
 {0: masked_array(data=[--, 45, 9, 8, 3, 6, 1, 5, 5, 6, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 39, 29, 14, 3, 5, 11, 1, 5, 2, 2, 0, 1, 5, 1, 1, 1,
                     1, 6, 2, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, Fa

## Simplifying to the samples

Finally, it is sometimes more efficient to simplify the tree sequences with
respect to the sample nodes.  For example, if there are a vast number of
ancient samples and you are processing each time point separately
(see {func}`fwdpy11.DiploidPopulation.sample_timepoints`), then *not* simplifying
means iterating over trees that are redundant/irrelevant to the history of
the current time point.  In order to get the `fs` from a simplified
tree sequence, pass `simplify=True` when calling {func}`fwdpy11.TableCollection.fs`.