# Empirical Example

The purpose of this notebook is to demonstrate `ipcoal` simulations on a topology inferred from empirical data. We provide recommendations for how to scale units from a time-calibrated phylogeny to use in coalescent simulations, and how to incorporate biological information about species, such as generation times and population sizes, to perform more realistic simulations. 

Simulating coalescent genealogies and sequences on a parameterized species tree model using `ipcoal` can provide a null expectation for the amount of discordance that you expect to observe across different nodes of a species tree, and can even be used as a posterior predictive tool for phylogenetic analyses. 

In [1]:
import numpy as np
import pandas as pd
import ipcoal
import toytree
import toyplot
colormap = toyplot.color.brewer.map("BlueRed", reverse=True)

### Mammal phylogeny data set

In this example we use published data for mammals. We will use a time-calibrated MCC phylogeny by [Upham et al. (2009)](https://doi.org/10.1371/journal.pbio.3000494) as a species tree hypothesis; we will use species geographic areas from the [PanTHERIA database](https://doi.org/10.1890/08-1494.1) as a proxy for effective population sizes; and we will use generation time estimates from the [Pacifici et al. (2014)](https://doi.org/10.5061/dryad.gd0m3) data set, which imputes a lot of missing data from pantheria by using mean values among close relatives. 

In [3]:
# load the phylogenetic data (big tree, takes a few seconds)
TREE_URL = (
    "https://github.com/eaton-lab/ipcoal/blob/master/"
    "notebooks/mammal_dat/MamPhy_fullPosterior_BDvr_DNAonly"
    "_4098sp_topoFree_NDexp_MCC_v2_target.tre?raw=true"
)
tree = toytree.tree(TREE_URL, tree_format=10)
print(tree.ntips, "tips in the Upham mammal tree")

4100 tips in the Upham mammal tree


In [4]:
# load the mammal biological data (e.g., geo range)
PANTH_URL = (
    "https://github.com/eaton-lab/ipcoal/blob/master/"
    "notebooks/mammal_dat/PanTHERIA_1-0_WR05_Aug2008"
    ".txt?raw=true"
)
panthdf = pd.read_csv(PANTH_URL, sep='\t')
print(panthdf.shape[0], "taxa in PanTHERIA database")

5416 taxa in PanTHERIA database


In [5]:
# load the generation time data
GT_URL = (
    "https://github.com/eaton-lab/ipcoal/blob/master/"
    "notebooks/mammal_dat/5734-SP-2-Editor.csv?raw=true"
)
gentimedf = pd.read_csv(GT_URL)
print(gentimedf.shape[0], "taxa in Pacifici gentime database")

5427 taxa in Pacifici gentime database


### Filtering and selecting taxa

We will first trim the data down to include only taxa that are shared among all three data sources and for which there is not missing biological data. This reduces the data set to 3121 taxa. The distribution of geographic range areas is in units of kilometers$^2$ (`geogrange`) and generation times is in units of years (`gentime`).

In [6]:
# subselect species names and geo range columns from pantheria
sppdata = panthdf.loc[:, ['MSW05_Binomial', '26-1_GR_Area_km2']]

# rename sppdata columns
sppdata.columns = ["species", "georange"]

In [7]:
# make column to record tree tip label names
sppdata["treename"] = np.nan

# dict map: {gen}_{spp} to {gen}_{spp}_{fam}_{order}
tipdict = {i.rsplit("_", 2)[0]: i for i in tree.get_tip_labels()}

# record whether species in pantheria is in the tree tip labels
for idx in sppdata.index:
    
    # match data names to tree names which have underscores
    name = sppdata.species[idx]
    name_ = name.replace(" ", "_")
    
    # record treename if it is in the database
    if name_ in tipdict:
        sppdata.loc[idx, "treename"] = tipdict[name_]

In [8]:
# add gentime values to all species matching to names in Pacifici data set
sppdata["gentime"] = np.nan
for idx in gentimedf.index:
    
    # get generation time in units of years
    species, gent = gentimedf.loc[idx, ["Scientific_name", "GenerationLength_d"]] 
    mask = sppdata.species == species
    sppdata.loc[mask.values, "gentime"] = gent / 365.

In [10]:
# set missing data (-999) to NaN
sppdata[sppdata == -999.000] = np.nan

# remove rows where either georange or gentime is missing
mask = sppdata.georange.notna() & sppdata.gentime.notna() & sppdata.treename.notna()
sppdata = sppdata.loc[mask, :]

# reorder and reset index for dropped rows
sppdata.sort_values(by="species", inplace=True)
sppdata.reset_index(drop=True, inplace=True)

# show first ten sorted rows
sppdata.head(10)

Unnamed: 0,species,georange,treename,gentime
0,Abeomelomys sevia,53261.73,Abeomelomys_sevia_MURIDA...,1.710684
1,Abrocoma bennettii,54615.98,Abrocoma_bennettii_ABROC...,2.829928
2,Abrocoma boliviensis,5773.97,Abrocoma_boliviensis_ABR...,2.829928
3,Abrocoma cinerea,381391.02,Abrocoma_cinerea_ABROCOM...,2.829928
4,Abrothrix andinus,722551.83,Abrothrix_andinus_CRICET...,1.614762
5,Abrothrix hershkovitzi,1775.72,Abrothrix_hershkovitzi_C...,1.614762
6,Abrothrix illuteus,35359.55,Abrothrix_illuteus_CRICE...,1.614762
7,Abrothrix jelskii,506394.71,Abrothrix_jelskii_CRICET...,1.614762
8,Abrothrix lanosus,43016.67,Abrothrix_lanosus_CRICET...,1.614762
9,Abrothrix longipilis,423823.71,Abrothrix_longipilis_CRI...,1.614762


### Filter the tree to include only taxa in the data table


In [11]:
# find names in tree but not in data table
names_in_data = set(sppdata.treename)
names_in_tree = set(tree.get_tip_labels())
names_to_remove = names_in_tree.difference(names_in_data)

In [12]:
# drop the tips from the tree not in data table
ftree = tree.drop_tips(names_to_remove)
print(len(ftree), "tips in filtered tree (ftree)")

3121 tips in filtered tree (ftree)


### Convert geographic ranges to Ne values
Here we generate a range of Ne values within a selected range that are scaled by the variation in geographic range area sizes among taxa. The distribution is plotted as a histrogram on a y-axis log scale. Many taxa have small Ne, few have very large Ne. 

In [13]:
# transform georange into Ne values within selected range
max_Ne = 1000000
min_Ne = 1000

# set Ne values in range scaled by geographic ranges
Ne = max_Ne * (sppdata.georange / sppdata.georange.max())
Ne = [max(min_Ne, i) for i in Ne]
sppdata["Ne"] = np.array(Ne, dtype=int)

# show 10 random samples
sppdata.sample(10)

Unnamed: 0,species,georange,treename,gentime,Ne
2606,Rusa alfredi,24504.78,Rusa_alfredi_CERVIDAE_CE...,7.0025,1000
2051,Otomops madagascariensis,215525.86,Otomops_madagascariensis...,3.912091,3419
2713,Sminthopsis granulipes,76315.75,Sminthopsis_granulipes_D...,1.495963,1210
2770,Sorex samniticus,57915.64,Sorex_samniticus_SORICID...,1.158588,1000
3070,Ursus americanus,9650153.48,Ursus_americanus_URSIDAE...,12.970074,153093
3052,Tylomys nudicaudus,480905.77,Tylomys_nudicaudus_CRICE...,1.904993,7629
1964,Nyctinomops macrotis,18149920.2,Nyctinomops_macrotis_MOL...,3.912091,287937
1618,Mimon cozumelae,812740.52,Mimon_cozumelae_PHYLLOST...,5.655552,12893
534,Cratogeomys merriami,17777.06,Cratogeomys_merriami_GEO...,2.141185,1000
2944,Tapirus bairdii,188831.76,Tapirus_bairdii_TAPIRIDA...,11.0,2995


In [14]:
# plot a histogram of Ne values
a, b = np.histogram(sppdata.Ne, bins=25)
toyplot.bars((a, b), height=300, yscale="log", ylabel="bin count", xlabel="Ne");

### Set Ne and g values for tip nodes of the tree object

*ipcoal* can accept different Ne and g values to use in simulations, and the easiest way to set variable values across different parts of the tree is to map the values to the tree object that *ipcoal* accepts as an argument.

In [15]:
# make a copy of the filtered tree
tree_ng = ftree.copy() 

# dictionaries mapping names to values
dict_ne = {sppdata.treename[i]: sppdata.Ne[i] for i in range(sppdata.shape[0])}
dict_gt = {sppdata.treename[i]: sppdata.gentime[i] for i in range(sppdata.shape[0])}

# set values on nodes of the tree for all species (tips)
tree_ng = tree_ng.set_node_values("Ne", dict_ne)
tree_ng = tree_ng.set_node_values("g", dict_gt)

### Set Ne and g values for ancestral nodes of the tree object

We only have estimates of Ne and g for species that are alive today, but it would be useful to also includes estimates for ancestral nodes in the species tree. Here we use a simple ancestral state reconstruction based on Brownian motion to infer states for ancestral nodes. 

In [16]:
# make another tree copy
tree_ng = tree_ng.pcm.ancestral_state_reconstruction("g")
tree_ng = tree_ng.pcm.ancestral_state_reconstruction("Ne")

### Plot the tree with node values 

Let's plot just a subset of taxa to start, since it will be much easier to visualize than trying to examine the entire tree. Here we select only the taxa in the genus *Mustela*. The tree plot shows variation in Ne using the thickness of edges, and generation times are shows by the color of nodes, blue to red, representing shorter to longer times. 

In [17]:
# make a tree copy
atree = tree_ng.copy()

# get ancestor of all tips that have 'Mustela' in their name
mrca_node_idx = atree.get_mrca_idx_from_tip_labels(wildcard="Mustela_")

# make new toytree from selected node
node = atree.get_feature_dict("idx")[mrca_node_idx]
subtree = toytree.tree(node)

# scale the tree height from millions of year to years
subtree = subtree.mod.node_scale_root_height(subtree.treenode.height * 1e6)

In [18]:
subtree.draw(
    ts='p', 
    edge_type='p',
    node_sizes=10,
    node_labels=False,
    node_colors=[
        colormap.colors(i, 0.1, 10) for i in subtree.get_node_values('g', 1, 1)
    ],
    width=400, 
    height=600,
);

### Convert edge lengths from time to generations

Time in years is converted to units of generations by dividing by each edge length by the generation time for that edge, recorded as ngenerations/year. When this is done the crown root age of the *Mustela* tree is now at 2.4M generations from the furthest tip in the tree. 


In [20]:
# divide the edge lengths (in abosolute time) by the generation time, to get number of generations for each edge
ttree = subtree.set_node_values(
    "dist",
    {i.name: i.dist / i.g for i in subtree.get_feature_dict()}
)

In [22]:
ttree.draw(
    ts='p',
    edge_type='p',
    tip_labels_align=True,
    tip_labels=[i.rsplit("_", 2)[0] for i in ttree.get_tip_labels()],
    node_labels=False,
    node_sizes=0,
    width=400, 
    height=400,
);

### Simulate a chromosome with ipcoal
The tree topology, edge lengths, and Ne values will all be inherited from the tree object by *ipcoal* to setup the coalescent simulation parameters. We leave the rest of the parameters at their default values. 

In [23]:
# initialize the ipcoal model object
mod = ipcoal.Model(ttree, seed=333)

# simulate 1 chromosome 1Mb is length
mod.sim_trees(nloci=1, nsites=1e6)

# show the dataframe of genealogy results
mod.df.head()

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,49,49,0,0,((Neovison_vison_MUSTELI...
1,0,49,166,117,0,1,((Neovison_vison_MUSTELI...
2,0,166,296,130,0,2,((Neovison_vison_MUSTELI...
3,0,296,343,47,0,3,((((Mustela_nivalis_MUST...
4,0,343,384,41,0,4,((((Mustela_nivalis_MUST...


### The distribution of lengths of genealogical variation

In [24]:
canvas, axes, mark = toyplot.bars(
    np.histogram(mod.df.nbps, 50, range=[0, 1000]),
    width=300, 
    height=300,
    xlabel="gene tree length (bp)",
    ylabel="bin count",
    label="The size of non-recombined genomic blocks",
)

In [25]:
kwargs = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "9px"},
}

# draw linked genealogies
rand = mod.df.sample(8).tidx
toytree.mtree(mod.df.genealogy[rand]).draw_tree_grid(ncols=4, nrows=2, **kwargs);

### Simulate unlinked loci (e.g., UCE)


In [30]:
# initialize the ipcoal model object
mod = ipcoal.Model(ttree, seed=333)

# simulate 1 chromosome 1Mb is length
mod.sim_loci(nloci=1000, nsites=1000)

# show the dataframe of genealogy results
mod.df.head()

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,7,7,1,0,((Mustela_erminea_MUSTEL...
1,0,7,39,32,6,1,(((Neovison_vison_MUSTEL...
2,0,39,40,1,1,2,(((Neovison_vison_MUSTEL...
3,0,40,186,146,27,3,(((Neovison_vison_MUSTEL...
4,0,186,265,79,15,4,(((Neovison_vison_MUSTEL...


### Compute average number of genealogies per locus


In [39]:
mod.df.groupby("locus").apply(len).mean()

8.691

### Infer gene trees

In [None]:
mod.infer_gene_trees()

### Simulate on a larger clade  
  
Let's scale up to Mustelidae, the whole family containing weasels, ferrets, otters, etc.

In [None]:
# make a tree copy
atree = tree_ng.copy()

# get ancestor of all tips that have 'Mustela' in their name
mrca_node_idx = atree.get_mrca_idx_from_tip_labels(wildcard="MUSTELIDAE")

# make new toytree from selected node
node = atree.get_feature_dict("idx")[mrca_node_idx]
subtree = toytree.tree(node)

# scale the tree height from millions of year to years
subtree = subtree.mod.node_scale_root_height(subtree.treenode.height * 1e6)

In [549]:
canvas, axes = subtree.draw(
    ts='p', 
    edge_type='p',
    node_sizes=10,
    node_labels=False,
    node_colors=[
        colormap.colors(i, 0.1, 10) for i in subtree.get_node_values('g', 1, 1)
    ],
    width=600, 
    height=600,
);

# set y axis ticks at 2My intervals and fit long tip names
axes.y.ticks.locator = toyplot.locator.Explicit(
    range(0, int(2e7), int(2e6)),
    [str(int(i / 1e6)) for i in range(0, int(2e7), int(2e6))],
)
axes.y.domain.min = -20e6

In [550]:
# divide the edge lengths (in abosolute time) by the generation time, to get number of generations for each edge
ttree = subtree.set_node_values(
    "dist",
    {i.name: i.dist / i.g for i in subtree.get_feature_dict()}
)

In [558]:
ttree.draw(
    ts='p',
    edge_type='p',
    tip_labels_align=True,
    node_labels=False,
    node_sizes=0,
    width=600, 
    height=600,
);

In [559]:
# mod = ipcoal.Model(ttree, seed=333)
# mod.sim_trees(nloci=5, nsites=1e5)
# mod.df

**How long are the genealogies?**

In [560]:
# canvas = toyplot.Canvas(width=300, height=300)
# axes = canvas.cartesian()
# bars = axes.bars(np.histogram(mod.df.nbps, 100,range=[0,500]))

**What do the genealogies look like?**

In [561]:
# # draw linked genealogies
# toytree.mtree(mod.df.genealogy).draw_tree_grid(tip_labels=False);

In [562]:
# # draw unlinked genealogies
# toytree.mtree(mod.df[mod.df.tidx==0].genealogy).draw_tree_grid(tip_labels=False);