# Empirical Example

### Introduction: Mammal phylogeny
The purpose of this notebook is to demonstrate `ipcoal` simulations on a topology inferred from empirical data, along with empirical data on generation times and species distribution areas (as a proxy for effective population sizes).  
  
`ipcoal` results can show us the amount of discordance to expect from one topology to the next in this scenario, and the expected lengths of genealogies.

**Imports:**

In [1]:
import numpy as np
import toytree
import pandas as pd
import toyplot
import ipcoal
from copy import deepcopy
colormap = toyplot.color.brewer.map("BlueRed", reverse=True)

**Data sources:**

1) Species tree: dated maximum clade credibility tree from a recent mammal supertree analysis (Upham et al. 2019: https://doi.org/10.1371/journal.pbio.3000494)  
2) Species distribution areas: from PanTHERIA database (Jones et al 2009: https://doi.org/10.1890/08-1494.1)  
3) Species generation lengths: from generation lengths dataset (Pacifici et al. 2014: https://doi.org/10.5061/dryad.gd0m3)

### Consolidate data

Pull together data from three data sources: species tree, species distribution areas, and species generation lengths, and align the data to eliminate species that aren't shared among the three datasets.

**Names from species tree:**

In [2]:
tree = toytree.tree("mammal_dat/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre",
                       tree_format=10)

In [3]:
trenames = np.array(["_".join(i.split("_")[:2]) for i in tree.get_tip_labels()])

**Names from PanTHERIA data:**

In [4]:
panth = pd.read_csv('./mammal_dat/PanTHERIA_1-0_WR05_Aug2008.txt',sep='\t')

In [5]:
panth.head()

Unnamed: 0,MSW05_Order,MSW05_Family,MSW05_Genus,MSW05_Species,MSW05_Binomial,1-1_ActivityCycle,5-1_AdultBodyMass_g,8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,2-1_AgeatEyeOpening_d,...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3.0,492714.47,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1.0,10392.49,-999.0,745.32,-999.0,...,-17.53,13.0,0,35.2,1.0,0.14,90.75,236.51,922.9,1534.4
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2.0,9658.7,-999.0,827.53,7.5,...,-17.05,45.74,0,79.29,0.0,0.1,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2.0,11989.1,-999.0,872.39,11.94,...,-168.12,-117.6,0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2.0,31756.51,-999.0,1055.0,14.01,...,-171.84,3.9,0,37.87,0.0,0.04,34.79,4.82,313.33,561.11


In [6]:
panthnames = [i.split(" ") for i in panth['MSW05_Binomial']]

In [7]:
panthnames = np.array(["_".join(i) for i in panthnames])

**Names from Nature Cons data:**

In [8]:
natcons = pd.read_csv('./mammal_dat/5734-SP-2-Editor.csv',sep=',')

In [9]:
natcons.head()

Unnamed: 0,TaxID,Order,Family,Genus,Scientific_name,AdultBodyMass_g,Sources_AdultBodyMass,Max_longevity_d,Sources_Max_longevity,Rspan_d,AFR_d,Data_AFR,Calculated_GL_d,GenerationLength_d,Sources_GL
0,42641,Rodentia,Muridae,Abditomys,Abditomys latidens,268.09,PanTHERIA,no information,no information,no information,no information,no information,no information,639.631832,Mean_family_same_body_mass
1,17879,Rodentia,Muridae,Abeomelomys,Abeomelomys sevia,54.88,PanTHERIA,no information,no information,no information,no information,no information,no information,624.399641,Mean_family_same_body_mass
2,16,Rodentia,Cricetidae,Abrawayaomys,Abrawayaomys ruschii,62.99,PanTHERIA,no information,no information,no information,no information,no information,no information,589.388299,Mean_family_same_body_mass
3,42656,Rodentia,Abrocomidae,Abrocoma,Abrocoma bennettii,250.5,PanTHERIA,839.5,PanTHERIA;AnAge,no information,no information,no information,no information,1032.923574,Mean_order_same_mass
4,18,Rodentia,Abrocomidae,Abrocoma,Abrocoma boliviensis,158.0,PanTHERIA,no information,no information,no information,no information,no information,no information,1032.923574,Mean_order_same_mass


In [10]:
natconsnames = np.array(["_".join(q) for q in [i.split(" ") for i in natcons['Scientific_name']]])

**See which names are common to all lists:**

In [11]:
# append name to intersection list if it is in all datasets
intersection_names = []
for trename in trenames:
    if trename in panthnames:
        if trename in natconsnames:
            intersection_names.append(trename)

In [12]:
# how many names are in all three datasets?
len(intersection_names)

3489

In [13]:
# start a dataframe to build up
intersect_df = pd.DataFrame({'scientific_name':intersection_names})

In [14]:
# look at dataframe -- currently just rows of scientific names
intersect_df.head()

Unnamed: 0,scientific_name
0,Akodon_boliviensis
1,Akodon_spegazzinii
2,Akodon_sylvanus
3,Akodon_lutescens
4,Akodon_subfuscus


**Gather the data and write to csv:**

In [16]:
gl_dat = np.zeros((len(intersect_df),1))
sp_area_dat = np.zeros((len(intersect_df),1))

for idx, name in enumerate(intersect_df['scientific_name']):
    # using mean because some entries have multiple generation lengths (from different datasets)
    gl_dat[idx] = np.mean(natcons['GenerationLength_d'][natcons['Scientific_name'] == " ".join(name.split("_"))])
    sp_area_dat[idx] = np.mean(panth['26-1_GR_Area_km2'][panth['MSW05_Binomial'] == " ".join(name.split("_"))])

In [17]:
# add data columns to the dataframe
intersect_df['generation_length'] = gl_dat
intersect_df['species_area'] = sp_area_dat

In [18]:
# look at the first few rows
intersect_df.head()

Unnamed: 0,scientific_name,generation_length,species_area
0,Akodon_boliviensis,589.388299,530195.66
1,Akodon_spegazzinii,589.388299,83477.58
2,Akodon_sylvanus,589.388299,48393.63
3,Akodon_lutescens,589.388299,318909.65
4,Akodon_subfuscus,589.388299,353423.95


In [19]:
# some of the data lists NaN vals with -999. Let's remove these.
intersect_df = intersect_df[~intersect_df.species_area.eq(-999)]

**Trim the species tree down to just the tips shared among datasets, and write that out as a new file:**

In [23]:
tips = tree.get_tip_labels()

In [24]:
dropped_tips = np.array(tips)[~np.array([q in np.array(intersect_df.scientific_name) for 
                                         q in ["_".join(i.split("_")[0:2]) for i in tips]])]

In [25]:
tree = tree.drop_tips(list(dropped_tips))

### Merge the species data with the phylogeny

We can see that the number of tree labels is that same as the number of rows in the dataframe:

In [29]:
len(tree.get_tip_labels())

3121

In [30]:
intersect_df.shape[0]

3121

Designate minimum and maximum values for Ne:

In [31]:
max_Ne = 150000
min_Ne = 50000
Ne_range = max_Ne - min_Ne

In [32]:
max_area = np.max(intersect_df.species_area)
min_area = np.min(intersect_df.species_area)
area_range = max_area - min_area

Add an `Ne` column where we've scaled species area to fit between a minimum and maximum Ne value:

In [33]:
intersect_df['Ne'] = ( ((np.array(intersect_df.species_area)-min_area) / area_range) * Ne_range ) + min_Ne

In [34]:
intersect_df.head()

Unnamed: 0,scientific_name,generation_length,species_area,Ne
0,Akodon_boliviensis,589.388299,530195.66,50841.12241
1,Akodon_spegazzinii,589.388299,83477.58,50132.431984
2,Akodon_sylvanus,589.388299,48393.63,50076.773481
3,Akodon_lutescens,589.388299,318909.65,50505.930308
4,Akodon_subfuscus,589.388299,353423.95,50560.685096


**Set tip vals on species tree for Ne and generation length:**

In [35]:
# first make a column where the literal tip label is matched to species name
intersect_df["tip_label"] = np.array(tree.get_tip_labels())[np.array([np.argmax(np.array(["_".join(i.split("_")[:2]) for i in tree.get_tip_labels()]) == q) for q in intersect_df.scientific_name])]
intersect_df.head()

Unnamed: 0,scientific_name,generation_length,species_area,Ne,tip_label
0,Akodon_boliviensis,589.388299,530195.66,50841.12241,Akodon_boliviensis_CRICE...
1,Akodon_spegazzinii,589.388299,83477.58,50132.431984,Akodon_spegazzinii_CRICE...
2,Akodon_sylvanus,589.388299,48393.63,50076.773481,Akodon_sylvanus_CRICETID...
3,Akodon_lutescens,589.388299,318909.65,50505.930308,Akodon_lutescens_CRICETI...
4,Akodon_subfuscus,589.388299,353423.95,50560.685096,Akodon_subfuscus_CRICETI...


In [36]:
# match up Ne and g to the tip labels
ne_dict = dict(zip(np.array(intersect_df.tip_label),np.array(intersect_df.Ne)))
g_dict = dict(zip(np.array(intersect_df.tip_label),np.array(intersect_df.generation_length)))

In [37]:
# set the tip labels for both Ne and g
tree = tree.set_node_values("g",g_dict)
tree = tree.set_node_values("Ne",ne_dict)

**Ancestral state reconstruction:**  
  
This is done using `toytree`'s PCM module.

In [38]:
recon_tree = toytree.PCM.PCM(tree)
ntree = recon_tree.ancestral_state_reconstruction("g")

In [39]:
recon_tree = toytree.PCM.PCM(ntree)
ntree = recon_tree.ancestral_state_reconstruction("Ne")

### Simulate on a subset of the species tree  
  
We'll pick out a genus from Carnivora.

In [40]:
# copy the tree to prune it down
s_plot = deepcopy(ntree)

In [41]:
# select the genus Mustela
mrca = s_plot.get_mrca_idx_from_tip_labels(wildcard="Mustela_")
snode = s_plot.treenode.search_nodes(idx=mrca)[0]
s_plot.treenode.prune(snode.get_descendants())
s_plot.ntips

15

In [42]:
# scale the height from millions of years to years
s_plot = s_plot.mod.node_scale_root_height(s_plot.treenode.height*1e6)
s_plot.treenode.height

8077930.0

**Visualize the species tree with generation time (node colors) and Ne (edge widths) mappled on:**

In [43]:
# draw tree with reconstructed g values as node colors and Ne values as edge widths
s_plot.draw(
    ts='p', 
    #node_labels=recon_tree.get_node_values("Ne", 1, 1),
    node_colors=[colormap.colors(i, 
                                 np.min(s_plot.get_node_values("g",1,1)), 
                                 np.max(s_plot.get_node_values("g",1,1))
                                ) for i in s_plot.get_node_values('g', 1, 1)],
    height=500,
    width=600,
);

**Adjust edge lengths to account for generation times in the `ipcoal` simulation:**

In [44]:
# divide the edge lengths (in abosolute time) by the generation time, to get number of generations for each edge
ttree = s_plot.set_node_values(
    "dist",
    {i.name: i.dist / (i.g/365) for i in s_plot.get_feature_dict()}
)

In [45]:
ttree.draw(ts='p',
          height=500,
          width=600);

**Simulate with `ipcoal`:**

In [46]:
mod = ipcoal.Model(ttree, seed=333)
mod.sim_trees(nloci=5, nsites=1e5)

**Look at the resulting dataframe:**

In [47]:
mod.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,197,197,0,0,((Neovison_vison_MUSTELI...
1,0,197,523,326,0,1,((Neovison_vison_MUSTELI...
2,0,523,959,436,0,2,((Neovison_vison_MUSTELI...
3,0,959,1460,501,0,3,((Neovison_vison_MUSTELI...
4,0,1460,1549,89,0,4,((Neovison_vison_MUSTELI...
...,...,...,...,...,...,...,...
2165,4,99445,99487,42,0,421,(((Mustela_nudipes_MUSTE...
2166,4,99487,99493,6,0,422,((Mustela_kathiah_MUSTEL...
2167,4,99493,99510,17,0,423,((Mustela_kathiah_MUSTEL...
2168,4,99510,99986,476,0,424,((Mustela_kathiah_MUSTEL...


**How long are the genealogies?**

In [48]:
canvas = toyplot.Canvas(width=300, height=300)
axes = canvas.cartesian()
bars = axes.bars(np.histogram(mod.df.nbps, 100,range=[0,1000]))

**What do the genealogies look like?**

In [49]:
# draw linked genealogies
toytree.mtree(mod.df.genealogy).draw_tree_grid(tip_labels=False);

In [50]:
# draw unlinked genealogies
toytree.mtree(mod.df[mod.df.tidx==0].genealogy).draw_tree_grid(tip_labels=False);

### Simulate on a larger clade  
  
Let's scale up to Mustelidae, the whole family containing weasels, ferrets, otters, etc.

In [51]:
# copy the whole mammal tree to prune down
s_plot = deepcopy(ntree)

In [52]:
# subset the Mustelidae
mrca = s_plot.get_mrca_idx_from_tip_labels(wildcard="MUSTELIDAE")
snode = s_plot.treenode.search_nodes(idx=mrca)[0]
s_plot.treenode.prune(snode.get_descendants())
s_plot.ntips

45

In [53]:
# scale the height from millions of years to years
s_plot = s_plot.mod.node_scale_root_height(s_plot.treenode.height*1e6)
s_plot.treenode.height

15154079.999999998

**Visualize the species tree with generation time (node colors) and Ne (edge widths) mappled on:**

In [54]:
# draw tree with reconstructed g values as node colors and Ne values as edge widths
s_plot.draw(
    ts='p', 
    #node_labels=recon_tree.get_node_values("Ne", 1, 1),
    node_colors=[colormap.colors(i, 
                                 np.min(s_plot.get_node_values("g",1,1)), 
                                 np.max(s_plot.get_node_values("g",1,1))
                                ) for i in s_plot.get_node_values('g', 1, 1)],
    height=500,
    width = 1000,
);

**Adjust edge lengths to account for generation times in the `ipcoal` simulation:**

In [55]:
ttree = s_plot.set_node_values(
    "dist",
    {i.name: i.dist / (i.g/365) for i in s_plot.get_feature_dict()}
)

In [56]:
ttree.draw(ts='p',
          height=500,
          width=1000);

**Simulate with `ipcoal`:**

In [57]:
mod = ipcoal.Model(ttree, seed=333)
mod.sim_trees(nloci=5, nsites=1e5)

**Look at the resulting dataframe:**

In [58]:
mod.df

Unnamed: 0,locus,start,end,nbps,nsnps,tidx,genealogy
0,0,0,147,147,0,0,(Arctonyx_collaris_MUSTE...
1,0,147,148,1,0,1,(Arctonyx_collaris_MUSTE...
2,0,148,162,14,0,2,(Arctonyx_collaris_MUSTE...
3,0,162,359,197,0,3,(Arctonyx_collaris_MUSTE...
4,0,359,472,113,0,4,(Arctonyx_collaris_MUSTE...
...,...,...,...,...,...,...,...
6188,4,99861,99880,19,0,1216,(Taxidea_taxus_MUSTELIDA...
6189,4,99880,99884,4,0,1217,(Taxidea_taxus_MUSTELIDA...
6190,4,99884,99906,22,0,1218,(Taxidea_taxus_MUSTELIDA...
6191,4,99906,99958,52,0,1219,(Taxidea_taxus_MUSTELIDA...


**How long are the genealogies?**

In [59]:
canvas = toyplot.Canvas(width=300, height=300)
axes = canvas.cartesian()
bars = axes.bars(np.histogram(mod.df.nbps, 100,range=[0,500]))

**What do the genealogies look like?**

In [60]:
# draw linked genealogies
toytree.mtree(mod.df.genealogy).draw_tree_grid(tip_labels=False);

In [61]:
# draw unlinked genealogies
toytree.mtree(mod.df[mod.df.tidx==0].genealogy).draw_tree_grid(tip_labels=False);