# Tree Classes

In this notebook we are going to take a look at some of the methods and classes available for fooling around with phylogenetic trees. I basically started out by building a `PhyloTree` class, and then a `ResolvedTree` class, and finally building this into a `ParameterizedTree` class. 

These all are built to work around the data as I constructed in an earlier workbook. 

Anyways, let's just pull in a fully parameterized Tree class and see what can be done with it. In a subsequent workbook, we will think about applying maximum likelihood methods, estimating trees, forming distributions over trees, and all that with an eye towards reconstructing distributions over past histories. 

As usual, the first thing to do is get all our modules up and running and installed. Here goes:

In [1]:
# My modules

import PyIETools
import PyIEClasses

# Other modules needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import mplleaflet
import re

from scipy.optimize import minimize

# Import all the routines that I have written.

%matplotlib inline
%matplotlib notebook

# Reading in the data
# Read in the Pickle files

Data=pd.read_pickle('IEData\\MasterData.pkl')
Splits=pd.read_pickle('IEData\\Splits.pkl')
Depths=pd.read_pickle('IEData\\Depths.pkl')

## Khoisan Language Tree

The first thing we are going to do is take a look at the Khoisan Language tree and what can be done with it. We are also going to add a little fictional information to the tree - i.e., deaden some branches - mainly just to see how our methods are working in this regard. So, let's suppose that the **Nama** branch and the **Gwi** branch expired in the year 1500 and the year 100 respectively. Then:

In [2]:
Data['ex_date'].loc[Data['name']=='NAMA']=1500
Data['ex_date'].loc[Data['name']=='GWI']=100
Data['deadOne'].loc[Data['name']=='NAMA']=0
Data['deadOne'].loc[Data['name']=='GWI']=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


We seek to set up and parameterize a complete tree. To this end, we have to add some branching parameters to the tree (one for each internal branch), and we also need to add transition rate parameters to the tree - one for each word. We then need an overall depth parameter, and a parameter for each dead branch. The easiest way to initialize all of this information is to count the internal branches, dead spots, and words using utilities from `PyIETools` and then just invent a vector somewhere in a reasonable neighborhood of parameter values. I've gotten a feel for this from experience. Anyways:






In [3]:
KhoisanRT = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'KHOISAN'], 'KTree1') #Create a resolved tree
numbranches = KhoisanRT.interiorbranches                                                #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)            #Make a conformable set of parameters
rInit       = np.zeros((1, len(KhoisanRT.words)))                                       #initial rate parameters
dparms      = np.sum(KhoisanRT.deathmat[:,0] == 0)                                      #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                 #Values for death parameters
eInit       = np.matrix(5)                                                              #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                   #Stack them all together

Next, the parameters can be passed to the class initiator, along with the requisite data frame slices, and a name for the tree. This will automatically create a (random) resolution of the tree, order all the data correctly, put branches in their respective places, etc. 

In [4]:
KhoisanPT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='KHOISAN'],'KTree1',parmsInit)

Include prior information on the depth of the tree, and also the information on splits.  We first pull the prior information about tree depth out of the requisite data set, and then assign it to class using the `priordepth` class method.  The `settimes()` class method translates parameters into times.

In [5]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'Khoisan'])
max = np.array(Depths['max'].loc[Depths['phylum'] == 'Khoisan'])
KhoisanPT.priordepth(min,max)
KhoisanPT.splitinfo(Splits[Splits['phylum'] == 'Khoisan'])

KhoisanPT.settimes()


### Exploring the Tree

We are now in a position to take a look at the tree, see its attributes, what it implies for points of origin, etc. One useful class method in this regard is the `showtree()` method, which plots out a phylogenetic picture of the tree (with branches proportioned accordingly and all that). Here it is for our tree:

In [6]:
KhoisanPT.showtree()

<IPython.core.display.Javascript object>

By applying the class methods `RouteChooser` and `TimeInPlace`, we come up with migratory routes through all the potential locations in the tree, and are also able to see what our parameters and tree structure imply for the organization of the tree. Let's give it a go and see what happens. 

In [7]:
KhoisanPT.RouteChooser()
KhoisanPT.TimeInPlace()

Once these two things are done, we can plot latitudes and longitudes. The bigger circles and deeper colors are where our algorithm suggests the route of the tree is. The code opens up a new `mplleaflet` map. This is really just for fun and verification - later, when we actually start estimating trees, we will probably drag in the big guns - like **QGIS** to make a more comprehensive interactive map with data and all that. As an illustration, the following produces a plot of the data, where darker circles indicate greater time depth in the location.

In [9]:
KhoisanPT.latlonplot('KhoisanPlot.html' ,"Blues")

<IPython.core.display.Javascript object>

I don't know why it is making such a big gap - it doesn't look great, but at least it seems to be working! Next, let's move on to another Tree and see where that gets us. 

## Na Dene Language Stock

The so-called Na Dene languages in include all the Athabaskan languages of Eastern and Western Canada, Apache, Navajo, and also, to varying degrees of controversy, the Tlingit, Eyak, and Haida languages. So, let's take a crack at setting up a tree and see how it goes. This is basically just a matter of copying and pasting the above code to see where it gets us.

In [10]:
NaDeneRT = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'NADENE'], 'NTree1')  #Create a resolved tree
numbranches = NaDeneRT.interiorbranches                                                #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)           #Make a conformable set of parameters
rInit       = np.zeros((1, len(NaDeneRT.words)))                                       #initial rate parameters
dparms      = np.sum(NaDeneRT.deathmat[:,0] == 0)                                      #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                #Values for death parameters
eInit       = np.matrix(5)                                                             #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                  #Stack them all together

In [11]:
NaDenePT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='NADENE'],'NTree1',parmsInit)

In [12]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'NaDene'])
max = np.array(Depths['max'].loc[Depths['phylum'] == 'NaDene'])
NaDenePT.priordepth(min,max)
NaDenePT.splitinfo(Splits[Splits['phylum'] == 'NaDene'])

NaDenePT.settimes()


For reasons that I can't ascertain, the `matplotlib` settings are being reset. So, here we go: 

In [13]:
%matplotlib notebook
NaDenePT.showtree()

<IPython.core.display.Javascript object>

Given this tree, what does it say about where the Na-Dene language group likely first originated? Let's do the usual steps and make a slippy map:

In [16]:
NaDenePT.RouteChooser()
NaDenePT.TimeInPlace()
NaDenePT.latlonplot('NaDeneMap.html', "Reds")

<IPython.core.display.Javascript object>

So we see again that the likeliest place of origin is where the most divergent group is located. How about we try a bigger tree like AfroAsiatic? 

## Afro-Asiatic languages

This group of languages includes Hebrew, Arabic, Chadic and Omotic languages - not to mention ancient Egyptian! The languages are distributed throughout Northern Africa and the Middle East. Let's get a preliminary tree and see what it looks like:

In [17]:
AfroAsiaRT = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'AFROASIA'], 'AATree1')  #Create a resolved tree
numbranches = AfroAsiaRT.interiorbranches                                                   #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)                #Make a conformable set of parameters
rInit       = np.zeros((1, len(AfroAsiaRT.words)))                                          #initial rate parameters
dparms      = np.sum(AfroAsiaRT.deathmat[:,0] == 0)                                         #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                     #Values for death parameters
eInit       = np.matrix(5)                                                                  #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                       #Stack them all together

In [18]:
AfroAsiaPT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='AFROASIA'],'NTree1',parmsInit)

In [19]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'AfroAsia '])           # Notice the space after "AfroAsia"
max = np.array(Depths['max'].loc[Depths['phylum'] == 'AfroAsia '])           # Should be fixed!
AfroAsiaPT.priordepth(min,max)
AfroAsiaPT.splitinfo(Splits[Splits['phylum'] == 'AfroAsia '])

AfroAsiaPT.settimes()


In [20]:
%matplotlib notebook
AfroAsiaPT.showtree()

<IPython.core.display.Javascript object>

The above is a bit hard to see, but one can zoom in and check it out in more detail. In any case, let's make a lat-lon plot of the above as well, to see what's happening. 

In [21]:
AfroAsiaPT.RouteChooser()
AfroAsiaPT.TimeInPlace()
AfroAsiaPT.latlonplot('AfroAsiaMap.html', "Purples")

<IPython.core.display.Javascript object>

One of the interesting things is that how the tree is resolved has a large impact on where the most likely point of origin is. If one runs the above over and over again, one gets very different answers! This is why we want to optimize the tree, and also be able to make probabilistic statements about points of origin. 

## Indo-Hittite Language Stock

A big and controversial family that includes almost all languages of Western Europe, but also languages of Iran and India. Competing theories of the point of origin are also out there - while now most seem to think the languages originated in Anatolia, in the past people have espoused the belief that they originated on the steppes of Russia to the north. Anyways...

In [22]:
IndoHittRT = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'INDOHITT'], 'IHTree1')  #Create a resolved tree
numbranches = IndoHittRT.interiorbranches                                                   #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)                #Make a conformable set of parameters
rInit       = np.zeros((1, len(AfroAsiaRT.words)))                                          #initial rate parameters
dparms      = np.sum(IndoHittRT.deathmat[:,0] == 0)                                         #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                     #Values for death parameters
eInit       = np.matrix(5)                                                                  #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                       #Stack them all together

In [23]:
IndoHittPT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='INDOHITT'],'IHTree1',parmsInit)

In [24]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'IndoHitt'])           
max = np.array(Depths['max'].loc[Depths['phylum'] == 'IndoHitt'])           
IndoHittPT.priordepth(min,max)
IndoHittPT.splitinfo(Splits[Splits['phylum'] == 'IndoHitt'])

IndoHittPT.settimes()


In [25]:
%matplotlib notebook
IndoHittPT.showtree()

<IPython.core.display.Javascript object>

In [26]:
IndoHittPT.RouteChooser()
IndoHittPT.TimeInPlace()
IndoHittPT.latlonplot('IndoHittiteMap.html', "Greys")

<IPython.core.display.Javascript object>

Anatolia is suggested as the most likely point of origin, at least given this tree resolution. 

## Altaic Languages

These languages include Turkic, Mongol, Khalka, Khazak, and, perhaps, very controversially, Japonese, Korean, Ainu, and other Japonic languages. I put them all together because of Ruhlen's book, but I'm not sure if this is correct. In any case:

In [27]:
AltaicRT    = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'ALTAIC'], 'AlTree1')  #Create a resolved tree
numbranches = AltaicRT.interiorbranches                                                   #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)                #Make a conformable set of parameters
rInit       = np.zeros((1, len(AltaicRT.words)))                                          #initial rate parameters
dparms      = np.sum(AltaicRT.deathmat[:,0] == 0)                                         #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                     #Values for death parameters
eInit       = np.matrix(5)                                                                  #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                       #Stack them all together

In [28]:
AltaicPT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='ALTAIC'],'AlTree1',parmsInit)

In [29]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'Altaic '])           # Notice the space after "AfroAsia"
max = np.array(Depths['max'].loc[Depths['phylum'] == 'Altaic '])           # Should be fixed!
AltaicPT.priordepth(min,max)
AltaicPT.splitinfo(Splits[Splits['phylum'] == 'Altaic '])

AltaicPT.settimes()

In [30]:
%matplotlib notebook
AltaicPT.showtree()

<IPython.core.display.Javascript object>

In [31]:
AltaicPT.RouteChooser()
AltaicPT.TimeInPlace()
AltaicPT.latlonplot('AltaicMap.html')

<IPython.core.display.Javascript object>

## Amerind Language Stock

This is probably Joseph Greenberg's Most controversial construction ever. Essentially, that all languages in North and South America belong to one stock, excepting the Na Dene languages and the Eskimo-Aleut Languages. We might have to redo this, but let's see what happens if we try to model them all at once, and where our approach suggests that these languages originated. 

As an aside, I also demonstrate use of the `cmap` feature for the `latlonplot()` method. 

In [37]:
AmerindRT    = PyIEClasses.ResolvedTree(Data.loc[Data['ruhlen_1'] == 'AMERIND'], 'AmTree1')  #Create a resolved tree
numbranches = AmerindRT.interiorbranches                                                     #Get no. of interior branches
bInit       = np.matrix(- 1 - np.linspace(0,10,num=numbranches)/numbranches)                 #Make a conformable set of parameters
rInit       = np.zeros((1, len(AmerindRT.words)))                                            #initial rate parameters
dparms      = np.sum(AmerindRT.deathmat[:,0] == 0)                                           #Number of death parameters needed
dInit       = np.zeros((1, dparms)) + 1                                                      #Values for death parameters
eInit       = np.matrix(5)                                                                   #Overall depth parameter
parmsInit   = np.hstack((bInit, rInit, dInit, eInit))                                        #Stack them all together

In [38]:
AmerindPT=PyIEClasses.ParameterizedTree(Data.loc[Data['ruhlen_1']=='AMERIND'],'AmTree1',parmsInit)

In [39]:
min = np.array(Depths['min'].loc[Depths['phylum'] == 'Amerind'])           
max = np.array(Depths['max'].loc[Depths['phylum'] == 'Amerind'])           
AmerindPT.priordepth(min,max)
AmerindPT.splitinfo(Splits[Splits['phylum'] == 'Amerind'])

AmerindPT.settimes()

In [40]:
%matplotlib notebook
AmerindPT.showtree()

<IPython.core.display.Javascript object>

This is very hard to read, but let's see where it suggests this sprawling and probably wrong group started, according to our model.

In [41]:
AmerindPT.RouteChooser()
AmerindPT.TimeInPlace()
AmerindPT.latlonplot('AmerindMap.html', 'afmhot')

<IPython.core.display.Javascript object>