# Exploring the Tree of Life dataset with Pandas

## First thing: import the pandas module.

In [None]:
import pandas as pd

You can find more information and tutorials on pandas here:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

## Second: import the data.
We will play with a excerpt of the Tree of Life, that can be found together with this notebook. This dataset is reduced to the first 1000 taxons (starting from the root node). The full version is available here: [Open Tree of Life](https://tree.opentreeoflife.org/about/taxonomy-version/ott3.0).
![Public domain, https://en.wikipedia.org/wiki/File:Phylogenetic_tree.svg](imgTuto1/800px-Phylogenetic_tree.svg.png)
![Public Domain, https://commons.wikimedia.org/w/index.php?curid=3633804](imgTuto1/480px-Tree_of_life_SVG.svg.png)


In [None]:
ToL = pd.read_csv('taxonomysmall.tsv',sep='\t\|\t?', encoding='utf-8', engine='python')

If you do not remember the details of a function:

In [None]:
pd.read_csv?

For more info on the separator, see [regex](https://docs.python.org/3.6/library/re.html)

Now, what is the object `ToL`? it is a Pandas DataFrame!

In [None]:
ToL

The description of the entries is given here:
https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/Interim-taxonomy-file-format

## Let us explore the table

In [None]:
ToL.columns

Let us drop some columns.

In [None]:
ToL = ToL.drop(columns= ['sourceinfo', 'uniqname', 'flags','Unnamed: 7'])

In [None]:
ToL.head()

Pandas infered the type of values inside each column (int, float, string and string). The parent_uid column has float values because there was a missing value, converted to `NaN`

In [None]:
print(ToL['uid'].dtype, ToL.parent_uid.dtype)

How to access the individual values

In [None]:
ToL.iloc[0,2]

In [None]:
ToL.loc[0,'name']

Exercice: Guess the output if this line:

In [None]:
# ToL.uid[0] == ToL.parent_uid[1]

Ordering the data

In [None]:
ToL.sort_values(by='name').head()

## Operation on the columns

Unique values, useful for categories:

In [None]:
ToL['rank'].unique()

In [None]:
#Selecting only one category
ToL[ToL['rank'] == 'species'].head()

How many species do we have?

In [None]:
len(ToL[ToL['rank'] == 'species'])

In [None]:
ToL['rank'].value_counts()

# Building the graph

Let us build the adjacency matrix of the graph. For that we need to reorganize the data. First we separate the node and their properties from the edges.

In [None]:
ToLnodes = ToL[['uid','name','rank']]
ToLedges = ToL[['uid','parent_uid']]

When using an adjacency matrix, nodes are indexed by their row or column number and not by a `uid`. Let us create a new index for the nodes.

In [None]:
# Create a column for node index
ToLnodes.reset_index(level=0, inplace=True)
ToLnodes = ToLnodes.rename(columns={'index':'node_idx'})
ToLnodes.head()

In [None]:
# Create a convertion table from uid to node index
uid2idx = ToLnodes[['node_idx','uid']]
uid2idx = uid2idx.set_index('uid')
uid2idx.head()

In [None]:
ToLedges.head()

Now we are ready to use yet another powerful function of Pandas. Those familiar with SQL will recognize it: the `join` function.

In [None]:
# Add a new column, matching the uid with the node_idx
ToLedges = ToLedges.join(uid2idx,on='uid')

In [None]:
# Do the same with the parent_uid
ToLedges = ToLedges.join(uid2idx, on='parent_uid', rsuffix='_parent')

In [None]:
# Drop the uids
ToLedges = ToLedges.drop(columns=['uid','parent_uid'])

In [None]:
ToLedges.head()

This table is a list of edges connecting nodes and their parents.

##  The weight matrix

Let us use numpy to build this weight matrix.

In [None]:
import numpy as np
nb_nodes = len(ToLnodes)
W = np.zeros((nb_nodes,nb_nodes),dtype=int)

In [None]:
for idx,row in ToLedges.iterrows():
    if np.isnan(row.node_idx_parent):
        continue
    i,j=int(row.node_idx),int(row.node_idx_parent)
    W[i,j] = 1
    W[j,i] = 1

In [None]:
W[:15,:15]

Congratulations, you have built the weight matrix!
To conclude, let us visualize the graph. We will use the python module `networkx`. Don't forget to run `pip install networkx` in the command line beforehand.

In [None]:
import networkx as nx

In [None]:
# A simple command to create the graph from the weight matrix
G = nx.from_numpy_array(W)

In addition, let us add some attributes to the nodes:

In [None]:
nodeprops = ToLnodes.to_dict()

In [None]:
for key in nodeprops:
    #print(key,nodeprops[key])
    nx.set_node_attributes(G,nodeprops[key],key)

Let us check if it is correctly recorded:

In [None]:
G.node[1]

In [None]:
nx.write_gexf(G,'ToL.gexf')

In [None]:
import matplotlib.pyplot as plt

In [None]:
nx.draw_spectral(G)

In [None]:
nx.draw_spring(G)

You may now explore the graph using Gephi and compare the visualizations.