# tiers - a hierarchical label handling library

The tiers library makes handling hierarchical labels easier. It is based on defining a hierarchy table with labels.

This notebook outlines the basics of using tiers.

Let's load up a hierarchy table. A hierarchy table consists of a table of strings that represent a hierarchy, and optionally a label column. The strings in the hierarchy table are referred to as *nodes* in tiers. A label is an arbitary string that maps to a node.

In [1]:
import pandas as pd

df = pd.read_csv("../tests/data/taxa_table.csv")
df

Unnamed: 0,kingdom,phylum,class,order,family,genus,species,type,label
0,Animalia,Arthropoda,Malacostraca,Isopoda,Asellidae,Asellus,,,As
1,Animalia,Arthropoda,Malacostraca,Isopoda,Asellidae,Asellus,Asellus aquaticus,,Asellus_aquaticus
2,Animalia,Arthropoda,Malacostraca,Isopoda,Asellidae,Asellus,Asellus aquaticus,,Asellus aquaticus
3,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis horaria,,Caenis_horaria
4,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis luctuosa,,Caenis_luctuosa
5,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis rivulorum,,Caenis_rivulorum
6,Animalia,Arthropoda,Insecta,Ephemeroptera,Baetidae,,,,Baetidae
7,Animalia,Arthropoda,Insecta,Ephemeroptera,Baetidae,Cloeon,Cloeon dipterum,,Cloeon_dipterum
8,Animalia,Arthropoda,Insecta,Trichoptera,Polycentropodidae,Cyrnus,Cyrnus trimaculatus,,Cyrnus_trimaculatus
9,Animalia,Arthropoda,Insecta,Coleoptera,Elmidae,Oulimnius,Oulimnius tuberculatus,Oulimnius tuberculatus larva,Oulimnius_tuberculatus


Most of the species are classified to the species level, except *Asellus* and *Baetidae*. The species *Oulimnius tuberculatus* has two different subclasses, an adult and a larva form. The larva form has also two aliased mappings, Oulimnius_tuberculatus and Oulimnius_tuberculatus_larv are both mapped to the larva type. Similarly, "Asellus_aquaticus" and "Asellus aquaticus" (the same string as the node) are mapped to the same node.

The label assignment is arbitary and can be any string. For example, the species *Radix balthica* has the label RaBa in this dataset. The label can also be the same string as the leaf node (leftmost not-NaN value on the row).

tiers assumes that the last column of a hierarchy table is the label column. The mapping from labels to nodes can also be provided separately as a dict.

## Trees

The basic object in tiers is a `Tree`. The label hierarchy is saved to the tree and it handles the mapping of labels to different values in the hiearchy. A `Tree` remembers also the different names of each level, which are provided as column names in the original dataframe. `Tree` can be set to a specific level that all labels are mapped to.

In [2]:
import importlib
import tiers
importlib.reload(tiers)

tree = tiers.Tree.from_dataframe(df)
tree.show()

Animalia
├── Arthropoda
│   ├── Malacostraca
│   │   └── Isopoda
│   │       └── Asellidae
│   │           └── Asellus
│   │               └── Asellus aquaticus
│   └── Insecta
│       ├── Ephemeroptera
│       │   ├── Caenidae
│       │   │   └── Caenis
│       │   │       ├── Caenis horaria
│       │   │       ├── Caenis luctuosa
│       │   │       └── Caenis rivulorum
│       │   └── Baetidae
│       │       └── Cloeon
│       │           └── Cloeon dipterum
│       ├── Trichoptera
│       │   └── Polycentropodidae
│       │       ├── Cyrnus
│       │       │   └── Cyrnus trimaculatus
│       │       └── Polycentropus
│       │           └── Polycentropus flavomaculatus
│       └── Coleoptera
│           └── Elmidae
│               └── Oulimnius
│                   └── Oulimnius tuberculatus
│                       ├── Oulimnius tuberculatus larva
│                       └── Oulimnius tuberculatus adult
└── Mollusca
    └── Gastropoda
        └── Hygrophila
            └── Lymnaeid

In addition to the full tree, a simplified version can also be shown. Here leaves are showed on the level where they have siblings.

In [3]:
tree.show_simple()

Animalia
├── Arthropoda
│   ├── Malacostraca
│   └── Insecta
│       ├── Ephemeroptera
│       │   ├── Caenidae
│       │   │   └── Caenis
│       │   │       ├── Caenis horaria
│       │   │       ├── Caenis luctuosa
│       │   │       └── Caenis rivulorum
│       │   └── Baetidae
│       ├── Trichoptera
│       │   └── Polycentropodidae
│       │       ├── Cyrnus
│       │       └── Polycentropus
│       └── Coleoptera
│           └── Elmidae
│               └── Oulimnius
│                   └── Oulimnius tuberculatus
│                       ├── Oulimnius tuberculatus larva
│                       └── Oulimnius tuberculatus adult
└── Mollusca


A `Tree` has a `level` property, that shows the current level of the tree. This can be changed with `tree.set_level`. Possible levels can be seen in properties `levels` and `levels_sortable`, which includes the depth of the level. `leaf` is the special default level where labels are mapped to the deepest leaf found in the hierarchy.

In [4]:
tree.set_level("genus")
tree.level

'05_genus'

In [5]:
print(tree.levels)
print(tree.levels_sortable)

['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'type']
['00_kingdom', '01_phylum', '02_class', '03_order', '04_family', '05_genus', '06_species', '07_type']


## Mapping values

The basic functionality of a tree is to map values to different levels in the hierarchy. This can be done by passing a label of a list of labels to the `tree.map()` method. Because we set the tree level as genus, the label `RaBa` is mapped to the genus level.

In [6]:
tree.map("RaBa")

'Radix'

In [7]:
# A list can be also passed
tree.map(["Asellus_aquaticus",
          "Caenis_horaria"])

['Asellus', 'Caenis']

In [8]:
# The level can be changed temporarily. This does not affect the level of the tree
tree.map(["As",
          "Asellus_aquaticus",
          "Oulimnius_tuberculatus"],
          level="type")

['Asellus', 'Asellus aquaticus', 'Oulimnius tuberculatus larva']

In [9]:
# The tree level is not changed
print(tree.level)


05_genus


### Strict level mapping
Above we see that even though we specify the mapping on `type` level, the labels `Asellus` and `Asellus_aquaticus` are mapped to their lowest levels, `genus` and `species`. If we want to map strictly to the specified level, we can pass the parameter `strict=True` to the method. Then labels which correspond to a leaf node higher than the level are returned as `None`.

In [10]:
tree.map(["As",
          "Asellus_aquaticus",
          "Oulimnius_tuberculatus"],
          level="type",
          strict=True)

[None, None, 'Oulimnius tuberculatus larva']

In [11]:
tree.map("Oulimnius_tuberculatus", level="leaf")

'Oulimnius tuberculatus larva'

### Mapping node strings

The `tree.map` function accepts both label and node values. If a label has re-mapped a node value, a warning is given during the tree creation. Strings are always first assumed to be labels, and mapped to their corresponding node values. If a string is not in the label list, it is assumed to be a node.

If you want to handle strings purely as node values (for example if a label has re-mapped a node), pass `nodes=true` to the `map` method

The labels can be seen in the `tree.label_map` dict. 

In [12]:
tree.map(["Asellus", "Caenis"], level="phylum", nodes=True)

['Arthropoda', 'Arthropoda']

In [13]:
# when nodes is true, labels can't be passed for mapping
try:
    tree.map(["As", "Asellus_aquaticus"], nodes=True)
except Exception as e:
    print(e)

'Trying to get an nonexistent node As. Perhaps you are trying to find a label?'


In [14]:
# By default one can map for both labels and nodes
tree.map(["Caenis", "As"], level="phylum")

['Arthropoda', 'Arthropoda']

### Level mapping
We can map labels to the level strings

In [15]:
print(tree.get_level("As"))
print(tree.get_level(["Asellus aquaticus", "Caenis"], nodes=True, prefix=False))

05_genus
['species', 'genus']


# Other features

Tiers provides also useful functions for handling hierarchical labels.

### A tree from a dataframe without labels

A tiers `Tree` can be created also without label mappings.

In [16]:
df_nolbl = pd.read_csv("../tests/data/taxa_table_no_labels.csv")
tree_nolbl = tiers.Tree(df = df_nolbl)
tree_nolbl.show_simple()

Animalia
├── Arthropoda
│   ├── Malacostraca
│   └── Insecta
│       ├── Ephemeroptera
│       │   ├── Caenidae
│       │   │   └── Caenis
│       │   │       ├── Caenis horaria
│       │   │       ├── Caenis luctuosa
│       │   │       └── Caenis rivulorum
│       │   ├── Baetidae
│       │   │   ├── Centroptilum
│       │   │   └── Cloeon
│       │   ├── Ephemeridae
│       │   └── Heptageniidae
│       │       ├── Heptagenia
│       │       └── Kageronia
│       ├── Trichoptera
│       │   ├── Polycentropodidae
│       │   │   ├── Cyrnus
│       │   │   └── Polycentropus
│       │   ├── Ecnomidae
│       │   ├── Lepidostomatidae
│       │   ├── Leptoceridae
│       │   └── Psychomyiidae
│       │       ├── Psychomyia
│       │       └── Tinodes
│       └── Coleoptera
│           └── Elmidae
│               └── Oulimnius
│                   └── Oulimnius tuberculatus
│                       ├── Oulimnius tuberculatus larva
│                       └── Oulimnius tuberculatus adult
├──

In [17]:
tree_nolbl.map(["Cloeon",
                "Cyrnus"],
                level="family")

['Baetidae', 'Polycentropodidae']

A label mapping can be later assigned

In [18]:
label_map = {"CaHo": "Caenis horaria"}
tree_nolbl.label_map = label_map
tree_nolbl.map(["CaHo", "Stylaria"], level="family")

['Caenidae', 'Naididae']

A label map can be safely updated with `tree.update_label_map`

In [19]:
tree_nolbl = tree_nolbl.update_label_map({"test_label": "Caenis horaria"})
tree_nolbl.map(["test_label", "CaHo"])

['Caenis horaria', 'Caenis horaria']

## Label as the same name as node
If your labels contain strings that are present in the hierarchy table, these values can be mapped to a different node. During tree creation this must be specified or a warning will be given. Here we want to remap label `Asellidae` to `Asellus`

In [20]:
df_remap_node = pd.read_csv("../tests/data/ancestor_as_label.csv")
df_remap_node

Unnamed: 0,kingdom,phylum,class,order,family,genus,species,type,label
0,Animalia,Arthropoda,Malacostraca,Isopoda,Asellidae,Asellus,,,Asellidae
1,Animalia,Arthropoda,Malacostraca,Isopoda,Asellidae,Asellus,Asellus aquaticus,,Asellus_aquaticus
2,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis horaria,,Caenis_horaria
3,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis luctuosa,,Caenis_luctuosa
4,Animalia,Arthropoda,Insecta,Ephemeroptera,Caenidae,Caenis,Caenis rivulorum,,Caenis_rivulorum


In [21]:
tree_remap = tiers.Tree.from_dataframe(df_remap_node)



In [22]:
# The warning can be suppressed by setting node_remapping=True
tree_remap = tiers.Tree.from_dataframe(df_remap_node, node_remapping=True)
tree_remap.show()

Animalia
└── Arthropoda
    ├── Malacostraca
    │   └── Isopoda
    │       └── Asellidae
    │           └── Asellus
    │               └── Asellus aquaticus
    └── Insecta
        └── Ephemeroptera
            └── Caenidae
                └── Caenis
                    ├── Caenis horaria
                    ├── Caenis luctuosa
                    └── Caenis rivulorum


In [23]:
# if nodes=False, 'Asellidae' is handled as a label
tree_remap.map(["Asellidae", "Asellus_aquaticus"], nodes=False)

['Asellus', 'Asellus aquaticus']

In [24]:
# if nodes=True, it is handled as a node string
tree_remap.map(["Asellidae", "Asellus aquaticus"], nodes=True)

['Asellidae', 'Asellus aquaticus']

### Other functions
- `in_ancestors`: See if a node value is in ancestors
- `match`: checks if the two nodes are of the same lineage
- `lca`: find the lowest common ancestor of two nodes
- `match_level` returns the level where the two nodes have the lowest common ancestor

All functions have the parameter `nodes=False` which assumes the strings are node names instead of labels

In [25]:
print(tree.in_ancestors("Caenis horaria", "Caenis"))
print(tree.in_ancestors("Caenis horaria", "Caenis horaria"))
print(tree.match("Caenis horaria", "Caenis"))
print(tree.match("Caenis", "Caenis horaria"))
print(tree.match("Baetidae", "Caenis horaria"))
print(tree.match("Caenis horaria", "Baetidae"))
print(tree.match("Caenis horaria", "Caenis horaria"))
print(tree.lca("Caenis horaria", "Baetidae"))
print(tree.lca("Caenis_horaria", "As"))
print(tree.match_level("Caenis horaria", "Baetidae"))

True
False
True
True
False
False
True
Ephemeroptera
Arthropoda
03_order


### Merging trees

Trees can be merged. The label mappings come now from both trees

In [26]:
new_tree = tree.merge(tree_nolbl)
new_tree.show()

Animalia
├── Arthropoda
│   ├── Malacostraca
│   │   └── Isopoda
│   │       └── Asellidae
│   │           └── Asellus
│   │               └── Asellus aquaticus
│   └── Insecta
│       ├── Ephemeroptera
│       │   ├── Caenidae
│       │   │   └── Caenis
│       │   │       ├── Caenis horaria
│       │   │       ├── Caenis luctuosa
│       │   │       └── Caenis rivulorum
│       │   ├── Baetidae
│       │   │   ├── Cloeon
│       │   │   │   └── Cloeon dipterum
│       │   │   └── Centroptilum
│       │   │       └── Centroptilum luteolum
│       │   ├── Ephemeridae
│       │   │   └── Ephemera
│       │   │       └── Ephemera vulgata
│       │   └── Heptageniidae
│       │       ├── Heptagenia
│       │       │   └── Heptagenia dalecarlica
│       │       └── Kageronia
│       │           └── Kageronia fuscogrisea
│       ├── Trichoptera
│       │   ├── Polycentropodidae
│       │   │   ├── Cyrnus
│       │   │   │   └── Cyrnus trimaculatus
│       │   │   └── Polycentropus
│       │

In [27]:
new_tree.label_map

{'As': 'Asellus',
 'Asellus_aquaticus': 'Asellus aquaticus',
 'Asellus aquaticus': 'Asellus aquaticus',
 'Caenis_horaria': 'Caenis horaria',
 'Caenis_luctuosa': 'Caenis luctuosa',
 'Caenis_rivulorum': 'Caenis rivulorum',
 'Baetidae': 'Baetidae',
 'Cloeon_dipterum': 'Cloeon dipterum',
 'Cyrnus_trimaculatus': 'Cyrnus trimaculatus',
 'Oulimnius_tuberculatus': 'Oulimnius tuberculatus larva',
 'Oulimnius_tuberculatus_adult': 'Oulimnius tuberculatus adult',
 'Oulimnius_tuberculatus_larv': 'Oulimnius tuberculatus larva',
 'Polycentropus_flavomaculatus': 'Polycentropus flavomaculatus',
 'RaBa': 'Radix balthica',
 'CaHo': 'Caenis horaria',
 'test_label': 'Caenis horaria'}

In [28]:
# New label, old label and node name are mapped to the same node
new_tree.map(["CaHo", "Caenis_horaria", "Caenis horaria"], level="species")

['Caenis horaria', 'Caenis horaria', 'Caenis horaria']