In [1]:
import pandas as pd
from taxumap.input_validation import fill_tax_table

# Outline

This notebook briefly describes the proper way of formatting the **taxonomy table** for use in TaxUMAP.

To properly use TaxUMAP, every taxonomical level of an ASV/OTU must be defined. For example:

| OTU       | Kingdom   | Phylum     | Class         | Order           | Family          | Genus       | Species             |
|:----------|:----------|:-----------|:--------------|:----------------|:----------------|:------------|:--------------------|
| Uniq53046 | Bacteria  | Firmicutes | Negativicutes | Selenomonadales | Veillonellaceae | Veillonella | Veillonella_atypica |

is a valid entry for the OTU 'Uniq53046', since all taxonomies are resolved. If there is any information missing, for example:


| OTU        | Kingdom   | Phylum     |   Class |   Order |   Family |   Genus |   Species |
|:-----------|:----------|:-----------|--------:|--------:|---------:|--------:|----------:|
| Uniq114339 | Bacteria  | Firmicutes |         |         |          |         |           |

then we ***must fill the taxonomic levels class through species with a placeholder***. **They cannot be left empty.**

To do this, first fill the data with "np.nan":

| OTU        | Kingdom   | Phylum     |   Class |   Order |   Family |   Genus |   Species |
|:-----------|:----------|:-----------|--------:|--------:|---------:|--------:|----------:|
| Uniq114339 | Bacteria  | Firmicutes |     nan |     nan |      nan |     nan |       nan |

and then use the `fill_tax_table()` function, which can be imported using:
```python
from taxumap.input_validation import fill_tax_table
```

Using the `fill_tax_table` function on the above row will yield:

| OTU        | Kingdom   | Phylum     | Class                                      | Order                                      | Family                                      | Genus                                      | Species                                      |
|:-----------|:----------|:-----------|:-------------------------------------------|:-------------------------------------------|:--------------------------------------------|:-------------------------------------------|:---------------------------------------------|
| Uniq114339 | Bacteria  | Firmicutes | unk_Class_of_Phylum_Firmicutes__Uniq114339 | unk_Order_of_Phylum_Firmicutes__Uniq114339 | unk_Family_of_Phylum_Firmicutes__Uniq114339 | unk_Genus_of_Phylum_Firmicutes__Uniq114339 | unk_Species_of_Phylum_Firmicutes__Uniq114339 |

"unk" stands for "unknown". These labels are **unique** to that OTU, and are important for the proper aggregation of taxonomic information in the TaxUMAP algorithm.

I will demonstrate its use below:

In [2]:
# Load in your uncleaned data
uncleaned_tax_table = pd.read_csv('example_data/uncleaned_olin_dataset.csv').set_index('OTU')
uncleaned_tax_table

Unnamed: 0_level_0,Kingdom,Phylum,Class,Order,Family,Genus,Species
OTU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Uniq114339,Bacteria,Firmicutes,,,,,
Uniq53046,Bacteria,Firmicutes,Negativicutes,Selenomonadales,Veillonellaceae,Veillonella,Veillonella_atypica
Uniq5707,Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae,Lachnospiraceae_FCS020_group,
Uniq45364,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Escherichia-Shigella,Enterobacter
Uniq80019,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Escherichia-Shigella,Escherichia_coli
...,...,...,...,...,...,...,...
Uniq103183,Bacteria,Firmicutes,Bacilli,,,,
Uniq371,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Porphyromonadaceae,Parabacteroides,Parabacteroides_distasonis
Uniq75647,Bacteria,Proteobacteria,,,,,
Uniq12824,Bacteria,Firmicutes,Bacilli,Bacillales,Family_XI,Gemella,Gemella_haemolysans


You can see in the above table that there are many "np.nan" values that we must fill. Now, we use the `fill_tax_table` function:

In [3]:
cleaned_tax_table = fill_tax_table(uncleaned_tax_table)
cleaned_tax_table

Unnamed: 0_level_0,Kingdom,Phylum,Class,Order,Family,Genus,Species
OTU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Uniq114339,Bacteria,Firmicutes,unk_Class_of_Phylum_Firmicutes__Uniq114339,unk_Order_of_Phylum_Firmicutes__Uniq114339,unk_Family_of_Phylum_Firmicutes__Uniq114339,unk_Genus_of_Phylum_Firmicutes__Uniq114339,unk_Species_of_Phylum_Firmicutes__Uniq114339
Uniq53046,Bacteria,Firmicutes,Negativicutes,Selenomonadales,Veillonellaceae,Veillonella,Veillonella_atypica
Uniq5707,Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae,Lachnospiraceae_FCS020_group,unk_Species_of_Genus_Lachnospiraceae_FCS020_gr...
Uniq45364,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Escherichia-Shigella,Enterobacter
Uniq80019,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Escherichia-Shigella,Escherichia_coli
...,...,...,...,...,...,...,...
Uniq103183,Bacteria,Firmicutes,Bacilli,unk_Order_of_Class_Bacilli__Uniq103183,unk_Family_of_Class_Bacilli__Uniq103183,unk_Genus_of_Class_Bacilli__Uniq103183,unk_Species_of_Class_Bacilli__Uniq103183
Uniq371,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Porphyromonadaceae,Parabacteroides,Parabacteroides_distasonis
Uniq75647,Bacteria,Proteobacteria,unk_Class_of_Phylum_Proteobacteria__Uniq75647,unk_Order_of_Phylum_Proteobacteria__Uniq75647,unk_Family_of_Phylum_Proteobacteria__Uniq75647,unk_Genus_of_Phylum_Proteobacteria__Uniq75647,unk_Species_of_Phylum_Proteobacteria__Uniq75647
Uniq12824,Bacteria,Firmicutes,Bacilli,Bacillales,Family_XI,Gemella,Gemella_haemolysans


The above taxonomy table is a properly-formatted taxonomy table to be used with TaxUMAP. 

1. Each taxonomic level is resolved.
2. The index is the OTU/ASV label.
3. Columns are listed in increasing specificity, e.g., Kingdom -> Species.


In [4]:
cleaned_tax_table.to_csv('example_data/taxonomy.csv')

# References

The Olin et al. dataset is used here to provide an practical example of using TaxUMAP. The original publication and dataset can be found below:

## Publication

> Olin A, Henckel E, Chen Y, et al. Stereotypic Immune System Development in Newborn Children. Cell. 2018;174(5):1277-1292.e14. doi:10.1016/j.cell.2018.06.045

## Dataset

> Olin, Axel (2018), “Stereotypic Immune System Development in Newborn Children”, Mendeley Data, v1
