# Taxonomy label generation from GTDB-tk

Last updated: June 15 2023

GTDB-tk is a useful tool that classifies bacterial and archaeal genomes using the **Genome Taxonomy Database** (GTDB) nomenclature. Information on how to run GTDB-tk can be found here: https://github.com/Ecogenomics/GTDBTk

Unfortunately, the output classification label is quite long and not practical when referring to a long list of genomes (often the case with metagenome-assembled genomes). This notebook parses the output files and condenses the taxonomy into something more useful for labelling trees, plots, etc. 



### Some examples:

d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales;f__Rhodanobacteraceae;g__Metallibacterium;s__Metallibacterium scheffleri --> **Metallibacterium scheffleri**

d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Acidithiobacillales;f__Acidithiobacillaceae;g__Acidithiobacillus;s__
--> **Acidithiobacillus sp.**

d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA2770;f__;g__;s__ --> **c. Gammaproteobacteria (o. UBA2770)**



---

In [1]:
import pandas as pd
import numpy as np

Importing raw outputs from gtdb-tk: **'gtdbtk.ar53.summary.tsv'** (archaea), and **'gtdbtk.ar53.summary.tsv'** (bacteria)

In [2]:
ar53_df = pd.read_table('gtdbtk.ar53.summary.tsv', sep='\t')
bac120_df = pd.read_table('gtdbtk.bac120.summary.tsv', sep='\t')

#filtering only bin ID and taxonomy columns
ar53_df = ar53_df[['user_genome', 'classification']]
bac120_df = bac120_df[['user_genome', 'classification']]

#concatenating dataframes
gtdb_all = pd.concat([ar53_df, bac120_df], axis=0)

Extracting classification data into separate columns and removing the prefixes:

In [3]:
taxonomy_df = gtdb_all.copy()

taxonomy_df[['domain','phylum', 'class', 'order', 'family', 'genus', 'species']] = \
taxonomy_df['classification'].str.split(';',expand=True)

#removes the first 3 characters from the newly-generated columns:
for col in taxonomy_df.columns[2:]:
    taxonomy_df[col] = taxonomy_df[col].str[3:]

#needed to deal with special cases like 'Leptospirillium_A'
taxonomy_df['genus'] = taxonomy_df['genus'].str.replace('_', ' ')
taxonomy_df['species'] = taxonomy_df['species'].str.replace('_', ' ')

The following function generates a condensed taxonomy label based on the various levels of classification. It returns the genus and species (or genus + 'sp.'), if known. Otherwise, it takes the lowest level of taxonomy that was assigned a value, and labels it with the appropriate prefix (p., c., o., f.) 

In [4]:
def make_taxon_label (df): 
    
    #sequential conditional values starting from species, check if value exists
    cond_list = [
        df['species'] != '',
        df['genus'] != '',
        df['family'] != '',
        df['order'] != '',
        df['class'] != '',
        df['phylum'] != ''
    ]

    #outputs the genus + species, OR the lowest known taxonomy level
    choice_list = [
        df['species'],
        df['genus'] + ' sp.',
        'f. ' + df['family'],
        'o. ' + df['order'],
        'c. ' + df['class'],
        'p. ' + df['phylum'],
    ]

    return np.select(cond_list, choice_list) #selects appropriate output based on conditionals

This function filters out any 'non-informative' labels: e.g., 'UBA184' , 'GCA-000496135', so the resulting taxonomy label is more descriptive of the organism's phylogeny at a glance.

In [5]:
#returns an empty string if the string meets any of the following conditions:
## has numerical digits or special characters (-, _, etc.)
## is written in all uppercase letters
def delete_labels (s):
    if ( 
        (any(not c.isalnum() for c in s.replace(' ', ''))) or #the replace makes an exception for spaces
        (any(c.isdigit() for c in s)) or
        (s.isupper())
    ):
        return ''
    else:
        return s

Applying functions to the dataframe to generate labels:

In [6]:
taxon_labels = taxonomy_df.copy()

#initial label that keeps all levels of classification
taxon_labels['lab'] = make_taxon_label(taxon_labels)

#new taxonomy df that filters out the non-informative labels
taxonomy_filtered = taxonomy_df.copy()
for col in taxonomy_filtered.columns[2:]:
    taxonomy_filtered[col] = taxonomy_filtered[col].apply(delete_labels)

#new set of labels, based on filtered dataset    
taxon_labels['filtered_label'] = make_taxon_label(taxonomy_filtered)

#generating the final label by comparing the two label sets:
#if they match, use original, otherwise concatenate them
taxon_labels['label'] = np.where(  
    taxon_labels['lab'] == taxon_labels['filtered_label'], taxon_labels['lab'],
    taxon_labels['filtered_label'].astype(str) + ' (' + taxon_labels['lab'].astype(str) + ')')

#adding an additional column which adds the genome ID to the label
taxon_labels['bin_w_label'] = taxon_labels['user_genome'] + '; '+ taxon_labels['label']

#filtering and sorting
taxon_labels = taxon_labels[['user_genome', 'classification', 'label', 'bin_w_label']]
taxon_labels = taxon_labels.sort_values('user_genome', axis=0)

taxon_labels.to_csv('gtdb_taxon_labels.csv', index_label='user_genome')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d1ddb408-cd62-40b3-ace3-0d9efd98842d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>