# Propagation of MGnify taxonomic annotations to GBIF

The goal of this notebook is to quality check taxonomic annotations from MGnify and to see how these propagate to GBIF. Problems with taxonomic annotation have been reported in [this blogpost](https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html) and a [filter is being applied](https://github.com/gbif/mgnify-to-dwc/commit/f15abb91cf7ae35fb430efbc8a22784102fcbb73) at GBIF to exclude plants and metazoans.

## Utilities

In [53]:
from mgnifyextract.analyses import get_analysis
from mgnifyextract.downloads import MseqDownload
from mgnifyextract.util import clean_taxonomy_string
import pandas as pd
from mgnifyextract.dwc import split_taxonomy_column

marker = "SSU"

def create_tax_table(mseq: pd.DataFrame) -> pd.DataFrame:
    mseq["SILVA"] = [clean_taxonomy_string(tax) for tax in mseq["SILVA"]]
    taxa = mseq.groupby(["SILVA"])["identity"].max().sort_values(ascending=True).to_frame()
    taxa["SILVA"] = taxa.index
    taxa.reset_index(drop=True, inplace=True)
    taxa = taxa.join(pd.DataFrame(taxa["SILVA"].apply(split_taxonomy_column).values.tolist()))
    return taxa

def get_mseq(accession: str) -> pd.DataFrame:
    """Clean the taxonomy string, group by taxonomy, and calculate the maximum identity."""
    analysis = get_analysis(accession)
    downloads = analysis.get_downloads()
    mseq_files = [download for download in downloads if isinstance(download, MseqDownload) and download.marker == marker]
    mseq = mseq_files[0].read()
    return mseq

## MGYA00156024

This analysis is from [Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists](https://www.gbif.org/dataset/d596fccb-2319-42eb-b13b-986c932780ad). See occurrences [here](https://www.gbif.org/occurrence/search?dataset_key=d596fccb-2319-42eb-b13b-986c932780ad&advanced=1&event_id=MGYA00167469).

In [61]:

mseq = get_mseq("MGYA00167469")
taxa = create_tax_table(mseq)
taxa

Unnamed: 0,identity,SILVA,superkingdom,class,order,scientificName,kingdom,phylum,family,genus,species
0,0.855422,sk__Eukaryota;c__Oligohymenophorea;o__Pleurone...,Eukaryota,Oligohymenophorea,Pleuronematida,Pleuronematida,,,,,
1,0.857143,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,,Malpighiales,Malpighiales,Viridiplantae,Streptophyta,,,
2,0.857143,sk__Eukaryota;c__Plagiopylea;o__Plagiopylida,Eukaryota,Plagiopylea,Plagiopylida,Plagiopylida,,,,,
3,0.857143,sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Inse...,Eukaryota,Insecta,Hemiptera,Hemiptera,Metazoa,Arthropoda,,,
4,0.857143,sk__Eukaryota;k__Metazoa;p__Cnidaria;c__Hydroz...,Eukaryota,Hydrozoa,Anthoathecata,Anthoathecata,Metazoa,Cnidaria,,,
5,0.857143,sk__Bacteria;p__Planctomycetes;c__Phycisphaerae,Bacteria,Phycisphaerae,,Phycisphaerae,,Planctomycetes,,,
6,0.858065,sk__Eukaryota;c__Nassophorea;o__Microthoracida,Eukaryota,Nassophorea,Microthoracida,Microthoracida,,,,,
7,0.858824,sk__Eukaryota;k__Metazoa;p__Cnidaria;c__Stauro...,Eukaryota,Staurozoa,Stauromedusae,Stauromedusae,Metazoa,Cnidaria,,,
8,0.858896,sk__Eukaryota;c__Prostomatea;o__Prorodontida,Eukaryota,Prostomatea,Prorodontida,Prorodontida,,,,,
9,0.858896,sk__Eukaryota;p__Bacillariophyta;c__Fragilario...,Eukaryota,Fragilariophyceae,Striatellales,Striatellales,,Bacillariophyta,,,


In [62]:
taxa[taxa.apply(lambda row: row.astype(str).str.contains("Rafflesia", case=False).any(), axis=1)]

Unnamed: 0,identity,SILVA,superkingdom,class,order,scientificName,kingdom,phylum,family,genus,species
349,0.963415,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,,Malpighiales,Rafflesia_cantleyi,Viridiplantae,Streptophyta,Rafflesiaceae,Rafflesia,Rafflesia_cantleyi


In [64]:
taxa[taxa.apply(lambda row: row.astype(str).str.contains("Hordeum", case=False).any(), axis=1)]

Unnamed: 0,identity,SILVA,superkingdom,class,order,scientificName,kingdom,phylum,family,genus,species
946,1.0,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,Liliopsida,Poales,Hordeum_vulgare,Viridiplantae,Streptophyta,Poaceae,Hordeum,Hordeum_vulgare
947,1.0,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,Liliopsida,Poales,Hordeum,Viridiplantae,Streptophyta,Poaceae,Hordeum,


In [65]:
taxa[taxa.apply(lambda row: row.astype(str).str.contains("Vigna", case=False).any(), axis=1)]

Unnamed: 0,identity,SILVA,superkingdom,class,order,scientificName,kingdom,phylum,family,genus,species
221,0.942308,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,,Fabales,Vigna,Viridiplantae,Streptophyta,Fabaceae,Vigna,
699,1.0,sk__Eukaryota;k__Viridiplantae;p__Streptophyta...,Eukaryota,,Fabales,Vigna_angularis,Viridiplantae,Streptophyta,Fabaceae,Vigna,Vigna_angularis


## MGYA00156024

This analysis is from [Bacterial 16s Amplicon Sequencing of the Atlantic Ocean](https://www.gbif.org/dataset/06fc84c8-f8e2-4ae9-a1ff-2020ad3bae29).

In [69]:
mseq = get_mseq("MGYA00156024")
taxa = create_tax_table(mseq)
taxa

Unnamed: 0,identity,SILVA,superkingdom,kingdom,scientificName,phylum,class,order,family,genus,species
0,0.780952,sk__Eukaryota;k__Viridiplantae,Eukaryota,Viridiplantae,Viridiplantae,,,,,,
1,0.842657,sk__Eukaryota;k__Fungi,Eukaryota,Fungi,Fungi,,,,,,
2,0.845714,sk__Eukaryota;k__Metazoa,Eukaryota,Metazoa,Metazoa,,,,,,
3,0.855932,sk__Bacteria;p__Nitrospinae;c__Nitrospinia,Bacteria,,Nitrospinia,Nitrospinae,Nitrospinia,,,,
4,0.857143,sk__Eukaryota;k__Fungi;p__Basidiomycota,Eukaryota,Fungi,Basidiomycota,Basidiomycota,,,,,
5,0.857924,sk__Bacteria;p__Calditrichaeota;c__Calditricha...,Bacteria,,Calditrichales,Calditrichaeota,Calditrichae,Calditrichales,,,
6,0.858025,sk__Eukaryota;p__Apicomplexa;c__Aconoidasida,Eukaryota,,Aconoidasida,Apicomplexa,Aconoidasida,,,,
7,0.858447,sk__Bacteria;p__Actinobacteria;c__Nitrilirupto...,Bacteria,,Euzebyales,Actinobacteria,Nitriliruptoria,Euzebyales,,,
8,0.860577,sk__Eukaryota;c__Spirotrichea,Eukaryota,,Spirotrichea,,Spirotrichea,,,,
9,0.862069,sk__Bacteria;p__Proteobacteria;c__Deltaproteob...,Bacteria,,Syntrophobacterales,Proteobacteria,Deltaproteobacteria,Syntrophobacterales,,,


In [70]:
taxa[taxa.apply(lambda row: row.astype(str).str.contains("Solanum", case=False).any(), axis=1)]

Unnamed: 0,identity,SILVA,superkingdom,kingdom,scientificName,phylum,class,order,family,genus,species
142,0.957143,sk__Mitochondria;s__Solanum_melongena_(eggplant),Mitochondria,,Solanum_melongena_(eggplant),,,,,,Solanum_melongena_(eggplant)
