# MGnify to Darwin Core export notes
## What do OTUs represent in the MGnify pipeline?

Get all SSU downloads for analysis <https://www.ebi.ac.uk/metagenomics/analyses/MGYA00166828> from [Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists](https://www.ebi.ac.uk/metagenomics/studies/MGYS00002392#overview):

In [44]:
import pandas as pd
from mgnifyextract.analyses import get_analysis
from mgnifyextract.downloads import FastaDownload, MseqDownload, TsvDownload
from mgnifyextract.studies import get_superstudy_studies
from mgnifyextract.util import clean_taxonomy_string

analysis = get_analysis("MGYA00166828")
analysis

<Analysis https://www.ebi.ac.uk/metagenomics/analyses/MGYA00166828 >

In [45]:
downloads = analysis.get_downloads()

marker = "SSU"

fasta_files = [download for download in downloads if isinstance(download, FastaDownload) and download.marker == marker]
mseq_files = [download for download in downloads if isinstance(download, MseqDownload) and download.marker == marker]
tsv_files = [download for download in downloads if isinstance(download, TsvDownload) and download.marker == marker]

Let's take a look at the number of rows in the fasta, mseq, and OTU files.

In [46]:
fasta = fasta_files[0].read_pandas()
fasta

Unnamed: 0,reference,sequence
0,ERR1756078.100000-G3:64TC8AAXX:7:49:15015:1734...,TTGNACACACCGCCCGTCGCTACTACCGATTGAACGTTTTAGTGAG...
1,ERR1756078.100003-G3:64TC8AAXX:7:49:16463:1735...,GATTGTTTTATTTTTTGAAAATTTGCAAACTAGATTATCTAGAGGA...
2,ERR1756078.100006-G3:64TC8AAXX:7:49:10821:1736...,AGTGAACGTGTTAGTGAGGTCCTCGGACTGTGAGCCTGGCGGGTCA...
3,ERR1756078.100008-G3:64TC8AAXX:7:49:13456:1736...,ACTACCGATTGAACGTTTTAGTGAGGTATTTGGACTGGGCCTTGGG...
4,ERR1756078.100010-G3:64TC8AAXX:7:49:17365:1737...,TTGTACACACCGCCCGTCGCTACTACCGATTGAACGTTTTAGTGAG...
...,...,...
287467,ERR1756078.86631-G3:64TC8AAXX:7:48:11577:7507-...,TTGTACACACCGCCCGTCAAACCATCTTAGTTGTGGGTGGGTGAGG...
287468,ERR1756078.87133-G3:64TC8AAXX:7:48:14643:8632-...,TTGTACACACCGCCCGTCAAATCACCCGAGCAGGGTTTGGGTGAGG...
287469,ERR1756078.87133-G3:64TC8AAXX:7:48:14643:8632-...,TTGNACACACCGCCCGTCAAATCACCCGAGCAGGGTTTGGGTGAGG...
287470,ERR1756078.9253-G3:64TC8AAXX:7:15:13620:20207-...,TTGTACACACCGCCCGTCAAACCACTCGAGCAGGGTTTAGATGAGT...


In [47]:
fasta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287472 entries, 0 to 287471
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   reference  287472 non-null  object
 1   sequence   287472 non-null  object
dtypes: object(2)
memory usage: 4.4+ MB


In [48]:
mseq = mseq_files[0].read()
mseq

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
0,ERR1756078.100000-G3:64TC8AAXX:7:49:15015:1734...,KJ761866.1.1804,158,0.981707,161,3,0,0,164,1634,1798,+,,sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Maxi...,
1,ERR1756078.100003-G3:64TC8AAXX:7:49:16463:1735...,AB613239.1.1794,83,0.966292,86,3,0,0,89,1698,1787,+,,sk__Eukaryota;k__;p__;c__Polycystinea;o__Collo...,
2,ERR1756078.100006-G3:64TC8AAXX:7:49:10821:1736...,KJ759511.1.1801,129,0.977778,132,3,0,2,137,1661,1796,+,,sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Maxi...,
3,ERR1756078.100008-G3:64TC8AAXX:7:49:13456:1736...,FLMP01002635.3161.4958,137,0.972414,141,4,0,0,145,1653,1798,+,,sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Maxi...,
4,ERR1756078.100010-G3:64TC8AAXX:7:49:17365:1737...,KJ760557.1.1801,162,0.987952,164,2,0,0,166,1629,1795,+,,sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Maxi...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287462,ERR1756078.86631-G3:64TC8AAXX:7:48:11577:7507-...,KX427585.1.1451,114,0.983051,116,2,0,0,118,1327,1445,+,,sk__Archaea;k__;p__Euryarchaeota;c__;o__;f__;g...,
287463,ERR1756078.87133-G3:64TC8AAXX:7:48:14643:8632-...,CEUC01139520.1.1105,115,0.960000,120,5,0,0,125,976,1101,+,,sk__Archaea;k__;p__;c__;o__;f__;g__;s__,
287464,ERR1756078.87133-G3:64TC8AAXX:7:48:14643:8632-...,CEUC01139520.1.1105,122,0.984127,124,2,0,0,126,976,1102,+,,sk__Archaea;k__;p__;c__;o__;f__;g__;s__,
287465,ERR1756078.9253-G3:64TC8AAXX:7:15:13620:20207-...,KP308742.40.1587,92,0.881890,112,14,1,0,127,1408,1534,+,,sk__Archaea;k__;p__;c__;o__,


In [49]:
mseq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287467 entries, 0 to 287466
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   #query       287467 non-null  object 
 1   dbhit        287467 non-null  object 
 2   bitscore     287467 non-null  int64  
 3   identity     287467 non-null  float64
 4   matches      287467 non-null  int64  
 5   mismatches   287467 non-null  int64  
 6   gaps         287467 non-null  int64  
 7   query_start  287467 non-null  int64  
 8   query_end    287467 non-null  int64  
 9   dbhit_start  287467 non-null  int64  
 10  dbhit_end    287467 non-null  int64  
 11  strand       287467 non-null  object 
 12  Unnamed: 12  0 non-null       float64
 13  SILVA        287467 non-null  object 
 14  Unnamed: 14  0 non-null       float64
dtypes: float64(3), int64(8), object(4)
memory usage: 32.9+ MB


In [50]:
otu = tsv_files[0].read()
otu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 701 entries, 0 to 700
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   # OTU ID    701 non-null    int64  
 1   ERR1756078  701 non-null    float64
 2   taxonomy    701 non-null    object 
dtypes: float64(1), int64(1), object(1)
memory usage: 16.6+ KB


So we go from 287,472 reads in fasta to 287,467 hits in the mseq file to just 701 OTUs.

The same number of distinct taxonomy values in the mseq file as well as the OTU suggests that the OTUs correspond to all distinct taxonomic assignments regardless of sequence similarity. All reads with assignment Bacteria for example are collapsed into a single OTU.

In [51]:
pd.Series([clean_taxonomy_string(tax) for tax in otu["taxonomy"]]).nunique()

701

In [52]:
pd.Series([clean_taxonomy_string(tax) for tax in mseq["SILVA"]]).nunique()

701

In [53]:
mseq_bacteria = mseq.loc[mseq["SILVA"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
mseq_bacteria

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
279909,ERR1756078.109161-G3:64TC8AAXX:7:53:5680:3360-...,FJ153005.1.1495,127,0.963504,132,5,0,0,137,1342,1479,+,,sk__Bacteria;k__,
279913,ERR1756078.109218-G3:64TC8AAXX:7:53:8283:3436-...,CEVJ01037068.1.1505,125,1.000000,125,0,0,0,125,1380,1505,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
279936,ERR1756078.110215-G3:64TC8AAXX:7:53:1765:4617-...,CEVJ01037068.1.1505,125,1.000000,125,0,0,0,125,1380,1505,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
279966,ERR1756078.111304-G3:64TC8AAXX:7:53:2098:5791-...,CEVJ01037068.1.1505,125,1.000000,125,0,0,0,125,1380,1505,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
279971,ERR1756078.111413-G3:64TC8AAXX:7:53:12359:5899...,APPQ01000025.545475.547000,133,0.992593,134,1,0,0,135,1378,1513,+,,sk__Bacteria;k__,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287062,ERR1756078.89898-G3:64TC8AAXX:7:48:13265:14480...,FJ152961.1.1495,125,0.956204,131,6,0,0,137,1342,1479,+,,sk__Bacteria;k__,
287082,ERR1756078.91392-G3:64TC8AAXX:7:48:18741:18679...,CENY01011605.269.1768,110,0.950820,116,6,0,0,122,1360,1482,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__,
287097,ERR1756078.94312-G3:64TC8AAXX:7:49:10248:5814-...,JQ426190.1.1299,81,0.787234,111,30,0,0,141,1146,1287,+,,sk__Bacteria;k__,
287104,ERR1756078.95119-G3:64TC8AAXX:7:49:2482:7546-1...,KC541057.1.1554,89,0.824818,113,24,0,0,137,1399,1536,+,,sk__Bacteria;k__,


In [54]:
otu_bacteria = otu.loc[otu["taxonomy"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
otu_bacteria

Unnamed: 0,# OTU ID,ERR1756078,taxonomy
11,103181,300.0,sk__Bacteria


Grouping by taxonomy appears to be the approach currently used in [mgnify-to-dwc](https://github.com/gbif/mgnify-to-dwc), see occurrences for this sample here: <https://www.gbif.org/occurrence/search?q=ssu&dataset_key=3de01c29-b3d4-4eeb-bd5c-b2bb475566cc&event_id=MGYA00463299>

## Alternative approach

Instead of relying on the OTU table as downloaded from MGnify, let's try grouping sequences by SILVA hit.

Merge fasta and mseq tables:

In [None]:
df = fasta.merge(mseq.rename({"#query": "reference"}, axis=1), how="left", on="reference")

First clean taxonomy strings by removing empty ranks:

In [None]:
df["SILVA"] = [clean_taxonomy_string(tax) for tax in df["SILVA"]]

Let's take a look at the number of distinct sequences, taxonomy strings, and DB hits:

In [None]:
df["sequence"].nunique()

In [None]:
df["SILVA"].nunique()

In [None]:
df["dbhit"].nunique()

So an alternative approach would be to group sequences by DB hit, and pick a random representative sequence.