# MGnify to Darwin Core export notes
## What do OTUs represent in the MGnify pipeline?

Get all SSU downloads for analysis <https://www.ebi.ac.uk/metagenomics/analyses/MGYA00166828> from [Amplicon sequencing of Tara Oceans DNA samples corresponding to size fractions for protists](https://www.ebi.ac.uk/metagenomics/studies/MGYS00002392#overview). Alternatively, for pipeline version 5.0, try with analysis `MGYA00591034`.

In [71]:
import pandas as pd
from mgnifyextract.analyses import get_analysis
from mgnifyextract.downloads import FastaDownload, MseqDownload, TsvDownload
from mgnifyextract.util import clean_taxonomy_string

analysis = get_analysis("MGYA00166828")
analysis

<Analysis https://www.ebi.ac.uk/metagenomics/analyses/MGYA00591034 >

In [72]:
downloads = analysis.get_downloads()

marker = "SSU"

fasta_files = [download for download in downloads if isinstance(download, FastaDownload) and download.marker == marker]
mseq_files = [download for download in downloads if isinstance(download, MseqDownload) and download.marker == marker]
tsv_files = [download for download in downloads if isinstance(download, TsvDownload) and download.marker == marker]

Let's take a look at the number of rows in the fasta, mseq, and OTU files.

In [73]:
fasta = fasta_files[0].read_pandas()
fasta

Unnamed: 0,reference,sequence
0,SRR2071335.1-G5K3KBG01BR0D3-2-SSU_rRNA_archaea...,GGGCCACACGCGGGCTGCAATGGTAGCGACAATTGGTTTCGAATCC...
1,SRR2071335.10-G5K3KBG01C064E-2-SSU_rRNA_eukary...,AAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACA...
2,SRR2071335.100-G5K3KBG01AKUKN-2-SSU_rRNA_eukar...,AAACTTAAAGGAATTGGCGGAAGGGCACCACCAGGAGTGGAGCCTG...
3,SRR2071335.10001-G5K3KBG01BK5GE-2-SSU_rRNA_bac...,CCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAA...
4,SRR2071335.10002-G5K3KBG01A4BHK-2-SSU_rRNA_arc...,CTTAAAGGAATTGACGGGGGAGCACCACAAGGGGTGAAGCCTGCGG...
...,...,...
18851,SRR2071335.9994-G5K3KBG01BBZ7O-2-SSU_rRNA_arch...,AAACTTAAATGAATTGACGGGGGAGCACCACAAGGGGTGAAGCCTG...
18852,SRR2071335.9995-G5K3KBG01EHGVC-2-SSU_rRNA_bact...,CCCGCACAAGCGGTGGAGCATTGTGGTTTTAATTCGAAGCAACGCG...
18853,SRR2071335.9996-G5K3KBG01BY7UD-2-SSU_rRNA_arch...,AATTGGCGGGGAGCACCACAAGGGGTGAAGCCTGCGGTTCAATTGG...
18854,SRR2071335.9998-G5K3KBG01D7F1M-2-SSU_rRNA_arch...,AAACTTAAAGGAATTGACGGGGAGCACCACAAGGGGTGAAGCCTGC...


In [74]:
fasta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18856 entries, 0 to 18855
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   reference  18856 non-null  object
 1   sequence   18856 non-null  object
dtypes: object(2)
memory usage: 294.8+ KB


In [75]:
mseq = mseq_files[0].read()
mseq

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
0,SRR2071335.1-G5K3KBG01BR0D3-2-SSU_rRNA_archaea...,JQ225313.1.1346,182,0.994565,183,1,0,0,184,1162,1346,+,,sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitro...,
1,SRR2071335.10099-G5K3KBG01D52OB-2-SSU_rRNA_bac...,AWNN01000021.8453.9992,464,0.979550,479,9,1,0,489,929,1417,+,,sk__Bacteria;k__;p__Candidatus_Marinimicrobia;...,
2,SRR2071335.1020-G5K3KBG01EODQC-2-SSU_rRNA_euka...,KJ762186.1.1796,512,0.994209,515,3,0,0,518,1127,1645,+,,sk__Eukaryota;k__;p__;c__Dinophyceae;o__Gymnod...,
3,SRR2071335.10300-G5K3KBG01B0E64-2-SSU_rRNA_bac...,KF271095.1.1439,468,0.991597,472,4,0,0,476,877,1353,+,,sk__Bacteria;k__;p__Actinobacteria;c__Acidimic...,
4,SRR2071335.10397-G5K3KBG01D8635-2-SSU_rRNA_bac...,FJ825920.1.1430,363,0.991870,366,3,0,0,369,967,1336,+,,sk__Bacteria;k__;p__Proteobacteria;c__Alphapro...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18851,SRR2071335.9832-G5K3KBG01D5G8Y-2-SSU_rRNA_bact...,GQ348765.1.1347,454,0.995633,456,2,0,0,458,886,1344,+,,sk__Bacteria;k__;p__Chloroflexi;c__Dehalococco...,
18852,SRR2071335.9833-G5K3KBG01A65V4-2-SSU_rRNA_arch...,JQ222985.1.1347,493,0.993988,496,3,0,0,499,847,1346,+,,sk__Archaea;k__;p__Euryarchaeota;c__Thermoplas...,
18853,SRR2071335.9834-G5K3KBG01CVM0K-2-SSU_rRNA_arch...,JQ225313.1.1346,454,0.995680,461,1,1,0,462,883,1346,+,,sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitro...,
18854,SRR2071335.9835-G5K3KBG01B90BN-2-SSU_rRNA_bact...,FR684427.1.1495,354,0.991781,362,2,1,0,364,1026,1391,+,,sk__Bacteria;k__;p__Proteobacteria;c__Gammapro...,


In [76]:
mseq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18856 entries, 0 to 18855
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   #query       18856 non-null  object 
 1   dbhit        18856 non-null  object 
 2   bitscore     18856 non-null  int64  
 3   identity     18856 non-null  float64
 4   matches      18856 non-null  int64  
 5   mismatches   18856 non-null  int64  
 6   gaps         18856 non-null  int64  
 7   query_start  18856 non-null  int64  
 8   query_end    18856 non-null  int64  
 9   dbhit_start  18856 non-null  int64  
 10  dbhit_end    18856 non-null  int64  
 11  strand       18856 non-null  object 
 12  Unnamed: 12  0 non-null      float64
 13  SILVA        18856 non-null  object 
 14  Unnamed: 14  0 non-null      float64
dtypes: float64(3), int64(8), object(4)
memory usage: 2.2+ MB


In [77]:
otu = tsv_files[0].read()
otu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 325 entries, 0 to 324
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   # OTU ID  325 non-null    int64  
 1   SSU_rRNA  325 non-null    float64
 2   taxonomy  325 non-null    object 
 3   taxid     325 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 10.3+ KB


So we go from 287,472 reads in fasta to 287,467 hits in the mseq file to just 701 OTUs.

The same number of distinct taxonomy values in the mseq file as well as the OTU suggests that the OTUs correspond to all distinct taxonomic assignments regardless of sequence similarity. All reads with assignment Bacteria for example are collapsed into a single OTU.

In [78]:
pd.Series([clean_taxonomy_string(tax) for tax in otu["taxonomy"]]).nunique()

325

In [79]:
pd.Series([clean_taxonomy_string(tax) for tax in mseq["SILVA"]]).nunique()

325

In [80]:
mseq_bacteria = mseq.loc[mseq["SILVA"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
mseq_bacteria

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
48,SRR2071335.1069-G5K3KBG01AZMUE-2-SSU_rRNA_bact...,LN870687.1.1403,109,1.000000,109,0,0,0,109,1294,1403,+,,sk__Bacteria;k__,
139,SRR2071335.10218-G5K3KBG01BDNN8-2-SSU_rRNA_bac...,EF646115.1.1485,105,0.990654,106,1,0,2,109,1276,1383,+,,sk__Bacteria;k__,
168,SRR2071335.10221-G5K3KBG01AEJTK-2-SSU_rRNA_bac...,KT318653.1.1359,111,1.000000,111,0,0,0,111,1247,1358,+,,sk__Bacteria;k__,
171,SRR2071335.10608-G5K3KBG01BE3K7-2-SSU_rRNA_bac...,GQ349306.1.1394,113,0.991304,114,1,0,0,115,1271,1386,+,,sk__Bacteria;k__,
214,SRR2071335.10325-G5K3KBG01EZIX6-2-SSU_rRNA_bac...,KJ365354.1.1446,111,0.991150,112,1,0,0,113,1246,1359,+,,sk__Bacteria;k__,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18475,SRR2071335.9669-G5K3KBG01CF1KM-2-SSU_rRNA_bact...,JQ451313.1.1396,98,1.000000,98,0,0,0,98,1297,1395,+,,sk__Bacteria;k__,
18562,SRR2071335.9505-G5K3KBG01C30B0-2-SSU_rRNA_bact...,JF796755.1.1376,109,1.000000,109,0,0,0,109,1264,1373,+,,sk__Bacteria;k__,
18581,SRR2071335.9616-G5K3KBG01DX9AM-2-SSU_rRNA_bact...,EF646115.1.1485,104,0.990566,105,1,0,7,113,1277,1383,+,,sk__Bacteria;k__,
18682,SRR2071335.9635-G5K3KBG01D13N8-2-SSU_rRNA_bact...,EF646115.1.1485,103,0.990476,104,1,0,0,105,1278,1383,+,,sk__Bacteria;k__,


In [81]:
otu_bacteria = otu.loc[otu["taxonomy"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
otu_bacteria

Unnamed: 0,# OTU ID,SSU_rRNA,taxonomy,taxid
13,5820,228.0,sk__Bacteria,2


Grouping by taxonomy appears to be the approach currently used in [mgnify-to-dwc](https://github.com/gbif/mgnify-to-dwc), see occurrences for this sample here: <https://www.gbif.org/occurrence/search?q=ssu&dataset_key=3de01c29-b3d4-4eeb-bd5c-b2bb475566cc&event_id=MGYA00463299>

## Alternative approach

Instead of relying on the OTU table as downloaded from MGnify, let's try grouping sequences by SILVA hit.

Merge fasta and mseq tables:

In [82]:
df = fasta.merge(mseq.rename({"#query": "reference"}, axis=1), how="left", on="reference")

First clean taxonomy strings by removing empty ranks:

In [83]:
df["SILVA"] = [clean_taxonomy_string(tax) for tax in df["SILVA"]]

Let's take a look at the number of distinct sequences, taxonomy strings, and DB hits:

In [84]:
df["sequence"].nunique()

16419

In [85]:
df["SILVA"].nunique()

325

In [86]:
df["dbhit"].nunique()

2819

So an alternative approach would be to group sequences by DB hit, and pick a random representative sequence.