# MGnify to Darwin Core export notes
## What do OTUs represent in the MGnify pipeline?

Get all SSU downloads for analysis `MGYA00593805`:

In [42]:
import pandas as pd
from mgnifyextract.analyses import get_analysis
from mgnifyextract.downloads import FastaDownload, MseqDownload, TsvDownload
from mgnifyextract.studies import get_superstudy_studies
from mgnifyextract.util import clean_taxonomy_string

analysis = get_analysis("MGYA00593805")
analysis

<Analysis https://www.ebi.ac.uk/metagenomics/analyses/MGYA00593805 >

In [3]:
downloads = analysis.get_downloads()

marker = "SSU"

fasta_files = [download for download in downloads if isinstance(download, FastaDownload) and download.marker == marker]
mseq_files = [download for download in downloads if isinstance(download, MseqDownload) and download.marker == marker]
tsv_files = [download for download in downloads if isinstance(download, TsvDownload) and download.marker == marker]

Let's take a look at the number of rows in the fasta, mseq, and OTU files.

In [47]:
fasta = fasta_files[0].read_pandas()
fasta

Unnamed: 0,reference,sequence
0,SRR5788044.10000144-NS500496-106-HF3LCBGXX:2:1...,TTTGAAGTGGTGGCGTCAGCTGCCATGGAGTCGCCAGTGAAATACC...
1,SRR5788044.1000402-NS500496-106-HF3LCBGXX:1:11...,CGGCAGCGAAGTTGGTGATGTCATGCTTCCAAGAAAAGCCCTATAC...
2,SRR5788044.10005124-NS500496-106-HF3LCBGXX:2:1...,TGGATGACTTGTGGTAAGGGGTGAAAGGCCAACCAAATTCGTAGAT...
3,SRR5788044.10005627-NS500496-106-HF3LCBGXX:2:1...,ACTTATACAGCTAGGAGGTTGGCTTAGAAGCAGCCATCCTTTAAAG...
4,SRR5788044.10005627-NS500496-106-HF3LCBGXX:2:1...,GTAGAGCACTGTTTCGGCTAGGGGGTCATCCCGACTTACCAAACCG...
...,...,...
46989,SRR5788044.9855455-NS500496-106-HF3LCBGXX:2:13...,CACAGAGGGTGAAAGTCCCGTATACGTAACGGATATGGCCATGTAA...
46990,SRR5788044.9889048-NS500496-106-HF3LCBGXX:2:13...,TGAGAATCTAGGGAAGAGTAGCAGCATAGAGTGGTGAGAATCCGCT...
46991,SRR5788044.9942923-NS500496-106-HF3LCBGXX:2:13...,ATCCTGGGTGTGCAGAAGCACCCAAGGGTGGGGTTGTTCGCCCATT...
46992,SRR5788044.9985239-NS500496-106-HF3LCBGXX:2:13...,TCCTGAGTAGCGTGCGTTGGATATCGCTCGTGAATATGGGGGGCAG...


In [48]:
fasta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46994 entries, 0 to 46993
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   reference  46994 non-null  object
 1   sequence   46994 non-null  object
dtypes: object(2)
memory usage: 734.4+ KB


In [54]:
mseq = mseq_files[0].read()
mseq

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
0,SRR5788044.10000144-NS500496-106-HF3LCBGXX:2:1...,MDKK01000001.24633.27373,238,0.954198,250,12,0,0,262,1983,2245,+,,sk__Bacteria;k__;p__Proteobacteria;c__Alphapro...,
1,SRR5788044.1000402-NS500496-106-HF3LCBGXX:1:11...,MWPE01000234.22117.24981,179,1.000000,179,0,0,0,179,1529,1708,+,,sk__Bacteria;k__;p__Cyanobacteria;c__;o__Synec...,
2,SRR5788044.10005124-NS500496-106-HF3LCBGXX:2:1...,GU574705.3739.6533,211,1.000000,211,0,0,0,211,742,953,+,,sk__Bacteria;k__;p__Proteobacteria;c__Alphapro...,
3,SRR5788044.10005627-NS500496-106-HF3LCBGXX:2:1...,AACY020565549.943.3654,142,0.993056,143,1,0,0,144,1002,1146,+,,sk__Bacteria;k__;p__Proteobacteria;c__Gammapro...,
4,SRR5788044.10005627-NS500496-106-HF3LCBGXX:2:1...,AACY020556324.4185.6607,141,0.973154,145,4,0,0,149,533,682,+,,sk__Bacteria;k__;p__Proteobacteria;c__Gammapro...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46979,SRR5788044.9855455-NS500496-106-HF3LCBGXX:2:13...,LURV01000038.566.3453,196,1.000000,196,0,0,0,196,345,541,+,,sk__Archaea;k__;p__Euryarchaeota;c__;o__;f__;g__,
46980,SRR5788044.9889048-NS500496-106-HF3LCBGXX:2:13...,CENG01045021.1.2603,251,0.996047,252,1,0,0,253,1316,1569,+,,sk__Archaea;k__;p__Euryarchaeota;c__;o__;f__;g...,
46981,SRR5788044.9942923-NS500496-106-HF3LCBGXX:2:13...,CESN01102825.363.3172,217,1.000000,217,0,0,0,217,2540,2757,+,,sk__Archaea;k__;p__Euryarchaeota;c__;o__;f__;g__,
46982,SRR5788044.9985239-NS500496-106-HF3LCBGXX:2:13...,LURP01000175.823.3708,98,1.000000,98,0,0,0,98,392,490,+,,sk__Archaea;k__;p__Euryarchaeota;c__;o__;f__;g__,


In [50]:
mseq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46984 entries, 0 to 46983
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   #query       46984 non-null  object 
 1   dbhit        46984 non-null  object 
 2   bitscore     46984 non-null  int64  
 3   identity     46984 non-null  float64
 4   matches      46984 non-null  int64  
 5   mismatches   46984 non-null  int64  
 6   gaps         46984 non-null  int64  
 7   query_start  46984 non-null  int64  
 8   query_end    46984 non-null  int64  
 9   dbhit_start  46984 non-null  int64  
 10  dbhit_end    46984 non-null  int64  
 11  strand       46984 non-null  object 
 12  Unnamed: 12  0 non-null      float64
 13  SILVA        46984 non-null  object 
 14  Unnamed: 14  0 non-null      float64
dtypes: float64(3), int64(8), object(4)
memory usage: 5.4+ MB


In [5]:
otu = tsv_files[0].read()
otu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   # OTU ID  121 non-null    int64  
 1   SSU_rRNA  121 non-null    float64
 2   taxonomy  121 non-null    object 
 3   taxid     121 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 3.9+ KB


So we go from 46,994 reads in fasta to 46,984 hits in the mseq file to just 121 OTUs.

The same number of distinct taxonomy values in the mseq file as well as the OTU suggests that the OTUs correspond to all distinct taxonomic assignments regardless of sequence similarity. All reads with assignment Bacteria for example are collapsed into a single OTU.

In [6]:
pd.Series([clean_taxonomy_string(tax) for tax in otu["taxonomy"]]).nunique()

121

In [7]:
pd.Series([clean_taxonomy_string(tax) for tax in mseq["SILVA"]]).nunique()

121

In [39]:
mseq_bacteria = mseq.loc[mseq["SILVA"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
mseq_bacteria

Unnamed: 0,#query,dbhit,bitscore,identity,matches,mismatches,gaps,query_start,query_end,dbhit_start,dbhit_end,strand,Unnamed: 12,SILVA,Unnamed: 14
148,SRR5788044.10096039-NS500496-106-HF3LCBGXX:2:1...,FUFK010036927.1.2535,144,0.986486,146,2,0,0,148,2286,2434,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
149,SRR5788044.10096039-NS500496-106-HF3LCBGXX:2:1...,FUFK010036398.1.2602,135,0.959184,141,6,0,0,147,2246,2393,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__,
222,SRR5788044.10139168-NS500496-106-HF3LCBGXX:2:1...,AACY020553637.2732.5725,143,0.979866,146,3,0,0,149,2434,2583,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__,
223,SRR5788044.10139168-NS500496-106-HF3LCBGXX:2:1...,AACY020564201.393.3385,147,0.993289,148,1,0,0,149,2779,2928,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
241,SRR5788044.1015022-NS500496-106-HF3LCBGXX:1:11...,AACY020564201.393.3385,94,1.000000,94,0,0,0,94,1066,1160,+,,sk__Bacteria;k__;p__;c__;o__;f__;g__;s__,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44729,SRR5788044.26984355-NS500496-106-HF3LCBGXX:4:2...,CEQN01024248.2.2406,43,0.776596,73,19,2,58,150,448,542,+,,sk__Bacteria;k__,
45411,SRR5788044.658775-NS500496-106-HF3LCBGXX:1:112...,MEXB01000013.33683.35781,41,0.742857,78,25,2,29,132,520,625,+,,sk__Bacteria;k__,
46282,SRR5788044.20719679-NS500496-106-HF3LCBGXX:3:2...,AUOS02000042.824.3639,81,0.786667,118,25,7,4,147,34,184,+,,sk__Bacteria;k__,
46782,SRR5788044.5182489-NS500496-106-HF3LCBGXX:1:22...,FUFK010036927.1.2535,89,0.932039,96,7,0,0,103,1660,1763,+,,sk__Bacteria;k__;p__;c__;o__;f__,


In [41]:
otu_bacteria = otu.loc[otu["taxonomy"].apply(lambda x: clean_taxonomy_string(x)) == "sk__Bacteria"]
otu_bacteria

Unnamed: 0,# OTU ID,SSU_rRNA,taxonomy,taxid
0,5820,203.0,sk__Bacteria,2


Grouping by taxonomy appears to be the approach currently used in [mgnify-to-dwc](https://github.com/gbif/mgnify-to-dwc), see occurrences for this sample here: <https://www.gbif.org/occurrence/search?dataset_key=f6da16a0-ad5a-4f47-a347-aa6281de3d0d&advanced=1&event_id=MGYA00593805>

## Alternative approach

Instead of relying on the OTU table as downloaded from MGnify, let's try grouping sequences by SILVA hit. I'm working with analysis <https://www.ebi.ac.uk/metagenomics/analyses/MGYA00463299> here.

In [8]:
study = get_superstudy_studies("atlanteco")[0]
study


<Study https://www.ebi.ac.uk/metagenomics/studies/MGYS00005780 >

In [9]:
sample = study.get_samples(max_results=1)[0]
sample

<Sample https://www.ebi.ac.uk/metagenomics/samples/SRS2329696 >

In [10]:
run = sample.get_runs(max_results=1)[0]
run

<Run https://www.ebi.ac.uk/metagenomics/runs/SRR5788044 >

In [11]:
analysis = run.get_analyses(max_results=1)[0]
analysis

<Analysis https://www.ebi.ac.uk/metagenomics/analyses/MGYA00463299 >

In [12]:
marker = "LSU"
downloads = analysis.get_downloads()
fasta_files = [download for download in downloads if isinstance(download, FastaDownload) and download.marker == marker]
mseq_files = [download for download in downloads if isinstance(download, MseqDownload) and download.marker == marker]
fasta = fasta_files[0].read_pandas()
mseq = mseq_files[0].read()

Merge fasta and mseq tables:

In [13]:
df = fasta.merge(mseq.rename({"#query": "reference"}, axis=1), how="left", on="reference")

First clean taxonomy strings by removing empty ranks:

In [14]:
df["SILVA"] = [clean_taxonomy_string(tax) for tax in df["SILVA"]]

Let's take a look at the number of distinct sequences, taxonomy strings, and DB hits:

In [15]:
df["sequence"].nunique()

42037

In [16]:
df["SILVA"].nunique()

453

In [17]:
df["dbhit"].nunique()

3184

So an alternative approach would be to group sequences by DB hit, and pick a random representative sequence.