# Sample dropout analysis
As mentioned in the paper, Qurro filters out "invalid" samples from its sample plot. Usually this means samples with an invalid log-ratio—that is, samples where the numerator and/or denominator is 0—but it can cover other reasons (e.g. samples with a non-numeric value in a field that's set as a quantitative scale).

Here, we get a bit obsessive and prove a few things about the dataset:

1. All non-seawater samples in the dataset have numeric `age_2` values, and all seawater samples have non-numeric `age_2` values.
  - This lets us speak confidently about *why* certain samples have been dropped from Figs. 1(c--d) and 2(c--d) (the scatterplot figures) in the paper.


2. Only one seawater sample has a *Shewanella* feature. Furthermore, this one *Shewanella* feature is present in just two samples in the feature table.
  - This tells us why this one weird *Shewanella* is not present in the Qurro visualization (Songbird filtered it out due to the two-samples thing).
  - ...And since none of the seawater samples then contain any *Shewanella* features (from Qurro's perspective, at least), we now know why exactly all of the seawater samples cannot be displayed in Figs. 1(b--d) and 2(b--d). (This is the main reason, at least.)
  
I should probably mention that part number 1 of this analysis could be done easier by just loading the exported sample plot data from Qurro and looking at that, rather than by loading the actual input files / doing the filtering / etc as is shown here.

The reason I did this stuff the long way was so I could demonstrate part number 2 of this analysis -- which shows the impact that upstream filtering can have on Qurro visualizations.

In [21]:
import pandas as pd
import biom
import click
from qiime2 import Artifact, Metadata
from qurro._df_utils import biom_table_to_sparse_df

tbl = Artifact.load("output/table.qza").view(biom.Table)
tax = Artifact.load("output/taxonomy.qza").view(pd.DataFrame)
md = Metadata.load("input/11721_prep_4638_qiime_20190722-104633.txt").to_dataframe()
md

Unnamed: 0_level_0,BarcodeSequence,LinkerPrimerSequence,center_name,experiment_design_description,instrument_model,library_construction_protocol,linker,pcr_primers,platform,run_center,...,water_collect_time,water_lot,water_pressure_dbar,water_salinity_psu,water_temp_c,weight_units,well_description,well_id,year,Description
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11721.s15.5.skin,CCTACCATTGTT,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1830,RNBF7110,3.62,33.45,18.88,kg,PCR1,b9,2017,not applicable
11721.s2.5.gill,ATGCTGCAACAC,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1730,RNBF7110,3.46,33.07,15.8,kg,PCR1,b2,2017,not applicable
11721.s3.4.gill,CCGGACAAGAAG,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1730,RNBF7110,2.98,33.06,14.95,kg,PCR1,f2,2017,not applicable
11721.s16.5.gill,AGGATCAGGGAA,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1715,RNBF7110,3.43,33.53,23.85,kg,PCR1,d10,2017,not applicable
11721.s3.4.digesta,ACTAAGTACCCG,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1730,RNBF7110,2.98,33.06,14.95,kg,PCR1,f2,2017,not applicable
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11721.s9.3.pyloricc,CGACCTCGCATA,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1800,RNBF7110,3.38,33.34,16.4,kg,PCR3,c5,2017,not applicable
11721.s9.4.GI,CTCATCATGTTC,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1800,RNBF7110,3.38,33.34,16.4,kg,PCR1,d5,2017,not applicable
11721.s9.4.pyloricc,GTCAGAGTATTG,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1800,RNBF7110,3.38,33.34,16.4,kg,PCR3,d5,2017,not applicable
11721.s9.5.GI,TTATCCAGTCCT,GTGTGYCAGCMGCCGCGGTAA,UCSDMI,mackerel exp4,Illumina MiSeq,"Illumina EMP protocol 515fbc, 806r amplificati...",GT,FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT,Illumina,UCSDMI,...,1800,RNBF7110,3.38,33.34,16.4,kg,PCR1,e5,2017,not applicable


In [22]:
shared_sample_ids = set(md.index) & set(tbl.ids())
shared_feature_ids = set(tax.index) & set(tbl.ids(axis="observation"))
print("{} shared sample IDs, {} shared feature IDs.".format(len(shared_sample_ids), len(shared_feature_ids)))

639 shared sample IDs, 23253 shared feature IDs.


## 1. Show that all non-seawater samples in the dataset have numeric `age_2` values, and that all seawater samples have non-numeric `age_2` values
This lets us conclude two things:

1. The reason non-seawater samples are dropped from Figs. 1(c), 1(d), 2(c), and 2(d) is that they have an invalid log-ratio (*not* that they don't have a proper estimated age value).

2. All of the seawater samples not shown in Figs. 1(c) and 2(c) would not have been shown here even if they had a valid log-ratio, because they have non-numeric `age_2` values.

Neither of these conclusions are that important, but I wanted to make sure they were clear so that I could speak confidently about the mentioned figures.

In [23]:
md_used = md.loc[shared_sample_ids]
non_sw = md_used[md_used["sample_type_body_site"] != "sea water"]
sw = md_used[md_used["sample_type_body_site"] == "sea water"]

# Verify point 1 above -- all non seawater samples have numeric age_2 values
print(non_sw["age_2"].unique())
# (this would fail if there are any nonnumeric values, like "asdf")
[float(a) for a in non_sw["age_2"].unique()]

# Verify point 2 above -- all seawater samples have nonnumeric age_2 values
print(sw["age_2"].unique())

['1.364102642' '1.206180149' '1.228407355' '0.64378291' '3.602632119'
 '1.925380172' '2.057141704' '1.873726724' '2.843295371' '2.361206407'
 '1.623913614' '2.003977887' '1.822653664' '1.010879986' '2.536053139'
 '1.797330732' '1.480361013' '0.885191207' '2.972253085' '1.387117651'
 '0.108405475' '2.03048207' '0.742816338' '3.678391449' '1.162049175'
 '1.250743786' '1.722197541' '0.926702496' '1.433500242' '1.318419297'
 '0.947601174' '1.03217021' '2.447789112' '1.527709926' '1.27319052'
 '1.772148094' '4.035860209' '2.24827005' '2.276242167' '2.565859505'
 '2.083958622' '1.053560624' '3.913553605' '1.575557258' '0.196328034'
 '1.951428694' '1.977627352' '1.075052175' '0.702943674' '1.848118462'
 '1.295748654' '0.624232538' '2.138071816' '1.140143328' '3.17282038'
 '2.656469572' '2.304387486' '0.056456858' '3.104976516' '1.503973817'
 '1.69742661' '0.762884603' '2.418744575' '1.599671132' '0.823626547'
 '1.341203578' '2.389884449' '1.648286083' '2.717895569' '2.939669042'
 '1.45687024'

## 2. Show that only one of the seawater samples has any *Shewanella* features at all, and that the *Shewanella* feature in question is only present in two samples
This explains why none of the seawater samples are shown in Figs. 1(b–d) and 2(b–d).

In [27]:
def get_filtered_table(tbl, query, search_on_taxon=True):
    if search_on_taxon:
        query_features = tax[
            tax["Taxon"].str.lower().str.find(query.lower()) != -1
        ].index
    else:
        # search on feature ID
        query_features = tax[
            tax.index.str.lower().str.find(query.lower()) != -1
        ].index

    # Make a copy of the table filtered to just the matching features
    tbl_query = tbl.filter(query_features, axis="observation", inplace=False)

    # Return filtered table as a pandas SparseDataFrame
    return biom_table_to_sparse_df(
        tbl_query, min_row_ct=1, min_col_ct=1
    )

df_tbl_query = get_filtered_table(tbl, "Shewanella")

# Note that this loop doesn't break when it finds its first hit -- it keeps on going until it's checked
# every seawater sample.
# (Even though I *know* that only the one seawater sample in this dataset contains any Shewanella, the
# code doesn't actually assume this until it explicitly defines sw_shewanella_asv after this loop.)
for sample_id in sw.index:
    tax_match_features_in_sample = df_tbl_query[
        df_tbl_query[sample_id] > 0
    ].index
    if len(tax_match_features_in_sample) > 0:
        print("Found Shewanella feature(s) in sea water sample {}.".format(sample_id))
        print("Feature IDs found: {}".format(list(tax_match_features_in_sample)))
        
# The feature ID of the matching ASV was TACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGCAGGCGGTTTTTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTCGAACTGAAGAACTAGAGTTTTGTAGAGGGTGGTAGAATTTCAGG
# Let's see which samples it's present in.
sw_shewanella_asv = "TACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGCAGGCGGTTTTTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTCGAACTGAAGAACTAGAGTTTTGTAGAGGGTGGTAGAATTTCAGG"
df_tbl_shew_asv = biom_table_to_sparse_df(
    tbl.filter([sw_shewanella_asv], axis="observation", inplace=False),
    min_row_ct=1,
    min_col_ct=1
)
# There is definitely a more elegant vectorized way to do this but I'm on a plane right now and this is
# the easiest-to-read thing I can think of
for sample_id in df_tbl_shew_asv.columns:
    if df_tbl_shew_asv[sample_id][sw_shewanella_asv] > 0:
        print("That one Shewanella ASV is present in sample {}".format(sample_id))

Found Shewanella feature(s) in sea water sample 11721.s1.2.sw.
Feature IDs found: ['TACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGCAGGCGGTTTTTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTCGAACTGAAGAACTAGAGTTTTGTAGAGGGTGGTAGAATTTCAGG']
That one Shewanella ASV is present in sample 11721.s1.3.gill
That one Shewanella ASV is present in sample 11721.s1.2.sw


Since the particular Shewanella ASV is only present in two samples, Songbird's default `--min-feature-count` as of writing of 10 will filter this ASV out (i.e. this ASV is not included in the differentials).

Therefore, we can say that -- from the perspective of the Qurro visualization created for this case study, which only knows about the features in the rankings -- none of the sea water samples contain any Shewanella features. So when you take a log-ratio with Shewanella in the numerator, all of these seawater samples will get dropped out (...if they weren't already dropped out).