# Mackerel Data Analysis
Roughly following the structure of the QIIME 2 "moving pictures" tutorial -- this focuses on just getting the data ready for analysis in Songbird and Qurro.

The data is from [study 11721 on Qiita](https://qiita.ucsd.edu/study/description/11721), and is associated with a manuscript currently under submission (Minich et al. 2019).

The input sOTU data (representative sequences and a BIOM table) were demultiplexed, trimmed to 150nt, and processed using Deblur through Qiita.

## Setting up
Declare some environment variables and move into the output directory.

In [29]:
# Input Data Locations (trimmed-to-150-nt and deblurred BIOM table and representative sequences,
# as well as sample metadata)
%env INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56428/all.biom
%env INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56428/all.seqs.fa
%env INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt

# Output directory (will contain all .qza and .qzv files generated by this analysis)
%env OUTPUT_DIRECTORY=/home/mfedarko/qurro-mackerel-analysis2/qurro-mackerel-analysis/20190731_MackerelAnalysisOutput/

env: INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56428/all.biom
env: INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56428/all.seqs.fa
env: INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt
env: OUTPUT_DIRECTORY=/home/mfedarko/qurro-mackerel-analysis2/qurro-mackerel-analysis/20190731_MackerelAnalysisOutput/


In [30]:
import os
odir = os.environ["OUTPUT_DIRECTORY"]
os.chdir(odir)
print("Moved into output directory: {}".format(odir))

Moved into output directory: /home/mfedarko/qurro-mackerel-analysis2/qurro-mackerel-analysis/20190731_MackerelAnalysisOutput/


## Get information about the current QIIME 2 environment

In [31]:
!qiime info

[32mSystem versions[0m
Python version: 3.6.7
QIIME 2 release: 2019.7
QIIME 2 version: 2019.7.0
q2cli version: 2019.7.0
[32m
Installed plugins[0m
alignment: 2019.7.0
composition: 2019.7.0
cutadapt: 2019.7.0
dada2: 2019.7.0
deblur: 2019.7.0
demux: 2019.7.0
diversity: 2019.7.0
emperor: 2019.7.0
feature-classifier: 2019.7.0
feature-table: 2019.7.0
fragment-insertion: 2019.7.0
gneiss: 2019.7.0
longitudinal: 2019.7.0
metadata: 2019.7.0
phylogeny: 2019.7.0
quality-control: 2019.7.0
quality-filter: 2019.7.0
qurro: 0.3.0
sample-classifier: 2019.7.0
songbird: 0.8.4
taxa: 2019.7.0
types: 2019.7.0
vsearch: 2019.7.0
[32m
Application config directory[0m
/home/mfedarko/.config/q2cli[0m
[32m
Getting help[0m
To get help with QIIME 2, visit https://qiime2.org[0m


## Importing data into QIIME 2 artifacts
See [the QIIME 2 documentation on importing data](https://docs.qiime2.org/2019.4/tutorials/importing/) for context on why this is necessary and useful.

Note that this dataset doesn't just contain data about the microbiota of pacific chub mackerel: it also contains samples taken from other species of fish, as well as well as environmental samples. We'll filter some of these samples out of the dataset soon.

In [32]:
!qiime tools import \
    --type "FeatureTable[Frequency]" \
    --input-path $INPUT_BIOM_TABLE_PATH \
    --output-path table-unfiltered.qza
!qiime tools import \
    --type "FeatureData[Sequence]" \
    --input-path $INPUT_REP_SEQS_PATH \
    --output-path rep-seqs-unfiltered.qza

[32mImported /projects/qiita_data/BIOM/56428/all.biom as BIOMV210DirFmt to table-unfiltered.qza[0m
[32mImported /projects/qiita_data/BIOM/56428/all.seqs.fa as DNASequencesDirectoryFormat to rep-seqs-unfiltered.qza[0m


### Import Greengenes 13_8 99% data as QIIME 2 artifacts
See [DeSantis et al. 2006](https://aem.asm.org/content/72/7/5069.short) and [McDonald et al. 2012](https://www.nature.com/articles/ismej2011139?report=reader). We'll use this in the "Taxonomic classification" section below.

In [33]:
# Import the Greengenes 13_8 99% data into QIIME 2 artifacts
!qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path /databases/gg/13_8/rep_set/99_otus.fasta \
    --output-path gg_13_8_99_otus.qza

!qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --input-format HeaderlessTSVTaxonomyFormat \
    --input-path /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt \
    --output-path gg_13_8_99_taxonomy.qza

[32mImported /databases/gg/13_8/rep_set/99_otus.fasta as DNASequencesDirectoryFormat to gg_13_8_99_otus.qza[0m
[32mImported /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt as HeaderlessTSVTaxonomyFormat to gg_13_8_99_taxonomy.qza[0m


### Summarize the imported table and representative sequence data
This gives us information about the number of samples and sequences present in these files. It's useful for sanity-checking the filtering that will be done in the next section.

In [34]:
!qiime feature-table summarize \
    --i-table table-unfiltered.qza \
    --o-visualization table-unfiltered-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs-unfiltered.qza \
    --o-visualization rep-seqs-unfiltered-summary.qzv

[32mSaved Visualization to: table-unfiltered-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-unfiltered-summary.qzv[0m


## Filter the feature table (and representative sequences) to just pacific chub mackerel and sea water samples *and* samples with &#8805; 1,362 sequences
If you examine `table-unfiltered-summary.qzv` (in particular the "Interactive Sample Detail" tab), you should see that only 1,173 / 1,530 samples have a `host_common_name` of `pacific chub mackerel`. We're going to look at how samples taken from various body sites of these mackerel differ from environmental samples (in particular, samples taken just from sea water).

So we'll filter the table to just samples where `host_common_name` is `pacific chub mackerel` *or* samples where `sample_type_body_site` is `sea water`.

Additionally, we filter the table to only include samples with at least 1,362 sequences. This isn't rarefaction—the remaining samples have all their sequences preserved—but samples with less than this number of sequences are removed from the analysis from here on down. (The 1,362 figure is based on this data's corresponding study's results from using the KatharoSeq protocol: see [Minich et al. 2018](https://msystems.asm.org/content/3/3/e00218-17.abstract) for a description of how KatharoSeq works.)

In [35]:
!qiime feature-table filter-samples \
    --i-table table-unfiltered.qza \
    --m-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --p-where "host_common_name='pacific chub mackerel' OR sample_type_body_site='sea water'" \
    --p-min-frequency 1362 \
    --o-filtered-table table.qza

# Filter rep-seqs-unfiltered.qza to only include sequences present in the now-filtered table (table.qza).
!qiime feature-table filter-seqs \
    --i-table table.qza \
    --i-data rep-seqs-unfiltered.qza \
    --o-filtered-data rep-seqs.qza

[32mSaved FeatureTable[Frequency] to: table.qza[0m
[32mSaved FeatureData[Sequence] to: rep-seqs.qza[0m


### Summarize the filtered data
This will let us double-check that the filtering above was done properly.

In [36]:
!qiime feature-table summarize \
    --i-table table.qza \
    --o-visualization table-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs.qza \
    --o-visualization rep-seqs-summary.qzv

[32mSaved Visualization to: table-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-summary.qzv[0m


## Taxonomic classification
You don't *need* taxonomy information (i.e. feature metadata) to run Songbird or Qurro. However, having this information available is extremely useful in interpreting a Qurro visualization -- this is why we'll perform taxonomic classification on our dataset's sOTUs.

We're going to do this taxonomic classification using BLAST+ (see [Camacho et al. 2009](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421)) and based on the Greengenes 13_8 99% database (see above for citations).

In [37]:
!qiime feature-classifier classify-consensus-blast \
    --i-query rep-seqs.qza \
    --i-reference-reads gg_13_8_99_otus.qza \
    --i-reference-taxonomy gg_13_8_99_taxonomy.qza \
    --o-classification taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: taxonomy.qza[0m


## Run Songbird
This will generate feature differentials, which we'll visualize in Qurro.
For details on what Songbird does and how it works, please see [Songbird's GitHub page](https://github.com/biocore/songbird/), as well as [Morton and Marotz et al. 2019](https://www.nature.com/articles/s41467-019-10656-5).

Note that we're using the version of Songbird described in [this pull request](https://github.com/biocore/songbird/pull/60), at commit [`872068d`](https://github.com/biocore/songbird/pull/60/commits/872068df5c406908aad469b7ce2c1ae10dd661bc) (as of writing [July 31, 2019], this hasn't been merged into the main `biocore/songbird` repository yet). This pull request has a fix that makes the output differentials with the latest version of QIIME 2 (2019.7), which was just released yesterday.

For reference, Songbird was installed using the following command (based on [this Stack Overflow answer](https://stackoverflow.com/a/13561621/10730311)):

```
pip install git+https://github.com/mortonjt/songbird.git@fix-type
```

### Explanations of Songbird parameters used
These parameters were chosen based on consulting Tensorboard to ensure that they resulted in a reasonable model fit.

#### `--p-formula`
This parameter is used by Songbird to determine what sample metadata fields should be used as covariates when generating differentials. Here, we generate differentials relative to the `sample_type_body_site` field (using the `sea water` values of this field as a reference), but there are plenty of other options for fields that could be used here.

#### `--p-min-feature-count`
To quote the Songbird documentation: the `--p-min-feature-count` parameter specifies "[the] minimum number of counts a feature needs for it to be included in the analysis." We manually specify this parameter here in order to be consistent (until recently, the default minimum feature count [was slightly different](https://github.com/biocore/songbird/issues/62) between the QIIME 2 and standalone Songbird versions).

(Also: since we already filtered samples with less than 1,362 total sequences out of the table, Songbird's default `--p-min-sample-count` of `1000` shouldn't do anything here.)

#### `--p-epochs`, `--p-learning-rate`, `--p-batch-size`
These parameters influence the number of iterations Songbird performs:

- We've increased `--p-epochs` from the default of 1,000 to 5,000 to make Songbird run for a bit longer (we're working with a fairly large dataset).
- We've decreased `--p-learning-rate` from the default of 0.001 to 0.0001 to similarly increase Songbird's run time.
- We've increased `--p-batch-size` from the default of 5 to 10 to make Songbird process a larger amount of samples at once in each iteration. Since our samples fall into six "categories" (the five mackerel body sites, plus sea water samples), using a larger batch size (that stands a better chance of reflecting this diversity) makes sense.

#### `--p-num-random-test-examples`
Quoting Songbird's documentation again, this is "[the number] of random samples to hold out for cross-validation if `training-column` is not specified." The default for this is 5 (i.e. use just 5 samples for cross-validation); since we have the luxury of having a lot of samples in this dataset, we can afford to hold out more samples. This is why we've increased this to 50 samples.

#### `--p-summary-interval`
This just influences how Songbird reports fitting statistics to Tensorboard (which lets us diagnose if Songbird's model is fitting reasonably to the dataset). Again, see [Songbird's FAQ](https://github.com/biocore/songbird#faqs) for details.

In [40]:
!qiime songbird multinomial \
    --i-table table.qza \
    --m-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --p-formula "C(sample_type_body_site, Treatment('sea water'))" \
    --p-min-feature-count 10 \
    --p-epochs 5000 \
    --p-learning-rate 0.0001 \
    --p-num-random-test-examples 50 \
    --p-batch-size 10 \
    --p-summary-interval 10 \
    --output-dir songbird-output/

[32mSaved FeatureData[Differential] to: songbird-output/differentials.qza[0m
[32mSaved SampleData[SongbirdStats] to: songbird-output/regression_stats.qza[0m
[32mSaved PCoAResults % Properties('biplot') to: songbird-output/regression_biplot.qza[0m


## Run Qurro!
The particular version of Qurro we use here was installed from my fork of Qurro (up-to-date with commit [`d9aae7a`](https://github.com/fedarko/qurro/commit/d9aae7aea0a76ed79a9c2ec0401fd4e4a67f2d19)). (As of writing, Qurro v0.3.0 has not been merged into the main `biocore/qurro` repository yet.)

In [42]:
!qiime qurro differential-plot \
    --i-table table.qza \
    --i-ranks songbird-output/differentials.qza \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --m-feature-metadata-file taxonomy.qza \
    --verbose \
    --o-visualization qurro-plot.qzv

28980 feature(s) in the BIOM table were not present in the feature rankings.
These feature(s) have been removed from the visualization.
1067 sample(s) in the sample metadata file were not present in the BIOM table.
These sample(s) have been removed from the visualization.
[32mSaved Visualization to: qurro-plot.qzv[0m
