# Mackerel Data Analysis
Roughly following the structure of the QIIME 2 "moving pictures" tutorial -- this focuses on just getting the data ready for analysis in Songbird and Qurro.

The data is from [study 11721 on Qiita](https://qiita.ucsd.edu/study/description/11721). The input sOTU data (representative sequences and a BIOM table) were demultiplexed, trimmed to 150nt, and processed using Deblur through Qiita.

## Setting up
Declare some environment variables and move into the output directory.

In [17]:
# Input Data Locations (trimmed-to-150-nt and deblurred BIOM table and representative sequences,
# as well as sample metadata)
%env INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56428/all.biom
%env INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56428/all.seqs.fa
%env INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt

# Output directory (will contain all .qza and .qzv files generated by this analysis)
%env OUTPUT_DIRECTORY=/home/mfedarko/20190726_MackerelAnalysisOutput

env: INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56428/all.biom
env: INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56428/all.seqs.fa
env: INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt
env: OUTPUT_DIRECTORY=/home/mfedarko/20190726_MackerelAnalysisOutput


In [18]:
import os
odir = os.environ["OUTPUT_DIRECTORY"]
os.chdir(odir)
print("Moved into output directory: {}".format(odir))

Moved into output directory: /home/mfedarko/20190726_MackerelAnalysisOutput


## Get information about the current QIIME 2 environment
This notebook is using a slightly modified kernel that has access to the `/bin/` folder of my `qiime2-2019.4` conda environment. This lets us just use QIIME 2 commands directly.

**Note that we're using QIIME 2 2019.4 here.** I'm writing this notebook in late July 2019; QIIME 2 2019.7 will be released in a few days. When QIIME 2 2019.7 is released, Qurro v0.3.0 will also be officially released. Due to the way the `Differential` QIIME 2 semantic type is defined, these changes will make the newest version of Qurro not work with old QIIME 2 versions (including 2019.4). If you'd like to replicate this notebook using QIIME 2 2019.4, you should download:

- QIIME 2 2019.4
- A modified version of Qurro v0.3.0 (in particular, from the [`fedarko/qurro@q2-2019.4-support`](https://github.com/fedarko/qurro/tree/q2-2019.4-support) branch).
  - This can be done by running `pip install git+https://github.com/fedarko/qurro.git@q2-2019.4-support`.
- A version of Songbird that defines the `Differential` type itself.
  - Songbird v0.8.3 should work for this.

In [19]:
!qiime info

[32mSystem versions[0m
Python version: 3.6.7
QIIME 2 release: 2019.4
QIIME 2 version: 2019.4.0
q2cli version: 2019.4.0
[32m
Installed plugins[0m
alignment: 2019.4.0
composition: 2019.4.0
cutadapt: 2019.4.0
dada2: 2019.4.0
deblur: 2019.4.0
deicode: 0.2.3
demux: 2019.4.1
diversity: 2019.4.0
emperor: 2019.4.0
feature-classifier: 2019.4.0
feature-table: 2019.4.0
fragment-insertion: 2019.4.0
gneiss: 2019.4.0
longitudinal: 2019.4.0
metadata: 2019.4.0
phylogeny: 2019.4.0
quality-control: 2019.4.0
quality-filter: 2019.4.0
qurro: 0.3.0-q2-2019.4-support
sample-classifier: 2019.4.0
songbird: 0.8.3
taxa: 2019.4.0
types: 2019.4.1
vsearch: 2019.4.0
[32m
Application config directory[0m
/home/mfedarko/.config/q2cli[0m
[32m
Getting help[0m
To get help with QIIME 2, visit https://qiime2.org[0m


## Importing data into QIIME 2 artifacts
See [the QIIME 2 documentation on importing data](https://docs.qiime2.org/2019.4/tutorials/importing/) for context on why this is necessary and useful.

Note that this dataset doesn't just contain data about the microbiota of pacific chub mackerel: it also contains samples taken from other species of fish, as well as well as environmental samples. We'll filter some of these samples out of the dataset soon.

In [20]:
!qiime tools import \
    --type "FeatureTable[Frequency]" \
    --input-path $INPUT_BIOM_TABLE_PATH \
    --output-path table-unfiltered.qza
!qiime tools import \
    --type "FeatureData[Sequence]" \
    --input-path $INPUT_REP_SEQS_PATH \
    --output-path rep-seqs-unfiltered.qza

[32mImported /projects/qiita_data/BIOM/56428/all.biom as BIOMV210DirFmt to table-unfiltered.qza[0m
[32mImported /projects/qiita_data/BIOM/56428/all.seqs.fa as DNASequencesDirectoryFormat to rep-seqs-unfiltered.qza[0m


### Import Greengenes 13_8 99% data as QIIME 2 artifacts
See [DeSantis et al. 2006](https://aem.asm.org/content/72/7/5069.short) and [McDonald et al. 2012](https://www.nature.com/articles/ismej2011139?report=reader). We'll use this in the "Taxonomic classification" section below.

(The main reason this is all the way up here is so we can just run all cells in this notebook below a certain point without having to repeatedly import this data.)

In [21]:
# Import the Greengenes 13_8 99% data into QIIME 2 artifacts
!qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path /databases/gg/13_8/rep_set/99_otus.fasta \
    --output-path gg_13_8_99_otus.qza

!qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --input-format HeaderlessTSVTaxonomyFormat \
    --input-path /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt \
    --output-path gg_13_8_99_taxonomy.qza

[32mImported /databases/gg/13_8/rep_set/99_otus.fasta as DNASequencesDirectoryFormat to gg_13_8_99_otus.qza[0m
[32mImported /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt as HeaderlessTSVTaxonomyFormat to gg_13_8_99_taxonomy.qza[0m


### Summarize the imported table and representative sequence data
This gives us information about the number of samples and sequences present in these files. It's useful for sanity-checking the filtering that will be done in the next section.

In [22]:
!qiime feature-table summarize \
    --i-table table-unfiltered.qza \
    --o-visualization table-unfiltered-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs-unfiltered.qza \
    --o-visualization rep-seqs-unfiltered-summary.qzv

[32mSaved Visualization to: table-unfiltered-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-unfiltered-summary.qzv[0m


## Filter the feature table (and representative sequences) to just pacific chub mackerel and sea water samples
If you examine `table-unfiltered-summary.qzv` (in particular the "Interactive Sample Detail" tab), you should see that only 1,173 / 1,530 samples have a `host_common_name` of `pacific chub mackerel`. We're going to look at how samples taken from various body sites of these mackerel differ from environmental samples (in particular, samples taken just from sea water).

So we'll filter the table to just samples where `host_common_name` is `pacific chub mackerel` *or* samples where `sample_type_body_site` is `sea water`.

In [23]:
!qiime feature-table filter-samples \
    --i-table table-unfiltered.qza \
    --m-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --p-where "host_common_name='pacific chub mackerel' OR sample_type_body_site='sea water'" \
    --o-filtered-table table.qza

# Filter rep-seqs-unfiltered.qza to only include sequences present in the now-filtered table (table.qza).
!qiime feature-table filter-seqs \
    --i-table table.qza \
    --i-data rep-seqs-unfiltered.qza \
    --o-filtered-data rep-seqs.qza

[32mSaved FeatureTable[Frequency] to: table.qza[0m
[32mSaved FeatureData[Sequence] to: rep-seqs.qza[0m


### Summarize the filtered data
This will let us double-check that the filtering above was done properly.

In [24]:
!qiime feature-table summarize \
    --i-table table.qza \
    --o-visualization table-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs.qza \
    --o-visualization rep-seqs-summary.qzv

[32mSaved Visualization to: table-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-summary.qzv[0m


## Taxonomic classification
You don't *need* taxonomy information (i.e. feature metadata) to run Songbird or Qurro. However, having this information available is extremely useful in interpreting a Qurro visualization -- this is why we'll perform taxonomic classification on our dataset's sOTUs.

We're going to do this taxonomic classification using BLAST+ (see [Camacho et al. 2009](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421)) and based on the Greengenes 13_8 99% database (see above for citations).

In [25]:
!qiime feature-classifier classify-consensus-blast \
    --i-query rep-seqs.qza \
    --i-reference-reads gg_13_8_99_otus.qza \
    --i-reference-taxonomy gg_13_8_99_taxonomy.qza \
    --o-classification taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: taxonomy.qza[0m


## Run Songbird (outside of QIIME 2)
This will generate feature differentials, which we'll visualize in Qurro.
For details on what Songbird does and how it works, please see [Songbird's GitHub page](https://github.com/biocore/songbird/), as well as [_Morton and Marotz et al. 2019_](https://www.nature.com/articles/s41467-019-10656-5).

We'll run Songbird outside of QIIME 2 here so that we can easily use Tensorboard to visualize its diagnostic plots.
### Export the table QIIME 2 artifact
This will allow Songbird to just load it as a normal BIOM table.

In [26]:
!qiime tools export \
    --input-path table.qza \
    --output-path table

[32mExported table.qza as BIOMV210DirFmt to directory table[0m


### Explanations of Songbird parameters used

#### `--formula`
This parameter is used by Songbird to determine what sample metadata fields should be used as covariates when generating differentials. Here, we generate differentials relative to the `sample_type_body_site` field (using the `sea water` values of this field as a reference), but there are plenty of other options for fields that could be used here.

#### `--min-sample-count` and `--min-feature-count`
To quote the Songbird documentation: the `--min-sample-count` and `--min-feature-count` parameters, respectively, specify "[the] minimum number of counts a [sample/feature] needs for it to be included in the analysis." We manually specify these parameters here to be consistent (until recently, the default minimum feature count [was slightly different](https://github.com/biocore/songbird/issues/62) between the QIIME 2 and standalone Songbird versions), but these parameters should line up with the default parameters as of writing.

#### `--epochs`, `--learning-rate`, `--batch-size`
These parameters influence the number of iterations Songbird performs:

- We've increased `--epochs` from the default of 1,000 to 5,000 to make Songbird run for a bit longer (we're working with a fairly large dataset).
- We've decreased `--learning-rate` from the default of 0.001 to 0.0001 to similarly increase Songbird's run time.
- We've increased `--batch-size` from the default of 5 to 10 to make Songbird process a larger amount of samples at once in each iteration. Since our samples fall into six "categories" (the five mackerel body sites, plus sea water samples), using a larger batch size (that stands a better chance of reflecting this diversity) makes sense.

#### `--num-random-test-examples`
Quoting Songbird's documentation again, this is "[the number] of random samples to hold out for cross-validation if `--training-column` is not specified." The default for this is 5 (i.e. use just 5 samples for cross-validation); since we have the luxury of having a lot of samples in this dataset, we can afford to hold out more samples. This is why we've increased this to 50 samples.

#### `--summary-interval`
This just influences how Songbird reports fitting statistics to Tensorboard (which lets us diagnose if Songbird's model is fitting reasonably to the dataset). Again, see [Songbird's FAQ](https://github.com/biocore/songbird#faqs) for details.

In [27]:
!songbird multinomial \
    --input-biom table/feature-table.biom \
    --metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --formula "C(sample_type_body_site, Treatment('sea water'))" \
    --min-sample-count 1000 \
    --min-feature-count 10 \
    --epochs 5000 \
    --learning-rate 0.0001 \
    --num-random-test-examples 50 \
    --batch-size 10 \
    --summary-interval 10 \
    --summary-dir songbird-output/

  filepath, dtype=object)
W0726 15:35:59.506467 140101596735296 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.4/bin/songbird:92: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-26 15:35:59.507573: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-07-26 15:35:59.571833: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599965000 Hz
2019-07-26 15:35:59.573530: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a4acb69530 executing computations on platform Host. Devices:
2019-07-26 15:35:59.573561: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
OMP: Info #212: KMP_AFFINIT

W0726 15:35:59.918335 140101596735296 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.4/lib/python3.6/site-packages/songbird/multinomial.py:136: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

W0726 15:35:59.980182 140101596735296 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.4/lib/python3.6/site-packages/songbird/multinomial.py:163: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

  0%|                                                | 0/415000 [00:00<?, ?it/s]W0726 15:35:59.992781 140101596735296 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.4/lib/python3.6/site-packages/songbird/multinomial.py:172: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

W0726 15:35:59.992933 140101596735296 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.4/lib/python3.6/

 12%|████                               | 48703/415000 [01:10<08:27, 721.55it/s]2019-07-26 15:37:10.076775: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 13%|████▋                              | 55720/415000 [01:20<08:34, 698.64it/s]2019-07-26 15:37:20.077841: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 15%|█████▎                             | 62675/415000 [01:30<08:21, 703.07it/s]2019-07-26 15:37:30.077773: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 17%|█████▊                             | 69593/415000 [01:40<08:15, 697.31it/s]2019-07-26 15:37:40.078433: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 18%|██████▍                            | 76639/415000 [01:50<08:00, 704.48it/s]2019-07-26 15:37:50.078719: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 20%|███████                            | 83576/41

 84%|████████████████████████████▌     | 348268/415000 [08:20<01:35, 701.89it/s]2019-07-26 15:44:20.108206: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 86%|█████████████████████████████     | 355180/415000 [08:30<01:26, 692.71it/s]2019-07-26 15:44:30.109382: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 87%|█████████████████████████████▋    | 362137/415000 [08:40<01:15, 701.87it/s]2019-07-26 15:44:40.110502: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 89%|██████████████████████████████▏   | 369140/415000 [08:50<01:04, 709.76it/s]2019-07-26 15:44:50.111726: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 91%|██████████████████████████████▊   | 376075/415000 [09:00<00:55, 698.98it/s]2019-07-26 15:45:00.112242: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 92%|███████████████████████████████▍  | 383056/41

## Import the Songbird differentials as a QIIME 2 artifact

In [28]:
!qiime tools import \
    --type "FeatureData[Differential]" \
    --input-path "songbird-output/differentials.tsv" \
    --output-path "songbird-differentials.qza"

[32mImported songbird-output/differentials.tsv as DifferentialDirFmt to songbird-differentials.qza[0m


## Run Qurro!

In [29]:
!qiime qurro differential-plot \
    --i-table table.qza \
    --i-ranks songbird-differentials.qza \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --m-feature-metadata-file taxonomy.qza \
    --verbose \
    --o-visualization qurro-plot.qzv

30659 feature(s) in the BIOM table were not present in the feature rankings.
These feature(s) have been removed from the visualization.
604 sample(s) in the sample metadata file were not present in the BIOM table.
These sample(s) have been removed from the visualization.
Removed 1 empty sample(s).
[32mSaved Visualization to: qurro-plot.qzv[0m
