# Mackerel Data Analysis
Roughly following the structure of the QIIME 2 "moving pictures" tutorial -- this focuses on just getting the data ready for analysis in Songbird and Qurro.

The data is from [study 11721 on Qiita](https://qiita.ucsd.edu/study/description/11721), and is associated with a manuscript currently in preparation (Minich et al. 2019, preprint available on bioRxiv [here](https://www.biorxiv.org/content/10.1101/721555v1)).

The input data in this notebook is 150nt sOTU data, corresponding to artifact ID `56427` on Qiita. These data were processed on Qiita using QIIME 1.9.1 (`Split libraries FASTQ` and `Trimming`), then denoised using Deblur 1.1.0. The `reference-hit` BIOM table and FASTA file were used as the starting point for this analysis.

## 0. Setting up
### 0.1. Declare some environment variables

In [1]:
# Input Data Locations (trimmed-to-150-nt and deblurred BIOM table and sequences,
# as well as sample metadata)
%env INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56427/reference-hit.biom
%env INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56427/reference-hit.seqs.fa
%env INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt

# Output directory (will contain all .qza and .qzv files generated by this analysis)
%env OUTPUT_DIRECTORY=/home/mfedarko/analyses/qurro-mackerel-analysis/AnalysisOutput

env: INPUT_BIOM_TABLE_PATH=/projects/qiita_data/BIOM/56427/reference-hit.biom
env: INPUT_REP_SEQS_PATH=/projects/qiita_data/BIOM/56427/reference-hit.seqs.fa
env: INPUT_SAMPLE_METADATA_PATH=/projects/qiita_data/templates/11721_prep_4638_qiime_20190722-104633.txt
env: OUTPUT_DIRECTORY=/home/mfedarko/analyses/qurro-mackerel-analysis/AnalysisOutput


### 0.2. Move into the output directory

In [2]:
import os
odir = os.environ["OUTPUT_DIRECTORY"]
os.chdir(odir)
print("Moved into output directory: {}".format(odir))

Moved into output directory: /home/mfedarko/analyses/qurro-mackerel-analysis/AnalysisOutput


### 0.3. Get information about the current QIIME 2 environment
(For future reference.)

In [3]:
!qiime info

[33mQIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.[0m
[32mSystem versions[0m
Python version: 3.6.7
QIIME 2 release: 2019.7
QIIME 2 version: 2019.7.0
q2cli version: 2019.7.0
[32m
Installed plugins[0m
alignment: 2019.7.0
composition: 2019.7.0
cutadapt: 2019.7.0
dada2: 2019.7.0
deblur: 2019.7.0
demux: 2019.7.0
diversity: 2019.7.0
emperor: 2019.7.0
feature-classifier: 2019.7.0
feature-table: 2019.7.0
fragment-insertion: 2019.7.0
gneiss: 2019.7.0
longitudinal: 2019.7.0
metadata: 2019.7.0
phylogeny: 2019.7.0
quality-control: 2019.7.0
quality-filter: 2019.7.0
qurro: 0.4.0
sample-classifier: 2019.7.0
songbird: 0.9.0
taxa: 2019.7.0
types: 2019.7.0
vsearch: 2019.7.0
[32m
Application config directory[0m
/home/mfedarko/.config/q2cli[0m
[32m
Getting help[0m
To get help with QIIME 2, visit https://qiime2.org[0m


## 1. Import data into QIIME 2 artifacts
See [the QIIME 2 documentation on importing data](https://docs.qiime2.org/2019.4/tutorials/importing/) for context on why this is necessary and useful.

### 1.1. Import the study's data
Note that this dataset doesn't just contain data about the microbiota of pacific chub mackerel: it also contains samples taken from other species of fish, as well as well as environmental samples. We'll filter some of these samples out of the dataset soon.

In [4]:
!qiime tools import \
    --type "FeatureTable[Frequency]" \
    --input-path $INPUT_BIOM_TABLE_PATH \
    --output-path table-unfiltered.qza
!qiime tools import \
    --type "FeatureData[Sequence]" \
    --input-path $INPUT_REP_SEQS_PATH \
    --output-path rep-seqs-unfiltered.qza

[32mImported /projects/qiita_data/BIOM/56427/reference-hit.biom as BIOMV210DirFmt to table-unfiltered.qza[0m
[32mImported /projects/qiita_data/BIOM/56427/reference-hit.seqs.fa as DNASequencesDirectoryFormat to rep-seqs-unfiltered.qza[0m


#### 1.1.1. Summarize the imported study table and representative sequence data
This gives us information about the number of samples and sequences present in these files. It's useful for sanity-checking the filtering that will be done in the next section.

In [5]:
!qiime feature-table summarize \
    --i-table table-unfiltered.qza \
    --o-visualization table-unfiltered-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs-unfiltered.qza \
    --o-visualization rep-seqs-unfiltered-summary.qzv

[32mSaved Visualization to: table-unfiltered-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-unfiltered-summary.qzv[0m


### 1.2. Import Greengenes 13_8 99% reference data
See [DeSantis et al. 2006](https://aem.asm.org/content/72/7/5069.short) and [McDonald et al. 2012](https://www.nature.com/articles/ismej2011139?report=reader). We'll use this in the "Taxonomic classification" section below.

In [6]:
!qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path /databases/gg/13_8/rep_set/99_otus.fasta \
    --output-path gg_13_8_99_otus.qza

!qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --input-format HeaderlessTSVTaxonomyFormat \
    --input-path /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt \
    --output-path gg_13_8_99_taxonomy.qza

[32mImported /databases/gg/13_8/rep_set/99_otus.fasta as DNASequencesDirectoryFormat to gg_13_8_99_otus.qza[0m
[32mImported /databases/gg/13_8/taxonomy/99_otu_taxonomy.txt as HeaderlessTSVTaxonomyFormat to gg_13_8_99_taxonomy.qza[0m


## 2. Filter the study's data
In particular, we'll filter the feature table (and representative sequences) to only contain samples that satisfy both of the following two conditions:

1. have a `host_common_name` of `pacific chub mackerel` OR have a `sample_type_body_site` of `sea water`
2. contain at least 1,362 sequences

### 2.1. Why do we filter to just pacific chub mackerel and sea water samples?
If you examine `table-unfiltered-summary.qzv` (in particular the "Interactive Sample Detail" tab), you should see that only 1,173 / 1,530 samples have a `host_common_name` of `pacific chub mackerel`. We're going to look at how samples taken from various body sites of these mackerel differ from environmental samples (in particular, just samples taken from sea water).

In order to perform this analysis, we filter the table to just samples where `host_common_name` is `pacific chub mackerel` *or* samples where `sample_type_body_site` is `sea water`.

### 2.2. Why do we filter out samples with less than 1,362 sequences?
This isn't rarefaction—the remaining samples have all their sequences preserved—but samples with less than this number of sequences are removed from the analysis from here on down.

The "1,362" number is based on this data's corresponding study's results from using the KatharoSeq protocol: see [Minich et al. 2018](https://msystems.asm.org/content/3/3/e00218-17.abstract) for a description of how KatharoSeq works, and Minich et al. 2019 (see intro section of this analysis for details) for details on how this number was determined.

In [7]:
!qiime feature-table filter-samples \
    --i-table table-unfiltered.qza \
    --m-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --p-where "host_common_name='pacific chub mackerel' OR sample_type_body_site='sea water'" \
    --p-min-frequency 1362 \
    --o-filtered-table table.qza

# Filter rep-seqs-unfiltered.qza to only include sequences present in the now-filtered table (table.qza).
!qiime feature-table filter-seqs \
    --i-table table.qza \
    --i-data rep-seqs-unfiltered.qza \
    --o-filtered-data rep-seqs.qza

[32mSaved FeatureTable[Frequency] to: table.qza[0m
[32mSaved FeatureData[Sequence] to: rep-seqs.qza[0m


### 2.3. Summarize the filtered data
This will let us double-check that the filtering above was done properly.

In [8]:
!qiime feature-table summarize \
    --i-table table.qza \
    --o-visualization table-summary.qzv \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH

!qiime feature-table tabulate-seqs \
    --i-data rep-seqs.qza \
    --o-visualization rep-seqs-summary.qzv

[32mSaved Visualization to: table-summary.qzv[0m
[32mSaved Visualization to: rep-seqs-summary.qzv[0m


## 3. Taxonomic classification
You don't *need* taxonomy information (i.e. feature metadata) to run Songbird or Qurro. However, having this information available is extremely useful in interpreting a Qurro visualization -- this is why we'll perform taxonomic classification on our dataset's features.

We're going to do this taxonomic classification using VSEARCH (see [Rognes et al. 2016](https://peerj.com/articles/2584/)) and based on the Greengenes 13_8 99% database (see above for citations).

In [9]:
!qiime feature-classifier classify-consensus-vsearch \
    --i-query rep-seqs.qza \
    --i-reference-reads gg_13_8_99_otus.qza \
    --i-reference-taxonomy gg_13_8_99_taxonomy.qza \
    --o-classification taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: taxonomy.qza[0m


## 4. Filter features with taxonomy containing the text `mitochondria` and `chloroplast`
Based on [this QIIME 2 tutorial](https://docs.qiime2.org/2019.7/tutorials/filtering/#taxonomy-based-filtering-of-tables-and-sequences). There are actually a decent amount of Chloroplast features in this dataset (which makes sense, since these are marine samples), but since they're irrelevant to our analysis we filter them out.

In [15]:
!qiime taxa filter-table \
    --i-table table.qza \
    --i-taxonomy taxonomy.qza \
    --p-exclude "mitochondria,chloroplast" \
    --o-filtered-table table-no-mitochondria-or-chloroplast.qza

[32mSaved FeatureTable[Frequency] to: table-no-mitochondria-or-chloroplast.qza[0m


## 5. Run Songbird
This will generate feature differentials, which we'll visualize in Qurro.
For details on what Songbird does and how it works, please see [Songbird's GitHub page](https://github.com/biocore/songbird/), as well as [Morton and Marotz et al. 2019](https://www.nature.com/articles/s41467-019-10656-5).

### 5.1. Why do we manually set certain Songbird parameters?
These parameters were chosen based on consulting Tensorboard to ensure that they resulted in a reasonable model fit.

#### `--p-formula`
This parameter is used by Songbird to determine what sample metadata fields should be used as covariates when generating differentials. Here, we generate differentials relative to the `sample_type_body_site` field (using the `sea water` values of this field as a reference), but there are of course plenty of other options for fields that could be used here if we'd like to ask different questions about this data.

#### `--p-epochs`, `--p-learning-rate`, `--p-batch-size`
These parameters influence Songbird's runtime:

- We've increased `--p-epochs` from the default of 1,000 to 5,000 to make Songbird run for a bit longer (we're working with a fairly large dataset).
- We've decreased `--p-learning-rate` from the default of 0.001 to 0.0001 to similarly increase Songbird's run time.
- We've increased `--p-batch-size` from the default of 5 to 10 to make Songbird process a larger amount of samples at once in each iteration. Since our samples fall into six "categories" (the five mackerel body sites, plus sea water samples), using a larger batch size (that stands a better chance of reflecting this diversity) makes sense.

#### `--p-num-random-test-examples`
Quoting Songbird's documentation again, this is "\[the number\] of random samples to hold out for cross-validation if `training-column` is not specified." The default for this is 5 (i.e. use just 5 samples for cross-validation); since we have the luxury of having a lot of samples in this dataset, we can afford to hold out more samples. This is why we've increased this to 30 samples.

#### `--p-summary-interval`
This is the frequency (in seconds) of how often Songbird saves model fitting statistics in the `--o-regression-stats` output artifact. More frequent measurements will help us more accurately diagnose if Songbird's model is fitting reasonably to the dataset. See [this section of Songbird's README](https://github.com/biocore/songbird#interpreting-model-fitting) for details on model fitting.

In [25]:
!qiime songbird multinomial \
    --i-table table-no-mitochondria-or-chloroplast.qza \
    --m-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --p-formula "C(sample_type_body_site, Treatment('sea water'))" \
    --p-epochs 5000 \
    --p-learning-rate 0.0001 \
    --p-num-random-test-examples 30 \
    --p-batch-size 10 \
    --p-summary-interval 1 \
    --verbose \
    --o-differentials songbird-differentials.qza \
    --o-regression-stats songbird-regression-stats.qza \
    --o-regression-biplot songbird-regression-biplot.qza

W0926 20:59:14.648060 140270108981056 deprecation_wrapper.py:119] From /home/mfedarko/.conda/envs/qiime2-2019.7/lib/python3.6/site-packages/songbird/q2/_method.py:50: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-09-26 20:59:14.648933: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-09-26 20:59:14.664770: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2600035000 Hz
2019-09-26 20:59:14.665349: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e0e5122ae0 executing computations on platform Host. Devices:
2019-09-26 20:59:14.665399: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
OMP: Info #212:

  0%|                                    | 412/300000 [00:00<3:23:05, 24.59it/s]2019-09-26 20:59:15.939383: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
  0%|▏                                   | 1438/300000 [00:01<10:16, 484.49it/s]2019-09-26 20:59:16.939818: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
  1%|▎                                  | 2519/300000 [00:02<04:48, 1031.68it/s]2019-09-26 20:59:17.940373: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
  1%|▍                                  | 3608/300000 [00:03<04:31, 1093.65it/s]2019-09-26 20:59:18.941082: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
  2%|▌                                  | 4589/300000 [00:04<04:35, 1072.71it/s]2019-09-26 20:59:19.941398: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
  2%|▋                                  | 5669/300

 16%|█████▍                            | 47682/300000 [00:43<03:47, 1109.89it/s]2019-09-26 20:59:58.957637: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 16%|█████▌                            | 48814/300000 [00:44<03:43, 1123.74it/s]2019-09-26 20:59:59.958102: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 17%|█████▋                            | 49949/300000 [00:45<03:41, 1130.65it/s]2019-09-26 21:00:00.958567: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 17%|█████▊                            | 51079/300000 [00:46<03:44, 1109.11it/s]2019-09-26 21:00:01.959304: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 17%|█████▉                            | 52206/300000 [00:47<03:40, 1121.53it/s]2019-09-26 21:00:02.959894: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 18%|██████                            | 53340/300

 32%|██████████▊                       | 95835/300000 [01:26<03:05, 1102.70it/s]2019-09-26 21:00:41.982658: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 32%|██████████▉                       | 96959/300000 [01:27<03:02, 1112.18it/s]2019-09-26 21:00:42.982864: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 33%|███████████                       | 98085/300000 [01:28<02:59, 1125.59it/s]2019-09-26 21:00:43.983603: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 33%|███████████▏                      | 99218/300000 [01:29<02:57, 1130.87it/s]2019-09-26 21:00:44.984067: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 33%|███████████                      | 100351/300000 [01:31<02:57, 1124.29it/s]2019-09-26 21:00:45.984300: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 34%|███████████▏                     | 101482/300

 48%|███████████████▊                 | 143624/300000 [02:10<02:19, 1118.51it/s]2019-09-26 21:01:25.002145: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 48%|███████████████▉                 | 144752/300000 [02:11<02:18, 1123.92it/s]2019-09-26 21:01:26.002381: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 49%|████████████████                 | 145869/300000 [02:12<02:20, 1096.40it/s]2019-09-26 21:01:27.003039: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 49%|████████████████▏                | 146975/300000 [02:13<02:20, 1092.53it/s]2019-09-26 21:01:28.003676: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 49%|████████████████▎                | 147967/300000 [02:13<02:21, 1072.63it/s]2019-09-26 21:01:29.003707: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 50%|████████████████▍                | 149066/300

 64%|█████████████████████            | 191244/300000 [02:53<01:38, 1100.15it/s]2019-09-26 21:02:08.021866: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 64%|█████████████████████▏           | 192237/300000 [02:53<01:39, 1078.03it/s]2019-09-26 21:02:09.022318: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 64%|█████████████████████▎           | 193323/300000 [02:55<01:39, 1076.64it/s]2019-09-26 21:02:10.023051: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 65%|█████████████████████▍           | 194415/300000 [02:56<01:36, 1092.49it/s]2019-09-26 21:02:11.023568: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 65%|█████████████████████▌           | 195514/300000 [02:57<01:36, 1086.33it/s]2019-09-26 21:02:12.024252: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 66%|█████████████████████▋           | 196615/300

 79%|██████████████████████████       | 237480/300000 [03:36<00:58, 1074.51it/s]2019-09-26 21:02:51.044656: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 80%|██████████████████████████▏      | 238567/300000 [03:37<00:56, 1084.02it/s]2019-09-26 21:02:52.045165: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 80%|██████████████████████████▎      | 239662/300000 [03:38<00:55, 1079.10it/s]2019-09-26 21:02:53.046979: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 80%|██████████████████████████▍      | 240753/300000 [03:39<00:54, 1083.63it/s]2019-09-26 21:02:54.047370: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 81%|██████████████████████████▌      | 241852/300000 [03:40<00:52, 1106.49it/s]2019-09-26 21:02:55.047802: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 81%|██████████████████████████▋      | 242975/300

 95%|███████████████████████████████▎ | 285019/300000 [04:19<00:13, 1113.19it/s]2019-09-26 21:03:34.066130: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 95%|███████████████████████████████▍ | 286139/300000 [04:20<00:12, 1111.59it/s]2019-09-26 21:03:35.066299: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 96%|███████████████████████████████▌ | 287256/300000 [04:21<00:11, 1114.03it/s]2019-09-26 21:03:36.066965: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 96%|███████████████████████████████▋ | 288268/300000 [04:22<00:10, 1116.72it/s]2019-09-26 21:03:37.067096: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 96%|███████████████████████████████▊ | 289394/300000 [04:23<00:09, 1109.19it/s]2019-09-26 21:03:38.067274: I tensorflow/core/profiler/lib/profiler_session.cc:174] Profiler session started.
 97%|███████████████████████████████▉ | 290514/300

### 5.2. Visualize Songbird model fitting statistics
For more information, check out [this section of the Songbird README](https://github.com/biocore/songbird#interpreting-model-fitting). Note that (as of writing) the version of Songbird used here is a bit older than my in-development version of it, so the QIIME 2 summary statistics here might look a bit funky. Essentially, the ordering of plots is switched from how it *will* be in future versions of these summaries: the `Loglikehood` _[sic]_ plot on the top corresponds to loss, a.k.a. the bottom plot in Tensorflow.

In [26]:
!qiime songbird summarize-single \
    --i-feature-table table-no-mitochondria-or-chloroplast.qza \
    --i-regression-stats songbird-regression-stats.qza \
    --o-visualization songbird-regression-summary.qzv

[32mSaved Visualization to: songbird-regression-summary.qzv[0m


## 6. Run Qurro!
We're doing this using Qurro v0.4.0. Note that the version of the "mackerel demo" up on [Qurro's website](https://biocore.github.io/qurro) will be updated as Qurro itself is updated (so although the underlying data should remain the same, the visualization interface might look a bit different/contain a few more features in a few months).

In [27]:
!qiime qurro differential-plot \
    --i-table table-no-mitochondria-or-chloroplast.qza \
    --i-ranks songbird-differentials.qza \
    --m-sample-metadata-file $INPUT_SAMPLE_METADATA_PATH \
    --m-feature-metadata-file taxonomy.qza \
    --verbose \
    --o-visualization qurro-plot.qzv

20973 feature(s) in the BIOM table were not present in the feature rankings.
These feature(s) have been removed from the visualization.
1247 sample(s) in the sample metadata file were not present in the BIOM table.
These sample(s) have been removed from the visualization.
[32mSaved Visualization to: qurro-plot.qzv[0m
