# Amazon urbanization project analysis protocol

Contributors: Laura-Isobel McCall, Chris Callewaert, Qiyun Zhu, Se Jin Song, James T. Morton

## Sample information

To be added.

## DNA data analysis

### Information

Qiita project
 - Project ID: **10333**
 - Title: Dominguez Sloan SAWesternization gradient
 - Barnacle project directory: `sloan_10333`
 - Prep IDs:
   - 16S: 1227, 1228, 1229, 1234
   - 18S: 1243
   - ITS: 1235

For 16S, dowload auto-deblurred BIOM tables from Qiita
 - Against Greengenes release 13_8, 88% OTU
 - dflt_30888, dflt_30777, dflt_30890, dflt_30585

For 18S and ITS data, run deblur locally on proper databases.
 - Using deblur version 1.0.2
 - 18S against Silva release 123, 80% OTU
 - ITS against UNITE release 7.1, 97% OTU
 
Note: These database releases were chosen because they were already deployed in Barnacle. To download them fresh, the links are [18S](https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_123_release.zip) and [ITS](https://unite.ut.ee/sh_files/sh_qiime_release_s_22.08.2016.zip). The guideline in the QIIME [website](http://qiime.org/home_static/dataFiles.html) was followed in selecting databases.
 
Note: The deblur parameters were set following the default setting in Qiita.

### Pre-process data

In [None]:
%%bash

# 16S: 4 preps (already automated in Qiita)
pos_ref_fp=/databases/gg/13_8/rep_set/88_otus.fasta
pos_ref_db_fp=/databases/gg/13_8/sortmerna/88_otus

# 18S: dflt_29852
pos_ref_fp=/databases/silva_18s/silva123/silva_18s/80_otus_18S.fasta
pos_ref_db_fp=/databases/silva_18s/silva123/silva_18S/80_otus_18S

# ITS: dflt_29828
pos_ref_fp=/databases/unite/7_1/sh_refs_qiime_ver7_97_s_22.08.2016.fasta
pos_ref_db_fp=/databases/unite/7_1/unite_ITS

In [None]:
%%bash
source activate deblurenv

deblur workflow \
  --seqs-fp seqs.fasta \
  --output-dir $outdir \
  --trim-length -1 \
  --pos-ref-fp ${pos_ref_fp} \
  --pos-ref-db-fp ${pos_ref_db_fp} \
  --min-reads 0

Drop all blank samples, and translate sample IDs to a simpler, uniform format.

In [None]:
%%python
from biom import load_table
from biom.util import biom_open

table = load_table('16S/prep_1227/dflt_30888.biom')

ids_to_keep = set([x for x in table.ids() if not 'blank' in x.lower()])
table.filter(ids_to_keep=ids_to_keep, inplace=True)

with open('id_map.txt', 'r') as f:
    id_map = dict(x.split('\t') for x in f.read().splitlines())
table.update_ids(id_map=id_map)

with biom_open('prep_1227.biom', 'w') as f:
    table.to_hdf5(f, table.generated_by)

Merge the four 16S BIOM tables into one.

In [None]:
%%bash
merge_otu_tables.py \
  --input_fps prep_1227,prep_1228.biom,prep_1229.biom,prep_1234.biom \
  --output_fp 16S.biom

### Assign taxonomy

Get sequences from BIOM tables

In [None]:
%%bash
biom convert --to-tsv -i 16S.biom -o 16S.tsv
while read line
do
  echo '>'$line >> 16S.fa
  echo $line >> 16S.fa
done < <(cat 16S.tsv | grep -v '#' | cut -f1)

Reference databases (the finest clustering scheme (99%) was used)

In [None]:
%%bash

# 16S:
reference_seqs_fp=/databases/gg/13_8/rep_set/99_otus.fasta
id_to_taxonomy_fp=/databases/gg/13_8/taxonomy/99_otu_taxonomy.txt

# 18S:
reference_seqs_fp=rep_set/rep_set_18S_only/99/99_otus_18S.fasta
id_to_taxonomy_fp=taxonomy/18S_only/99/taxonomy_7_levels.txt

# ITS:
reference_seqs_fp=sh_refs_qiime_ver7_99_s_22.08.2016.fasta
id_to_taxonomy_fp=sh_taxonomy_qiime_ver7_99_s_22.08.2016.txt

Assign taxonomy using the [SortMeRNA](https://github.com/biocore/sortmerna) ([Kopylova, Noé and Touzet, 2012](https://academic.oup.com/bioinformatics/article/28/24/3211/246053)) method

In [None]:
%%bash
assign_taxonomy.py \
  --input_fasta_fp 16S.fa \
  --output_dir 16S \
  --reference_seqs_fp ${reference_seqs_fp} \
  --id_to_taxonomy_fp ${id_to_taxonomy_fp} \
  --assignment_method sortmerna

Check assignment ratio

In [None]:
%%bash
# total sequences
cat 16S/16S_tax_assignments.txt | tail -n+2 | wc -l
# unassigned sequences
cat 16S/16S_tax_assignments.txt | tail -n+2 | grep $'\t'Unassigned$'\t' | wc -l

Unassignment ratios:
 - 16S: 7501 / 210370 = 3.56%
 - 18S: 6688 / 29778 = 22.46%
 - ITS: 5032 / 47062 = 10.69%

Append taxonomy to BIOM tables

In [None]:
%%bash
biom add-metadata \
  --input-fp 16S.biom \
  --output-fp 16S.wtax.biom \
  --observation-metadata-fp 16S/16S_tax_assignments.txt \
  --observation-header OTUID,taxonomy \
  --sc-separated taxonomy

There were non-standard characters in the ITS assignment result (specifically, `s__Montagnula_aloës`), which caused error running biom add-metdata. We followed the protocol [here](https://groups.google.com/forum/#!topic/qiime-forum/W6NqdoWhNfI) to resolve the issue.

### Post-process data

#### For 16S, perform bloom-filtering

Using the script and references provided in [Amir et al. (2017)](http://msystems.asm.org/content/2/2/e00199-16).

In [None]:
%%bash
python filterbiomseqs.py -i 16S.biom -o 16S.bf.biom -f newbloom.10.fna

But there was no bloom sequences found. So this step was omitted.

#### Filter out sequences with <10 counts study-wide.

In [None]:
%%bash
filter_otus_from_otu_table.py -i 16S.biom -o 16S.n10.biom -n 10
filter_otus_from_otu_table.py -i 16S.wtax.biom -o 16S.wtax.n10.biom -n 10

#### For 18S, perform taxonomic filterings

In [None]:
%%bash
# no fungi
filter_taxa_from_otu_table.py -i 18S.biom -n "D_3__Fungi" -o 18S.noFungi.biom
# animals only
filter_taxa_from_otu_table.py -i 18S.biom -p "D_3__Metazoa (Animalia)" -o 18S.animals.biom
# plants only (green algae and land plants)
filter_taxa_from_otu_table.py -i 18S.biom -p "D_2__Chloroplastida" -o 18S.plants.biom
# no animal, plants and fungi
filter_taxa_from_otu_table.py -i 18S.biom -n "D_3__Fungi,D_3__Metazoa (Animalia),D_2__Chloroplastida" -o 18S.noAPF.biom

#### Rarefaction

After examining the BIOM table summaries, we decided to use a sampling depth of 1000 for all three data types.

In [None]:
%%bash
filter_samples_from_otu_table.py -i 16S.biom -o 16S.mc1000.biom -n 1000
single_rarefaction.py -i 16S.mc1000.biom -o 16S.even1000.biom -d 1000

From this point on, all subsequent analyses were based on `16S.even1000.biom`, unless otherwise stated.

#### Filter human vs house samples

### Taxonomic profile

In [None]:
%%bash
sort_otu_table.py -i 16S.biom -o 16S.sorted.biom
summarize_taxa.py -i 16S.sorted.biom -o 16S

In [None]:
%%bash
filter_samples_from_otu_table.py -i 16S.biom -m metadata.txt -s 'host_type:human' -o 16S.human.biom
filter_samples_from_otu_table.py -i 16S.biom -m metadata.txt -s 'host_type:house' -o 16S.house.biom

### Alpha diversity

In [None]:
%%bash
multiple_rarefactions.py -i 16S.mc1000.biom -m 10 -x 1000 -s 99 -o 16S.multi
alpha_diversity.py -i 16S.multi -o 16S.alpha --metrics observed_otus,chao1,shannon
collate_alpha.py -i 16S.alpha -o 16S
rm -rf 16S.alpha 16S.multi

### Beta diversity

In [None]:
%%bash
beta_diversity.py -i 16S.biom -o 16S --metrics bray_curtis
principal_coordinates.py -i $x/bray_curtis_16S.txt -o $x/bray_curtis_16S.pcoa

## MS data analysis

To be added.

## Multi-omics analysis

We applied the Partial Least Squares Singular Value Decomposition (**PLSSVD**) method ([Kapono et al., 2018](https://www.nature.com/articles/s41598-018-21541-4)) to explore the correlation between microbiome and metabolome data and with their metadata.

Source codes are under the "plssvd" directory. They were derived and modified from the [original source codes](https://github.com/knightlab-analyses/office-study/tree/master/ipynb) used in Kapono et al. (2018).

In [None]:
%%bash
python plssvd.py \
  metadata/house.txt \
  microbes/ITS.biom.qza \
  metabolites/all.biom.qza \
  microbes/label/ITS.txt \
  metabolites/label/trim.txt \
  > ITS_all_house.log

## Obsolete analyses

#### Relevant metadata fields
 - village (Checherta, Puerto Almendras, Iquitos, Manaus)
 - socioeconomic_level (low, middle) (Manaus only)
 - village_socio (Checherta, Puerto Almendras, Iquitos, Manaus low, Manaus middle)
  - village_socio_number (1, 2, 3, 4, 5)
 - accult_score
 - age_category (baby: 0-0.5, infant: 0.5-3, child: 3-12, teen: 13-17, adult: 18+)
 - collection_year (2012, 2013) (Manaus only)
 - host_type (**human**, **house**, animal, water)
 - host_or_room
  - animals: anaconda, capuchin, cat, chicken, dog, monkey, parrot, sloth, turtle
  - rooms: bathroom, bedroom, kitchen, living
  - human
  - water
 - sample_site_general
  - animal: anal, nose, oral, skin
  - human: anal, areola, feces, nose, oral, skin
  - bed, chair handle, countertop, crib, cup, faucet, fire beam, floor, hammock, matress, table, wall, water container
  - water
  - misc: Swab PERU, Swab USA
  * In "sample_site", human skin is further divided into r arm, r foot and r hand.
 - metabolites: X230_248_320, X304_301_355, X332_335_381, X343_296_294, X369_384_393, X384_383_402, X599_437_371, X661_451_401, X705_477_401
 - sum of metabolite abundances: sum_369_384, sum_396_332, sum_599_437, sum_705_484

Create single-column metadata tables:

In [2]:
%%bash
cat metadata.txt | cut -f1,5 | grep -v "BLANK" | grep -v "NA" > village.txt

cat: metadata.txt: No such file or directory


For certain method and test set, it was necessary to filter out unused samples from the input BIOM table:
 - e.g., comparing 2012 vs 2013 for Manaus samples.

In [None]:
%%bash
filter_samples_from_otu_table.py \
  --input_fp 16S.biom \
  --output_fp 16S.fil.biom \
  --sample_id_fp $category.txt

#### observation-metadata correlation
 - Using Pearson or Spearman tests

In [None]:
%%bash
# method = pearson or spearman
observation_metadata_correlation.py \
  --otu_table_fp 16S.biom \
  --output_fp $category.txt \
  --mapping_fp design/$category.txt \
  --category $category \
  --test $method

#### Two-level metadata columns for LEfSe
 - host_type,host_or_room
 - sample_site_general,sample_site

#### Supervised classification
 - Using the **random forest** method
 - Categories: village_socio_number

In [3]:
%%bash
supervised_learning.py \
  --input_data 16S.biom \
  --output_fp $category \
  --mapping_fp $category.txt \
  --category $category/16S

bash: line 2: supervised_learning.py: command not found


#### Identify significantly differential taxa
 - Using DESeq2 1.14.1

In [None]:
%%bash differential_abundance.py \
  -a DESeq2_nbinom -d \
  -i 16S.biom \
  -o $category/16S.txt \
  -m $category.txt \
  -c $category \
  -x $subcat1 -y $subcat2

#### Compare categories based on distance matrics
 - Using the [adonis](http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/adonis.html) method as implemented in vegan 2.4-4.

In [None]:
%%bash
compare_categories.py \
  --method adonis \
  --input_dm bray_curtis_16S.txt \
  --output_dir $category/16S \
  --mapping_file $category.txt \
  --categories $category \
  --num_permutations 999

SourceTracker

PLSSVD

LEfSe

Procrust