#  Environmental analysis

![scheme](img01.jpg
)

RW: Raw Water before treatment

BC: Before Chlorination

FW: Finished Water 

DS1-DS3: tap water

PB1-PB2: biofilm in pipes

WM: biofilm in water meters

We are going to study the variation of microbial populations before entering the circuit (RW), at the end of the treatment (FW) and tap water samples in three houses (DS1, DS2 and DS3). We will also study samples taken in June and July. 

I propose you to run the pipeline I showed you in Unit 2 whit this dataset and answer the following questions:

1)	How many reads have we got for each sample?

2)	Which is the trimming length you are using for the denoising step?

3)	How many ASVs do you have before filtering deblur-table? And after filtering? 

4)	Which is the average frequency of sequences per sample before filtering deblur table?

5)	Which is the sample with the lower number of sequences after filtering deblur table? And the one with the highest number?

6)	Which is the most abundant phylum in each sample?

7)	Has the study enough coverage to allow us to make any statistical inference on communities’ diversity?

8)	Studying the Unifrac Weighted PCoA plot. Is there any effect of water treatment or sampling moment on the bacteria communities? 

9)	If we compare Untreated samples (RW) vs Treated samples (FW, DS1, DS2, DS3) which are the Phyla or Classes explaining the differences among both groups? (Hint: use LEfSe)

**Dataset Contents**:

1)	fastq folder: raw sample sequences

2)	85_otus.fasta and 85_otus_taxonomy.txt: taxonomy database

3)	metadata.txt: sample metadata. I have included some columns useful for diversity analyses.

4)	primers.txt: information on primers used for 16S PCR amplification

5)	quiz.docx: this file

6)	samplemanifest: manifest file with information of the ubication of  fastq files and their corresponding tags.

## Responses

1)	How many reads have we got for each sample?


>From *grep* against fastq files and verified after import on *qiime*.

| File              |#reads|
| :------------------|--------:|
|SRR3593621_1.fastq |52945| 
|SRR3593621_2.fastq |52945|
|SRR3593622_1.fastq |62218|
|SRR3593622_2.fastq |62218|
|SRR3593623_1.fastq |92740|
|SRR3593623_2.fastq |92740|
|SRR3593625_1.fastq |70366|
|SRR3593625_2.fastq |70366|
|SRR3593627_1.fastq |100615|
|SRR3593627_2.fastq |100615|
|SRR3593628_1.fastq |78495|
|SRR3593628_2.fastq |78495|
|SRR3593631_1.fastq |97332|
|SRR3593631_2.fastq |97332|
|SRR3593632_1.fastq |84361|
|SRR3593632_2.fastq |84361|
|SRR3593664_1.fastq |101827|
|SRR3593664_2.fastq |101827|
|SRR3593665_1.fastq |84850|
|SRR3593665_2.fastq |84850|


2)	Which is the trimming length you are using for the denoising step?

> The qualiy of the reads is incremented substantially after the merge step by VSEARCH, consequence of a great overlap within forward and reverse sequences. In fact the filtering by means of 'qiime quality-filter q-score-joined' doesn't drop any read.

>I choose $372$ that retains $Q >= 38$.

3)	How many ASVs do you have before filtering deblur-table? And after filtering? 

>Before filtering I've $765$ features after $104$.

4)	Which is the average frequency of sequences per sample before filtering deblur table?
>$30,070$

5)	Which is the sample with the lower number of sequences after filtering deblur table? And the one with the highest number?

The sample $SRR3593622$ has the lower number of sequences : $15,296$
The sample $SRR3593631$ has the higher number of sequences : $41,934$

6)	Which is the most abundant phylum in each sample?

The marked in the table of figure 2.


![Taxonomy by philum](taxonomy.jpg)

7)	Has the study enough coverage to allow us to make any statistical inference on communities diversity?
Yes, the rarefaction curves reach saturation asymptotic pattern, no more reads would give us more observed OTUS (figure 3).

![Rarefaction curves](rarefaction.jpg)

8)	Studying the Unifrac Weighted PCoA plot. Is there any effect of water treatment or sampling moment on the bacteria communities? 

The water treatment groups cluster together nearly over axis2-axis3 plane, and the not treated water groups far along the axis1. As axis1 explains 43% of differences, we conclude that treated groups are most in common between than between the not treated groups (figure 4).

Also the graph evidences the dissimilarity between DS3 groups and the other treated groups, above on the axis2. The biofilm presence in the distribution, surely is the explaining factor.

![PCoA groups](pcoa_groups.png)

The difference in months are less relevant as seen in figure 5. In the case of treated groups month juny groups cluster nearer to the axis2.

![PCoA months](pcoa_months.png)

9)	If we compare Untreated samples (RW) vs Treated samples (FW, DS1, DS2, DS3) which are the Phyla or Classes explaining the differences among both groups? (Hint: use LEfSe)

The classes are drawn in the figure 6 LDA representation (TT treated and TN untrated). The most explanatory respectively are cyanobacteria and sphirocaetes + OP11, coherently with the table of question 6.

Also the graph evidences the higher diversity in the not treated water, as expected.

![LDA treated/not treated](lda_tt_nt.png)

## Pipeline

### Preprocessing and quality check


In [5]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt

FILE_ID = "SRR"
FASTQ_STR = "@SRR"

In [103]:
%%bash -s "$FILE_ID" "$FASTQ_STR"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_2/fastq
head -4 ${FILE_ID}*fastq
grep -c $FASTQ_STR ${FILE_ID}*fastq
EOT

==> SRR3593621_1.fastq <==
@SRR3593621.1 1 length=300
TACGTAGGGTGCAAGCGTTATCCGGATTCACTGTTCTTCCCTATTCGTTTAGTTTTTTTTTTCCTTCTCACTTCACAGCCCTTTGCTTTACCTCGTCCTTCTTTTCTTCTTTACTATACTCGTTTTTTATATTTGTACGTGGTTCTCCTTTTTGTTCTGTGCCTTGCGTTGTGTTCTTGTTGTACCCCAATTGCCATTGCTCCTTTCTTCTTCATTCCTGTCACTCTTCCACGCAAGCTATCGTACTCCATCAGTTTAGTCCCCTCCTTTTTTCTAGCCCTCAACTCTTCCCTGCTAGTT
+SRR3593621.1 1 length=300
<6BCCGGGGGGGGGGGGGG7CEE6+,8@@,;C,,,<,<,,,,,,;6,;,6,,,;-,:+8+86,,,<9,:,6CE,,,,,,66,,,,,5:,,6696,,8,89,,,9,<,,,,9,,,:,<,,95+,+++9+,9,,,,,,,,+,,+++:>;,74,4,+,,5,,8,,,,8,,+66+++,66,7,,,,,,:,:5++,,+2,,,,+3+74<2=2@,,+22,5,5*4+,4*/5++++3*.)*)((//++(-(((*+*))+)))3.)))++,/;*/(/)//587))6((,(*),-.).)-44)43)).-

==> SRR3593621_2.fastq <==
@SRR3593621.1 1 length=300
TTTCCTCTTTCTCTCTTCCTCCCCACCCTTCCTCCTTCCCTCTTTCTCTTCCTTCTCTCCTCACATTCTCCTCCCTCCCCTGCTTCCCTTTATTTCTCCCTCTACTCCTCTCTCTTCTCTGCTTCTCTCCTCTACCTTTCGTTTCTCAGTGTCATCTATTTCCCTGCCCGTTGCCTTCTCCTTTCTTCTTCCTCCTCATCTCCACTCATTTCACTTCTCCTCCATGCTTTCCCCTTTCCTCTATCACCCTCCACTCTCTTCC

In [104]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
                   --input-path samplemanifest \
                   --output-path paired-end-demux.qza \
                   --input-format PairedEndFastqManifestPhred33
EOT

Imported samplemanifest as PairedEndFastqManifestPhred33 to paired-end-demux.qza


In [105]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2
qiime demux summarize --i-data paired-end-demux.qza --o-visualization paired-end-demux.qzv
EOT

Saved Visualization to: paired-end-demux.qzv


### Determination of ASV using Deblur

#### Sample pre-processing

In [106]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

# Deblur does not currently support unpaired paired-end readings, 
#so we have to use the VSEARCH algorithm to merge the readings:
qiime vsearch join-pairs --i-demultiplexed-seqs paired-end-demux.qza \
                         --o-joined-sequences joined-reads.qza
qiime demux summarize --i-data joined-reads.qza --o-visualization joined-reads.qzv

# Filtering of the readings according to their quality.
qiime quality-filter q-score-joined --i-demux joined-reads.qza \
                                    --o-filter-stats filt_stats.qza \
                                    --o-filtered-sequences joined-filt-reads.qza

qiime demux summarize --i-data joined-filt-reads.qza --o-visualization joined-filt-reads.qzv

EOT

Saved SampleData[JoinedSequencesWithQuality] to: joined-reads.qza
Saved Visualization to: joined-reads.qzv
Saved SampleData[JoinedSequencesWithQuality] to: joined-filt-reads.qza
Saved QualityFilterStats to: filt_stats.qza
Saved Visualization to: joined-filt-reads.qzv


#### Determination of ASV/features

In [107]:
%%bash 
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

rm -rf deblur_output
qiime deblur denoise-16S --i-demultiplexed-seqs joined-filt-reads.qza \
                         --p-trim-length 372 \
                         --p-sample-stats \
                         --p-jobs-to-start 2 \
                         --p-min-reads 1 \
                         --output-dir deblur_output

qiime feature-table summarize --i-table deblur_output/table.qza --o-visualization deblur_output/deblur_table_summary.qzv


EOT

Saved FeatureTable[Frequency] to: deblur_output/table.qza
Saved FeatureData[Sequence] to: deblur_output/representative_sequences.qza
Saved DeblurStats to: deblur_output/stats.qza
Saved Visualization to: deblur_output/deblur_table_summary.qzv


In [108]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

# Exclude sOTUS that have a frequency below 0.1% of the mean depth. 
# This threshold would exclude those sOTUs that are due to Illumina sequencing errors (0.1% of total reads).

qiime feature-table filter-features --i-table deblur_output/table.qza \
                                    --p-min-frequency 30 \
                                    --p-min-samples 1 \
                                    --o-filtered-table deblur_output/deblur_table_filt.qza

# Exclude low frequency sOTUS
qiime feature-table filter-seqs --i-data deblur_output/representative_sequences.qza \
                                --i-table deblur_output/deblur_table_filt.qza \
                                --o-filtered-data deblur_output/rep_seqs_filt.qza


# Summarize
qiime feature-table summarize --i-table deblur_output/deblur_table_filt.qza --o-visualization deblur_output/deblur_table_filt_summary.qzv

EOT

Saved FeatureTable[Frequency] to: deblur_output/deblur_table_filt.qza
Saved FeatureData[Sequence] to: deblur_output/rep_seqs_filt.qza
Saved Visualization to: deblur_output/deblur_table_filt_summary.qzv


### Phylogenetic distances determination using FastTree

In [112]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2
rm -rf tree_out
mkdir tree_out
#Sequence alignment
qiime alignment mafft --i-sequences deblur_output/rep_seqs_filt.qza \
                      --p-n-threads 3 \
                      --o-alignment tree_out/rep_seqs_filt_aligned.qza

#Mask hypervariable regions
qiime alignment mask --i-alignment tree_out/rep_seqs_filt_aligned.qza \
                     --o-masked-alignment tree_out/rep_seqs_filt_aligned_masked.qza

#Calculate phylogenie

qiime phylogeny fasttree --i-alignment tree_out/rep_seqs_filt_aligned_masked.qza \
                         --p-n-threads 2 \
                         --o-tree tree_out/rep_seqs_filt_aligned_masked_tree

#Root the tree

qiime phylogeny midpoint-root --i-tree tree_out/rep_seqs_filt_aligned_masked_tree.qza \
                              --o-rooted-tree tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza
                     
EOT

Saved FeatureData[AlignedSequence] to: tree_out/rep_seqs_filt_aligned.qza
Saved FeatureData[AlignedSequence] to: tree_out/rep_seqs_filt_aligned_masked.qza
Saved Phylogeny[Unrooted] to: tree_out/rep_seqs_filt_aligned_masked_tree.qza
Saved Phylogeny[Rooted] to: tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza


### Taxonomic assignment

#### Assignment database training

In [113]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

#Import reference sequences
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 85_otus.fasta \
  --output-path 85_otus.qza

#Import reference taxonomy
qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path 85_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza
  
EOT

Imported 85_otus.fasta as DNASequencesDirectoryFormat to 85_otus.qza
Imported 85_otu_taxonomy.txt as HeaderlessTSVTaxonomyFormat to ref-taxonomy.qza


In [115]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

#Trim reference sequences to the region intra-primers according to primers.txt
qiime feature-classifier extract-reads \
  --i-sequences 85_otus.qza \
  --p-f-primer GTGCCAGCMGCCGCGGTAA \
  --p-r-primer CCGTCAATTCMTTTRAGTTT \
  --p-min-length 100 \
  --p-max-length 400 \
  --o-reads ref-seqs.qza

#Generate classifier naive-bayes
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier classifier.qza
  
EOT

Saved FeatureData[Sequence] to: ref-seqs.qza
Saved TaxonomicClassifier to: classifier.qza


#### Taxonomic assignment of representative sequences

In [116]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

rm -rf taxa
#Taxonomic assignment of the representative sequences of each sOTU
qiime feature-classifier classify-sklearn --i-reads deblur_output/rep_seqs_filt.qza \
--i-classifier classifier.qza \
--p-n-jobs 2 \
--output-dir taxa

#Export to tabular file
qiime tools export --input-path taxa/classification.qza --output-path taxa

#Obtain interactive graph to visualize the abundance in each sOTU per sample
qiime taxa barplot --i-table deblur_output/deblur_table_filt.qza \
--i-taxonomy taxa/classification.qza \
--m-metadata-file metadata.txt \
--o-visualization taxa/taxa_barplot.qzv

EOT

Saved FeatureData[Taxonomy] to: taxa/classification.qza
Exported taxa/classification.qza as TSVTaxonomyDirectoryFormat to directory taxa
Saved Visualization to: taxa/taxa_barplot.qzv


### Diversity analysis

#### Alpha diversity

In [118]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2

# Obtain rarefaction curves with max-depht = num of reads of the richest sample
qiime diversity alpha-rarefaction --i-table deblur_output/deblur_table_filt.qza \
                                  --p-max-depth 41934 \
                                  --p-steps 20 \
                                  --i-phylogeny tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza \
                                  --m-metadata-file metadata.txt \
                                  --o-visualization rarefaction_curves.qzv
                                  
EOT

Saved Visualization to: rarefaction_curves.qzv


#### Beta diversity

In [119]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2
rm -rf diversity
# Obtain metrics curves with sampling-depth 19845 = num of reads of the poorest sample
qiime diversity core-metrics-phylogenetic --i-table deblur_output/deblur_table_filt.qza \
                                          --i-phylogeny tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza \
                                          --p-sampling-depth 15296 \
                                          --m-metadata-file metadata.txt \
                                          --p-n-jobs 2 \
                                          --output-dir diversity

EOT

Saved FeatureTable[Frequency] to: diversity/rarefied_table.qza
Saved SampleData[AlphaDiversity] % Properties(['phylogenetic']) to: diversity/faith_pd_vector.qza
Saved SampleData[AlphaDiversity] to: diversity/observed_otus_vector.qza
Saved SampleData[AlphaDiversity] to: diversity/shannon_vector.qza
Saved SampleData[AlphaDiversity] to: diversity/evenness_vector.qza
Saved DistanceMatrix % Properties(['phylogenetic']) to: diversity/unweighted_unifrac_distance_matrix.qza
Saved DistanceMatrix % Properties(['phylogenetic']) to: diversity/weighted_unifrac_distance_matrix.qza
Saved DistanceMatrix to: diversity/jaccard_distance_matrix.qza
Saved DistanceMatrix to: diversity/bray_curtis_distance_matrix.qza
Saved PCoAResults to: diversity/unweighted_unifrac_pcoa_results.qza
Saved PCoAResults to: diversity/weighted_unifrac_pcoa_results.qza
Saved PCoAResults to: diversity/jaccard_pcoa_results.qza
Saved PCoAResults to: diversity/bray_curtis_pcoa_results.qza
Saved Visualization to: diversity/unweighted

In [120]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<"EOT"
export PATH=$PATH:/home/microbioinf/miniconda3/bin
source activate qiime2-2018.11
cd Documentos/Tema_2
qiime taxa collapse --i-table deblur_output/deblur_table_filt.qza \
                    --o-collapsed-table deblur_output/L3_collapse_table.qza \
                    --p-level 3 \
                    --i-taxonomy taxa/classification.qza
                    
qiime tools export --input-path deblur_output/L3_collapse_table.qza \
                   --output-path lefse_table/

biom convert -i lefse_table/feature-table.biom \
             -o lefse_table/feature-table.txt \
             --header-key “taxonomy” --to-tsv
             
EOT

Saved FeatureTable[Frequency] to: deblur_output/L3_collapse_table.qza
Exported deblur_output/L3_collapse_table.qza as BIOMV210DirFmt to directory lefse_table/


# Outputs

In [128]:
%%bash
jupyter nbconvert --to=latex --template=~/report.tplx environmental_population_analysis.ipynb 1>/dev/null 2>/dev/null
/Library/TeX/texbin/pdflatex -shell-escape environmental_population_analysis 1>/dev/null 2>/dev/null
jupyter nbconvert --to html_toc environmental_population_analysis.ipynb 1>/dev/null 2>/dev/null