# 01 - Generate ASVs using DADA2
## Pipeline Description
This is the 16s rRNA amplicon soil data from CeMiST, re-analyzed using QIIME2 to get Amplicon Sequence Variants (ASVs) for downstream analysis. The V3V4 region of bacterial 16s rRNA was amplified using these primers (the product should be around 464 bp): 
* _341F (5’-CCTACGGGNGGCWGCAG-3')_ : len 17
* _805R (5’-GACTACHVGGGTATCTAATCC-3')_ len 21

There are three runs, each containing a paired end FASTQ in this folders:
* ../data/raw_files/Psoil_1_L001-ds.be0c008785964521859130b2f2ead9be
* ../data/raw_files/Psoil_2_L001-ds.6255278a9f7448a0b3e7c48b80f6d25f
* ../data/raw_files/Psoil_3_L001-ds.e31e51f5248b4af4ad90603ebbf46e08

And the metadata in this files:
* ../data/raw_files/Pool1.barcodes
* ../data/raw_files/Pool2.barcodes
* ../data/raw_files/Pool3.barcodes

Here, Amplicon Sequence Variants was generated from raw FASTQ data using DADA2 in [QIIME2 version 2020.11](https://docs.qiime2.org/2020.11/). 

Analysis was conducted within Conda environment in NBC Shared machine.

## Steps: 
1. [Check input](#section1)
    - [ ] QC Raw Paired End FastQ 
    - [x] Build Metadata - make it compatible with QIIME2
    - [x] Import to QIIME2
2. [Demultiplexing](#section2)
    - [ ] QC Demultiplexed samples
    - [X] Run demultiplexing with cutadapt in QIIME2
3. [Denoising with DADA2](#section3)
    - [ ] QC denoised data 
    - [x] Run DADA2
4. [Merge outputs](#section4)
    - [x] Merge QZA files into 1

Notes: 
* _metadata in Pool3 suggested that the 16s run was mixed with other experiments. Will these affect DADA2 capabilites to denoise the samples?_
* _fastq files are mixed forward and reverse reads_

In [37]:
_341F = 'CCTACGGGNGGCWGCAG'
_805R = 'GACTACHVGGGTATCTAATCC'
x = 461 -len(_341F) - len(_805R)
x

423

In [2]:
# Load Library
import pandas as pd
from qiime2 import Artifact, Visualization
import os

<a id='section1'></a>
## Check Input
### Build metadata

In [4]:
# create demultiplexing barcodes containing barcodes + the first 5 nucleotides of the forward primer
pool = [1, 2, 3]
for i in pool:
    df_demux = pd.read_csv('../data/raw_files/Pool'+str(i)+'.barcodes', sep='\t')
    try:
        df_demux = df_demux.rename(columns={'ID ':'ID'})
    except:
        pass
    df_demux = pd.DataFrame({'#SampleID' : [i for i in df_demux['ID']],
                             'BarcodeSequence' : [i+'CCTAC' for num, i in enumerate(df_demux['#F-tag'])],
                             'Sample' : [i.split('_')[0] for i in df_demux['ID']]
                             })
    df_demux.to_csv('../data/metadata/psoil'+str(i)+'_metadata_fwd.tsv', sep='\t', index=False) #write into tsv file

In [4]:
# metadata is already available
[i for i in os.listdir('../data/metadata') if i.endswith('metadata_fwd.tsv')]

['psoil1_metadata_fwd.tsv',
 'psoil3_metadata_fwd.tsv',
 'psoil2_metadata_fwd.tsv']

In [5]:
# import raw data into qza format
path = '../data/raw_files/'
path = [os.path.join(path, i) for i in os.listdir(path) if i.startswith('Psoil')]

# create manifest files
filepath = []
for i in path:
    file = os.listdir(i)
    try:
        file.remove('.ipynb_checkpoints')
    except:
        pass
    file = [os.path.join(i, x) for x in file]
    filepath.append(file)

for num, i in enumerate(path):
    try:
        os.rmdir(os.path.join(i, '.ipynb_checkpoints'))
    except:
        pass
    os.rename(filepath[num][0], os.path.join(i, 'forward.fastq.gz'))
    os.rename(filepath[num][1], os.path.join(i, 'reverse.fastq.gz'))
    print(i)
    
    # Import raw files into QIIME 
    #! qiime tools import --type MultiplexedPairedEndBarcodeInSequence --input-path {i} --input-format MultiplexedPairedEndBarcodeInSequenceDirFmt --output-path ../data/qiime2/Psoil{num+1}_PE.qza 
    os.rename(os.path.join(i, 'forward.fastq.gz'), filepath[num][0]) 
    os.rename(os.path.join(i, 'reverse.fastq.gz'), filepath[num][1]) 

../data/raw_files/Psoil_1_L001-ds.be0c008785964521859130b2f2ead9be
../data/raw_files/Psoil_2_L001-ds.6255278a9f7448a0b3e7c48b80f6d25f
../data/raw_files/Psoil_3_L001-ds.e31e51f5248b4af4ad90603ebbf46e08


<a id='section2'></a>
## Demultiplexing

In [6]:
# demultiplex PE based on metadata
def demux(path, metadata, out):
    ! mkdir {out}
    ! qiime cutadapt demux-paired \
        --i-seqs {path} \
        --m-forward-barcodes-file {metadata} \
        --m-forward-barcodes-column BarcodeSequence \
        --p-error-rate 0.1 \
        --p-minimum-length 150 \
        --p-mixed-orientation \
        --o-per-sample-sequences {out}/demux-0.1.qza \
        --o-untrimmed-sequences {out}/untrimmed-0.1.qza \
        --verbose \
        > {out}/cutadapt-0.1.log
    return

In [12]:
paths = [os.path.join('../data/qiime2', i) for i in os.listdir('../data/qiime2') if i.endswith('PE.qza')]
for num, i in enumerate(paths):
    metadata = '../data/metadata/psoil'+str(num+1)+'_metadata_fwd.tsv'
    out = i.replace('.qza', '')
    if os.path.isdir(out):
        print(out)
        pass
    else:
        print('done')#demux(i, metadata, out)
        
#[       8=---] 00:26:18     6,318,906 reads  @    249.9 µs/read;   0.24 M reads/minute
#[------>8    ] 00:15:15     3,389,998 reads  @    270.2 µs/read;   0.22 M reads/minute
#[=8          ] 00:37:51     8,795,488 reads  @    258.2 µs/read;   0.23 M reads/minute
#[---->8      ] 00:22:08     4,747,122 reads  @    279.8 µs/read;   0.21 M reads/minute
#[-------->8  ] 00:19:56     6,869,491 reads  @    174.1 µs/read;   0.34 M reads/minute
#[------->8   ] 00:12:45     3,801,926 reads  @    201.4 µs/read;   0.30 M reads/minute

../data/qiime2/Psoil1_PE
../data/qiime2/Psoil2_PE
../data/qiime2/Psoil3_PE


In [None]:
# summarize
for i in range(3):
    ! qiime demux summarize \
        --i-data ../data/qiime2/Psoil{i+1}_PE/demux-0.1.qza \
        --o-visualization ../data/qiime2/Psoil{i+1}_PE/demux-0.1.qzv

In [None]:
Visualization.load('../data/qiime2/Psoil3_PE/demux-0.1.qzv')

<a id='section2'></a>
## Denoising

In [25]:
def denoise(path):
    ! qiime dada2 denoise-paired \
        --i-demultiplexed-seqs {path}/demux-0.1.qza \
        --p-trim-left-f 12 \
        --p-trim-left-r 29 \
        --p-trunc-len-f 200 \
        --p-trunc-len-r 250 \
        --p-n-threads 6 \
        --o-table {path}/table-0.1.qza \
        --o-representative-sequences {path}/rep-seqs-0.1.qza \
        --o-denoising-stats {path}/denoising-stats-0.1.qza \
        --verbose
    return

In [9]:
# sample 3 contains runs that are not part of the data, so we need to clean it up first
## create metadata for filtering
df_psoil3 = pd.read_csv('../data/metadata/psoil3_metadata_fwd.tsv', delimiter='\t')
sample = ['P9', 'Pmix', 'Neg']
df_psoil3 = df_psoil3[df_psoil3.Sample.isin(sample)]
df_psoil3.to_csv('../data/metadata/psoil3_metadata_fwd_filtered.tsv', sep='\t', index=False) #write into tsv file
## filtering using qiime2
! mv ../data/qiime2/Psoil3_PE/demux-0.1.qza ../data/qiime2/Psoil3_PE/demux-0.1-unfiltered.qza
! mv ../data/metadata/psoil3_metadata_fwd.tsv ../data/metadata/psoil3_metadata_fwd_unfiltered.tsv
! mv ../data/metadata/psoil3_metadata_fwd_filtered.tsv ../data/metadata/psoil3_metadata_fwd.tsv

In [None]:
## filtering using qiime2
! qiime demux filter-samples \
    --i-demux ./data/qiime2/Psoil3_PE/demux-0.1-unfiltered.qza \
    --m-metadata-file ../data/metadata/psoil3_metadata_fwd.tsv \
    --o-filtered-demux ../data/qiime2/Psoil3_PE/demux-0.1.qza \
    --verbose

In [30]:
# run denoising for all samples
paths = [os.path.join('../data/qiime2', i) for i in os.listdir('../data/qiime2') if os.path.isdir(os.path.join('../data/qiime2', i))]
paths = [i for i in paths if i.endswith('PE')]
for num, i in enumerate(paths):
    pe = os.path.join(i, 'demux-0.1.qza')
    if 'rep-seqs-0.1.qza' in os.listdir(i):
        print('already denoised', i)
        pass
    else:
        print('denoising', i)
        denoise(i)

already denoised ../data/qiime2/Psoil2_PE
already denoised ../data/qiime2/Psoil1_PE
already denoised ../data/qiime2/Psoil3_PE


In [22]:
# summarizing denoising results
# summarize
for i in range(3):   
    ! qiime feature-table summarize \
        --i-table ../data/qiime2/Psoil{i+1}_PE/table-0.1.qza \
        --o-visualization ../data/qiime2/Psoil{i+1}_PE/table-0.1.qzv \
        --m-sample-metadata-file ../data/metadata/psoil{i+1}_metadata_fwd.tsv

    ! qiime feature-table tabulate-seqs \
        --i-data ../data/qiime2/Psoil{i+1}_PE/rep-seqs-0.1.qza \
        --o-visualization ../data/qiime2/Psoil{i+1}_PE/rep-seqs-0.1.qzv

    ! qiime metadata tabulate \
        --m-input-file ../data/qiime2/Psoil{i+1}_PE/denoising-stats-0.1.qza \
        --o-visualization ../data/qiime2/Psoil{i+1}_PE/denoising-stats-0.1.qzv

[32mSaved Visualization to: qiime2/Psoil1_PE/table-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil1_PE/rep-seqs-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil1_PE/denoising-stats-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil2_PE/table-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil2_PE/rep-seqs-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil2_PE/denoising-stats-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil3_PE/table-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil3_PE/rep-seqs-0.1.qzv[0m
[32mSaved Visualization to: qiime2/Psoil3_PE/denoising-stats-0.1.qzv[0m


<a id='section4'></a>
## Merge Outputs

In [24]:
! qiime feature-table merge \
  --i-tables ../data/qiime2/Psoil1_PE/table-0.1.qza \
  --i-tables ../data/qiime2/Psoil2_PE/table-0.1.qza \
  --i-tables ../data/qiime2/Psoil3_PE/table-0.1.qza \
  --o-merged-table ../data/qiime2/table.qza

! qiime feature-table merge-seqs \
  --i-data ../data/qiime2/Psoil1_PE/rep-seqs-0.1.qza \
  --i-data ../data/qiime2/Psoil2_PE/rep-seqs-0.1.qza \
  --i-data ../data/qiime2/Psoil3_PE/rep-seqs-0.1.qza \
  --o-merged-data ../data/qiime2/rep-seqs.qza

[32mSaved FeatureTable[Frequency] to: qiime2/table.qza[0m
[32mSaved FeatureData[Sequence] to: qiime2/rep-seqs.qza[0m


In [32]:
df1 = pd.read_csv('../data/metadata/psoil1_metadata_fwd.tsv', sep='\t')
df2 = pd.read_csv('../data/metadata/psoil2_metadata_fwd.tsv', sep='\t')
df3 = pd.read_csv('../data/metadata/psoil3_metadata_fwd.tsv', sep='\t')
df = df1.append([df2, df3])
df.to_csv('../data/metadata/sample-metadata.tsv', sep='\t', index=False) #write into tsv file

In [33]:
! qiime feature-table summarize \
    --i-table ../data/qiime2/table.qza \
    --o-visualization ../data/qiime2/table.qzv \
    --m-sample-metadata-file ../data/metadata/sample-metadata.tsv

! qiime feature-table tabulate-seqs \
    --i-data ../data/qiime2/rep-seqs.qza \
    --o-visualization ../data/qiime2/rep-seqs.qzv

[32mSaved Visualization to: qiime2/table.qzv[0m
[32mSaved Visualization to: qiime2/rep-seqs.qzv[0m


In [31]:
Visualization.load('../data/qiime2/table.qzv')

In [32]:
Visualization.load('../data/qiime2/rep-seqs.qzv')