# Deblur 
> Deblur uses sequence error profiles to associate erroneous sequence reads with the true biological sequence from which they are derived, resulting in high quality sequence variant data. This is applied in two steps. First, an initial quality filtering process based on quality scores is applied. This method is an implementation of the quality filtering approach described by Bokulich et al. (2013).

> Next, the Deblur workflow is applied using the qiime deblur denoise-16S method. This method requires one parameter that is used in quality filtering, --p-trim-length n which truncates the sequences at position n. In general, the Deblur developers recommend setting this value to a length where the median quality score begins to drop too low. On these data, the quality plots (prior to quality filtering) suggest a reasonable choice is in the 115 to 130 sequence position range. This is a subjective assessment. One situation where you might deviate from that recommendation is when performing a meta-analysis across multiple sequencing runs. In this type of meta-analysis, it is critical that the read lengths be the same for all of the sequencing runs being compared to avoid introducing a study-specific bias. Since we already using a trim length of 120 for qiime dada2 denoise-single, and since 120 is reasonable given the quality plots, weâ€™ll pass --p-trim-length 120. This next command may take up to 10 minutes to run.

In [1]:
import os
import pandas as pd

def qiime_manifest(path, out):
    try:
        os.mkdir(out)
    except FileExistsError:
        pass
    else:
        try:
            os.mkdir(out.rsplit('/', 1)[0])
        except FileNotFoundError:
            print('check outdir')   

    manifest = sorted([i for i in os.listdir(os.path.abspath(path)) if i.endswith('.fastq.gz')])
    ctr = 0
    df = pd.DataFrame(columns=['sample-id','forward-absolute-filepath','reverse-absolute-filepath'])

    for i in range(int(len(manifest)/2)):
        forward = manifest[ctr]
        reverse = manifest[ctr+1]
        sampleid = (manifest[ctr].rsplit('.', 3)[0])
        #print(sampleid)
        if 'unknown' in sampleid:
            print('dropped:', sampleid)
        else:
            df.loc[i] = [sampleid.replace('_', '-'), os.path.abspath(os.path.join(path, forward)), os.path.abspath(os.path.join(path, reverse))]
        ctr = ctr+2
    df.to_csv(out+'/manifest.txt', sep='\t', index=False)
    return

for i in range(3):
    path = '../data/preprocessing/Psoil-'+str(i+1)+'/clean/'
    out = '../data/qiime2/Psoil-'+str(i+1)
    qiime_manifest(path, out)

dropped: unknown
dropped: round2-unknown
dropped: unknown
dropped: unknown


In [2]:
os.path.abspath('../data/qiime2/Psoil-1/manifest.txt')

'/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn16s/data/qiime2/Psoil-1/manifest.txt'

In [3]:
df1 = pd.read_csv('../data/qiime2/Psoil-1/manifest.txt', sep='\t')
df2 = pd.read_csv('../data/qiime2/Psoil-2/manifest.txt', sep='\t')
df3 = pd.read_csv('../data/qiime2/Psoil-3/manifest.txt', sep='\t')
frames = [df1, df2, df3]
df_all = pd.concat(frames)
df_all = df_all[df_all['sample-id'].isin([i for i in df_all['sample-id'] if i.startswith(('P5', 'P8', 'P9'))])]
df_all.to_csv('../data/qiime2/manifest-selection.txt', sep='\t', index=False)
df_all

Unnamed: 0,sample-id,forward-absolute-filepath,reverse-absolute-filepath
0,P5-rep1,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
1,P5-rep2,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
2,P5-rep3,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
3,P5-rep4,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
4,P5-rep5,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
5,P8-rep1,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
6,P8-rep2,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
7,P8-rep3,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
8,P8-rep4,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...
9,P8-rep5,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...,/home/WIN.DTU.DK/matinnu/phd/projects/dyrehavn...


In [4]:
! qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path ../data/qiime2/manifest-selection.txt \
    --output-path ../data/qiime2/paired-end-demux.qza \
    --input-format PairedEndFastqManifestPhred33V2

[32mImported ../data/qiime2/manifest-selection.txt as PairedEndFastqManifestPhred33V2 to ../data/qiime2/paired-end-demux.qza[0m


In [5]:
! qiime demux summarize \
  --i-data ../data/qiime2/paired-end-demux.qza \
  --o-visualization ../data/qiime2/paired-end-demux.qzv

[32mSaved Visualization to: ../data/qiime2/paired-end-demux.qzv[0m


In [6]:
from qiime2 import Artifact, Visualization
Visualization.load('../data/qiime2/paired-end-demux.qzv')

In [7]:
! qiime vsearch join-pairs \
    --i-demultiplexed-seqs ../data/qiime2/paired-end-demux.qza \
    --p-threads 0 \
    --o-joined-sequences ../data/qiime2/joined-demux.qza \
    --verbose > ../data/qiime2/joined.log

  demultiplexed_seqs.metadata.pathspec)))['phred-offset']
vsearch v2.7.0_linux_x86_64, 62.8GB RAM, 16 cores
https://github.com/torognes/vsearch

Merging reads 100%                                            
    128220  Pairs
    112472  Merged (87.7%)
     15748  Not merged (12.3%)

Pairs that failed merging due to various reasons:
        59  too few kmers found on same diagonal
        53  potential tandem repeat
     11044  too many differences
      4455  alignment score too low, or score drop to high
       137  staggered read pairs

Statistics of all reads:
    273.54  Mean read length

Statistics of merged reads:
    418.12  Mean fragment length
     11.12  Standard deviation of fragment length
      1.94  Mean expected error in forward sequences
      1.31  Mean expected error in reverse sequences
      0.47  Mean expected error in merged sequences
      1.52  Mean observed errors in merged region of forward sequences
      0.98  Mean observed errors in merged region of revers

In [8]:
! qiime demux summarize \
  --i-data ../data/qiime2/joined-demux.qza \
  --o-visualization ../data/qiime2/joined-demux.qzv

[32mSaved Visualization to: ../data/qiime2/joined-demux.qzv[0m


In [9]:
Visualization.load('../data/qiime2/joined-demux.qzv')

In [10]:
! qiime quality-filter q-score \
    --i-demux ../data/qiime2/joined-demux.qza \
    --o-filtered-sequences ../data/qiime2/demux-filtered.qza \
    --o-filter-stats ../data/qiime2/demux-filter-stats.qza

[32mSaved SampleData[JoinedSequencesWithQuality] to: ../data/qiime2/demux-filtered.qza[0m
[32mSaved QualityFilterStats to: ../data/qiime2/demux-filter-stats.qza[0m


In [11]:
! qiime deblur denoise-16S \
    --i-demultiplexed-seqs ../data/qiime2/demux-filtered.qza \
    --p-trim-length 400 \
    --o-representative-sequences ../data/qiime2/rep-seqs-deblur.qza \
    --o-table ../data/qiime2/table-deblur.qza \
    --p-sample-stats \
    --p-jobs-to-start 8 \
    --o-stats ../data/qiime2/deblur-stats.qza \
    --verbose

[32mSaved FeatureTable[Frequency] to: ../data/qiime2/table-deblur.qza[0m
[32mSaved FeatureData[Sequence] to: ../data/qiime2/rep-seqs-deblur.qza[0m
[32mSaved DeblurStats to: ../data/qiime2/deblur-stats.qza[0m


In [12]:
! qiime metadata tabulate \
  --m-input-file ../data/qiime2/demux-filter-stats.qza \
  --o-visualization ../data/qiime2/demux-filter-stats.qzv

! qiime deblur visualize-stats \
  --i-deblur-stats ../data/qiime2/deblur-stats.qza \
  --o-visualization ../data/qiime2/deblur-stats.qzv

[32mSaved Visualization to: ../data/qiime2/demux-filter-stats.qzv[0m
[32mSaved Visualization to: ../data/qiime2/deblur-stats.qzv[0m


In [13]:
Visualization.load('../data/qiime2/demux-filter-stats.qzv')

In [14]:
Visualization.load('../data/qiime2/deblur-stats.qzv')

In [15]:
df1 = pd.read_csv('../data/metadata/Psoil-1_barcode.tsv', sep='\t')
df2 = pd.read_csv('../data/metadata/Psoil-2_barcode.tsv', sep='\t')
df3 = pd.read_csv('../data/metadata/Psoil-3_barcode.tsv', sep='\t')
frames = [df1, df2, df3]
df_all = pd.concat(frames)
df_all = df_all[df_all['#SampleID'].isin([i for i in df_all['#SampleID'] if i.startswith(('P5', 'P8', 'P9'))])].reset_index(drop=True)
for num, i in enumerate(df_all['#SampleID']):
    df_all.loc[num, '#SampleID'] = i.replace('_', '-')
df_all.to_csv('../data/metadata/metadata-selection.tsv', sep='\t', index=False)
df_all

Unnamed: 0,#SampleID,BarcodeSequence,Sample
0,P5-rep1,GCTTCTGA,P5
1,P5-rep2,GGCAAGAT,P5
2,P5-rep3,GTGCTTTC,P5
3,P5-rep4,ACACACTG,P5
4,P5-rep5,CGATTCTG,P5
5,P8-rep1,GCAGAGTT,P8
6,P8-rep2,CGTCCTAT,P8
7,P8-rep3,GCTTGGTT,P8
8,P8-rep4,ACAGGCTT,P8
9,P8-rep5,TGACGCTT,P8


In [16]:
! qiime feature-table summarize \
        --i-table ../data/qiime2/table-deblur.qza \
        --o-visualization ../data/qiime2/table-deblur.qzv \
        --m-sample-metadata-file ../data/metadata/metadata-selection.tsv

! qiime feature-table tabulate-seqs \
        --i-data ../data/qiime2/rep-seqs-deblur.qza \
        --o-visualization ../data/qiime2/rep-seqs-deblur.qzv

[32mSaved Visualization to: ../data/qiime2/table-deblur.qzv[0m
[32mSaved Visualization to: ../data/qiime2/rep-seqs-deblur.qzv[0m


In [17]:
Visualization.load('../data/qiime2/table-deblur.qzv')

In [18]:
Visualization.load('../data/qiime2/rep-seqs-deblur.qzv')