## 3. Mapping, sorting, and calling trimmed data

Using [NextGenMap](https://github.com/Cibiv/NextGenMap) for mapping, [samtools](https://github.com/samtools/samtools) for SAM to BAM conversion, and [Sambamba](https://github.com/biod/sambamba) for sorting. One of our individuals is in colorspace (ABI SOLiD data) and needed to be mapped with [BWA](https://github.com/lh3/bwa) instead of NextGenMap.

After this, we will use [FreeBayes](https://github.com/ekg/freebayes) for variant detection on the mapped and sorted files.

In [16]:
##!conda install -c bioconda samtools minimap2 sambamba bwa
##!conda install -c anaconda openssl=1.0.2
import os

In [1]:
%time !gunzip < /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz > /moto/eaton/projects/macaques/refpapio/refpapio.fa

CPU times: user 125 ms, sys: 46.7 ms, total: 172 ms
Wall time: 19.8 s


Building a regular index for the illumina data:

In [None]:
!samtools faidx /moto/eaton/projects/macaques/refpapio/refpapio.fa

In [None]:
!bwa index /moto/eaton/projects/macaques/refpapio/refpapio.fa

Building a colorspace index for our ABI SOLiD data:

In [None]:
!mkdir /moto/eaton/projects/macaques/refpapio/bowtie

In [None]:
!cp /moto/eaton/projects/macaques/refpapio/refpapio.fa /moto/eaton/projects/macaques/refpapio/bowtie/refpapio.fa

In [None]:
%%bash
bowtie-build --color --threads 12 /moto/eaton/projects/macaques/refpapio/bowtie/refpapio.fa \
    /moto/eaton/projects/macaques/refpapio/bowtie/refpapio

In [None]:
!mkdir /moto/eaton/projects/macaques/refpapio/color

In [None]:
!cp /moto/eaton/projects/macaques/refpapio/refpapio.fa /moto/eaton/projects/macaques/refpapio/color/refpapio.fa

In [None]:
!bwa index -a bwtsw -c /moto/eaton/projects/macaques/refpapio/color/refpapio.fa

#### Mapping colorspace data with `bwa aln` and `samse/sampe`:

If you have exclusive access to a computing node for long periods of time, you can iterate through all the files using the following for loop where `-t` is your desired number of threads:

In [None]:
##creating a folder to put the BAM files into 
!mkdir /moto/eaton/projects/macaques/mapped/

In [5]:
%%bash
bowtie /moto/eaton/projects/macaques/refpapio/bowtie/refpapio -C --threads 12 --sam \
    --sam-RG ID:fasso --sam-RG SM:fasso -f \
    /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3.csfasta \
    -Q /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3_QV.qual \
    | samtools view -Sb -@ 4 -> /moto/eaton/projects/macaques/mapped/bowtieDRR001227.rg.raw.bam

Process is interrupted.


In [6]:
%%bash
bowtie /moto/eaton/projects/macaques/refpapio/bowtie/refpapio -C --threads 12 --sam \
    --sam-RG ID:fasso --sam-RG SM:fasso -f \
    -1 /moto/eaton/projects/macaques/fastqdump/fasso/DRR001233_F3.csfasta \
    -2 /moto/eaton/projects/macaques/fastqdump/fasso/DRR001233_R3.csfasta \
    --Q1 /moto/eaton/projects/macaques/fastqdump/fasso/DRR001233_F3_QV.qual \
    --Q2 /moto/eaton/projects/macaques/fastqdump/fasso/DRR001233_R3_QV.qual \
    | samtools view -Sb -@ 4 -> /moto/eaton/projects/macaques/mapped/bowtieDRR001233.rg.raw.bam

Process is interrupted.


In [None]:
%%bash
for i in /moto/eaton/projects/macaques/filteredfastq/*.fastq; do
    bwa aln -c -t 12 \
        /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
        /moto/eaton/projects/macaques/filteredfastq/$i \
        > /moto/eaton/projects/macaques/filteredfastq/$i.sai
    done

For loop for the single end (`bwa samse`) data:

In [None]:
%%bash
for i in DRR001227 DRR001228 DRR001229 DRR001230; do
    bwa samse /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
        -r '@RG\tID:fasso\tSM:fasso' \
        /moto/eaton/projects/macaques/filteredfastq/${i}_F.sai \
        /moto/eaton/projects/macaques/filteredfastq/${i}_F.fastq \
        > /moto/eaton/projects/macaques/mapped/$i.rg.raw.sam
    done

For loop for the mate-pair (`bwa sampe`) data:

In [None]:
%%bash
for i in DRR001231 DRR001232 DRR001233; do
    bwa sampe /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
        -r '@RG\tID:fasso\tSM:fasso' \
        /moto/eaton/projects/macaques/filteredfastq/${i}_R.sai \
        /moto/eaton/projects/macaques/filteredfastq/${i}_F.sai \
        /moto/eaton/projects/macaques/filteredfastq/${i}_R.fastq \
        /moto/eaton/projects/macaques/filteredfastq/${i}_F.fastq \
        > /moto/eaton/projects/macaques/mapped/$i.rg.raw.sam
    done

While working on this project we had to often share resources so we submitted SLURM jobs for individual fastq files with the following format (shown for the mate-pair ABI SOLiD data, convert cells to code cells if you want to run them this way):

%%bash
bwa aln -c -t 12 \
    /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_F.fastq \
    > /moto/eaton/projects/macaques/filteredfastq/DRR001231_F.sai

%%bash
bwa aln -c -t 12 \
    /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_R.fastq \
    > /moto/eaton/projects/macaques/filteredfastq/DRR001231_R.sai

%%bash
bwa sampe /moto/eaton/projects/macaques/refpapio/color/refpapio.fa \
    -r '@RG\tID:fasso\tSM:fasso' \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_R.sai \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_F.sai \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_R.fastq \
    /moto/eaton/projects/macaques/filteredfastq/DRR001231_F.fastq \
    > /moto/eaton/projects/macaques/mapped/DRR001231.rg.raw.sam

And now we convert the SAM files to BAM files using `samtools` (change `-@` arg to the number of threads available to you):

In [1]:
!samtools flagstat /moto/eaton/projects/macaques/mapped/fuscata2.1.raw.sam

/usr/bin/sh: samtools: command not found


In [6]:
samtools view -Sb -@ 10 /moto/eaton/projects/macaques/mapped/fuscata2.1.raw.sam > /moto/eaton/projects/macaques/mapped/fuscata2.1.raw.bam

In [None]:
samtools view -Sb -@ 12 /moto/eaton/projects/macaques/mapped/fuscata2.2.raw.sam > /moto/eaton/projects/macaques/mapped/fuscata2.2.raw.bam

In [None]:
%%bash
for i in DRR001231f DRR001232f DRR001233f DRR001231r DRR001232r DRR001233r; do
    samtools view -Sb -@ 10 /moto/eaton/projects/macaques/mapped/$i.raw.sam \
        > /moto/eaton/projects/macaques/mapped/$i.raw.bam
    done

Since the colorspace fastq files were a mix of mate-pair and single end data, we could not merge them before but we can merge the BAM files that resulted from the mapping and then sort the BAM file using `sambamba`:

In [2]:
%%bash
for i in fuscata2.1 fuscata2.2 DRR001231f DRR001232f DRR001233f DRR001231r DRR001232r DRR001233r; do
    sambamba sort -t 12 --tmpdir=/moto/eaton/projects/macaques/tmp/ \
        /moto/eaton/projects/macaques/mapped/$i.raw.bam
    done


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)


sambamba 0.6.8 

In [2]:
!echo /moto/eaton/projects/macaques/mapped/DRR0012*.raw.sorted.bam

/moto/eaton/projects/macaques/mapped/DRR001227.rg.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001228.rg.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001229.rg.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001230.rg.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001231f.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001231r.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001232f.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001232r.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001233f.raw.sorted.bam /moto/eaton/projects/macaques/mapped/DRR001233r.raw.sorted.bam


In [3]:
sambamba merge -t 12 /moto/eaton/projects/macaques/mapped/fasso.rg.raw.bam /moto/eaton/projects/macaques/mapped/DRR0012*.raw.sorted.bam


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)



In [None]:
sambamba merge -t 12 /moto/eaton/projects/macaques/mapped/fuscata2.rg.raw.bam /moto/eaton/projects/macaques/mapped/fuscata2.*.raw.sorted.bam

In [1]:
!mkdir /moto/eaton/projects/macaques/mapped/Chr4

In [2]:
%%bash
source ~/.bashrc
conda activate py2
for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
    samtools view -b -@ 12 /moto/eaton/projects/macaques/mapped/$i.rg.raw.sorted.bam NC_018155.2 > \
        /moto/eaton/projects/macaques/mapped/Chr4/$i.Chr4.bam
    done

In [3]:
!mkdir /moto/eaton/projects/macaques/mapped/Chr3

In [4]:
%%bash
source ~/.bashrc
conda activate py2
for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
    samtools view -b -@ 12 /moto/eaton/projects/macaques/mapped/$i.rg.raw.sorted.bam NC_018154.2 > \
        /moto/eaton/projects/macaques/mapped/Chr3/$i.Chr3.bam
    done

In [5]:
!mkdir /moto/eaton/projects/macaques/mapped/Chr2

In [6]:
%%bash
source ~/.bashrc
conda activate py2
for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
    samtools view -b -@ 12 /moto/eaton/projects/macaques/mapped/$i.rg.raw.sorted.bam NC_018153.2 > \
        /moto/eaton/projects/macaques/mapped/Chr2/$i.Chr2.bam
    done

In [7]:
!mkdir /moto/eaton/projects/macaques/mapped/Chr1

In [8]:
%%bash
source ~/.bashrc
conda activate py2
for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
    samtools view -b -@ 12 /moto/eaton/projects/macaques/mapped/$i.rg.raw.sorted.bam NC_018152.2 > \
        /moto/eaton/projects/macaques/mapped/Chr1/$i.Chr1.bam
    done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
for j in Chr4 Chr3 Chr2 Chr1; do
    for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
    ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
    ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
        samtools index /moto/eaton/projects/macaques/mapped/$j/$i.$j.bam
    done
done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
for j in Chr4 Chr3 Chr2 Chr1; do
    for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 \
    ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 \
    ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
        sambamba markdup -t 12 --overflow-list-size 6000000 --hash-table-size 6000000 \
            --tmpdir=/moto/eaton/projects/macaques/tmp \
            /moto/eaton/projects/macaques/mapped/$j/$i.$j.bam \
            /moto/eaton/projects/macaques/mapped/$j/$i.$j.mark.bam
    done
done

In [None]:
!sbatch /moto/eaton/projects/macaques/scripts/slurm-scripts/freebayesChr4.sh

In [None]:
!sbatch /moto/eaton/projects/macaques/scripts/slurm-scripts/freebayesChr3.sh

In [None]:
!sbatch /moto/eaton/projects/macaques/scripts/slurm-scripts/freebayesChr2.sh

In [None]:
!sbatch /moto/eaton/projects/macaques/scripts/slurm-scripts/freebayesChr1.sh

In [None]:
!scancel -u nsl2119 333457

In [None]:
freebayes -f /moto/eaton/projects/macaques/refpapio/refpapio.fa \
    --genotype-qualities \
    /moto/eaton/projects/macaques/mapped/MT/ngmDRR002233.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR1024051.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR2981114.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR2981139.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR2981140.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR4453966.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR4454020.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR4454026.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR5628058.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR5947292.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR5947293.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR5947294.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR7588781.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmSRR8285768.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmfasno.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/fasso.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmfuscata2.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmnemestrina2.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmsilenus.MT.mark.bam \
    /moto/eaton/projects/macaques/mapped/MT/ngmsylvanus.MT.mark.bam \
    >/moto/eaton/projects/macaques/mapped/MT/calls/MT_2.raw.vcf

In [None]:
freebayes-parallel <(fasta_generate_regions.py /moto/eaton/projects/macaques/refpapio/refpapio.fa 100000) 12 \
    -f /moto/eaton/projects/macaques/refpapio/refpapio.fa \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmDRR002233.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR1024051.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR2981114.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR2981139.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR2981140.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR4453966.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR4454020.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR4454026.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR5628058.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR5947292.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR5947293.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR5947294.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR7588781.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmSRR8285768.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmfasno.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/fasso.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmfuscata2.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmnemestrina2.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmsilenus.Chr19.mark.bam \
    /moto/eaton/projects/macaques/mapped/Chr19/ngmsylvanus.Chr19.mark.bam \
    >/moto/eaton/projects/macaques/mapped/Chr19/calls/Chr19_2.raw.vcf

### A) Mapping and sorting northern _Macaca mulatta_:

In [8]:
%%bash
for i in /moto/eaton/projects/macaques/filteredfastq/*.filtered_*; do
    echo $i
    done

/moto/eaton/projects/macaques/filteredfastq/fuscata2.filtered_1.fastq.gz
/moto/eaton/projects/macaques/filteredfastq/fuscata2.filtered_2.fastq.gz
/moto/eaton/projects/macaques/filteredfastq/SRR8285768.filtered_1.fastq.gz
/moto/eaton/projects/macaques/filteredfastq/SRR8285768.filtered_2.fastq.gz


In [None]:
%time !ngm --no-progress --rg-id SRR5628058 --rg-sm SRR5628058 -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmSRR5628058.rg.raw.bam

[MAIN] NextGenMap 0.5.5
[MAIN] Startup : x64 (build Jul 15 2018 19:15:59)
[MAIN] Starting time: 2019-02-11.23:15:24
[CONFIG] Parameter:  --affine 0 --argos_min_score 0 --bin_size 2 --block_multiplier 2 --broken_pairs 0 --bs_cutoff 6 --bs_mapping 0 --cpu_threads 24 --dualstrand 1 --fast 0 --fast_pairing 0 --force_rlength_check 0 --format 1 --gap_extend_penalty 5 --gap_read_penalty 20 --gap_ref_penalty 20 --hard_clip 0 --keep_tags 0 --kmer 13 --kmer_min 0 --kmer_skip 2 --match_bonus 10 --match_bonus_tc 2 --match_bonus_tt 10 --max_cmrs 2147483647 --max_equal 1 --max_insert_size 1000 --max_polya -1 --max_read_length 0 --min_identity 0.650000 --min_insert_size 0 --min_mq 0 --min_residues 0.500000 --min_score 0.000000 --mismatch_penalty 15 --mode 0 --no_progress 1 --no_unal 0 --ocl_threads 1 --overwrite 1 --pair_score_cutoff 0.900000 --paired 1 --parse_all 1 --pe_delimiter / --qry1 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_1.fastq.gz --qry2 /moto/eaton/projects/macaques

In [None]:
%time !ngm --no-progress --rg-id fasno --rg-sm fasno -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/fasno.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/fasno.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmfasno.rg.raw.bam

In [None]:
%time !ngm --no-progress --rg-id SRR2981139 --rg-sm SRR2981139 -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/SRR2981139.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/SRR2981139.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmSRR2981139.rg.raw.bam

In [None]:
%time !ngm --no-progress --rg-id SRR1024051 --rg-sm SRR1024051 -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/SRR1024051.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/SRR1024051.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmSRR5628058.rg.raw.bam

In [None]:
%time !ngm --no-progress --rg-id macintro --rg-sm SRR5628058 -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmSRR5628058.rg.raw.bam

In [None]:
%time !ngm --no-progress --rg-id macintro --rg-sm SRR5628058 -t 24 -r /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz \
    -1 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_1.fastq.gz \
    -2 /moto/eaton/projects/macaques/filteredfastq/SRR5628058.filtered_2.fastq.gz \
    | samtools view -b -@ 3 -> /moto/eaton/projects/macaques/mapped/ngmSRR5628058.rg.raw.bam

In [None]:
SRR1024051', 'silenus', 'sylvanus']

In [None]:
test=['SRR5628058', 'fasno', 'SRR2981139', 'SRR1024051', 'silenus', 'sylvanus']

In [None]:
##mapping
for i in test:
    cmd='minimap2 -ax sr -t 19 /moto/eaton/projects/macaques/refpapio/refpapio.fa \
            /moto/eaton/projects/macaques/filteredfastq/'+i+'.filtered_1.fastq.gz \
            /moto/eaton/projects/macaques/filteredfastq/'+i+'.filtered_2.fastq.gz \
            | samtools view -b -> /moto/eaton/projects/macaques/mapped/'+i+'.raw.bam'
    os.system(cmd)

In [4]:
test=['SRR5628058', 'ngmfasno', 'ngmSRR2981139', 'ngmSRR1024051', 'ngmsilenus', 'ngmsylvanus']

In [None]:
##sorting mapped data
for i in test:
    cmd='sambamba sort -t 20 --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/'+i+'.raw.bam'
    os.system(cmd)

In [2]:
test=['ngmfasno']

In [None]:
##marking likely PCR duplicates
for i in test:
    cmd='sambamba markdup -t 20 -p --overflow-list-size 6000000 --hash-table-size 6000000 \
            --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/'+i+'.raw.sorted.bam \
            /moto/eaton/projects/macaques/mapped/'+i+'.mark.bam'
    os.system(cmd)

In [None]:
test=['ngmSRR2981139', 'ngmSRR1024051', 'ngmsilenus', 'ngmsylvanus']

In [None]:
%%bash
sambamba markdup -t 12 -p --overflow-list-size 6000000 --hash-table-size 6000000 \
            --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/fasso.rg.raw.bam \
            /moto/eaton/projects/macaques/mapped/fasso.mark.bam

In [None]:
for i in test:
    cmd='sambamba markdup -t 20 -p --overflow-list-size 6000000 --hash-table-size 6000000 \
            --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/'+i+'.raw.sorted.bam \
            /moto/eaton/projects/macaques/mapped/'+i+'.mark.bam'
    os.system(cmd)

### B) Calling Variants with FreeBayes

In [None]:
##!conda install -c bioconda freebayes

In [2]:
!mkdir /moto/eaton/projects/macaques/calls/

In [35]:
test=['SRR5628058', 'ngmfasno', 'ngmSRR2981139', 'ngmSRR1024051', 'ngmsilenus', 'ngmsylvanus']

In [37]:
!samtools addreplacerg -r 'ID:SRR5628058' -r 'SM:SRR5628058' \
        -o /moto/eaton/projects/macaques/mapped/SRR5628058.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/SRR5628058.NC_018153.2.mark.bam

In [38]:
!samtools addreplacerg -r 'ID:ngmfasno' -r 'SM:ngmfasno' \
        -o /moto/eaton/projects/macaques/mapped/ngmfasno.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/ngmfasno.NC_018153.2.mark.bam

In [39]:
!samtools addreplacerg -r 'ID:ngmSRR2981139' -r 'SM:ngmSRR2981139' \
        -o /moto/eaton/projects/macaques/mapped/ngmSRR2981139.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/ngmSRR2981139.NC_018153.2.mark.bam

In [40]:
!samtools addreplacerg -r 'ID:ngmSRR1024051' -r 'SM:ngmSRR1024051' \
        -o /moto/eaton/projects/macaques/mapped/ngmSRR1024051.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/ngmSRR1024051.NC_018153.2.mark.bam

In [41]:
!samtools addreplacerg -r 'ID:ngmsilenus' -r 'SM:ngmsilenus' \
        -o /moto/eaton/projects/macaques/mapped/ngmsilenus.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/ngmsilenus.NC_018153.2.mark.bam

In [42]:
!samtools addreplacerg -r 'ID:ngmsylvanus' -r 'SM:ngmsylvanus' \
        -o /moto/eaton/projects/macaques/mapped/ngmsylvanus.NC_018153.2.fixed.bam \
        /moto/eaton/projects/macaques/mapped/ngmsylvanus.NC_018153.2.mark.bam

In [43]:
%%bash
for i in SRR5628058 ngmfasno ngmSRR2981139 ngmSRR1024051 ngmsilenus ngmsylvanus; do
    samtools index /moto/eaton/projects/macaques/mapped/$i.NC_018153.2.fixed.bam
    done

In [46]:
%%bash
freebayes-parallel <(fasta_generate_regions.py /moto/eaton/projects/macaques/refpapio/refpapio.fa 100000) 22 \
    -f /moto/eaton/projects/macaques/refpapio/refpapio.fa \
    /moto/eaton/projects/macaques/mapped/SRR5628058.NC_018153.2.fixed.bam \
    /moto/eaton/projects/macaques/mapped/ngmfasno.NC_018153.2.fixed.bam \
    /moto/eaton/projects/macaques/mapped/ngmSRR2981139.NC_018153.2.fixed.bam \
    /moto/eaton/projects/macaques/mapped/ngmSRR1024051.NC_018153.2.fixed.bam \
    /moto/eaton/projects/macaques/mapped/ngmsilenus.NC_018153.2.fixed.bam \
    /moto/eaton/projects/macaques/mapped/ngmsylvanus.NC_018153.2.fixed.bam \
    >/moto/eaton/projects/macaques/calls/test.chr2.vcf

Process is interrupted.


Finding what are the names of the scaffolds that reads were mapped to:

In [6]:
!samtools idxstats /moto/eaton/projects/macaques/mapped/ngmfasno.mark.bam \
    | cut -f 1 | head -22

NC_018152.2
NC_018153.2
NC_018154.2
NC_018155.2
NC_018156.2
NC_018157.2
NC_018158.2
NC_018159.2
NC_018160.2
NC_018161.2
NC_018162.2
NC_018163.2
NC_018164.2
NC_018165.2
NC_018166.2
NC_018167.2
NC_018168.2
NC_018169.2
NC_018170.2
NC_018171.2
NC_018172.2
NW_018761063.1
cut: write error: Broken pipe


Splitting by chromosome (NC_* is chromosome or mitochondria and NW_* is unassigned, so we just need the NC names):

In [4]:
test=['SRR5628058', 'ngmfasno', 'ngmSRR2981139', 'ngmSRR1024051', 'ngmsilenus', 'ngmsylvanus']

In [2]:
!mkdir /moto/eaton/projects/macaques/mapped/MT

In [1]:
%%bash
for i in ngmDRR002233 ngmSRR1024051 ngmSRR2981114 ngmSRR2981139 ngmSRR2981140 ngmSRR4453966 ngmSRR4454020 ngmSRR4454026 ngmSRR5628058 ngmSRR5947292 ngmSRR5947293 ngmSRR5947294 ngmSRR7588781 ngmSRR8285768 ngmfasno ngmfasso ngmfuscata2 ngmnemestrina2 ngmsilenus ngmsylvanus; do
    samtools view -b -@ 12 /moto/eaton/projects/macaques/mapped/$i.rg.raw.sorted.bam NC_020006.2 > \
        /moto/eaton/projects/macaques/mapped/MT/$i.MT.bam
    done

/moto/eaton/projects/macaques/mapped/ngmDRR002233.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR1024051.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR2981114.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR2981139.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR2981140.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR4453966.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR4454020.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR4454026.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR5628058.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR5947292.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR5947293.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR5947294.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR7588781.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mapped/ngmSRR8285768.rg.raw.sorted.bam
/moto/eaton/projects/macaques/mappe

In [5]:
for i in test:
    cmd='samtools view -@ 7 \
            /moto/eaton/projects/macaques/mapped/'+i+'.mark.bam NC_018153.2 -b > \
            /moto/eaton/projects/macaques/mapped/'+i+'.NC_018153.2.bam'
    os.system(cmd)

In [6]:
for i in test:
    cmd='sambamba sort -t 20 --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/'+i+'.NC_018153.2.bam'
    os.system(cmd)

In [7]:
for i in test:
    cmd='sambamba markdup -t 20 -p --overflow-list-size 6000000 --hash-table-size 6000000 \
            --tmpdir=/moto/eaton/projects/macaques/tmp/ \
            /moto/eaton/projects/macaques/mapped/'+i+'.NC_018153.2.sorted.bam \
            /moto/eaton/projects/macaques/mapped/'+i+'.NC_018153.2.mark.bam'
    os.system(cmd)

In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2; do
    samtools view \
        /moto/eaton/projects/macaques/mulattanorthern/filtercall/mulattanorthern.SRR4454026.raw.minimap2.sorted.bam $i -b > \
        /moto/eaton/projects/macaques/mulattanorthern/filtercall/mulattanorthern.$i.bam


In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba sort -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattanorthern/filtercall/ \
        /moto/eaton/projects/macaques/mulattanorthern/filtercall/mulattanorthern.$i.bam
done

In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba markdup -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattanorthern/filtercall/ \
        /moto/eaton/projects/macaques/mulattanorthern/filtercall/mulattanorthern.$i.sorted.bam \
        /moto/eaton/projects/macaques/mulattanorthern/filtercall/mulattanorthern.$i.ready.bam
done

In [2]:
!mkdir /moto/eaton/projects/macaques/MAPPEDANDSORTED

In [3]:
##moving our mapped and sorted reads, separated by chromosome/mitochondria, to a staging directory
!mv /moto/eaton/projects/macaques/mulattanorthern/filtercall/*.ready.* \
    /moto/eaton/projects/macaques/MAPPEDANDSORTED

## C) Converting VCF to Phylip for Phylogenetic Analyses

In [8]:
##!git clone https://github.com/edgardomortiz/vcf2phylip
!python /moto/home/nsl2119/vcf2phylip/vcf2phylip.py -i /moto/eaton/projects/macaques/calls/test.chr2.biallelic.vcf


Converting file /moto/eaton/projects/macaques/calls/test.chr2.biallelic.vcf:

Number of samples in VCF: 6
500000 genotypes processed.
1000000 genotypes processed.
1500000 genotypes processed.
2000000 genotypes processed.
2500000 genotypes processed.
3000000 genotypes processed.
3500000 genotypes processed.
4000000 genotypes processed.
4500000 genotypes processed.
5000000 genotypes processed.
5500000 genotypes processed.
6000000 genotypes processed.
6500000 genotypes processed.
7000000 genotypes processed.
Total of genotypes processed: 7146409
Genotypes excluded because they exceeded the amount of missing data allowed: 7146409
Genotypes that passed missing data filter but were excluded for not being SNPs: 0
SNPs that passed the filters: 0

Sample 1 of 6, ngmSRR1024051, added to the nucleotide matrix(ces).
Sample 2 of 6, ngmSRR2981139, added to the nucleotide matrix(ces).
Sample 3 of 6, SRR5628058, added to the nucleotide matrix(ces).
Sample 4 of 6, ngmsilenus, added to the nucleotide m

### B) Mapping and sorting southern, low altitude _Macaca mulatta_:

In [4]:
!mkdir /moto/eaton/projects/macaques/mulattasouthernlow/filtercall

In [None]:
%time !minimap2 -ax sr -t 24 /moto/eaton/projects/macaques/refpapio/refpapio.fa \
    /moto/eaton/projects/macaques/TRIM/mulattasouthernlowSRR4454020_1.fastq.gz \
    /moto/eaton/projects/macaques/TRIM/mulattasouthernlowSRR4454020_2.fastq.gz \
    | samtools view -b -> /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.SRR4454020.raw.minimap2.bam

In [6]:
%time !sambamba sort -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernlow/filtercall/ \
    /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.SRR4454020.raw.minimap2.bam


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)

Writing sorted chunks to temporary directory...
Merging sorted chunks...
CPU times: user 14.1 s, sys: 869 ms, total: 14.9 s
Wall time: 5min 28s


In [10]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    samtools view \
        /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.SRR4454020.raw.minimap2.sorted.bam $i -b > \
        /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.$i.bam
done

In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba sort -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernlow/filtercall/ \
        /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.$i.bam
done

In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba markdup -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernlow/filtercall/ \
        /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.$i.sorted.bam \
        /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/mulattasouthernlow.$i.ready.bam
done

In [13]:
!mv /moto/eaton/projects/macaques/mulattasouthernlow/filtercall/*.ready.* \
    /moto/eaton/projects/macaques/MAPPEDANDSORTED

### C) Mapping and sorting southern, high altitude _Macaca mulatta_:

In [14]:
!mkdir /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall

In [None]:
%time !minimap2 -ax sr -t 24 /moto/eaton/projects/macaques/refpapio/refpapio.fa \
    /moto/eaton/projects/macaques/TRIM/mulattasouthernhighSRR4453966_1.fastq.gz \
    /moto/eaton/projects/macaques/TRIM/mulattasouthernhighSRR4453966_2.fastq.gz \
    | samtools view -b -> /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.SRR4453966.raw.minimap2.bam

In [16]:
%time !sambamba sort -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/ \
    /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.SRR4453966.raw.minimap2.bam


sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.11.0 / DMD v2.081.2 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6git-0156298)

Writing sorted chunks to temporary directory...
Merging sorted chunks...
CPU times: user 16.7 s, sys: 1.01 s, total: 17.7 s
Wall time: 6min 16s


In [17]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    samtools view \
        /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.SRR4453966.raw.minimap2.sorted.bam $i -b > \
        /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.$i.bam
done

In [None]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba sort -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/ \
        /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.$i.bam
done

In [19]:
%%bash
for i in NC_018152.2 NC_018153.2 NC_018154.2 NC_018155.2 NC_018156.2 NC_018157.2 NC_018158.2 NC_018159.2 NC_018160.2 NC_018161.2 NC_018162.2 NC_018163.2 NC_018164.2 NC_018165.2 NC_018166.2 NC_018167.2 NC_018168.2 NC_018169.2 NC_018170.2 NC_018171.2 NC_018172.2 NC_020006.2; do
    sambamba markdup -t 24 -p --tmpdir=/moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/ \
        /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.$i.sorted.bam \
        /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/mulattasouthernhigh.$i.ready.bam
done

Process is interrupted.


In [None]:
!mv /moto/eaton/projects/macaques/mulattasouthernhigh/filtercall/*.ready.* \
    /moto/eaton/projects/macaques/MAPPEDANDSORTED

#### SAM to BAM conversion and sorting reads:

In [6]:
%time !samtools view -S -b results.sam > sample.bam ##simple conversion to bam appx 21 min on a 12 thread desktop w/ 16gb ram, not bad

CPU times: user 23.8 s, sys: 3.48 s, total: 27.3 s
Wall time: 21min 13s


In [7]:
%time !samtools sort sample.bam -o sample.sorted.bam ##sorting bam file into genome order ~26mins

[bam_sort_core] merging from 53 files and 1 in-memory blocks...
CPU times: user 28.7 s, sys: 3.85 s, total: 32.6 s
Wall time: 25min 41s


In [14]:
%time !samtools index sample.sorted.bam ##of course this will all be piped together...

CPU times: user 2.36 s, sys: 363 ms, total: 2.73 s
Wall time: 1min 57s


In [1]:
%time !samtools view sample.sorted.bam | head -n 1 ##We see that instead of giving chromosomes logical names like Chr1, Chr2, etc., the reference genome has strange names for chromosomes (NC_027893.1, etc)...

SRR445694~125200.sra.858593	99	NC_027893.1	1	60	5S95M	=	294	393	AAGGCCATGGAAACAAGGAAAGTCTGAAAAACTCACAGTTTAGGAACCTAAAGAGACTTGACTACTAAATGGAATATATCTTGGGATCCTGGAAAAGAAA	CCCFFFFFHHHHHIIIIIIIIIHIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHFHHHHFFFFFFDDDDCCCDBCCDDBDD	AS:i:950	NM:i:0	NH:i:0	XI:f:1	X0:i:0	XE:i:28	XR:i:95	MD:Z:95
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
CPU times: user 5.74 ms, sys: 7.92 ms, total: 13.7 ms
Wall time: 224 ms


In [3]:
%time !samtools view -h -b sample.sorted.bam NC_027893.1 > chr1.bam ##Which makes splitting files up for chromosome-level analyses a bit more annoying but not too bad...I'll make a bash script

CPU times: user 1.19 s, sys: 147 ms, total: 1.34 s
Wall time: 58.6 s


Pipe from NGM to samtools with an output of a sorted bam file:

In [None]:
!ngm -r ./reference-genome/Mmul8.fna.gz -1 out.R1.fq.gz -2 out.R2.fq.gz | samtools view -S -b | samtools sort -o sample.sorted.bam

#### Variant calling:

In [None]:
!freebayes -f ./reference-genome/Mmul8.fna.gz sample.sorted.bam >wholegenome.vcf ##example code for variant calling on entire genome. Can be split by chromosome/region using -r 