# Align SRA files reading SRA ids from a text file

This jupyter notebook will align SRA files that are stored in a text file. The notebook supports single and paired reads from human and mouse samples. All files in the text file are however from the same species. Make sure that the correct species is selected before aligning the SRA files. Reads are aligned by kallisto, Salmon, STAR, HISAT2, and BWA and finally aggregated into a gene count matrix.

## Requirements

- 26GB of memory
- 100GB of free disk space
- CPU with high thread count (16)

In [None]:
cd /alignment

# the number of threads the aligners should use.
thread_num=16

mkdir -p sradata
mkdir -p quant/combined
mkdir -p quant/salmon/
mkdir -p quant/kallisto/
mkdir -p quant/star/
mkdir -p quant/hisat2/
mkdir -p time

# when using high thread count in STAR there will be a large number of open files. This can lead to errors
# set ulimit -n 1024 and it should work with many threads
ulimit -n 1024


# before we start downloading data we should disable the caching of SRA files
# when downloading many files this will quickly end up filling up all disk space otherwise
# this command below magically works to disable caching
mkdir -p ~/.ncbi
echo '/repository/user/cache-disabled = "true"' > ~/.ncbi/user-settings.mkfg

Select the species that you want the SRA samples to be aligned against and the location of precomputed index files. From "mssm-genecount-combined" species "human" and "mouse" can be retrieved. There is a Jupyter notebook to build new index files.

In [None]:
SPECIES="human"
AWS_BUCKET="mssm-genecount-combined"

Download the precomputed index files. The index files are created from the Ensembl release 96 for the GRCh38 human and GRCm38 mouse reference genome.

In [None]:
scripts/load_index_species.sh ${SPECIES} ${AWS_BUCKET}

There are a couple of functions that will be called during the alignment.

In [None]:
center() {
  echo "---------------------------------- ${1} ----------------------------------"
}

downloadSRA(){
    center "Download SRA file"
    rm -f sradata/*
    fasterq-dump \
        --split-files \
        --outdir $2 \
        -e $thread_num -t "sradata" \
        $1
}

The "alignFile" function will iterate over the SRA ids in the input file and submit them to singe read or paired read alignment functions. It will also remove some of the larger files that are generated during the gene quantification process to free available disk space.

In [None]:
alignFile(){
    input=$1
    while IFS= read -r line
    do
        SRA_FILE=$line
        
        printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' =
        center $SRA_FILE
        downloadSRA $SRA_FILE "sradata"
        count=`find sradata -name "${SRA_FILE}*" | wc -l`
        
        if [ "$count" -eq "1" ];
        then
            alignSingleSRA $SRA_FILE $2
        else
            alignPairedSRA $SRA_FILE $2
        fi
        
        rm -f sradata/${SRA_FILE}*
        rm -f quant/star/${SRA_FILE}/${SRA_FILE}*.bam
        rm -f quant/star/${SRA_FILE}/${SRA_FILE}*.mate1
        rm -f quant/star/${SRA_FILE}/${SRA_FILE}*.mate2
        rm -f quant/star/${SRA_FILE}/${SRA_FILE}*J.out.tab
        rm -f quant/hisat2/${SRA_FILE}/${SRA_FILE}*.sam
        rm -f quant/bwa/${SRA_FILE}/${SRA_FILE}*.sam
        
    done < "$input"
}

The following function alignes single read files using Salmon, kallisto, STAR, HISAT2, and BWA. Once gene counts are computed their counts are aggregated into a combined gene count matrix.

In [None]:
alignSingleSRA(){
    SRA_FILE=$1
    SPECIES=$2
    
    center "Salmon"
    salmon quant -i "index/salmon/salmon_${SPECIES}_96" -l A \
        -r "sradata/${SRA_FILE}.fastq" \
        -p $thread_num -q --validateMappings -o quant/salmon/$SRA_FILE

    center "kallisto"
    kallisto quant -i index/kallisto/kallisto_${SPECIES}_96.idx \
        -t $thread_num -o quant/kallisto/$SRA_FILE \
        --single -l 180 -s 20 "sradata/${SRA_FILE}.fastq"

    center "STAR"
    mkdir -p quant/star/$SRA_FILE

    STAR \
        --genomeDir "index/star/${SPECIES}_96" \
        --limitBAMsortRAM 10000000000 \
        --runThreadN $thread_num \
        --outSAMstrandField intronMotif \
        --outFilterIntronMotifs RemoveNoncanonical \
        --outFileNamePrefix quant/star/$SRA_FILE/$SRA_FILE \
        --readFilesIn "sradata/${SRA_FILE}.fastq" \
        --outSAMtype BAM SortedByCoordinate \
        --outReadsUnmapped Fastx \
        --outSAMmode Full \
        --quantMode GeneCounts \
        --limitIObufferSize 50000000

    center "HISAT2"
    mkdir -p quant/hisat2/$SRA_FILE
    hisat2 \
        -x "index/hisat2/${SPECIES}_96/${SPECIES}" \
        -U "sradata/${SRA_FILE}.fastq" \
        -p $thread_num \
        -S "quant/hisat2/${SRA_FILE}/${SRA_FILE}.sam"
    featureCounts -T $thread_num -a "reference/${SPECIES}_96.gtf" -o "quant/hisat2/${SRA_FILE}/${SRA_FILE}.tsv" "quant/hisat2/${SRA_FILE}/${SRA_FILE}.sam"

    center "BWA"
    mkdir -p quant/bwa/$SRA_FILE
    bwa mem \
        -t $thread_num \
        "index/bwa/${SPECIES}_96/${SPECIES}_96" \
        "sradata/${SRA_FILE}.fastq" \
        > "quant/bwa/${SRA_FILE}/${SRA_FILE}.sam"
    featureCounts -T $thread_num -a "reference/${SPECIES}_96.gtf" -o "quant/bwa/${SRA_FILE}/${SRA_FILE}.tsv" "quant/bwa/${SRA_FILE}/${SRA_FILE}.sam"

    Rscript --vanilla scripts/aggregatecounts.r $SRA_FILE $SPECIES
}

The following function alignes paired-end read files using Salmon, kallisto, STAR, HISAT2, and BWA. Once gene counts are computed they counts are aggregated into a combined gene count matrix.

In [None]:
alignPairedSRA(){
    SRA_FILE=$1
    SPECIES=$2
    
    center "Salmon"
    salmon quant -i "index/salmon/salmon_${SPECIES}_96" -l A \
        -1 "sradata/${SRA_FILE}_1.fastq" \
        -2 "sradata/${SRA_FILE}_2.fastq" \
        -p $thread_num -q --validateMappings -o quant/salmon/$SRA_FILE

    center "kallisto"
    kallisto quant -i index/kallisto/kallisto_${SPECIES}_96.idx \
        "sradata/${SRA_FILE}_1.fastq" \
        "sradata/${SRA_FILE}_2.fastq" \
        -t $thread_num -o quant/kallisto/$SRA_FILE

    center "STAR"
    mkdir -p quant/star/$SRA_FILE
    STAR \
        --genomeDir "index/star/${SPECIES}_96" \
        --limitBAMsortRAM 10000000000 \
        --runThreadN $thread_num \
        --outSAMstrandField intronMotif \
        --outFilterIntronMotifs RemoveNoncanonical \
        --outFileNamePrefix quant/star/$SRA_FILE/$SRA_FILE \
        --readFilesIn "sradata/${SRA_FILE}_1.fastq" "sradata/${SRA_FILE}_2.fastq" \
        --outSAMtype BAM SortedByCoordinate \
        --outReadsUnmapped Fastx \
        --outSAMmode Full \
        --quantMode GeneCounts \
        --limitIObufferSize 50000000

    center "HISAT2"
    mkdir -p quant/hisat2/$SRA_FILE
    hisat2 \
        -x "index/hisat2/${SPECIES}_96/${SPECIES}" \
        -1 "sradata/${SRA_FILE}_1.fastq" \
        -2 "sradata/${SRA_FILE}_2.fastq" \
        -p $thread_num \
        -S "quant/hisat2/${SRA_FILE}/${SRA_FILE}.sam"
    featureCounts -T $thread_num -a "reference/${SPECIES}_96.gtf" -o "quant/hisat2/${SRA_FILE}/${SRA_FILE}.tsv" "quant/hisat2/${SRA_FILE}/${SRA_FILE}.sam"

    center "BWA"
    mkdir -p quant/bwa/$SRA_FILE
    bwa mem \
        -t $thread_num \
        "index/bwa/${SPECIES}_96/${SPECIES}_96" \
        "sradata/${SRA_FILE}_1.fastq" \
        "sradata/${SRA_FILE}_2.fastq" \
        > "quant/bwa/${SRA_FILE}/${SRA_FILE}.sam"
    featureCounts -T $thread_num -a "reference/${SPECIES}_96.gtf" -o "quant/bwa/${SRA_FILE}/${SRA_FILE}.tsv" "quant/bwa/${SRA_FILE}/${SRA_FILE}.sam"

    Rscript --vanilla scripts/aggregatecounts.r $SRA_FILE $SPECIES
}

This will start the alignment process for all SRA files listed in the file.

In [None]:
alignFile "supportfiles/test/sra_list_human.txt" "human"