# Build alignment index files

Here we build alignment index files for the aligners Salmon, kallisto, STAR, HISAT2, and BWA. In this configuration we process Ensemble genomes of human and mouse.

When AWS is configured the constructed indeces are uploaded to a specified AWS bucket using AWS credentials.

## Requirements

Building index files with BWA is very memory intensive. The compute instance might require large amounts of memory to perform the index construction. The maayanlab/alignmentbenchmark supports a script than can download precomputed index files at "/alignment/scripts/load_index.sh". 

- 50GB of memory
- 200GB of free disk space
- CPU with high thread count (16)

In [1]:
cd /alignment

# the thread count used by some of the alignment algorithms.
# Not all aligners support multithreading when constructing index files
num_threads=16

mkdir -p reference
mkdir -p index/salmon
mkdir -p index/kallisto
mkdir -p index/star/human_96
mkdir -p index/star/mouse_96
mkdir -p index/hisat2/human_96
mkdir -p index/hisat2/mouse_96
mkdir -p index/bwa/human_96
mkdir -p index/bwa/mouse_96

# set number of open files, needed for high thread count
ulimit -n 1024

The Jupyter notebook supports saving generated index files into a cloud repository for later reuse. Downloading the index files is faster than rebuilding. For the successful upload of data to Amazon S3 a user account has to be configured and a bucket location specified. The Bucket name has to be unique in S3.

Here the private AWS key and AWS id can be set. To do so replace the strings "yourid4lKG0RY5v4tha+b7UjA4WO" and "yourkeyWOQvMIc71yE8kfswJ5WGs6KMmvrfkh" and adjust the regoin is needed and the bucket. **When working with credentials special care is advised. When uploading code with credentials to public repositories the account can be compromised by malicious third parties.**

In [2]:
mkdir -p ~/.aws
touch config credentials

# never upload credentials to Github, these credentials will be abused instantaneously
echo "[default]" > ~/.aws/credentials
echo "aws_access_key_id = yourid4lKG0RY5v4tha" >> ~/.aws/credentials
echo "aws_secret_access_key = yourkeyWOQvMIc71yE8kfswJ5W" >> ~/.aws/credentials

echo "[default]" > ~/.aws/config
echo "region = us-east-1" >> ~/.aws/config
echo "output = json" >> ~/.aws/config

# Bucket name has to be changed to existing bucket in AWS user space
aws_bucket="alignmentworkbench"

This function will take a file and compress it before uploading it to a S3 bucket. For this to work the AWS credentials have to be set. The AWS credentials need S3 write access.

In [3]:
saveIndexS3(){
    tar cf - ${1} | pigz > ${1}.tar.gz
    aws s3 cp ${1}.tar.gz s3://${3}/${2}.tar.gz
    rm ${1}.tar.gz
}

This script was written to build index files for GRCh38 and GRCm38 release 96 from Ensembl. This can be changed to other genomes and annotations. The downloads are triffered in parallel and the script will wait until all files are downloaded.

In [4]:
# fetch 96 release cDNA sequences.
# This is the reference used by the transcript quantification methods
# this will take some time depending on the network speed
# the downloads are forked, sometimes the FTP server of ensembl is slow
curl ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz -o reference/human_cdna_96.fa.gz &
curl ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz -o reference/mouse_cdna_96.fa.gz &

# fetch 96 whole genome sequences
# This is the reference used by true aligners such as STAR and HISAT2
curl ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -o reference/human_dna_96.fa.gz &
curl ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz -o reference/mouse_dna_96.fa.gz &

# fetch GTF files annotationg the raw DNA sequences
# they are needed by the alignment algorithms STAR and HISAT2 to count transcript reads
curl ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz -o reference/human_96.gtf.gz &
curl ftp://ftp.ensembl.org/pub/release-96/gtf/mus_musculus/Mus_musculus.GRCm38.96.gtf.gz -o reference/mouse_96.gtf.gz &

# the script is now waiting for the downloads to finish before continuing
# the jobs -p command lists forked jobs (commands followed by &)
for job in `jobs -p`
do
echo $job
    wait $job || let "FAIL+=1"
done


[1] 3850
[2] 3851
    %%  TToottaall        %%  RReecceeiivveedd  %%  XXffeerrdd    AAvveerraaggee  SSppeeeedd      TTiimmee       T Tiimmee          TiTmiem e  C uCrurrernetn
t 
                                                                D lDolaoda d  U pUlpolaoda d    T oTtoatla l    S pSepnetn t      L eLfetf t  S pSepeede
d
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0     0   - -0: ----::---- :----: ----::---- :--:--:--- -   - - :0--:--     0[3] 3854
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0[4] 3856
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0[5] 3858
  % Total    

Once the files are downloaded the files need to be extracted for further processing. Some aligners can work with uncompressed reference genomes.

In [None]:
# unpack some files that can't be used when zipped
unpigz reference/human_cdna_96.fa.gz
unpigz reference/human_dna_96.fa.gz
unpigz reference/human_96.gtf.gz

unpigz reference/mouse_cdna_96.fa.gz
unpigz reference/mouse_dna_96.fa.gz
unpigz reference/mouse_96.gtf.gz

The index building process starts here, now that all the prerequisites are met.

In [None]:
# Salmon run create index
time salmon index -p $num_threads -t reference/human_cdna_96.fa -i index/salmon/salmon_human_96
time salmon index -p $num_threads  -t reference/mouse_cdna_96.fa -i index/salmon/salmon_mouse_96


# kallisto run create index
time kallisto index -i index/kallisto/kallisto_human_96.idx reference/human_cdna_96.fa
time kallisto index -i index/kallisto/kallisto_mouse_96.idx reference/mouse_cdna_96.fa


# STAR run create index
# we use the genomeSAsparseD flag, it will create a smaller index that will require less memory

GENOME="reference/human_dna_96.fa"
GTF="reference/human_96.gtf"
OUTPUT="index/star/human_96"
STAR \
    --runMode genomeGenerate \
    --genomeDir $OUTPUT \
    --genomeFastaFiles $GENOME \
    --sjdbGTFfile $GTF \
    --runThreadN $num_threads \
    --genomeSAsparseD 2 \
    --genomeSAindexNbases 13

GENOME="reference/mouse_dna_96.fa"
GTF="reference/mouse_96.gtf"
OUTPUT="index/star/mouse_96"
time STAR \
    --runMode genomeGenerate \
    --genomeDir $OUTPUT \
    --genomeFastaFiles $GENOME \
    --sjdbGTFfile $GTF \
    --runThreadN $num_threads \
    --genomeSAsparseD 2 \
    --genomeSAindexNbases 13


# HISAT2 run create index
time hisat2-build -p $num_threads reference/human_dna_96.fa index/hisat2/human_96/human
time hisat2-build -p $num_threads reference/mouse_dna_96.fa index/hisat2/mouse_96/mouse


# BWA run create index
time bwa index -p index/bwa/human_96/human_96 reference/human_dna_96.fa
time bwa index -p index/bwa/mouse_96/mouse_96 reference/mouse_dna_96.fa

During the alignment step some algorithms will calculate RNA-Seq at the transcript level that will be mapped to gene counts using an R script. For this a mapping table is required. This can be constructed for human and mouse by running the R script below.

In [None]:
Rscript --vanilla scripts/buildMapping.r

When all the index files are built they can be uploaded to AWS. For this the AWS settings have to be configured above.

In [None]:
saveIndexS3 "index/salmon/salmon_human_96" "salmon_human_96" $aws_bucket
saveIndexS3 "index/salmon/salmon_mouse_96" "salmon_mouse_96" $aws_bucket

saveIndexS3 "index/kallisto/kallisto_human_96.idx" "kallisto_human_96.idx" $aws_bucket
saveIndexS3 "index/kallisto/kallisto_mouse_96.idx" "kallisto_mouse_96" $aws_bucket

saveIndexS3 "index/star/human_96" "star_human_96" $aws_bucket
saveIndexS3 "index/star/mouse_96" "star_mouse_96" $aws_bucket

saveIndexS3 "index/hisat2/human_96" "hisat2_human_96" $aws_bucket
saveIndexS3 "index/hisat2/mouse_96" "hisat2_mouse_96" $aws_bucket

saveIndexS3 "index/bwa/human_96" "bwa_human_96" $aws_bucket
saveIndexS3 "index/bwa/human_96" "bwa_human_96" $aws_bucket

aws s3 cp supportfiles/gene_mapping.rda s3://${aws_bucket}/gene_mapping.rda