##Genome assembly with ABySS All of the scripts you will need to complete this lab as well as the sample dataset will be copied to your Beocat directory as you follow the instructions below. You should type or paste the text in the beige code block into your terminal as you follow along with the instructions below. If you are not used to commandline, practice with real data is one of the best ways to learn.
If you would like a quick primer on basic linux commands try these 10 minute lessons from Software Carpentry http://software-carpentry.org/v4/shell/index.html. To learn to start using Beocat and begin using the terminal go to https://github.com/i5K-KINBRE-script-share/FAQ/blob/master/UsingBeocat.md. Learn how to download files from Beocat at https://github.com/i5K-KINBRE-script-share/FAQ/blob/master/BeocatEditingTransferingFiles.md.
We will be using the script "AssembleG.pl" to organize our working directory and write scripts to clean our reads, assemble our de novo genome for Staphylococcus aureus, and summarize our assembly metrics.
To begin this lab your should read about the software we will be using. Prinseq will be used to clean raw reads. Priseq cleaning is highly customizable. You can see a detailed parameter list by typing "perl /homes/sheltonj/abjc/prinseq-lite-0.20.3/prinseq-lite.pl -h" or by visiting their manual at http://prinseq.sourceforge.net/manual.html. You can read a detailed list of parameter options for Abyss genome assembler by typing "/homes/bjsco/local/bin/abyss-pe -h" or visit the Abyss manual at https://github.com/bcgsc/abyss#abyss.
To find out more about the parameters for "AssembleG.pl" run "perl ~/transcriptome-and-genome-assembly/KSU_bioinfo_lab/AssembleG/AssembleG.pl -man" or visit its manual at https://github.com/i5K-KINBRE-script-share/transcriptome-and-genome-assembly/tree/master/KSU_bioinfo_lab/AssembleG/AssembleG_MANUAL.md.
###Step 1: Clone the Git repositories
git clone https://github.com/i5K-KINBRE-script-share/transcriptome-and-genome-assembly git clone https://github.com/i5K-KINBRE-script-share/genome-annotation-and-comparison
###Step 2: Create project directory and add your input data to it
Make a working directory.
mkdir de_novo_genome cd de_novo_genome
Create symbolic links to raw DNA reads from the Staphylococcus aureus raw data. Creating a symbolic link rather than copying avoids wasting disk space and protects your raw data from being altered. Our sample dataset is genomic DNA from the European Nucleotide Archive http://www.ebi.ac.uk/ena/data/view/ERX012513. IlluminaGenome Analyzer II from Staphylococcus aureus generated by an Illumina Genome Analyzer II.
ln -s /homes/bioinfo/pipeline_datasets/AssembleG/* ~/de_novo_genome/
###Step 3: Write assembly scripts
Check to see if your fastq headers end in "/1" or "/2" (if they do not you must add the parameter "-c" when you run "AssembleG.pl"
Your output will look similar to the output below for the sample data. Because these headers end in "/1" or "/2" we will not add "-c" when we call "AssembleG.pl".
@ERR033278.1 IL35_4330:1:1:1008:1653#1/1 NATAGTTAATGTATACTTTCGCTTCTTCAGAATCTACTTTATTATCTTTAGTAC + $**/);)%.()(/()6=667<18614--<=6/?936==8-*.*)1$)/'',*7. @ERR033278.2 IL35_4330:1:1:1009:6252#1/1 NTACCACTACCAAATACTTCTGTTAACCCACCTTTATCATATGATTCGAATAAT + $,*3:),/5.A?:A?$'1*,88???76(6:/0/3,$'*'.,,,&.*+,;(:()' @ERR033278.3 IL35_4330:1:1:1009:6338#1/1 NTTAACAGAACGTCAACGTGATATATTATTGTATGGTTCGGGTGCCAAAGAAAT
Call "AssembleG.pl". Our reads are only ~50 bp long so we are setting our minimum read length to 35 bp. Generally you want to keep this length ~10 bp shorter than our read length. We would also raise the longest kmer value "-l" to ~61 if our reads were 100bp.
perl ~/transcriptome-and-genome-assembly/KSU_bioinfo_lab/AssembleG/AssembleG.pl -r ~/de_novo_genome/S_aureus_reads.txt -p S_aureus -n 35 --nodes 8 --mem_per_core 3 -s 21 -l 45 -i 2 -m 31
###Step 4: Run prinseq and the assembly scripts
Clean raw reads. When these jobs are complete go to next step. Test completion by typing "status" in a Beocat session. Download the ".gd" files in the "~/de_novo_genome/S_aureus_prinseq" directory and upload them to http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1 to evaluate read quality pre and post cleaning.
Concatenate cleaned reads
Assemble single kmer genomes. When these jobs are complete go to next step. Test completion by typing "status" in a Beocat session.
Merge single kmer genomes. When these jobs are complete go to next step. Test completion by typing "status" in a Beocat session.
This step will generate assembly metrics and summarize the cleaning step results.