# DEMO for STAR

In [4]:
!cat /fdb/STAR/00README

# David Hoover, 2015-03-02

This directory contains reference genomes for STAR.

Reference genomes have been partitioned into subdirectories based on their source and species.  Each terminal directory
contains the base reference genome in the ref directory, along with 5 genome references generated with annotation
from a related GTF file.  The annotated genome references were created with different read lengths (50, 75, 100, 
150, and 250 nt).  Links to the reference fasta and GTF files, along with a .dict and .fai file, are also available
in each terminal directory.

The structure of each terminal directory looks like this (for example mm9 from UCSC):

    genes-50/
    genes-75/
    genes-100/
    genes-150/
    genes-250/
    genes.gtf -> /fdb/igenomes/Mus_musculus/UCSC/mm9/Annotation/Genes/genes.gtf
    ref/
    ref.dict
    ref.fa -> /data/genome/fasta/mm9.fa
    ref.fa.fai

The sources for fasta and reference files are:

  UCSC               http://hgdownload.soe.ucsc.edu/download

In [17]:
%%bash
module load STAR
cat  $STAR_TEST_DATA/00README

ENCFF138LJO.fastq.gz
--------------------
description: mouse cerebral cortex RNA-Seq, 100nt SE from Wold lab
obtained:    Sep 01, 2015
source:      https://www.encodeproject.org/files/ENCFF138LJO/@@download/ENCFF138LJO.fastq.gz

ENCFF138LJO_250k.fastq.gz
-------------------------
description: first 250k reads from ENCFF138LJO.fastq.gz

ENCFF138LJO_1M.fastq.gz
-------------------------
description: first 1M reads from ENCFF138LJO.fastq.gz


[+] Loading STAR  2.6.1c 


In [12]:
%%bash
module load STAR
mkdir -p indices/star100-EF4
GENOME=/fdb/igenomes/Saccharomyces_cerevisiae/Ensembl/EF4
STAR \
    --runThreadN 12 \
    --runMode genomeGenerate \
    --genomeDir indices/star100-EF4 \
    --genomeFastaFiles $GENOME/Sequence/WholeGenomeFasta/genome.fa \
    --sjdbGTFfile $GENOME/Annotation/Genes/genes.gtf \
    --sjdbOverhang 99 \
    --genomeSAindexNbases 11

Feb 17 22:17:46 ..... started STAR run
Feb 17 22:17:46 ... starting to generate Genome files
Feb 17 22:17:46 ... starting to sort Suffix Array. This may take a long time...
Feb 17 22:17:47 ... sorting Suffix Array chunks and saving them to disk...
Feb 17 22:17:49 ... loading chunks from disk, packing SA...
Feb 17 22:17:49 ... finished generating suffix array
Feb 17 22:17:49 ... generating Suffix Array index
Feb 17 22:17:51 ... completed Suffix Array index
Feb 17 22:17:51 ..... processing annotations GTF
Feb 17 22:17:51 ..... inserting junctions into the genome indices
Feb 17 22:17:52 ... writing Genome to disk ...
Feb 17 22:17:52 ... writing Suffix Array to disk ...
Feb 17 22:17:52 ... writing SAindex to disk
Feb 17 22:17:52 ..... finished successfully


[+] Loading STAR  2.6.1c 


# Demo with Piared RNAseq
1. Paied end model
2. Single-end model (skip gtf process)

In [3]:
%%bash
rm -rf indices/starDEMO
mkdir indices/starDEMO
REF="/fdb/bwa/indexes/hg38.fa"
READ1="RNAsq1.fastq.gz"
READ2="RNAsq2.fastq.gz"
module load STAR
STAR \
    --runThreadN 10 \
    --genomeDir '/fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100/' \
    --readFilesIn $READ1 $READ2 \
    --readFilesCommand zcat \
    --sjdbGTFfile '/fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf' \
    --outFileNamePrefix indices/starDEMO/

Feb 18 18:10:42 ..... started STAR run
Feb 18 18:10:42 ..... loading genome
Feb 18 18:11:06 ..... processing annotations GTF
Feb 18 18:11:14 ..... inserting junctions into the genome indices
Feb 18 18:13:30 ..... started mapping
Feb 18 18:14:02 ..... finished successfully


[+] Loading STAR  2.6.1c 


In [1]:
!cat indices/starDEMO/Log.final.out

                                 Started job on |	Feb 18 18:10:42
                             Started mapping on |	Feb 18 18:13:30
                                    Finished on |	Feb 18 18:14:02
       Mapping speed, Million of reads per hour |	337.50

                          Number of input reads |	3000000
                      Average input read length |	96
                                    UNIQUE READS:
                   Uniquely mapped reads number |	1934961
                        Uniquely mapped reads % |	64.50%
                          Average mapped length |	95.64
                       Number of splices: Total |	278369
            Number of splices: Annotated (sjdb) |	275975
                       Number of splices: GT/AG |	275382
                       Number of splices: GC/AG |	1961
                       Number of splices: AT/AC |	269
               Number of splices: Non-canonical |	757
                      Mismatch rate per base, % |	0.15%
                      

In [14]:
!head -n800 indices/starDEMO/Log.out | tail

chr10_KI270825v1_alt	Gnomon	exon	33541	33650	.	-	.	gene_id "DLG5-2"; gene_name "DLG5"; p_id "P22855"; transcript_id "XM_006725121.1"; tss_id "TSS45465";
chr10_KI270825v1_alt	BestRefSeq	exon	34753	34880	.	-	.	gene_id "DLG5-2"; gene_name "DLG5"; p_id "P23880"; transcript_id "NM_004747.3-2"; tss_id "TSS3083";
chr10_KI270825v1_alt	Gnomon	exon	34753	34880	.	-	.	gene_id "DLG5-2"; gene_name "DLG5"; p_id "P3952"; transcript_id "XM_006725122.1"; tss_id "TSS45410";
chr10_KI270825v1_alt	Gnomon	exon	34753	34880	.	-	.	gene_id "DLG5-2"; gene_name "DLG5"; p_id "P1906"; transcript_id "XM_006725120.1"; tss_id "TSS45465";
chr10_KI270825v1_alt	Gnomon	exon	34753	34880	.	-	.	gene_id "DLG5-2"; gene_name "DLG5"; p_id "P22855"; transcript_id "XM_006725121.1"; tss_id "TSS45465";


In [20]:
!head indices/starDEMO/Aligned.out.sam
!tail indices/starDEMO/Aligned.out.sam

@HD	VN:1.4
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
SRR1813891.2700849	163	chr18	13025029	255	48M	=	13025112	131	CCCAAAGTGCTGGAATTACAGGCGTGACCCACTGCACCCAGCCAGTTG	CCCFFFFEHHHHHIJJJJJJJJJIJJJJJJIJIJIJJJJIIJJEIIIJ	NH:i:1	HI:i:1	AS:i:94	nM:i:0
SRR1813891.2700849	83	chr18	13025112	255	48M	=	13025029	-131	TGATATAGTTGGATTAATGTCTGCCATGTTGTTCTTGTTTTTTTCCCC	IJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJIJJHHHHHFFFFFCBB	NH:i:1	HI:i:1	AS:i:94	nM:i:0
SRR1813891.2700850	99	chr7	26191886	255	48M	=	26191983	145	CTTCTTAACTCTACACACGCACTTAAATTTTTTTAAAGGAAAAACGTT	BB?DFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ	NH:i:1	HI:i:1	AS:i:94	nM:i:0
SRR1813891.2700850	147	chr7	26191983	255	48M	=	26191886	-145	AAAAATCTTAAAAAAGGTTTCACATGTCACCTGAAACTTACAAATTTA	JJJJJIJJJJJJJJJJJJJGHHJJJIIHHJJJJJJHHHHHFFFFFCCC	NH:i:1	HI:i:1	AS:i:94	nM:i:0
SRR1813891.270

In [2]:
%%bash
REF="/fdb/bwa/indexes/hg38.fa"
READ1="RNAsq1.fastq.gz"
module load STAR
mkdir -p indices/starDEMO_SE
STAR \
    --runThreadN 10 \
    --genomeDir '/fdb/STAR_indices/2.6.1c/GENCODE/Gencode_human/release_27/genes-100/' \
    --readFilesIn $READ1 \
    --readFilesCommand zcat \
    --outFileNamePrefix indices/starDEMO_SE/

Feb 18 18:18:29 ..... started STAR run
Feb 18 18:18:29 ..... loading genome
Feb 18 18:18:53 ..... started mapping
Feb 18 18:19:19 ..... finished successfully


[+] Loading STAR  2.6.1c 


In [3]:
!cat indices/starDEMO_SE/Log.final.out

                                 Started job on |	Feb 18 18:18:29
                             Started mapping on |	Feb 18 18:18:53
                                    Finished on |	Feb 18 18:19:19
       Mapping speed, Million of reads per hour |	415.38

                          Number of input reads |	3000000
                      Average input read length |	48
                                    UNIQUE READS:
                   Uniquely mapped reads number |	1903495
                        Uniquely mapped reads % |	63.45%
                          Average mapped length |	47.85
                       Number of splices: Total |	121953
            Number of splices: Annotated (sjdb) |	120508
                       Number of splices: GT/AG |	120555
                       Number of splices: GC/AG |	916
                       Number of splices: AT/AC |	112
               Number of splices: Non-canonical |	370
                      Mismatch rate per base, % |	0.15%
                       

In [30]:
!head -n700 indices/starDEMO_SE/Log.out | tail

Completed: thread #6
Completed: thread #5
Joined thread # 5
Joined thread # 6
Joined thread # 7
Completed: thread #9
Completed: thread #8
Joined thread # 8
Joined thread # 9
ALL DONE!


In [22]:
!head indices/starDEMO_SE/Aligned.out.sam
!tail indices/starDEMO_SE/Aligned.out.sam

@HD	VN:1.4
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
SRR1813891.2849989	16	chr4	85610024	255	48M	*	0	0	GGCATAAGACAATTTAATGAACTTGTTTATTTGTGTCAGGTCCCTGAG	JJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJHHHHHFFFDDB=B	NH:i:1	HI:i:1	AS:i:47	nM:i:0
SRR1813891.2849990	16	chr21	8216143	0	48M	*	0	0	TACGCCGCGACGAGTAGGAGGGCCGCTGCGGTGAGCCTTGAAGCCTAG	DDDCFHIGIHDJHHFJJJJIJJIJIIGGFHGGJIIHFFHHFFFFDB=B	NH:i:8	HI:i:1	AS:i:47	nM:i:0
SRR1813891.2849990	272	chr21	8399177	0	48M	*	0	0	TACGCCGCGACGAGTAGGAGGGCCGCTGCGGTGAGCCTTGAAGCCTAG	DDDCFHIGIHDJHHFJJJJIJJIJIIGGFHGGJIIHFFHHFFFFDB=B	NH:i:8	HI:i:2	AS:i:47	nM:i:0
SRR1813891.2849990	272	chr21	8260372	0	48M	*	0	0	TACGCCGCGACGAGTAGGAGGGCCGCTGCGGTGAGCCTTGAAGCCTAG	DDDCFHIGIHDJHHFJJJJIJJIJIIGGFHGGJIIHFFHHFFFFDB=B	NH:i:8	HI:i:3	AS:i:47	nM:i:0
SRR1813891.2849990	272	chr21	8443412	0	48M	*	0	0	TACGCCGCGAC