## Processing B16 R499 Stat1KO RNA-seq data

**Updated 04/30/20**

Conditions (n=16): B16_cas, B16_SKO, R499_cas, R499_SKO

1. Trim adapter sequences
2. Quantify with salmon
3. Generate normalized count matrix

We don't use STAR alignments for normal quantification. To quantify TEs, run STAR with optimized multimapping parameters for TEtranscripts. Squire does its own alignment for TE quantification.

**Processed data** <br>
- Normalized count matrix: ~/Dropbox/Minn/ATAC_RNA_integration/mRFAR_v5/rna.dat_mRFAR_v5.txt
- RNA annotations: ~/Dropbox/Minn/ATAC_RNA_integration/mRFAR_v5/rna.anno_mRFAR_v5.txt
- DESeq2 results: ~/Dropbox/Minn/ATAC_RNA_integration/mRFAR_v5/dds_mRFAR_v5.rds
- STAR-mapped bigwig files: /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/aligned/merged/bigwig/
- Unique reads only (for TE analysis): 


### fastQC

FastQC version 0.11.5 <br>
multiqc version 1.8

fastq files located at: /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq/ <br>

QC files located at: <br>
/home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq/fastqc/ <br>
/Users/jingyaqiu/Dropbox/Minn/B16_R499_Stat1KO_RNA/data/fastqc/

Looks like no adapter contamination, but trim to be safe.

In [None]:
cd /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq
mkdir fastqc

for file in *fastq.gz;  do /home/jingyaq/FastQC/fastqc $file -o fastqc; done

export PATH="/home/jingyaq/anaconda3-new/bin:$PATH"
multiqc fastqc
mv multiqc* fastqc

### Trim adapter sequences

cutadapt version 2.9

Trim reads with adapter contamination on 3' ends
- -m 5 - discard reads shorter than 5 bp (30 for RNA, 5 for ATAC)
- -e 0.2 - maximum tolerated error rate when searching for adapter sequence (20%)
- -q 10 - trim low-quality ends (phred < +33)
- -O 5 - minimum overlap length between read and adapter for adapter to be found

RNA adapters: <br>
R1 - AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC <br>
R2 - AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

fastq files located at: <br>
/home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq/fastq_trimmed/

**FASTQ FILES MUST BE IN THE FORM: PREFIX.R1.FASTQ.GZ - FIRST "." HAS TO SEPARATE PREFIX AND R1!**

In [None]:
cd /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq
mkdir fastq_trimmed

python /home/jingyaq/Minn/processing_scripts/cutadapt_jobScripts.py AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 30

grep -rnw *.out -e "Successfully" | wc -l
rm *.out *.err

### QC trimmed reads ###

cd /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/fastq/fastq_trimmed/
mkdir fastqc
for file in *fastq.gz;  do /home/jingyaq/FastQC/fastqc $file -o fastqc; done
export PATH="/home/jingyaq/anaconda3-new/bin:$PATH"
multiqc fastqc
mv multiqc* fastqc

### Quantification with salmon

**Run on 09/12/2018!!**

salmon version 0.13.1

Reference transcriptome located at: /home/jingyaq/Minn/references/GRCm38/gencode.GRCm38.p6.vmM18.transcripts.fa.gz
Salmon index located at: /home/jingyaq/Minn/resources/salmon/GRCm38.p6_index

Parameters:
- --libType A - automatically infer library type
- --validateMappings - use a more sensitive and accurate mapping algorithm
- --rangeFactorizationBins 4 - likelihood factorization, can improve quantification accuracy on a class of "difficult" transcripts
- --numBootstraps 3 - compute bootstrapped abundance estimates. More accurate estimate of variance, but more computation and time required
- --seqBias - learn and correct for sequence-specific biases (random hexamer priming)
- --gcBias - learn and correct for fragment-level GC biases

https://salmon.readthedocs.io/en/latest/salmon.html

Output saved at: <br>
/home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/data/counts/salmon/

#### Installation

In [None]:
export PATH="/home/jingyaq/anaconda3-new/bin:$PATH"

conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n salmon salmon

source activate salmon # Latest version 1.2.0

#### Build index

In [None]:
# export PATH="/home/jingyaq/anaconda3-new/bin:$PATH"
# source activate salmon

salmon index -t /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/resources/gencode.GRCm38.p6.vmM18.transcripts.fa.gz -i /home/jingyaq/Minn/data/B16_R499_Stat1KO_RNA/resources/GRCm38.p6_salmon_index --gencode

#### Non-alignment-based quantification

In [None]:
# export PATH="/home/jingyaq/anaconda3-new/bin:$PATH"
# source activate salmon

python scripts/salmon_jobScripts.py

### Generate normalized count matrix

Import salmon counts, normalize with DESeq2 rlog

Output count matrix: <br>
~/Dropbox/Minn/ATAC_RNA_integration/mRFAR_v5/rna.dat_mRFAR_v5.txt

In [None]:
Rscript scripts/tximport_geneQuant.R

### OPTIONAL: align with STAR

(Only need this for TEtranscripts)

Genome fasta file: STAR (manual 2.7.2b) strongly recommends "files marked with PRI (primary)" from GENCODE.

"It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome."

GRCm38.primary_assembly.genome.fa - this includes major chromosomes and scaffolds, but excludes assembly patches and haplotypes.

GTF annotation file: STAR (manual 2.7.2b) strongly recommends the "most comprehensive annotations for a given species". Chromosome names in GTF and fasta files must match!!!
gencode.vM17.chr_patch_hapl_scaff.annotation.gtf - contains comprehensive gene annotation on reference chromosomes, scaffolds, assembly patches, and alternate loci (haplotypes)

Genome indices located at:
/home/jingyaq/Minn/STAR_genome_indices/GRCm38

Didn't use trimmed fastq reads!

#### Generate index

In [None]:
bsub -M 180000 -o ~/Minn/STAR_genome_indices/output_logs/output_STAR_2.7_genomeGenerate -e ~/Minn/STAR_genome_indices/output_logs/error_STAR_2.7_genomeGenerate STAR --runThreadN 28 --runMode genomeGenerate --genomeDir ~/Minn/STAR_genome_indices/STAR_2.7.1a/GRCm38/ --genomeFastaFiles ~/Minn/references/GRCm38/GRCm38.primary_assembly.genome.fa --sjdbGTFfile ~/Minn/references/GRCm38/gencode.vM22.chr_patch_hapl_scaff.annotation.gtf --sjdbOverhang 74

#### Run STAR

In [None]:
module load STAR/2.5.2a - B16_R499_Stat1KO_RNA
module load STAR/2.7.1a - epi_ATAC

python scripts/STAR_jobScripts.py
python scripts/STAR_TE.py for optimizing for transposable elements recovery

### TE quantification

~/Dropbox/Minn/B16_R499_Stat1KO_RNA/analysis scripts/TE quantification.notebook.ipynb

~/Dropbox/Minn/B16_R499_Stat1KO_RNA/analysis scripts/SQuIRE.notebook.ipynb (squire)
~/Dropbox/Minn/epi_ATAC/scripts/TEtranscripts.notebook.ipynb (TEtranscripts)