# RNA-sequencing Alignment Pipeline.


This pipeline will filter/trim your reads, run QC, and align reads to a reference genome using STAR alignment. 
Make sure your data is placed in the "rawdata" subfolder and you have activated conda "conda activate STAR_conda" 

Run Fastp on each of your samples in a loop format to trim/filter all reads

required packages(conda): 
fastp, 
FastQC,
MultiQC,
STAR, Subread

## Trim/Filter Reads

In [None]:
%%bash
mkdir ./fastp_output
for file in `ls -1 ./rawdata/*.fastq.gz | sed 's/.fastq.gz//'` #finds files in rawdata folder that match *.fastq.gz
do
    #fastp 
    fastp -i $file\.fastq.gz -o $file\_fastp_output.fastq.gz --trim_poly_x
    mv fastp.html $file\_fastp.html #renames log file to include original filename
    mv fastp.json $file\_fastp.json
done  
#move all files in sequencinglib1 that were Fastp output to a new folder and back to the original home directo
mv ./rawdata/*fastp* ./fastp_output
mkdir ./fastp_output/data
mv ./fastp_output/*.gz ./fastp_output/data

## Fastqc - quality control of sample

In [None]:
##very fast - use in the future!!!
!mkdir ./FastQC_output
#quality control on trimmed samples
!find ./fastp_output/data -name "*_output.fastq.gz" | parallel fastqc -o ./FastQC_output/ {} 
#quality control on raw samples - not necessary but I like to compare 
!find ./rawdata -name "*.fastq.gz" | parallel fastqc -o ./FastQC_output/ {} 
!multiqc ./FastQC_output/

if in jupyter notebook, download multiqc .html file locally to your computer for viewing 

## Genome Generation (only done once!)

Make STAR index for each genome if not already created. Download your genome and GTF file of choice and place them in the folder ./STARindex/

In [None]:
#make STAR indices 
#do not run if you already have indices! 
!STAR --runThreadN 5 \
    --runMode genomeGenerate \
    --genomeDir ./STARindex/Ca22genome \
    --genomeFastaFiles ./STARindex/Ca22genome/C_albicans_SC5314_version_A22-s07-m01-r105_chromosomes.fasta \
    --sjdbGTFfile ./STARindex/Ca22genome/C_albicans_SC5314_version_A22-s07-m01-r105_features.gtf \
    --sjdbOverhang 62 \
    --genomeSAindexNbases 10

## STAR Alignment

Run STAR slignment on all files in loop format

In [None]:
%%bash
mkdir ./star_output
for file in $(ls ./fastp_output/data/*)
do
    b=$(basename ${file})
    STAR --genomeDir ./STARindex/Ca22genome \
              --runMode alignReads \
              --readFilesIn $file \
              --runThreadN 20 \
              --readFilesCommand gunzip -c \
              --outSAMtype BAM SortedByCoordinate\
              --outReadsUnmapped Fastx \
              --outFileNamePrefix ./starout/$b\_ \
              --limitBAMsortRAM 31000000000 \
              --outSAMmultNmax 10 \
              --alignIntronMax 4000 --alignIntronMin 4
  done

## Grabbing Count Data Using Subread

In [None]:
featureCounts -T 20 -t mRNA -g Name -s 1 -a ~/STARgenomes/Ca22genome_Aallele/C_albicans_SC5314_version_A22-s07-m01-r105_features_Aallele_tuch.gff.txt -o ./featurecounts/C_alb_mRNA_STAR_counts.txt ./starout/*.bam

output file found in ./featurecounts is then loaded into another notebook for running DESeq (Differential Expression Analysis)