Project: Gene expression analysis of gill tissue from red king crab (_Paralithodes camtschaticus_) reared in varying pH conditions

In this first notebook I will record the steps taken to process RNASeq files from raw data to trimmed/filtered data that is ready for alignment. All steps will be conducted on Sedna, the high computing node. Locations of files, code, and scripts will be documented in this notebook.  

## Step 1:  Concatenate sequence data from same individual that was collected in different lanes

There are 43 libraries (from individual crab), which were each run in 7 lanes (each crab is identified by its Tank # and Crab #).  Therefore, sequencing data (paired reads) for the same crab was delivered to us from Univ. of Oregon in 7 separate files.  Giles concatenated the data by Tank#/Crab# using the following script: [concat_fastq_files.sh](https://raw.githubusercontent.com/laurahspencer/red-king_RNASeq-2022/main/scripts/concat_fastq_files.sh). Script location on Sedna: 'biodata/ggoetz/nichols/201910-redking_crab-rnaseq/scripts/concat_fastq_files.sh'

He then compressed the concatenated files using this script:  [compress_concat_files.sh](https://raw.githubusercontent.com/laurahspencer/red-king_RNASeq-2022/main/scripts/compress_concat_files.sh). Script location on Sedna: 'biodata/ggoetz/nichols/201910-redking_crab-rnaseq/scripts/compress_concat_files.sh'

He copied the concatenated/compressed data over to a new directory on Sedna, which is where I will retrieve the data for further processing: `share/nwfsc/ggoetz/red_king_crab/illumina/`

Here is a list of the concatenated sequence files: 

```
Tank_10_Crab_1_R1.fastq.gz  Tank_15_Crab_3_R1.fastq.gz  Tank_20_Crab_2_R1.fastq.gz  Tank_5_Crab_1_R1.fastq.gz
Tank_10_Crab_1_R2.fastq.gz  Tank_15_Crab_3_R2.fastq.gz  Tank_20_Crab_2_R2.fastq.gz  Tank_5_Crab_1_R2.fastq.gz
Tank_10_Crab_2_R1.fastq.gz  Tank_16_Crab_1_R1.fastq.gz  Tank_20_Crab_3_R1.fastq.gz  Tank_5_Crab_2_R1.fastq.gz
Tank_10_Crab_2_R2.fastq.gz  Tank_16_Crab_1_R2.fastq.gz  Tank_20_Crab_3_R2.fastq.gz  Tank_5_Crab_2_R2.fastq.gz
Tank_10_Crab_3_R1.fastq.gz  Tank_16_Crab_2_R1.fastq.gz  Tank_2_Crab_1_R1.fastq.gz   Tank_5_Crab_3_R1.fastq.gz
Tank_10_Crab_3_R2.fastq.gz  Tank_16_Crab_2_R2.fastq.gz  Tank_2_Crab_1_R2.fastq.gz   Tank_5_Crab_3_R2.fastq.gz
Tank_11_Crab_1_R1.fastq.gz  Tank_16_Crab_4_R1.fastq.gz  Tank_2_Crab_2_R1.fastq.gz   Tank_7_Crab_1_R1.fastq.gz
Tank_11_Crab_1_R2.fastq.gz  Tank_16_Crab_4_R2.fastq.gz  Tank_2_Crab_2_R2.fastq.gz   Tank_7_Crab_1_R2.fastq.gz
Tank_11_Crab_2_R1.fastq.gz  Tank_18_Crab_1_R1.fastq.gz  Tank_2_Crab_3_R1.fastq.gz   Tank_7_Crab_3_R1.fastq.gz
Tank_11_Crab_2_R2.fastq.gz  Tank_18_Crab_1_R2.fastq.gz  Tank_2_Crab_3_R2.fastq.gz   Tank_7_Crab_3_R2.fastq.gz
Tank_11_Crab_3_R1.fastq.gz  Tank_18_Crab_2_R1.fastq.gz  Tank_3_Crab_1_R1.fastq.gz   Tank_7_Crab_4_R1.fastq.gz
Tank_11_Crab_3_R2.fastq.gz  Tank_18_Crab_2_R2.fastq.gz  Tank_3_Crab_1_R2.fastq.gz   Tank_7_Crab_4_R2.fastq.gz
Tank_13_Crab_1_R1.fastq.gz  Tank_18_Crab_3_R1.fastq.gz  Tank_3_Crab_2_R1.fastq.gz   Tank_9_Crab_1_R1.fastq.gz
Tank_13_Crab_1_R2.fastq.gz  Tank_18_Crab_3_R2.fastq.gz  Tank_3_Crab_2_R2.fastq.gz   Tank_9_Crab_1_R2.fastq.gz
Tank_13_Crab_2_R1.fastq.gz  Tank_1_Crab_1_R1.fastq.gz   Tank_3_Crab_3_R1.fastq.gz   Tank_9_Crab_2_R1.fastq.gz
Tank_13_Crab_2_R2.fastq.gz  Tank_1_Crab_1_R2.fastq.gz   Tank_3_Crab_3_R2.fastq.gz   Tank_9_Crab_2_R2.fastq.gz
Tank_13_Crab_3_R1.fastq.gz  Tank_1_Crab_2_R1.fastq.gz   Tank_4_Crab_1_R1.fastq.gz   Tank_9_Crab_3_R1.fastq.gz
Tank_13_Crab_3_R2.fastq.gz  Tank_1_Crab_2_R2.fastq.gz   Tank_4_Crab_1_R2.fastq.gz   Tank_9_Crab_3_R2.fastq.gz
Tank_15_Crab_1_R1.fastq.gz  Tank_1_Crab_3_R1.fastq.gz   Tank_4_Crab_2_R1.fastq.gz   Tank_9_Crab_4_R1.fastq.gz
Tank_15_Crab_1_R2.fastq.gz  Tank_1_Crab_3_R2.fastq.gz   Tank_4_Crab_2_R2.fastq.gz   Tank_9_Crab_4_R2.fastq.gz
Tank_15_Crab_2_R1.fastq.gz  Tank_20_Crab_1_R1.fastq.gz  Tank_4_Crab_3_R1.fastq.gz
Tank_15_Crab_2_R2.fastq.gz  Tank_20_Crab_1_R2.fastq.gz  Tank_4_Crab_3_R2.fastq.gz
```

## Step 2: Assess quality of raw reads using FastQC/MultiQC



I wrote a slurm script to run `fastqc` and `multiqc` on all the raw (but concatenated) RNASeq data - [2021-12-13_fastqc-concat.sh](https://raw.githubusercontent.com/laurahspencer/red-king_RNASeq-2022/main/scripts/2021-12-13_fastqc-concat.sh)

I tried to rsync from my local computer to Sedna using the below command, but `rsync` isn't installed on Cygwin yet. (NOTE: the below rsync code hasn't yet been tested). 

`rsync --archive --progess --verbose --relative 2021-12-13_fastqc-concat.sh lspencer@sedna.nwfsc2.noaa.gov:/home/lspencer/2022-redking-OA/fastqc`

So, instead I created the script directly on Sedna using `touch 2021-12-13_fastqc-concat.sh` then opened the file using `nano 2021-12-13_fastqc-concat.sh`, and pasted contents from my clipboard using keys Shift+Insert. 

I executed the fastqc slurm script via `sbatch 2021-12-13_fastqc-concat.sh`

NOTE: to increase fastqc speed I added the argument `--thread 8` and set the number of CPUs to 8 in the SLURM header. 

Transferred MultiQC files over to my computer using `rsync`.  I used this code 
```
rsync --archive --progress --verbose --relative lspencer@sedna.nwfsc2.noaa.gov:/home/lspencer/2022-redking-OA/fastqc/concat/multiqc* .
```
BUT that transferred the entire file structure starting from my `home/` directory, so I need to fix the code to only transfer the files I want for next time. 

#### Inspect MultiQC report

In [1]:
import IPython
IPython.display.HTML(filename='C:/Users/laura.spencer/Work/red-king_RNASeq-2022/results/fastqc/multiqc_report_raw.html')

Sample Name,% Dups,% GC,Length,% Failed,M Seqs
Tank_10_Crab_1_R1,66.8%,45%,101 bp,18%,73.9
Tank_10_Crab_1_R2,61.6%,45%,101 bp,18%,73.9
Tank_10_Crab_2_R1,78.2%,47%,101 bp,9%,61.8
Tank_10_Crab_2_R2,71.5%,47%,101 bp,27%,61.8
Tank_10_Crab_3_R1,64.7%,46%,101 bp,18%,69.6
Tank_10_Crab_3_R2,59.3%,46%,101 bp,18%,69.6
Tank_11_Crab_1_R1,59.8%,45%,101 bp,18%,55.5
Tank_11_Crab_1_R2,55.6%,45%,101 bp,18%,55.5
Tank_11_Crab_2_R1,62.8%,46%,101 bp,18%,64.1
Tank_11_Crab_2_R2,57.6%,46%,101 bp,27%,64.1

Sort,Visible,Group,Column,Description,ID,Scale
||,,FastQC,% Dups,% Duplicate Reads,percent_duplicates,
||,,FastQC,% GC,Average % GC Content,percent_gc,
||,,FastQC,Length,Average Sequence Length (bp),avg_sequence_length,
||,,FastQC,% Failed,Percentage of modules failed in FastQC report (includes those not plotted here),percent_fails,
||,,FastQC,M Seqs,Total Sequences (millions),total_sequences,read_count


## Step 3: Adapter trimming and quality filtering

