🧪 Day 2 — Data Retrieval & Inspection (RNA-seq raw data)
🎯 Goal:
Download real RNA-seq data (SRA/GEO), inspect the files, and verify integrity. All dataset-agnostic.

source data for the workshop:
https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP422095&o=acc_s%3Aa
https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=study&acc=SRP422095
https://www.ncbi.nlm.nih.gov/sra/?term=SRP422095

✅ Step 1: Create a project folder. In your JupyterLab Terminal (NOT NOTEBOOK!):

In [None]:
mkdir -p ~/rnaseq_project/raw_data
cd ~/rnaseq_project/raw_data

✅ Step 2: Install Entrez Direct

In [None]:
conda install -c bioconda entrez-direct -y

✅ Step 3: Find SRA sample IDs. If you're using a known dataset like GSE225123, get SRA links from NCBI or ENA. For any dataset, here's a quick way:

In [None]:
esearch -db gds -query GSE225123 | elink -target sra | efetch -format runinfo > runinfo.csv

 Then either open the file in Jupyterlab navigator (double click) or run:

In [None]:
head runinfo.csv

✅ Step 4: Download FASTQ files (2 samples for testing).
Use --split-files to separate paired-end reads and use --gzip to compress the files directly.

In [None]:
prefetch SRR23426208 SRR23426209
fasterq-dump SRR23426208 SRR23426209 --split-files
gzip SRR23426208*.fastq
gzip SRR23426209*.fastq

✅ Step 5: Inspect files
In notebook run:

In [14]:
!ls -lh ./raw_data/SRR23426208/SRR23426208_*.fastq.gz
!zcat ./raw_data/SRR23426208/SRR23426208_1.fastq.gz | head

-rw-r--r-- 1 henri henri 271M May 20 19:20 ./raw_data/SRR23426208/SRR23426208_1.fastq.gz
-rw-r--r-- 1 henri henri 329M May 20 19:20 ./raw_data/SRR23426208/SRR23426208_2.fastq.gz
@SRR23426208.1 1 length=76
TAAAGAATGCTGTAAAGTCAGTTAAGTGATAATTAGAGGGAAGATAAAATATTCAATTATATATAGGTATTTATTT
+SRR23426208.1 1 length=76
AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEAEEEEEEEEEEEE/EAEEEAAEEEEEEEEEA
@SRR23426208.2 2 length=76
CATAGTAAATATTAATATTTTATGTCGCTTATGGCGGCAAGATGGGGGTAAAAAATGATTATTAGTGCAAGAAATC
+SRR23426208.2 2 length=76
AAAAAEE6AEAAAE/E/EEEEEEEEEAAAAEEEEEEEE/EEEEE/AEEEE/E/AEEEEEA/EEEEAEAAAEE//<E
@SRR23426208.3 3 length=76
ATTAATATTTTATGTCGCTTATGGCGGCAAGATGGGGGTAAAAAATGATTATTAGTGCAAGAAATCAAATAACAGG

gzip: stdout: Broken pipe


In [None]:
confirmed :
File size looks right (in GBs or hundreds of MBs).
FastQ format is valid (starts with @, has four lines per read).

✅ Step 6: Run initial quality check
In notebook or terminal:

In [None]:
!fastqc ./raw_data/SRR23426208_1.fastq.gz ./raw_data/SRR23426208_2.fastq.gz

In [None]:
This creates .html reports. You can open in browser or in Jupyerlab


##############################---------##############################---------##############################---------

Bonus material/tools:

1. run download_ALL_runs.sh to download all the SRA sample data based on the runinfo.csv file:

In [None]:
bash download_ALL_runs.sh

Folder/file structure after downloads:
    
rnaseq_project/
├── xxx.sh
├── raw_data/
│   ├── somefolder1/
│   │   └── SRRxxxxxx.sra
│   ├── somefolder2/
│   │   └── SRRyyyyyy.sra
│   └── ...
└── your_notebooks_here.ipynb

2. For having an overview of all the .sra and fastq.gz files, including all subfolders like raw_data: 

In [None]:
!find -type f -name "*.sra"
!find -type f -name "*.fastq.gz"

3. run extract_all_fastq.sh to run fasterq-dump + gzip compression on all of the .sra files in subdirectories

In [None]:
bash extract_all_fastq.sh

4. run run_fastqc.sh to run fastQC on all the found *.fastq or *.fastq.gz files found in the subdirectiories:

In [None]:
bash run_fastqc.sh

In [24]:
##############################---------##############################---------##############################---------
#final notes:
#- downloaded all samples using download_ALL_runs.sh and extracted fastq files for first 2 runs and ran fastQC on those four samples
!find -type f -name "*.sra"
!find -type f -name "*.fastq.gz"
!find -type f -name "*fastqc.html"

./raw_data/SRR23426214/SRR23426214.sra
./raw_data/SRR23426208/SRR23426208.sra
./raw_data/SRR23426215/SRR23426215.sra
./raw_data/SRR23426220/SRR23426220.sra
./raw_data/SRR23426213/SRR23426213.sra
./raw_data/SRR23426219/SRR23426219.sra
./raw_data/SRR23426209/SRR23426209.sra
./raw_data/SRR23426221/SRR23426221.sra
./raw_data/SRR23426218/SRR23426218.sra
./raw_data/SRR23426211/SRR23426211.sra
./raw_data/SRR23426212/SRR23426212.sra
./raw_data/SRR23426222/SRR23426222.sra
./raw_data/SRR23426210/SRR23426210.sra
./raw_data/SRR23426216/SRR23426216.sra
./raw_data/SRR23426217/SRR23426217.sra
./raw_data/SRR23426208/SRR23426208_2.fastq.gz
./raw_data/SRR23426208/SRR23426208_1.fastq.gz
./raw_data/SRR23426209/SRR23426209_1.fastq.gz
./raw_data/SRR23426209/SRR23426209_2.fastq.gz
./raw_data/SRR23426208/SRR23426208_1_fastqc.html
./raw_data/SRR23426208/SRR23426208_2_fastqc.html
./raw_data/SRR23426209/SRR23426209_1_fastqc.html
./raw_data/SRR23426209/SRR23426209_2_fastqc.html
