# RNA-Seq Pipeline Walkthrough
This notebook guides new users through configuring and running the RNA-seq processing pipeline that converts public SRA runs into aligned BAM and CPM-normalized BigWig tracks.

## 1. Prerequisites
Ensure you are working inside the conda environment that provides STAR, Trimmomatic, samtools, deepTools, and the SRA Toolkit.
The repository root should contain `download_data.sh`, `processing_pipeline.sh`, and `launch.sh`.
Update `config/pipeline.env` with environment-specific paths before proceeding.

In [None]:
%%bash
# Verify key tools are on PATH
for tool in prefetch fasterq-dump STAR trimmomatic samtools bamCoverage; do
  if command -v "${tool}" >/dev/null 2>&1; then
    printf "FOUND %s -> %s\n" "${tool}" "$(command -v "${tool}")"
  else
    printf "MISSING %s (check your conda env)\n" "${tool}"
  fi
done

## 2. Configure the Pipeline
Copy the example environment file and edit the values to match your installation.

In [None]:
%%bash
set -euo pipefail
if [[ ! -f config/pipeline.env ]]; then
  cp config/pipeline.env.example config/pipeline.env
  echo "Created config/pipeline.env from template"
else
  echo "config/pipeline.env already exists"
fi
echo "Current config snippet:"
head -n 20 config/pipeline.env

## 3. Inspect Metadata
The pipeline expects a CSV with at least `CellType`, `RunID`, and `Description`.
Use the cell below to preview the records that will be processed.

In [None]:
import csv
from pathlib import Path
metadata_path = Path('metadata.csv')
if not metadata_path.exists():
    raise FileNotFoundError('metadata.csv not found in repository root')
with metadata_path.open(newline='') as handle:
    reader = csv.DictReader(handle)
    rows = list(reader)
print(f'Total samples: {len(rows)}')
for row in rows[:5]:
    print(f
)
if len(rows) > 5:
    print('...')

## 4. Dry-Run the Download Stage
Run the downloader in dry-run mode to confirm the SRA accessions resolve and to inspect logging without pulling large files.

In [None]:
%%bash
set -euo pipefail
./download_data.sh --dry-run

## 5. Fetch Missing SRA Files
Remove `--dry-run` once you are ready to download the data. The command creates the `sra_files/` directory (or the path set in the config) and skips runs that already exist.

In [None]:
%%bash
set -euo pipefail
./download_data.sh

## 6. Process a Single Sample (Optional Sanity Check)
Start with one accession to validate permissions, reference indices, and overall runtime before scaling to the full cohort.

In [None]:
%%bash
set -euo pipefail
./processing_pipeline.sh --sample SRR17143399

## 7. Run the Full Pipeline
After verifying an individual run, execute the combined launcher to download any missing data and process all samples listed in your metadata.

In [None]:
%%bash
set -euo pipefail
./launch.sh

## 8. Inspect Outputs
Results are organized per cell type under the `BASE_DIR` defined in the config (defaults to `results/`). Each folder contains FASTQ, trimmed FASTQ, BAM, BigWig, logs, and temporary data.

In [None]:
%%bash
set -euo pipefail
BASE_DIR=$(grep '^BASE_DIR' config/pipeline.env | cut -d'=' -f2 | tr -d '
,
,
Listing outputs under: ${BASE_DIR}"
find "${BASE_DIR}" -maxdepth 2 -type f | sort | head -n 20

## 9. Next Steps
- Review `README.md` for additional CLI switches (dry runs, step skipping, sample filters).
- Version your `metadata.csv` and `config/pipeline.env.example` in Git so collaborators can reuse the setup.
- Consider summarizing alignment metrics from `logs/` or generating QC plots in a follow-up notebook.