This repository contains two Python-based RNA-seq analysis pipelines:
- Part 1: From raw FASTQ files to read counts using HISAT2 and HTSeq
- Part 2: From read counts to differentially expressed genes (DEGs) using DESeq2 via R
Script: RNA-seq-ReadsCount.py
This pipeline processes paired-end RNA-seq FASTQ files to produce read count matrices using standard bioinformatics tools.
- Quality Control (FastQC) – Checks raw FASTQ quality
- Genome Alignment (HISAT2) – Aligns reads to the reference genome
- Read Counting (HTSeq-count) – Counts reads mapped to annotated genes
- (Optional) PCA clustering analysis
conda install -c bioconda fastqc conda install -c bioconda hisat2 conda install -c bioconda samtools conda install -c bioconda htseq
pip install pandas scikit-learn matplotlib seaborn
python3 -m venv env source env/bin/activate pip install pandas scikit-learn matplotlib seaborn
python RNA-seq-ReadsCount.py -i Fastq -o hisat-count-dir -g /path/to/hisat2_index_prefix -a /path/to/annotation.gtf
FastQ/: Folder containing paired-end .fq.gz files (e.g., sample1_1.fq.gz, sample1_2.fq.gz) HISAT2-indexed reference genome (with .ht2 files) Gene annotation file (GTF or GFF3)
FastQC_reports/: Quality reports per FASTQ
aligned_bam_files/: Aligned, sorted, and indexed BAM files
htseq_counts/: Gene-level read count files
merged_counts_matrix.csv: Combined count matrix
pca_plot.png: Optional PCA clustering plot
Script: run-DEG-analysis.py
This pipeline performs differential gene expression analysis using DESeq2 via R, then generates summary files and heatmaps.
conda install -c bioconda bioconductor-deseq2 conda install -c conda-forge rpy2 conda install pandas matplotlib seaborn scikit-learn
python run-DEG-analysis.py counts.txt control1,control2,control3 case1_rep1,case1_rep2,case1_rep3 ...
• counts.txt: Gene expression matrix (rows = genes, columns = samples), which is generated by RNA-seq-ReadsCount.py
• Sample names must match across: counts.txt header, Input sample list to this script
• The first group (group1) is the control, others are test conditions to be compared with group1
• All replicates are comma-separated
All results are saved in the DEG_results/ folder:
*_filtered.csv Filtered DEGs with padj < 0.05 and |log2FC| ≥ 1
*_unfiltered.csv All DESeq2 results
*_heatmap.png Heatmap of all filtered DEGs
summary_all_DEG.csv Combined results from all group comparisons
summary_filtered_DEG.csv Combined filtered DEGs from all comparisons
Supports any number of test groups, each compared independently to the control DESeq2 handles normalization and dispersion estimation automatically Heatmaps show expression levels of all filtered DEGs
- Run
RNA-seq-ReadsCount.pyto generate counts.txt - Use
run-DEG-analysis.pyto analyze DEGs and visualize heatmaps