Skip to content

lipingzengGitHub/RNA-seq-Python

Repository files navigation

RNA-seq Bioinformatics Pipelines

This repository contains two Python-based RNA-seq analysis pipelines:

  • Part 1: From raw FASTQ files to read counts using HISAT2 and HTSeq
  • Part 2: From read counts to differentially expressed genes (DEGs) using DESeq2 via R

Part 1: RNA-seq Reads Count Pipeline

Script: RNA-seq-ReadsCount.py
This pipeline processes paired-end RNA-seq FASTQ files to produce read count matrices using standard bioinformatics tools.

Steps:

  1. Quality Control (FastQC) – Checks raw FASTQ quality
  2. Genome Alignment (HISAT2) – Aligns reads to the reference genome
  3. Read Counting (HTSeq-count) – Counts reads mapped to annotated genes
  4. (Optional) PCA clustering analysis

Requirements

Conda-installed tools:

conda install -c bioconda fastqc conda install -c bioconda hisat2 conda install -c bioconda samtools conda install -c bioconda htseq

Python packages:

pip install pandas scikit-learn matplotlib seaborn

⚠️ If in a managed system, use a virtual environment:

python3 -m venv env source env/bin/activate pip install pandas scikit-learn matplotlib seaborn

How to Run

python RNA-seq-ReadsCount.py -i Fastq -o hisat-count-dir -g /path/to/hisat2_index_prefix -a /path/to/annotation.gtf

Input Files

FastQ/: Folder containing paired-end .fq.gz files (e.g., sample1_1.fq.gz, sample1_2.fq.gz) HISAT2-indexed reference genome (with .ht2 files) Gene annotation file (GTF or GFF3)

Output Files

FastQC_reports/: Quality reports per FASTQ

aligned_bam_files/: Aligned, sorted, and indexed BAM files

htseq_counts/: Gene-level read count files

merged_counts_matrix.csv: Combined count matrix

pca_plot.png: Optional PCA clustering plot

Part 2: Differential Expression Analysis (DEG)

Script: run-DEG-analysis.py

This pipeline performs differential gene expression analysis using DESeq2 via R, then generates summary files and heatmaps.

Requirements

Conda-installed tools:

conda install -c bioconda bioconductor-deseq2 conda install -c conda-forge rpy2 conda install pandas matplotlib seaborn scikit-learn

⚠️ Make sure = R versions are compatible with rpy2

How to Run

python run-DEG-analysis.py counts.txt control1,control2,control3 case1_rep1,case1_rep2,case1_rep3 ...

Input Files

• counts.txt: Gene expression matrix (rows = genes, columns = samples), which is generated by RNA-seq-ReadsCount.py

• Sample names must match across: counts.txt header, Input sample list to this script

• The first group (group1) is the control, others are test conditions to be compared with group1

• All replicates are comma-separated

Output Files

All results are saved in the DEG_results/ folder:

*_filtered.csv Filtered DEGs with padj < 0.05 and |log2FC| ≥ 1

*_unfiltered.csv All DESeq2 results

*_heatmap.png Heatmap of all filtered DEGs

summary_all_DEG.csv Combined results from all group comparisons

summary_filtered_DEG.csv Combined filtered DEGs from all comparisons

Notes

Supports any number of test groups, each compared independently to the control DESeq2 handles normalization and dispersion estimation automatically Heatmaps show expression levels of all filtered DEGs

Suggested Workflow

  1. Run RNA-seq-ReadsCount.py to generate counts.txt
  2. Use run-DEG-analysis.py to analyze DEGs and visualize heatmaps

About

Python pipeline for RNA-seq analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages