<div align="left">
    <strong>RNA Seq Data Analysis Pipeline using Jupyter and Colab Notebook</strong>
</div>

![alt text](https://pydeseq2.readthedocs.io/en/latest/_static/pydeseq2_logo.svg)
![alt text](https://jupyter.org/assets/logos/rectanglelogo-greytext-orangebody-greymoons.svg)
![alt text](https://www.python.org/static/img/python-logo.png)
![alt text](https://colab.google/static/images/icons/colab.png)

</p> <a class="btn" href="#home" style="display: inline-block; margin-top: 10px; text-decoration: none; color: #ffffff; background-color: #4CAF85; padding: 8px 15px; border-radius: 4px;"> 
<b>Table of Contents</b>
    
- .........................................................................................................................................
    
### Part: 1
- Import Libraries
- Working Directory Info.
- Sample information
- Download Dataset From SRA
- Converting Files: SRA to FASTQ
- Initial Quality Check
- Adaptor Trim and Trim Quality Check
- Download Ref Genome Indexing and Read Alignment
- SAM to BAM Conversion sorted and indexed
- Deduplication: Use Picard to mark and remove duplicate reads
- Final Quality Check:  Qualimap, Samtools Stats, and MultiQC
- Reads Quantification using HTseq
- Reads Metrics integration with metadata

### Part: 2
- Normalization:  Filter lowly expressed genes and Normalized Dataset Creation
- Visualization: PCA, Volcano and Heatmap Plots
- Result Interpretation: Upregulated and Downregulated Genes
- Reference

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/x18pG8B2/image.webp');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Import Libraries</b>
</h1>

- Jupyter Working Directory Path: `/home/mahendra/Desktop/Python/Project/`
- If You Want to Run This Jupyter Notebook on Google Colab You Can Set the Following Path: `/content/Python/Project/`

In [5]:
# Step 1: System Updates
!pip install --upgrade pip
!apt-get update && apt-get upgrade -y                                    # Update and upgrade package list

# Step 2: Install Key Tools and Libraries
!pip install pysradb --upgrade                                           # Install pysradb (for SRA database access)
!apt-get install fastqc --upgrade -y                                     # Install FastQC (for quality control)
!pip install multiqc                                                     # Install MultiQC (for quality control report generation)
!apt-get update && apt-get install -y sra-toolkit                        # Install SRA Toolkit (for downloading data from SRA)
!pip install cutadapt                                                    # Install cutadapt (for adapter trimming)
!pip install samtools                                                    # Install samtools (for BAM/SAM file manipulation)
!pip install picard                                                      # Install picard (for working with BAM files, deduplication)
!pip install biopython                                                   # Install Biopython (for biological computations)
!apt-get update && apt-get install -y bwa                                # Install BWA (for sequence alignment)
!apt-get install -y default-jdk                                          # Install JDK (for running Java-based tools like Picard)

# Step 3: Download Picard JAR File
!wget https://github.com/broadinstitute/picard/releases/download/2.27.4/picard.jar  
!chmod +x /home/mahendra/Desktop/Python/Project/picard.jar               # Make picard.jar executable
# Optional: Install Picard via Bioconda # Install picard and qualimap using bioconda
!conda install -c conda-forge -c bioconda openjdk=8 picard -y
!conda install -c bioconda qualimap -y

# Step 4: Install Additional Python Libraries for RNA-Seq Analysis
!pip install pandas                                                      # Install pandas (for data manipulation)
!pip install numpy                                                       # Install numpy (for numerical computations)
!pip install matplotlib                                                  # Install matplotlib (for visualization)
!pip install seaborn                                                     # Install seaborn (for advanced data visualization)
!pip install scipy                                                       # Install scipy (for statistical analysis)
!pip install pysam                                                       # Install pysam (for BAM/SAM file reading and writing)
!pip install gffutils                                                    # Install gffutils (for working with GFF files)
!pip install rpy2                                                        # Install rpy2 (for interfacing R with Python, useful for DESeq2 and edgeR)
!pip install pydeseq2                                                    # Install pyDESeq2
!pip install bioinfokit                                                  # Install bioinfokit (for statistical analysis)
!pip install jupyterlab                                                  # Optional: Install JupyterLab (if running locally)
!pip install scikit-learn                                                # Install scikit-learn (for machine learning algorithms)
!pip install statsmodels                                                 # Install statsmodels (for statistical modeling)
!pip install plotly                                                      # Install Plotly (for interactive plots)
!pip install tqdm                                                        # Install tqdm (for progress bars)

Reading package lists... Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)
E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Reading package lists... Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)
[31mERROR: Could not find a version that satisfies the requirement samtools (from versions: none)[0m[31m


In [7]:
# Importing installed libraries and Python packages
try:
  import pandas as pd                                                     # For data manipulation
  import pysradb                                                          # For accessing and managing SRA data
  import subprocess                                                       # For executing system commands from Python
  import os                                                               # For file management
  from tqdm import tqdm                                                   # For creating progress bars
  import numpy as np                                                      # For numerical computations
  import scipy.stats as stats                                             # For statistical analysis
  import numpy as np                                                      # For numerical computations
  import matplotlib.pyplot as plt                                         # For plotting
  import seaborn as sns                                                   # For creating advanced visualizations
  import pysam                                                            # For working with BAM and SAM files
  import gffutils                                                         # For handling GFF files
  import pydeseq2                                                         # For differential gene expression analysis
  from sklearn.preprocessing import StandardScaler                        # For preprocessing the data
  from pydeseq2.dds import DeseqDataSet                                   # For Read counts modeling with the DeseqDataSet class
  from pydeseq2.ds import DeseqStats                                      # Statistical analysis with the DeseqStats class
  from pydeseq2.default_inference import DefaultInference  
  from IPython.display import IFrame                                      # For Reading the web frame
  import warnings                                                         # For hiding the warning text
  from sklearn.exceptions import ConvergenceWarning
  warnings.filterwarnings("ignore", category=ConvergenceWarning)
  print("All Python packages imported successfully.")
except ImportError as e:
  print(f"Import failed: {e}")                                            # Print error message if import fails

All Python packages imported successfully.


<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Sample information</b>
</h1>

In [11]:
# Create the data frame for sample information
data = {
    'Sample': [
        "Alzheimer's whole brain",
        "Normal brain, temporal lobe",
        "Alzheimer's brain, temporal lobe",
        "Normal brain, frontal lobe",
        "Alzheimer's brain, frontal lobe",
        "Normal whole brain"
    ],
    'SRR_ID': [
        "SRR087416", "SRR085471", "SRR085473", "SRR085474", "SRR085726", "SRR085725"
    ],
    'Spots': [
        14720816, 15256752, 14227702, 15772947, 15228832, 13442077
    ],
    'Bases': [
        "529.9M", "549.2M", "498M", "552.1M", "533M", "483.9M"
    ],
    'Size (Mb)': [
        362.1, 372.8, 350.2, 391.0, 377.3, 324.0
    ],
    'Published': [
        "2011-01-05", "2011-01-05", "2011-01-05", "2011-01-05", "2011-01-05", "2011-01-05"
    ],
    'Instrument': [
        "Illumina Genome Analyzer II",
        "Illumina Genome Analyzer II",
        "Illumina Genome Analyzer II",
        "Illumina Genome Analyzer II",
        "Illumina Genome Analyzer II",
        "Illumina Genome Analyzer II"
    ],
    'Strategy': [
        "WGS", "WGS", "WGS", "WGS", "WGS", "WGS"
    ],
    'Source': [
        "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRIPTOMIC"
    ],
    'Layout': [
        "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SINGLE"
    ]
}

df = pd.DataFrame(data)
# Display the DataFrame
df.head(6)

Unnamed: 0,Sample,SRR_ID,Spots,Bases,Size (Mb),Published,Instrument,Strategy,Source,Layout
0,Alzheimer's whole brain,SRR087416,14720816,529.9M,362.1,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE
1,"Normal brain, temporal lobe",SRR085471,15256752,549.2M,372.8,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE
2,"Alzheimer's brain, temporal lobe",SRR085473,14227702,498M,350.2,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE
3,"Normal brain, frontal lobe",SRR085474,15772947,552.1M,391.0,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE
4,"Alzheimer's brain, frontal lobe",SRR085726,15228832,533M,377.3,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE
5,Normal whole brain,SRR085725,13442077,483.9M,324.0,2011-01-05,Illumina Genome Analyzer II,WGS,TRANSCRIPTOMIC,SINGLE


<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Download Dataset</b>
</h1>

In [None]:
# NOTE: Please Say "Y" to Download the Dataset or Run this code on Google Colab.........

sra = pysradb.SRAweb()                                            # Initialize SRAweb object
study_accession = "SRA027308"                                     # Define the study accession

                                                                  # Get metadata for the study accession
try:
    metadata_df = sra.sra_metadata(study_accession)
    run_accessions = metadata_df['run_accession'].tolist()        # Access run_accession column
except Exception as e:
    print(f"Error retrieving metadata for {study_accession}: {e}")
    run_accessions = []

                                                                  # Check if run_accessions is empty and handle it
if not run_accessions:
    print("No run accessions found for the study. Skipping download.")
else:
                                                                  # Download the SRA files using run accessions
    sra.download(run_accessions, out_dir='sra_downloads/')
    print(f"Downloaded SRA files for study {study_accession} to 'sra_downloads/' directory.")

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Converting the .SRA files into .fastq format</b>
</h1>

In [7]:
# Define a list of SRA files with path
sra_files = [
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX035170/SRR085725.sra",
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX035760/SRR087416.sra",
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX034874/SRR085471.sra",
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX035166/SRR085473.sra",
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX035167/SRR085474.sra",
    "/home/mahendra/Desktop/Python/Project/sra_downloads/SRP004879/SRX035171/SRR085726.sra"
]

# Create a FastQ folder if it doesn't exist
output_dir = "FastQ"
os.makedirs(output_dir, exist_ok=True)

# Loop through and run fasterq-dump for each file
for sra_file in sra_files:
  cmd = ["fasterq-dump", "--split-files", sra_file]
  subprocess.run(cmd)
# Move the output FASTQ files to the FastQ folder
  fastq_files = [f for f in os.listdir('.') if f.endswith('.fastq')]
  for fastq in fastq_files:
    subprocess.run(["mv", fastq, os.path.join(output_dir, fastq)])
# Compress the moved FASTQ files
    subprocess.run(["gzip", os.path.join(output_dir, fastq)])

spots read      : 13,442,077
reads read      : 13,442,077
reads written   : 13,442,077
spots read      : 14,720,816
reads read      : 14,720,816
reads written   : 14,720,816
spots read      : 15,256,752
reads read      : 15,256,752
reads written   : 15,256,752
spots read      : 14,227,702
reads read      : 14,227,702
reads written   : 14,227,702
spots read      : 15,772,947
reads read      : 15,772,947
reads written   : 15,772,947
spots read      : 15,228,832
reads read      : 15,228,832
reads written   : 15,228,832


<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Initial Quality Check</b>
</h1>

In [17]:
# List of FastQ files (with path)
fastq_files = [os.path.join('FastQ', f) for f in ['SRR087416.fastq.gz', 'SRR085725.fastq.gz','SRR085471.fastq.gz','SRR085473.fastq.gz','SRR085474.fastq.gz', 'SRR085726.fastq.gz']]

# Directory to store FastQC results
output_dir_fastqc = 'fastqc_output'
os.makedirs(output_dir_fastqc, exist_ok=True)

# Run FastQC on all files
# Define parameters for threads and kmers
threads = 4   # Set the number of threads, e.g., 4 (adjust as needed based on available memory)
kmers = 10    # Set the kmer length, e.g., 7 (ensure it's between 2 and 10)

# Run FastQC on all files with specified threads and kmer length
for fastq_file in fastq_files:
    subprocess.run(['fastqc', fastq_file, '-o', output_dir_fastqc, '-t', str(threads), '-k', str(kmers)])


# After FastQC, run MultiQC to aggregate the results
output_dir_multiqc = 'multiqc_output'
os.makedirs(output_dir_multiqc, exist_ok=True)
subprocess.run(['multiqc', output_dir_fastqc, '-o', output_dir_multiqc])

application/gzip


Started analysis of SRR087416.fastq.gz
Approx 5% complete for SRR087416.fastq.gz
Approx 10% complete for SRR087416.fastq.gz
Approx 15% complete for SRR087416.fastq.gz
Approx 20% complete for SRR087416.fastq.gz
Approx 25% complete for SRR087416.fastq.gz
Approx 30% complete for SRR087416.fastq.gz
Approx 35% complete for SRR087416.fastq.gz
Approx 40% complete for SRR087416.fastq.gz
Approx 45% complete for SRR087416.fastq.gz
Approx 50% complete for SRR087416.fastq.gz
Approx 55% complete for SRR087416.fastq.gz
Approx 60% complete for SRR087416.fastq.gz
Approx 65% complete for SRR087416.fastq.gz
Approx 70% complete for SRR087416.fastq.gz
Approx 75% complete for SRR087416.fastq.gz
Approx 80% complete for SRR087416.fastq.gz
Approx 85% complete for SRR087416.fastq.gz
Approx 90% complete for SRR087416.fastq.gz
Approx 95% complete for SRR087416.fastq.gz


Analysis complete for SRR087416.fastq.gz
application/gzip


Started analysis of SRR085725.fastq.gz
Approx 5% complete for SRR085725.fastq.gz
Approx 10% complete for SRR085725.fastq.gz
Approx 15% complete for SRR085725.fastq.gz
Approx 20% complete for SRR085725.fastq.gz
Approx 25% complete for SRR085725.fastq.gz
Approx 30% complete for SRR085725.fastq.gz
Approx 35% complete for SRR085725.fastq.gz
Approx 40% complete for SRR085725.fastq.gz
Approx 45% complete for SRR085725.fastq.gz
Approx 50% complete for SRR085725.fastq.gz
Approx 55% complete for SRR085725.fastq.gz
Approx 60% complete for SRR085725.fastq.gz
Approx 65% complete for SRR085725.fastq.gz
Approx 70% complete for SRR085725.fastq.gz
Approx 75% complete for SRR085725.fastq.gz
Approx 80% complete for SRR085725.fastq.gz
Approx 85% complete for SRR085725.fastq.gz
Approx 90% complete for SRR085725.fastq.gz
Approx 95% complete for SRR085725.fastq.gz


Analysis complete for SRR085725.fastq.gz
application/gzip


Started analysis of SRR085471.fastq.gz
Approx 5% complete for SRR085471.fastq.gz
Approx 10% complete for SRR085471.fastq.gz
Approx 15% complete for SRR085471.fastq.gz
Approx 20% complete for SRR085471.fastq.gz
Approx 25% complete for SRR085471.fastq.gz
Approx 30% complete for SRR085471.fastq.gz
Approx 35% complete for SRR085471.fastq.gz
Approx 40% complete for SRR085471.fastq.gz
Approx 45% complete for SRR085471.fastq.gz
Approx 50% complete for SRR085471.fastq.gz
Approx 55% complete for SRR085471.fastq.gz
Approx 60% complete for SRR085471.fastq.gz
Approx 65% complete for SRR085471.fastq.gz
Approx 70% complete for SRR085471.fastq.gz
Approx 75% complete for SRR085471.fastq.gz
Approx 80% complete for SRR085471.fastq.gz
Approx 85% complete for SRR085471.fastq.gz
Approx 90% complete for SRR085471.fastq.gz
Approx 95% complete for SRR085471.fastq.gz


Analysis complete for SRR085471.fastq.gz
application/gzip


Started analysis of SRR085473.fastq.gz
Approx 5% complete for SRR085473.fastq.gz
Approx 10% complete for SRR085473.fastq.gz
Approx 15% complete for SRR085473.fastq.gz
Approx 20% complete for SRR085473.fastq.gz
Approx 25% complete for SRR085473.fastq.gz
Approx 30% complete for SRR085473.fastq.gz
Approx 35% complete for SRR085473.fastq.gz
Approx 40% complete for SRR085473.fastq.gz
Approx 45% complete for SRR085473.fastq.gz
Approx 50% complete for SRR085473.fastq.gz
Approx 55% complete for SRR085473.fastq.gz
Approx 60% complete for SRR085473.fastq.gz
Approx 65% complete for SRR085473.fastq.gz
Approx 70% complete for SRR085473.fastq.gz
Approx 75% complete for SRR085473.fastq.gz
Approx 80% complete for SRR085473.fastq.gz
Approx 85% complete for SRR085473.fastq.gz
Approx 90% complete for SRR085473.fastq.gz
Approx 95% complete for SRR085473.fastq.gz


Analysis complete for SRR085473.fastq.gz
application/gzip


Started analysis of SRR085474.fastq.gz
Approx 5% complete for SRR085474.fastq.gz
Approx 10% complete for SRR085474.fastq.gz
Approx 15% complete for SRR085474.fastq.gz
Approx 20% complete for SRR085474.fastq.gz
Approx 25% complete for SRR085474.fastq.gz
Approx 30% complete for SRR085474.fastq.gz
Approx 35% complete for SRR085474.fastq.gz
Approx 40% complete for SRR085474.fastq.gz
Approx 45% complete for SRR085474.fastq.gz
Approx 50% complete for SRR085474.fastq.gz
Approx 55% complete for SRR085474.fastq.gz
Approx 60% complete for SRR085474.fastq.gz
Approx 65% complete for SRR085474.fastq.gz
Approx 70% complete for SRR085474.fastq.gz
Approx 75% complete for SRR085474.fastq.gz
Approx 80% complete for SRR085474.fastq.gz
Approx 85% complete for SRR085474.fastq.gz
Approx 90% complete for SRR085474.fastq.gz
Approx 95% complete for SRR085474.fastq.gz


Analysis complete for SRR085474.fastq.gz
application/gzip


Started analysis of SRR085726.fastq.gz
Approx 5% complete for SRR085726.fastq.gz
Approx 10% complete for SRR085726.fastq.gz
Approx 15% complete for SRR085726.fastq.gz
Approx 20% complete for SRR085726.fastq.gz
Approx 25% complete for SRR085726.fastq.gz
Approx 30% complete for SRR085726.fastq.gz
Approx 35% complete for SRR085726.fastq.gz
Approx 40% complete for SRR085726.fastq.gz
Approx 45% complete for SRR085726.fastq.gz
Approx 50% complete for SRR085726.fastq.gz
Approx 55% complete for SRR085726.fastq.gz
Approx 60% complete for SRR085726.fastq.gz
Approx 65% complete for SRR085726.fastq.gz
Approx 70% complete for SRR085726.fastq.gz
Approx 75% complete for SRR085726.fastq.gz
Approx 80% complete for SRR085726.fastq.gz
Approx 85% complete for SRR085726.fastq.gz
Approx 90% complete for SRR085726.fastq.gz
Approx 95% complete for SRR085726.fastq.gz


Analysis complete for SRR085726.fastq.gz



[38;5;208m///[0m ]8;id=240177;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25.1[0m

[34m       file_search[0m | Search path: /home/mahendra/Desktop/Python/Project/fastqc_output
[2K         [34msearching[0m | [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m12/12[0m   
[?25h[34m            fastqc[0m | Found 6 reports
[34m     write_results[0m | Data        : multiqc_output/multiqc_data
[34m     write_results[0m | Report      : multiqc_output/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


CompletedProcess(args=['multiqc', 'fastqc_output', '-o', 'multiqc_output'], returncode=0)

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Adapter Trimming using cutadapt</b>
</h1>

In [20]:
# List of FastQ files (with path)
fastq_files = [os.path.join('FastQ', f) for f in ['SRR087416.fastq.gz', 'SRR085725.fastq.gz', 'SRR085471.fastq.gz', 'SRR085473.fastq.gz', 'SRR085474.fastq.gz', 'SRR085726.fastq.gz']]

# Adapter sequence
adapter_seq = 'AGATCGGAAGAGC'  # Example adapter for Illumina

# Directory to store trimmed FastQ files
trimmed_dir = 'trimmed_fastq'
# Create output directory if it doesn't exist
os.makedirs(trimmed_dir, exist_ok=True)

# Run Cutadapt for each file
for fastq_file in fastq_files:
    output_file = os.path.join(trimmed_dir, os.path.basename(fastq_file).replace('.fastq.gz', '_trimmed.fastq.gz'))
    subprocess.run(['cutadapt', '-a', adapter_seq, '-q', '30', '-o', output_file, fastq_file])

print("All FastQ files have been trimmed using cutadapt.")

This is cutadapt 4.9 with Python 3.12.2
Command line parameters: -a AGATCGGAAGAGC -q 30 -o trimmed_fastq/SRR087416_trimmed.fastq.gz FastQ/SRR087416.fastq.gz
Processing single-end reads on 1 core ...
Finished in 63.311 s (4.301 µs/read; 13.95 M reads/minute).

=== Summary ===

Total reads processed:              14,720,816
Reads with adapters:                   337,690 (2.3%)
Reads written (passing filters):    14,720,816 (100.0%)

Total basepairs processed:   529,949,376 bp
Quality-trimmed:                       0 bp (0.0%)
Total written (filtered):    528,821,475 bp (99.8%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 337690 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 29.5%
  C: 27.2%
  G: 26.1%
  T: 17.1%
  none/other: 0.2%

Overview of removed sequences
length	count	expect	max.err	error counts
3	269225	230012.8	0	269225
4	54562	57503.2	0	54562
5	9431	14375.8	0	9431
6	1288	3593.

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Post-Trim Quality Check</b>
</h1>

In [25]:
# List of trimmed files
trimmed_files = [os.path.join('trimmed_fastq', f) for f in ['SRR087416_trimmed.fastq.gz', 'SRR085725_trimmed.fastq.gz','SRR085471_trimmed.fastq.gz', 'SRR085473_trimmed.fastq.gz',
                                                            'SRR085474_trimmed.fastq.gz', 'SRR085726_trimmed.fastq.gz']]

# Directory to store FastQC results
fastqc_trim_dir = 'trim_fastqc'
os.makedirs(fastqc_trim_dir, exist_ok=True)

# Run FastQC on deduplicated BAM files
# Define parameters for threads and kmers
threads = 4   # Set the number of threads, e.g., 4 (adjust as needed based on available memory)
kmers = 10    # Set the kmer length, e.g., 7 (ensure it's between 2 and 10)
for trim_file in trimmed_files:
  subprocess.run(['fastqc', trim_file, '-o', fastqc_trim_dir, '-t', str(threads), '-k', str(kmers)])

# Run MultiQC to aggregate FastQC results
output_trim_multiqc = 'trim_mqc'
os.makedirs(output_trim_multiqc, exist_ok=True)
subprocess.run(['multiqc', fastqc_trim_dir, '-o', output_trim_multiqc])

print("Final quality control completed with FastQC and MultiQC.")

application/gzip


Started analysis of SRR087416_trimmed.fastq.gz
Approx 5% complete for SRR087416_trimmed.fastq.gz
Approx 10% complete for SRR087416_trimmed.fastq.gz
Approx 15% complete for SRR087416_trimmed.fastq.gz
Approx 20% complete for SRR087416_trimmed.fastq.gz
Approx 25% complete for SRR087416_trimmed.fastq.gz
Approx 30% complete for SRR087416_trimmed.fastq.gz
Approx 35% complete for SRR087416_trimmed.fastq.gz
Approx 40% complete for SRR087416_trimmed.fastq.gz
Approx 45% complete for SRR087416_trimmed.fastq.gz
Approx 50% complete for SRR087416_trimmed.fastq.gz
Approx 55% complete for SRR087416_trimmed.fastq.gz
Approx 60% complete for SRR087416_trimmed.fastq.gz
Approx 65% complete for SRR087416_trimmed.fastq.gz
Approx 70% complete for SRR087416_trimmed.fastq.gz
Approx 75% complete for SRR087416_trimmed.fastq.gz
Approx 80% complete for SRR087416_trimmed.fastq.gz
Approx 85% complete for SRR087416_trimmed.fastq.gz
Approx 90% complete for SRR087416_trimmed.fastq.gz
Approx 95% complete for SRR087416_tr

Analysis complete for SRR087416_trimmed.fastq.gz
application/gzip


Started analysis of SRR085725_trimmed.fastq.gz
Approx 5% complete for SRR085725_trimmed.fastq.gz
Approx 10% complete for SRR085725_trimmed.fastq.gz
Approx 15% complete for SRR085725_trimmed.fastq.gz
Approx 20% complete for SRR085725_trimmed.fastq.gz
Approx 25% complete for SRR085725_trimmed.fastq.gz
Approx 30% complete for SRR085725_trimmed.fastq.gz
Approx 35% complete for SRR085725_trimmed.fastq.gz
Approx 40% complete for SRR085725_trimmed.fastq.gz
Approx 45% complete for SRR085725_trimmed.fastq.gz
Approx 50% complete for SRR085725_trimmed.fastq.gz
Approx 55% complete for SRR085725_trimmed.fastq.gz
Approx 60% complete for SRR085725_trimmed.fastq.gz
Approx 65% complete for SRR085725_trimmed.fastq.gz
Approx 70% complete for SRR085725_trimmed.fastq.gz
Approx 75% complete for SRR085725_trimmed.fastq.gz
Approx 80% complete for SRR085725_trimmed.fastq.gz
Approx 85% complete for SRR085725_trimmed.fastq.gz
Approx 90% complete for SRR085725_trimmed.fastq.gz
Approx 95% complete for SRR085725_tr

Analysis complete for SRR085725_trimmed.fastq.gz
application/gzip


Started analysis of SRR085471_trimmed.fastq.gz
Approx 5% complete for SRR085471_trimmed.fastq.gz
Approx 10% complete for SRR085471_trimmed.fastq.gz
Approx 15% complete for SRR085471_trimmed.fastq.gz
Approx 20% complete for SRR085471_trimmed.fastq.gz
Approx 25% complete for SRR085471_trimmed.fastq.gz
Approx 30% complete for SRR085471_trimmed.fastq.gz
Approx 35% complete for SRR085471_trimmed.fastq.gz
Approx 40% complete for SRR085471_trimmed.fastq.gz
Approx 45% complete for SRR085471_trimmed.fastq.gz
Approx 50% complete for SRR085471_trimmed.fastq.gz
Approx 55% complete for SRR085471_trimmed.fastq.gz
Approx 60% complete for SRR085471_trimmed.fastq.gz
Approx 65% complete for SRR085471_trimmed.fastq.gz
Approx 70% complete for SRR085471_trimmed.fastq.gz
Approx 75% complete for SRR085471_trimmed.fastq.gz
Approx 80% complete for SRR085471_trimmed.fastq.gz
Approx 85% complete for SRR085471_trimmed.fastq.gz
Approx 90% complete for SRR085471_trimmed.fastq.gz
Approx 95% complete for SRR085471_tr

Analysis complete for SRR085471_trimmed.fastq.gz
application/gzip


Started analysis of SRR085473_trimmed.fastq.gz
Approx 5% complete for SRR085473_trimmed.fastq.gz
Approx 10% complete for SRR085473_trimmed.fastq.gz
Approx 15% complete for SRR085473_trimmed.fastq.gz
Approx 20% complete for SRR085473_trimmed.fastq.gz
Approx 25% complete for SRR085473_trimmed.fastq.gz
Approx 30% complete for SRR085473_trimmed.fastq.gz
Approx 35% complete for SRR085473_trimmed.fastq.gz
Approx 40% complete for SRR085473_trimmed.fastq.gz
Approx 45% complete for SRR085473_trimmed.fastq.gz
Approx 50% complete for SRR085473_trimmed.fastq.gz
Approx 55% complete for SRR085473_trimmed.fastq.gz
Approx 60% complete for SRR085473_trimmed.fastq.gz
Approx 65% complete for SRR085473_trimmed.fastq.gz
Approx 70% complete for SRR085473_trimmed.fastq.gz
Approx 75% complete for SRR085473_trimmed.fastq.gz
Approx 80% complete for SRR085473_trimmed.fastq.gz
Approx 85% complete for SRR085473_trimmed.fastq.gz
Approx 90% complete for SRR085473_trimmed.fastq.gz
Approx 95% complete for SRR085473_tr

Analysis complete for SRR085473_trimmed.fastq.gz
application/gzip


Started analysis of SRR085474_trimmed.fastq.gz
Approx 5% complete for SRR085474_trimmed.fastq.gz
Approx 10% complete for SRR085474_trimmed.fastq.gz
Approx 15% complete for SRR085474_trimmed.fastq.gz
Approx 20% complete for SRR085474_trimmed.fastq.gz
Approx 25% complete for SRR085474_trimmed.fastq.gz
Approx 30% complete for SRR085474_trimmed.fastq.gz
Approx 35% complete for SRR085474_trimmed.fastq.gz
Approx 40% complete for SRR085474_trimmed.fastq.gz
Approx 45% complete for SRR085474_trimmed.fastq.gz
Approx 50% complete for SRR085474_trimmed.fastq.gz
Approx 55% complete for SRR085474_trimmed.fastq.gz
Approx 60% complete for SRR085474_trimmed.fastq.gz
Approx 65% complete for SRR085474_trimmed.fastq.gz
Approx 70% complete for SRR085474_trimmed.fastq.gz
Approx 75% complete for SRR085474_trimmed.fastq.gz
Approx 80% complete for SRR085474_trimmed.fastq.gz
Approx 85% complete for SRR085474_trimmed.fastq.gz
Approx 90% complete for SRR085474_trimmed.fastq.gz
Approx 95% complete for SRR085474_tr

Analysis complete for SRR085474_trimmed.fastq.gz
application/gzip


Started analysis of SRR085726_trimmed.fastq.gz
Approx 5% complete for SRR085726_trimmed.fastq.gz
Approx 10% complete for SRR085726_trimmed.fastq.gz
Approx 15% complete for SRR085726_trimmed.fastq.gz
Approx 20% complete for SRR085726_trimmed.fastq.gz
Approx 25% complete for SRR085726_trimmed.fastq.gz
Approx 30% complete for SRR085726_trimmed.fastq.gz
Approx 35% complete for SRR085726_trimmed.fastq.gz
Approx 40% complete for SRR085726_trimmed.fastq.gz
Approx 45% complete for SRR085726_trimmed.fastq.gz
Approx 50% complete for SRR085726_trimmed.fastq.gz
Approx 55% complete for SRR085726_trimmed.fastq.gz
Approx 60% complete for SRR085726_trimmed.fastq.gz
Approx 65% complete for SRR085726_trimmed.fastq.gz
Approx 70% complete for SRR085726_trimmed.fastq.gz
Approx 75% complete for SRR085726_trimmed.fastq.gz
Approx 80% complete for SRR085726_trimmed.fastq.gz
Approx 85% complete for SRR085726_trimmed.fastq.gz
Approx 90% complete for SRR085726_trimmed.fastq.gz
Approx 95% complete for SRR085726_tr

Analysis complete for SRR085726_trimmed.fastq.gz



[38;5;208m///[0m ]8;id=305485;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25.1[0m

[34m       file_search[0m | Search path: /home/mahendra/Desktop/Python/Project/trim_fastqc
[2K         [34msearching[0m | [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m12/12[0m   
[?25h[34m            fastqc[0m | Found 6 reports
[34m     write_results[0m | Data        : trim_mqc/multiqc_data
[34m     write_results[0m | Report      : trim_mqc/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


Final quality control completed with FastQC and MultiQC.


In [101]:
# Path to the MultiQC report
multiqc_report_path = os.path.join(output_trim_multiqc, "multiqc_report.html")

# Display the report in the notebook
IFrame(multiqc_report_path, width=1000, height=800)

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Download the human reference genome and Align FASTQ files with the Reference Genome using BWA</b>
</h1>

In [29]:
# Reference genome path
reference_genome = '/home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna'
if not os.path.exists(reference_genome):
    subprocess.run(['wget', 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz'])
    subprocess.run(['gunzip', 'GCF_000001405.40_GRCh38.p14_genomic.fna.gz'])
    RF = 'Ref'
    os.makedirs(RF, exist_ok=True)
    subprocess.run(['mv', 'GCF_000001405.40_GRCh38.p14_genomic.fna', '/home/mahendra/Desktop/Python/Project/Ref/'])

In [35]:
# Step 1: Check for the human reference genome
reference_genome = '/home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna'
if not os.path.exists(reference_genome):
    print("Reference genome is not found. Please download it first.")
else:
    print("Reference genome already downloaded.")

# Step 2: Ensure reference genome is indexed
index_files = [reference_genome + ext for ext in ['.bwt', '.sa', '.ann', '.amb', '.pac']]
if not all(os.path.exists(f) for f in index_files):
    print("Indexing the reference genome...")
    subprocess.run(['bwa', 'index', reference_genome])
else:
    print("Reference genome is already indexed.")

# Step 3: Run BWA mem to align each FASTQ file and save output to SAM files
fastq_files = [os.path.join('trimmed_fastq', f) for f in [
    'SRR087416_trimmed.fastq.gz', 'SRR085725_trimmed.fastq.gz',
    'SRR085471_trimmed.fastq.gz', 'SRR085473_trimmed.fastq.gz',
    'SRR085474_trimmed.fastq.gz', 'SRR085726_trimmed.fastq.gz'
]]
sam_dir = 'aligned_sam'
os.makedirs(sam_dir, exist_ok=True)

print("Running BWA mem for each FASTQ file...")
for fastq_file in tqdm(fastq_files, desc="Alignment Progress", unit="file", bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} files', colour="yellow"):
    output_sam = os.path.join(sam_dir, os.path.basename(fastq_file).replace('_trimmed.fastq.gz', '.sam'))

    # Run BWA mem with real-time output and error handling
    with open(output_sam, 'w') as sam_output:
        process = subprocess.Popen(['bwa', 'mem', reference_genome, fastq_file], stdout=sam_output, stderr=subprocess.PIPE, text=True)

        # Capture stderr output and print any errors
        for line in process.stderr:
            print(line, end="")  # Real-time error output

        # Wait for the process to complete and check for errors
        return_code = process.wait()
        if return_code != 0:
            print(f"Error during alignment for {fastq_file}. Check the SAM file or stderr for details.")
        else:
            print(f"Alignment completed for {fastq_file}, output saved to {output_sam}")

Reference genome already downloaded.
Reference genome is already indexed.
Running BWA mem for each FASTQ file...


Alignment Progress:   0%|[33m                                            [0m| 0/6 files[0m

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 278392 sequences (10000040 bp)...
[M::process] read 278374 sequences (10000032 bp)...
[M::mem_process_seqs] Processed 278392 reads in 40.399 CPU sec, 100.554 real sec
[M::process] read 278382 sequences (10000067 bp)...
[M::mem_process_seqs] Processed 278374 reads in 33.895 CPU sec, 60.862 real sec
[M::process] read 278366 sequences (10000060 bp)...
[M::mem_process_seqs] Processed 278382 reads in 32.662 CPU sec, 54.507 real sec
[M::process] read 278382 sequences (10000028 bp)...
[M::mem_process_seqs] Processed 278366 reads in 34.366 CPU sec, 61.419 real sec
[M::process] read 278378 sequences (10000034 bp)...
[M::mem_process_seqs] Processed 278382 reads in 34.655 CPU sec, 61.958 real sec
[M::process] read 278374 sequences (10000048 bp)...
[M::mem_process_seqs] Processed 278378 reads in 33.773 CPU sec, 58.527 real sec
[M::process] read 278392 sequences (10000028 bp)...
[M::mem_process_seqs] Processed 278374 reads in 35.154 C

Alignment Progress:  17%|[33m███████▎                                    [0m| 1/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR087416_trimmed.fastq.gz
[main] Real time: 3171.090 sec; CPU: 1801.563 sec
Alignment completed for trimmed_fastq/SRR087416_trimmed.fastq.gz, output saved to aligned_sam/SRR087416.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 278420 sequences (10000040 bp)...
[M::process] read 278418 sequences (10000056 bp)...
[M::mem_process_seqs] Processed 278420 reads in 33.517 CPU sec, 74.458 real sec
[M::process] read 278434 sequences (10000020 bp)...
[M::mem_process_seqs] Processed 278418 reads in 29.795 CPU sec, 54.491 real sec
[M::process] read 278424 sequences (10000042 bp)...
[M::mem_process_seqs] Processed 278434 reads in 28.898 CPU sec, 48.654 real sec
[M::process] read 278428 sequences (10000017 bp)...
[M::mem_process_seqs] Processed 278424 reads in 29.362 CPU sec, 51.353 real sec
[M::process] read 278444 sequences (1000005

Alignment Progress:  33%|[33m██████████████▋                             [0m| 2/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR085725_trimmed.fastq.gz
[main] Real time: 2424.286 sec; CPU: 1426.885 sec
Alignment completed for trimmed_fastq/SRR085725_trimmed.fastq.gz, output saved to aligned_sam/SRR085725.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 278398 sequences (10000033 bp)...
[M::process] read 278412 sequences (10000049 bp)...
[M::mem_process_seqs] Processed 278398 reads in 33.164 CPU sec, 75.198 real sec
[M::process] read 278428 sequences (10000042 bp)...
[M::mem_process_seqs] Processed 278412 reads in 28.216 CPU sec, 44.986 real sec
[M::process] read 278406 sequences (10000041 bp)...
[M::mem_process_seqs] Processed 278428 reads in 27.578 CPU sec, 42.791 real sec
[M::process] read 278394 sequences (10000009 bp)...
[M::mem_process_seqs] Processed 278406 reads in 29.142 CPU sec, 48.717 real sec
[M::process] read 278392 sequences (1000003

Alignment Progress:  50%|[33m██████████████████████                      [0m| 3/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR085471_trimmed.fastq.gz
[main] Real time: 2613.812 sec; CPU: 1572.619 sec
Alignment completed for trimmed_fastq/SRR085471_trimmed.fastq.gz, output saved to aligned_sam/SRR085471.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 286376 sequences (10000022 bp)...
[M::process] read 286374 sequences (10000056 bp)...
[M::mem_process_seqs] Processed 286376 reads in 39.932 CPU sec, 89.680 real sec
[M::process] read 286340 sequences (10000030 bp)...
[M::mem_process_seqs] Processed 286374 reads in 33.506 CPU sec, 48.115 real sec
[M::process] read 286342 sequences (10000023 bp)...
[M::mem_process_seqs] Processed 286340 reads in 35.674 CPU sec, 59.341 real sec
[M::process] read 286352 sequences (10000069 bp)...
[M::mem_process_seqs] Processed 286342 reads in 35.117 CPU sec, 55.816 real sec
[M::process] read 286336 sequences (1000005

Alignment Progress:  67%|[33m█████████████████████████████▎              [0m| 4/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR085473_trimmed.fastq.gz
[main] Real time: 2591.593 sec; CPU: 1708.399 sec
Alignment completed for trimmed_fastq/SRR085473_trimmed.fastq.gz, output saved to aligned_sam/SRR085473.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 286372 sequences (10000059 bp)...
[M::process] read 286358 sequences (10000024 bp)...
[M::mem_process_seqs] Processed 286372 reads in 34.433 CPU sec, 70.224 real sec
[M::process] read 286350 sequences (10000028 bp)...
[M::mem_process_seqs] Processed 286358 reads in 30.851 CPU sec, 48.177 real sec
[M::process] read 286362 sequences (10000016 bp)...
[M::mem_process_seqs] Processed 286350 reads in 30.761 CPU sec, 47.870 real sec
[M::process] read 286340 sequences (10000030 bp)...
[M::mem_process_seqs] Processed 286362 reads in 31.243 CPU sec, 49.331 real sec
[M::process] read 286342 sequences (1000000

Alignment Progress:  83%|[33m████████████████████████████████████▋       [0m| 5/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR085474_trimmed.fastq.gz
[main] Real time: 2826.703 sec; CPU: 1740.923 sec
Alignment completed for trimmed_fastq/SRR085474_trimmed.fastq.gz, output saved to aligned_sam/SRR085474.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 286374 sequences (10000065 bp)...
[M::process] read 286366 sequences (10000054 bp)...
[M::mem_process_seqs] Processed 286374 reads in 37.640 CPU sec, 80.877 real sec
[M::process] read 286354 sequences (10000023 bp)...
[M::mem_process_seqs] Processed 286366 reads in 32.345 CPU sec, 50.570 real sec
[M::process] read 286342 sequences (10000059 bp)...
[M::mem_process_seqs] Processed 286354 reads in 32.487 CPU sec, 51.513 real sec
[M::process] read 286358 sequences (10000047 bp)...
[M::mem_process_seqs] Processed 286342 reads in 32.500 CPU sec, 51.517 real sec
[M::process] read 286344 sequences (1000005

Alignment Progress: 100%|[33m████████████████████████████████████████████[0m| 6/6 files[0m

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /home/mahendra/Desktop/Python/Project/Ref/GCF_000001405.40_GRCh38.p14_genomic.fna trimmed_fastq/SRR085726_trimmed.fastq.gz
[main] Real time: 2844.202 sec; CPU: 1766.266 sec
Alignment completed for trimmed_fastq/SRR085726_trimmed.fastq.gz, output saved to aligned_sam/SRR085726.sam





<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>SAM File converted, sorted, and indexed</b>
</h1>

In [69]:
# Directories for output BAM files, and sorted BAM files
bam_dir = 'aligned_bam'
sorted_bam_dir = 'sorted_bam'
os.makedirs(bam_dir, exist_ok=True)
os.makedirs(sorted_bam_dir, exist_ok=True)

# List SAM files to process
sam_files = [os.path.join(sam_dir, f) for f in os.listdir(sam_dir) if f.endswith('.sam')]

# Process each SAM file: Convert to BAM, sort, and index
for sam_file in tqdm(sam_files, desc="Processing SAM files", unit="file"):
    try:
        # Step 1: Convert SAM to BAM
        bam_file = os.path.join(bam_dir, os.path.basename(sam_file).replace('.sam', '.bam'))
        subprocess.run(['samtools', 'view', '-b', '-o', bam_file, sam_file], check=True)

        # Step 2: Sort BAM file
        sorted_bam_file = os.path.join(sorted_bam_dir, os.path.basename(sam_file).replace('.sam', '_sorted.bam'))
        subprocess.run(['samtools', 'sort', '-o', sorted_bam_file, bam_file], check=True)

        # Step 3: Index sorted BAM file
        subprocess.run(['samtools', 'index', sorted_bam_file], check=True)

        # Optional: Delete intermediate BAM file to save space
        os.remove(bam_file)

    except subprocess.CalledProcessError as e:
        print(f"Error processing file {sam_file}: {e}")
        continue

print("All SAM files have been processed: converted, sorted, and indexed.")

Processing SAM files:   0%|                             | 0/6 [00:00<?, ?file/s][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files:  17%|███▌                 | 1/6 [00:55<04:37, 55.42s/file][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files:  33%|███████              | 2/6 [01:55<03:52, 58.20s/file][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files:  50%|██████████▌          | 3/6 [02:57<02:59, 59.98s/file][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files:  67%|██████████████       | 4/6 [03:53<01:56, 58.42s/file][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files:  83%|█████████████████▌   | 5/6 [04:56<00:59, 59.85s/file][bam_sort_core] merging from 3 files and 1 in-memory blocks...
Processing SAM files: 100%|█████████████████████| 6/6 [06:01<00:00, 60.33s/file]

All SAM files have been processed: converted, sorted, and indexed.





<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Deduplication: Use Picard to mark and remove duplicate reads</b>
</h1>

In [71]:
# Directory to store deduplicated BAM files
dedup_bam_dir = 'deduplicated_bam'
os.makedirs(dedup_bam_dir, exist_ok=True)

# List of sorted BAM files
sorted_bam_files = [os.path.join('sorted_bam', f) for f in [
    'SRR087416_sorted.bam', 'SRR085725_sorted.bam',
    'SRR085471_sorted.bam', 'SRR085473_sorted.bam',
    'SRR085474_sorted.bam', 'SRR085726_sorted.bam']]

# Path to Picard JAR file (adjust as needed)
picard_jar_path = '/home/mahendra/Desktop/Python/Project/picard.jar'

if not os.path.exists(picard_jar_path):
    raise FileNotFoundError(f"Picard JAR file not found at {picard_jar_path}")

# Run Picard MarkDuplicates
for sorted_bam in sorted_bam_files:
    dedup_bam = os.path.join(dedup_bam_dir, os.path.basename(sorted_bam).replace('_sorted.bam', '_dedup.bam'))
    metrics_file = os.path.join(dedup_bam_dir, os.path.basename(sorted_bam).replace('_sorted.bam', '_metrics.txt'))

    try:
        subprocess.run([
            'java', '-Xmx4g', '-jar', picard_jar_path, 'MarkDuplicates',
            f'I={sorted_bam}',
            f'O={dedup_bam}',
            f'M={metrics_file}',
            'REMOVE_DUPLICATES=true',
            'CREATE_INDEX=true'
        ], check=True)
        print(f"Deduplication completed for {sorted_bam}, output BAM: {dedup_bam}")
    except subprocess.CalledProcessError as e:
        print(f"Error during deduplication for {sorted_bam}: {e}")

INFO	2024-11-17 10:04:01	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR087416_sorted.bam -O deduplicated_bam/SRR087416_dedup.bam -M deduplicated_bam/SRR087416_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:04:02.300 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:04:02 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR087416_sorted.bam] OUTPUT=deduplicated_bam/SRR087416_dedup.bam METRICS_FILE=deduplicated_bam/SRR087416_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR087416_sorted.bam, output BAM: deduplicated_bam/SRR087416_dedup.bam


INFO	2024-11-17 10:05:06	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR085725_sorted.bam -O deduplicated_bam/SRR085725_dedup.bam -M deduplicated_bam/SRR085725_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:05:06.524 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:05:06 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR085725_sorted.bam] OUTPUT=deduplicated_bam/SRR085725_dedup.bam METRICS_FILE=deduplicated_bam/SRR085725_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR085725_sorted.bam, output BAM: deduplicated_bam/SRR085725_dedup.bam


INFO	2024-11-17 10:06:06	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR085471_sorted.bam -O deduplicated_bam/SRR085471_dedup.bam -M deduplicated_bam/SRR085471_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:06:07.140 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:06:07 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR085471_sorted.bam] OUTPUT=deduplicated_bam/SRR085471_dedup.bam METRICS_FILE=deduplicated_bam/SRR085471_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR085471_sorted.bam, output BAM: deduplicated_bam/SRR085471_dedup.bam


INFO	2024-11-17 10:07:14	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR085473_sorted.bam -O deduplicated_bam/SRR085473_dedup.bam -M deduplicated_bam/SRR085473_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:07:14.777 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:07:14 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR085473_sorted.bam] OUTPUT=deduplicated_bam/SRR085473_dedup.bam METRICS_FILE=deduplicated_bam/SRR085473_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR085473_sorted.bam, output BAM: deduplicated_bam/SRR085473_dedup.bam


INFO	2024-11-17 10:08:07	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR085474_sorted.bam -O deduplicated_bam/SRR085474_dedup.bam -M deduplicated_bam/SRR085474_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:08:08.218 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:08:08 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR085474_sorted.bam] OUTPUT=deduplicated_bam/SRR085474_dedup.bam METRICS_FILE=deduplicated_bam/SRR085474_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR085474_sorted.bam, output BAM: deduplicated_bam/SRR085474_dedup.bam


INFO	2024-11-17 10:09:18	MarkDuplicates	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** 
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -I sorted_bam/SRR085726_sorted.bam -O deduplicated_bam/SRR085726_dedup.bam -M deduplicated_bam/SRR085726_metrics.txt -REMOVE_DUPLICATES true -CREATE_INDEX true
**********


10:09:18.675 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mahendra/Desktop/Python/Project/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Nov 17 10:09:18 IST 2024] MarkDuplicates INPUT=[sorted_bam/SRR085726_sorted.bam] OUTPUT=deduplicated_bam/SRR085726_dedup.bam METRICS_FILE=deduplicated_bam/SRR085726_metrics.txt REMOVE_DUPLICATES=true CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

Deduplication completed for sorted_bam/SRR085726_sorted.bam, output BAM: deduplicated_bam/SRR085726_dedup.bam


<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Final Quality Check:  Qualimap, Samtools Stats, and MultiQC</b>
</h1>

In [95]:
# Directories
#dedup_bam_dir = 'deduplicated_bam'  # Directory containing deduplicated BAM files
qualimap_output_dir = 'qualimap_output'  # Qualimap results directory
samtools_stats_output_dir = 'samtools_stats_output'  # Directory for Samtools stats output
multiqc_output_dir = 'multiqc_output'  # MultiQC aggregated results directory

# Ensure directories exist
os.makedirs(qualimap_output_dir, exist_ok=True)
os.makedirs(samtools_stats_output_dir, exist_ok=True)
os.makedirs(multiqc_output_dir, exist_ok=True)

# List deduplicated BAM files
dedup_bam_files = [os.path.join(dedup_bam_dir, f) for f in os.listdir(dedup_bam_dir) if f.endswith('_dedup.bam')]

# Step 1: Run Qualimap BAMQC
for bam_file in dedup_bam_files:
    sample_name = os.path.basename(bam_file).replace('_dedup.bam', '')
    sample_output_dir = os.path.join(qualimap_output_dir, sample_name)
    os.makedirs(sample_output_dir, exist_ok=True)
    try:
        subprocess.run([
            'qualimap', 'bamqc',
            '-bam', bam_file,
            '-outdir', sample_output_dir,
            '-outformat', 'PDF:HTML'
        ], check=True)
        print(f"Qualimap completed for {bam_file}. Results in {sample_output_dir}")
    except subprocess.CalledProcessError as e:
        print(f"Error running Qualimap for {bam_file}: {e}")

# Step 2: Run Samtools stats
for bam_file in dedup_bam_files:
    sample_name = os.path.basename(bam_file).replace('_dedup.bam', '')
    stats_output_file = os.path.join(samtools_stats_output_dir, f"{sample_name}_stats.txt")
    try:
        with open(stats_output_file, 'w') as stats_out:
            subprocess.run(['samtools', 'stats', bam_file], stdout=stats_out, check=True)
        print(f"Samtools stats completed for {bam_file}. Output in {stats_output_file}")
    except subprocess.CalledProcessError as e:
        print(f"Error running Samtools stats for {bam_file}: {e}")

# Step 3: Run MultiQC to aggregate results
try:
    subprocess.run(['multiqc', qualimap_output_dir, samtools_stats_output_dir, '-o', multiqc_output_dir], check=True)
    print("MultiQC completed. Aggregated report available in:", multiqc_output_dir)
except subprocess.CalledProcessError as e:
    print("Error running MultiQC:", e)

Java memory size is set to 1200M
Launching application...

QualiMap v.2.3
Built on 2023-05-19 16:57

Selected tool: bamqc
Available memory (Mb): 35
Max memory (Mb): 1258

Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 1104
Chunk of reads size: 1000
Number of threads: 8
Processed 110 out of 1104 windows...
Processed 220 out of 1104 windows...
Processed 330 out of 1104 windows...
Processed 440 out of 1104 windows...
Processed 550 out of 1104 windows...
Processed 660 out of 1104 windows...
Processed 770 out of 1104 windows...
Processed 880 out of 1104 windows...
Processed 990 out of 1104 windows...
Processed 1100 out of 1104 windows...
Total processed windows:1104
Number of reads: 7642762
Number of valid reads: 6717101
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 6717101
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 2


[38;5;208m///[0m ]8;id=233344;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25.1[0m

[34m       file_search[0m | Search path: /home/mahendra/Desktop/Python/Project/qualimap_output
[34m       file_search[0m | Search path: /home/mahendra/Desktop/Python/Project/samtools_stats_output
[2K         [34msearching[0m | [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m288/288[0m  88[0m [2mqualimap_output/SRR085474/report.pdf[0m
[?25h[34m          qualimap[0m | Found 6 BamQC reports
[34m          samtools[0m | Found 6 stats reports
[34m     write_results[0m | Data        : multiqc_output/multiqc_data
[34m     write_results[0m | Report      : multiqc_output/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


MultiQC completed. Aggregated report available in: multiqc_output


In [97]:
# Path to the MultiQC report
multiqc_report_path = os.path.join(multiqc_output_dir, "multiqc_report.html")
# Display the report in the notebook
IFrame(multiqc_report_path, width=1000, height=800)

<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Reads Quantification using HTseq</b>
</h1>

In [204]:
# Directories for input and output
#dedup_bam_dir = 'dedup_bam'  # BAM files already deduplicated using Picard
count_output_dir = 'read_counts'
os.makedirs(count_output_dir, exist_ok=True)

# Reference annotation file for quantification
gff_file = '/home/mahendra/Desktop/Python/Project/GTF/GCF_000001405.40_GRCh38.p14_genomic.gtf'
if not os.path.exists(gff_file):
    print("GTF file not found. Downloading...")
    os.makedirs('GTF', exist_ok=True)
    subprocess.run(['wget', '-q', '-O', 'GCF_000001405.40_GRCh38.p14_genomic.gtf.gz',
                    'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz'])
    subprocess.run(['gunzip', 'GCF_000001405.40_GRCh38.p14_genomic.gtf.gz'])
    subprocess.run(['mv', 'GCF_000001405.40_GRCh38.p14_genomic.gtf', gff_file])
    print("GTF file downloaded and prepared.")

# List of deduplicated BAM files
dedup_bam_files = [os.path.join(dedup_bam_dir, f) for f in os.listdir(dedup_bam_dir) if f.endswith('_dedup.bam')]

# Process each deduplicated BAM file
for dedup_bam_file in tqdm(dedup_bam_files, desc="Quantifying Reads", unit="file"):
    # Define output file for counts
    count_file = os.path.join(count_output_dir, os.path.basename(dedup_bam_file).replace('_dedup.bam', '_counts.txt'))

    # Run htseq-count
    subprocess.run([
        'htseq-count',
        '-f', 'bam',        # Input file format
        '-r', 'pos',        # Sort order: positional
        '-s', 'no',         # Strand-specific: adjust based on your data
        '-t', 'exon',       # Feature type: exon
        '-i', 'gene_id',    # Identifier attribute
        dedup_bam_file,
        gff_file
    ], stdout=open(count_file, 'w'))

    print(f"Counts saved to {count_file}")

print("Read quantification completed for all BAM files.")

Quantifying Reads:   0%|                                                                                                                                        | 0/6 [00:00<?, ?file/s]100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF l

Counts saved to read_counts/SRR087416_counts.txt


100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3300000 GFF lines processed.
3400000 GFF lines processed.
3500000 GFF lines proce

Counts saved to read_counts/SRR085726_counts.txt


100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3300000 GFF lines processed.
3400000 GFF lines processed.
3500000 GFF lines proce

Counts saved to read_counts/SRR085725_counts.txt


100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3300000 GFF lines processed.
3400000 GFF lines processed.
3500000 GFF lines proce

Counts saved to read_counts/SRR085474_counts.txt


100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3300000 GFF lines processed.
3400000 GFF lines processed.
3500000 GFF lines proce

Counts saved to read_counts/SRR085473_counts.txt


100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2700000 GFF lines processed.
2800000 GFF lines processed.
2900000 GFF lines processed.
3000000 GFF lines processed.
3100000 GFF lines processed.
3200000 GFF lines processed.
3300000 GFF lines processed.
3400000 GFF lines processed.
3500000 GFF lines proce

Counts saved to read_counts/SRR085471_counts.txt
Read quantification completed for all BAM files.





<a id="1"></a>
<h1 style="
    background-image: url('https://i.postimg.cc/K87ByXmr/stage5.jpg');
    background-size: cover;
    background-repeat: no-repeat;
    font-family: 'Arial', sans-serif;
    font-size: 24px;
    color: white;
    text-align: center;
    border-radius: 15px 50px;
    padding: 20px 40px;
    margin: 20px 0;
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.5);">
    <b>Reads Metrics integration with metadata</b>
</h1>

In [9]:
# Step 1: Set the path to the directory containing the count files
count_output_dir = 'read_counts'  # Adjust this path to where your count files are stored

# Step 2.1: Combine count files into a single DataFrame
count_files = [os.path.join(count_output_dir, f) for f in os.listdir(count_output_dir) if f.endswith('_counts.txt')]

# Initialize an empty DataFrame to combine the counts
counts_df = pd.DataFrame()

# Read each count file and merge them into counts_df
for count_file in count_files:
    sample_name = os.path.basename(count_file).replace('_counts.txt', '')  # Get sample name from the file name
    counts = pd.read_csv(count_file, sep='\t', index_col=0, header=None, names=[sample_name])  # Read counts

    # Join the new count file to the counts_df
    if counts_df.empty:
        counts_df = counts
    else:
        counts_df = counts_df.join(counts, how='outer')

# Step 2.2: Prepare simplified metadata for DESeq2
# Create the simplified metadata DataFrame
df = pd.DataFrame({
    'SRR_ID': ["SRR087416", "SRR085471", "SRR085473", "SRR085474", "SRR085726", "SRR085725"],  # SRR_IDs corresponding to samples
    'Sample': [
        "Alzheimer's whole brain",
        "Normal brain, temporal lobe",
        "Alzheimer's brain, temporal lobe",
        "Normal brain, frontal lobe",
        "Alzheimer's brain, frontal lobe",
        "Normal whole brain"
    ]
})

# Simplify the metadata to focus on 'SRR_ID' and 'Sample'
metadata_simplified = df[['SRR_ID', 'Sample']].copy()

# Extract the condition from the 'Sample' column (e.g., 'Alzheimer's brain' -> 'AD')
metadata_simplified['condition'] = metadata_simplified['Sample'].apply(lambda x: 'AD' if "Alzheimer" in x else 'Normal')

# Set the SRR_ID as the index (to match counts columns)
metadata_simplified.set_index('SRR_ID', inplace=True)

# Ensure that the length of metadata matches the number of columns in counts_df
if len(metadata_simplified) != counts_df.shape[1]:
    raise ValueError(f"Metadata length ({len(metadata_simplified)}) does not match count data columns ({counts_df.shape[1]}).")

# Step 2.3: Align metadata with counts_df
metadata_simplified = metadata_simplified.loc[counts_df.columns]

# Display the final counts_df and metadata DataFrame to ensure proper alignment
print("Counts DataFrame:")
counts_df.head(10)

Counts DataFrame:


Unnamed: 0,SRR085474,SRR085473,SRR085726,SRR085471,SRR087416,SRR085725
A1BG,48,34,68,53,44,40
A1BG-AS1,11,3,17,16,13,10
A1CF,0,0,1,1,0,0
A2M,1173,438,439,1419,475,1068
A2M-AS1,21,1,3,15,8,19
A2ML1,55,47,28,50,31,21
A2ML1-AS1,0,5,0,0,2,0
A2MP1,1,0,3,0,3,1
A3GALT2,0,1,1,1,2,0
A4GALT,42,120,94,32,59,46


In [11]:
print("\nSimplified Metadata DataFrame:")
metadata_simplified.head(6)


Simplified Metadata DataFrame:


Unnamed: 0,Sample,condition
SRR085474,"Normal brain, frontal lobe",Normal
SRR085473,"Alzheimer's brain, temporal lobe",AD
SRR085726,"Alzheimer's brain, frontal lobe",AD
SRR085471,"Normal brain, temporal lobe",Normal
SRR087416,Alzheimer's whole brain,AD
SRR085725,Normal whole brain,Normal


In [13]:
counts_df.shape

(50042, 6)

In [15]:
metadata_simplified.shape

(6, 2)

In [17]:
# Check the shape of each count file
for count_file in count_files:
    counts = pd.read_csv(count_file, sep='\t', index_col=0, header=None)
    print(f"{os.path.basename(count_file)}: {counts.shape}")

SRR085474_counts.txt: (50042, 1)
SRR085473_counts.txt: (50042, 1)
SRR085726_counts.txt: (50042, 1)
SRR085471_counts.txt: (50042, 1)
SRR087416_counts.txt: (50042, 1)
SRR085725_counts.txt: (50042, 1)


In [19]:
print(counts_df.index[counts_df.index.str.startswith('__')])

Index(['__alignment_not_unique', '__ambiguous', '__no_feature',
       '__not_aligned', '__too_low_aQual'],
      dtype='object')


In [21]:
counts_df = counts_df[~counts_df.index.str.startswith('__')]
print("Filtered Counts DataFrame shape:", counts_df.shape)

Filtered Counts DataFrame shape: (50037, 6)


In [23]:
print("Counts columns:", counts_df.columns.tolist())
print("Metadata index:", metadata_simplified.index.tolist())

Counts columns: ['SRR085474', 'SRR085473', 'SRR085726', 'SRR085471', 'SRR087416', 'SRR085725']
Metadata index: ['SRR085474', 'SRR085473', 'SRR085726', 'SRR085471', 'SRR087416', 'SRR085725']


In [25]:
print("Counts DataFrame shape after filtering:", counts_df.shape)

Counts DataFrame shape after filtering: (50037, 6)


In [238]:
counts_df.to_csv('cleaned_counts_matrix.csv')
metadata_simplified.to_csv('cleaned_metadata.csv')

Part One Done:), Please Open `pipeline2.ipynb`