In [None]:
#| hide
%load_ext autoreload
%autoreload 2

# OligoSeeker

>[![DOI](https://zenodo.org/badge/946567115.svg)](https://doi.org/10.5281/zenodo.15011916)

> A Python library for processing FASTQ files to count oligo codons

https://mtinti.github.io/OligoSeeker/


## Installation

You can install the package via pip:

```bash
pip install oligoseeker
```

Or directly from the repository:

```bash
pip install git+https://github.com/username/OligoSeeker.git
```

## Overview

OligoSeeker is a Python library designed to process paired FASTQ files and count occurrences of specific oligo codons. It provides a simple yet powerful interface for bioinformatics researchers working with oligonucleotide analysis.

## Features

- Process paired FASTQ files (gzipped or uncompressed)
- Search for custom oligo sequences with codon sites (NNN)
- Support for both forward and reverse complement matching
- Comprehensive results in CSV format
- Merge functionality to combine results from multiple samples
- User-friendly command-line interface with multiple modes
- Modular design for integration with other tools

## Scientific Background: Oligonucleotide-Targeted Mutagenesis

Oligonucleotide-targeted mutagenesis is a powerful technique in molecular biology that enables precise alterations of DNA sequences. In this approach, synthetic oligonucleotides (short DNA fragments, typically 20-60 nucleotides) are designed to target specific locations in a gene, allowing researchers to introduce defined mutations.

### The Structure of Mutagenic Oligos

A typical mutagenic oligo has three distinct components:

1. **5' Homology Arm**: A sequence that matches the target DNA upstream of the mutation site, providing specificity.
2. **Mutation Site (NNN)**: The actual mutation being introduced, often represented as "NNN" when a mixture of all possible codons is used.
3. **3' Homology Arm**: A sequence that matches the target DNA downstream of the mutation site, providing additional specificity.

For example, if our target DNA sequence is:
```
5'-ATGCATGCATGCATGCATGCATGCATGCATGC-3'
```

And we want to mutagenize the underlined codon:
```
5'-ATGCATGCATGCAT___GCATGCATGCATGCATGC-3'
```

We would design an oligo like:
```
5'-ATGCATGCATGCATNNNGCATGCATGCATGC-3'
```

### Why Use NNN Codons?

The "NNN" in the oligo represents a mixture of all possible nucleotide combinations at that position:
- N = A mixture of A, T, G, and C
- NNN = All 64 possible codons (4³ = 64)

This approach allows:
- **Saturation mutagenesis**: Testing all possible amino acid substitutions at a position
- **Structure-function studies**: Identifying critical residues in proteins
- **Protein engineering**: Optimizing enzyme activity or stability

### Deep Sequencing of Mutagenesis Libraries

After the mutagenesis reaction, the resulting DNA library contains a mixture of variants with different codons at the target position. Next-generation sequencing technologies allow researchers to sequence thousands or millions of these variants simultaneously.

`OligoSeeker` helps analyze this sequencing data by:
1. Identifying reads that contain the mutagenic oligo
2. Extracting the specific codon present at the NNN position
3. Counting the frequency of each codon variant

This information is crucial for:
- Verifying library coverage (were all possible codons incorporated?)
- Quantifying biases in the mutagenesis process
- Analyzing selection experiments where certain variants may be enriched



## How It Works

OligoSeeker searches for specific oligonucleotide patterns in paired FASTQ reads. When it finds a match, it extracts the codon sequence (represented by NNN in the oligo pattern) and tallies its occurrence. The library handles both forward and reverse complement matching, ensuring comprehensive detection.

The basic count workflow is:
1. Load and validate oligo sequences
2. Process paired FASTQ files
3. Count codon occurrences for each oligo
4. Output results in CSV format

Additionally, the merge workflow allows you to:
1. Process multiple samples independently
2. Combine the count results from different runs
3. Sum the codon occurrences across samples
4. Analyze patterns across a larger dataset

## Performance and Compatibility

OligoSeeker has been tested on both Linux and macOS platforms

- **Test Case**: 1 oligo (33 bp) analyzed in 150 bp paired-end FASTQ files containing 300 million reads
- **Processing Time**:
  - ~1 hour on a high-performance compute cluster
  - ~1.5 hours on a standard MacBook Pro

### Scalability

For large datasets, we've implemented an efficient workflow to significantly increase throughput:

1. **File Splitting**: Large FASTQ files are split into smaller chunks using [seqkit](https://bioinf.shenwei.me/seqkit/), a high-performance toolkit for FASTA/Q file manipulation
2. **Parallel Processing**: OligoSeeker is applied in parallel to each chunk independently
3. **Result Merging**: Individual results are merged using OligoSeeker's built-in merge functionality

## Quick Start

### Command-Line Usage

```bash
# Basic usage with oligos
!oligoseeker -m count \
--f1 ../test_files/test_1.fq.gz \
--f2 ../test_files/test_2.fq.gz \
--oligos "GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT" \
--output ../test_files/test_outs --prefix test_cm3

# Basic usage with oligos files
oligoseeker -m count \
--f1 ../test_files/test_1.fq.gz \
--f2 ../test_files/test_2.fq.gz \
--oligos-file '../test_files/oligos.txt' \
--output ../test_files/test_outs --prefix test_cm4

# Basic usage to merge oligo counts
oligoseeker -m merge \
--output-file 'merge_cl.csv' \
--input-dir ../test_files/test_outs \
--output ../test_files/merged 
```

### Python API Usage

Here's a simple example of using the Python API:

In [None]:
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_list=["GCGGATTACATTNNNAAATAACATCGT", "TGTGGTAAGCGGNNNGAAAGCATTTGT", "GTCGTAGAAAATNNNTGGGTGATGAGC"],
    output_path="../test_files/test_outs",
    output_prefix='test1'
)

# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")

  from pandas.core import (
2025-03-11 19:50:45,590 - INFO - Starting OligoCodonPipeline
2025-03-11 19:50:45,591 - INFO - Loading oligo sequences...
2025-03-11 19:50:45,591 - INFO - Using provided oligo list
2025-03-11 19:50:45,591 - INFO - Loaded 3 oligo sequences
2025-03-11 19:50:45,592 - INFO - Processing FASTQ files...


0it [00:00, ?it/s]

2025-03-11 19:50:45,666 - INFO - Formatting results...
2025-03-11 19:50:45,669 - INFO - Saving results to: ../test_files/test_outs/test1_counts.csv
2025-03-11 19:50:45,679 - INFO - Pipeline completed in 0.09 seconds


Results saved to: ../test_files/test_outs/test1_counts.csv


In [None]:
# this should show 20 (ACT), 40 (GGC) and 60 matches (AAA) for
# oligo 1, 2 and 3 respectievely
import pandas as pd
out = pd.read_csv(results['csv_path'],index_col=[0])
out.head()

Unnamed: 0,1_GCGGATTACATTNNNAAATAACATCGT,2_TGTGGTAAGCGGNNNGAAAGCATTTGT,3_GTCGTAGAAAATNNNTGGGTGATGAGC
none,1980.0,1960.0,1940.0
ACT,20.0,0.0,0.0
GGC,0.0,40.0,0.0
AAA,0.0,0.0,60.0


Here's a simple example of using the Python API with oligo listed in a file:

In [None]:
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_file="../test_files/oligos.txt",
    output_path="../test_files/test_outs",
    output_prefix='test2'
)



# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")

2025-03-11 19:51:08,402 - INFO - Starting OligoCodonPipeline
2025-03-11 19:51:08,403 - INFO - Loading oligo sequences...
2025-03-11 19:51:08,404 - INFO - Loading oligos from file: ../test_files/oligos.txt
2025-03-11 19:51:08,407 - INFO - Loaded 3 oligo sequences
2025-03-11 19:51:08,407 - INFO - Processing FASTQ files...


0it [00:00, ?it/s]

2025-03-11 19:51:08,462 - INFO - Formatting results...
2025-03-11 19:51:08,463 - INFO - Saving results to: ../test_files/test_outs/test2_counts.csv
2025-03-11 19:51:08,468 - INFO - Pipeline completed in 0.07 seconds


Results saved to: ../test_files/test_outs/test2_counts.csv


### Merging Count Files

You can merge multiple count files from different runs to combine results:


In [None]:
from OligoSeeker.merge import merge_count_csvs

# Merge all count files in a directory
merged_df = merge_count_csvs(
    input_dir="../test_files/test_outs",  # Directory containing count files
    output_file="merged_counts.csv",      # Output filename
    output_dir="../test_files/merged",    # Output directory
    pattern="*_counts.csv"                # Pattern to match files
)

print(f"Merged {len(merged_df)} codons across {len(merged_df.columns)} oligos")
merged_df.head()

Found 6 CSV files to merge
  Loaded ../test_files/test_outs/test_cm2_counts.csv with 4 rows and 3 columns
  Loaded ../test_files/test_outs/test2_counts.csv with 4 rows and 3 columns
  Loaded ../test_files/test_outs/test1_counts.csv with 4 rows and 3 columns
  Loaded ../test_files/test_outs/test_cm3_counts.csv with 3 rows and 2 columns
  Loaded ../test_files/test_outs/test_cm4_counts.csv with 4 rows and 3 columns
  Loaded ../test_files/test_outs/test_cm1_counts.csv with 4 rows and 3 columns
Merged data saved to ../test_files/merged/merged_counts.csv
Merged 4 codons across 3 oligos


Unnamed: 0,1_GCGGATTACATTNNNAAATAACATCGT,2_TGTGGTAAGCGGNNNGAAAGCATTTGT,3_GTCGTAGAAAATNNNTGGGTGATGAGC
AAA,0.0,0.0,300.0
ACT,120.0,0.0,0.0
GGC,0.0,240.0,0.0
none,11880.0,11760.0,9700.0


## Modules

OligoSeeker is organized into several modules:

### Core

The [core module](./core.html) contains fundamental utilities and classes:
- DNA sequence operations (reverse complement, etc.)
- OligoRegex for pattern matching
- OligoLoader for loading and validating oligo sequences

### FASTQ Processing

The [FASTQ module](./fastq.html) handles reading and processing FASTQ files:
- FastqHandler for file operations
- OligoCodonProcessor for counting codons in FASTQ files

### Output

The [output module](./output.html) manages results formatting and saving:
- ResultsFormatter for converting results to DataFrames
- ResultsSaver for saving to various file formats

### Pipeline

The [pipeline module](./pipeline.html) provides the complete processing pipeline:
- PipelineConfig for configuration settings
- ProgressReporter for progress tracking
- OligoCodonPipeline for end-to-end processing

### Merge

The [merge module](./merge.html) provides functionality to combine multiple count results:
- Merge count CSV files by summing values
- Support for flexible output naming and location
- Pattern matching to select specific files

### CLI

The [CLI module](./cli.html) implements the command-line interface:
- Argument parsing
- Configuration validation
- Pipeline execution

## Quick Start

### Command-Line Usage

For count mode (processing FASTQ files):
```bash
# Using oligos directly specified
oligoseeker -m count --f1 test_files/test_1.fq.gz --f2 test_files/test_2.fq.gz \
--oligos "GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT" \
--output test_outs --prefix test_run1

# Using oligos from a file
oligoseeker -m count --f1 test_files/test_1.fq.gz --f2 test_files/test_2.fq.gz \
--oligos-file test_files/oligos.txt --output test_outs --prefix test_run2
```

For merge mode (combining multiple count files):
```bash
# Merge all count files in a directory
oligoseeker -m merge --input-dir test_outs --output test_outs/merged \
--output-file combined_counts.csv
```

## CLI Reference

```bash
usage: oligoseeker [-h] [-m {count,merge}] [--f1 FASTQ_PATH_1] [--f2 FASTQ_PATH_2]
                  [--oligos-file OLIGOS_FILE] [--oligos OLIGOS_STRING]
                  [--offset OFFSET_OLIGO] [--input-dir INPUT_DIR]
                  [--output-file OUTPUT_FILE] [--pattern PATTERN]
                  [-o OUTPUT_PATH] [--prefix OUTPUT_PREFIX]
                  [--log-file LOG_FILE]
                  [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

OligoSeeker: Process FASTQ files to count oligo codons

options:
  -h, --help            show this help message and exit
  -m {count,merge}, --mode {count,merge}
                        Operation mode: 'count' to process FASTQ files or 'merge' to combine CSV counts (default: count)
  -o OUTPUT_PATH, --output OUTPUT_PATH
                        Output directory for results (default: ../test_files/test_outs)
  --prefix OUTPUT_PREFIX
                        Prefix for output files (default: )
  --log-file LOG_FILE   Path to log file (if not specified, logs to console only)
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level (default: INFO)

Count Mode Options:
  --f1 FASTQ_PATH_1, --fastq_1 FASTQ_PATH_1
                        Path to FASTQ 1 file (default: ../test_fastq_files/test_1.fq.gz)
  --f2 FASTQ_PATH_2, --fastq_2 FASTQ_PATH_2
                        Path to FASTQ 2 file (default: ../test_fastq_files/test_2.fq.gz)

Oligo Source Options:
  --oligos-file OLIGOS_FILE
                        File containing oligo sequences (one per line)
  --oligos OLIGOS_STRING
                        Comma-separated list of oligo sequences
                        (default: GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT,GTCGTAGAAAATNNNTGGGTGATGAGC)
  --offset OFFSET_OLIGO
                        Value to add to oligo index in output (default: 1)

Merge Mode Options:
  --input-dir INPUT_DIR
                        Directory containing CSV files to merge (required for merge mode)
  --output-file OUTPUT_FILE
                        Name of the output merged file (default: merged_counts.csv)
  --pattern PATTERN     Pattern to match CSV files (default: *count*.csv)
```

## Data Requirements

OligoSeeker works with standard paired FASTQ files, which should be named according to common conventions:

- Read 1: `*_1.fq.gz`, `*_R1.fastq.gz`, or `*_R1_001.fastq.gz`
- Read 2: `*_2.fq.gz`, `*_R2.fastq.gz`, or `*_R2_001.fastq.gz`

The oligo sequences should include a codon site marked with `NNN`. For example:
```
GAACNNNCAT
TGACNNNTAG
```

This specifies that the 3 bases following `GAAC` or `TGAC` should be captured as the codon.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

### Development Setup

1. Clone the repository
2. Install development dependencies:
   ```bash
   pip install -e ".[dev]"
   pip install nbdev
   ```
3. Make changes to the notebook files in the `nbs` directory
4. Build the library:
   ```bash
   nbdev_build_lib
   ```
5. Build the documentation:
   ```bash
   nbdev_build_docs
   ```

## License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()