# BarcodeSeqKit

> Extract and classify sequences based on barcode presence in BAM and FASTQ files

## Overview

BarcodeSeqKit is a Python library for extracting and classifying sequencing reads based on the presence of specific barcode sequences. It supports both BAM and FASTQ file formats and provides flexible options for barcode matching and output generation.

Key features:
- Process both BAM and FASTQ files (including paired-end data)
- Support for single barcodes or specific 5'/3' barcodes
- Detect barcodes in forward and reverse complement orientations
- Options for fuzzy matching with configurable mismatches
- Search in softclipped regions of BAM alignments
- Process paired-end FASTQ files with options to search both reads
- Comprehensive statistics on barcode matches

## Installation

You can install the package via pip:

```bash
pip install barcodeseqkit
```

Or directly from the repository:

```bash
pip install git+https://github.com/username/BarcodeSeqKit.git
```

## Usage

BarcodeSeqKit can be used as a Python library or as a command-line tool. Below, we'll demonstrate both usage patterns.

### Command-line Usage

BarcodeSeqKit provides a command-line interface through the `barcodeseqkit` command. Here are the basic arguments:

```
usage: barcodeseqkit [-h] [--bam BAM | --fastq1 FASTQ1 | --fastq-dir FASTQ_DIR] [--fastq2 FASTQ2]
                      [--barcode-config BARCODE_CONFIG | --barcode BARCODE | --barcode5 BARCODE5] [--barcode3 BARCODE3]
                      [--max-mismatches MAX_MISMATCHES] --output-prefix OUTPUT_PREFIX
                      [--output-dir OUTPUT_DIR] [--merge-orientations] [--search-both-reads]
                      [--search-softclipped] [--no-compress] [--verbose] [--log-file LOG_FILE]
```

#### Single Barcode Example

When you have a single barcode sequence, you can use the `--barcode` option. This creates two output files with reads matching the barcode in forward orientation (`barcode_orientFR`) and reverse complement orientation (`barcode_orientRC`).

Let's run an example using a test BAM file:

```bash
barcodeseqkit --bam test.bam \
              --barcode CTGACTCCTTAAGGGCC \
              --output-prefix test_out_1 \
              --output-dir ./test_output
```

This command will:
1. Process the BAM file `test.bam`
2. Search for the barcode sequence `CTGACTCCTTAAGGGCC` (and its reverse complement)
3. Create the following output files in the `./test_output` directory:
   - `test_out_1_barcode_orientFR.bam`: Reads with the barcode in forward orientation
   - `test_out_1_barcode_orientRC.bam`: Reads with the barcode in reverse complement orientation
   - `test_out_1_extraction_stats.json` and `test_out_1_extraction_stats.tsv`: Extraction statistics

#### Dual Barcode Example

When you have two barcode sequences with specific locations (5' and 3'), you can use the `--barcode5` and `--barcode3` options. This creates four output files for all combinations of barcode locations and orientations.

Let's run an example with both 5' and 3' barcodes:

```bash
barcodeseqkit --bam test.bam \
              --barcode5 CTGACTCCTTAAGGGCC \
              --barcode3 TAACTGAGGCCGGC \
              --output-prefix test_out_2 \
              --output-dir ./test_output
```

This command will:
1. Process the BAM file `test.bam`
2. Search for the 5' barcode sequence `CTGACTCCTTAAGGGCC` and the 3' barcode sequence `TAACTGAGGCCGGC` (and their reverse complements)
3. Create the following output files in the `./test_output` directory:
   - `test_out_2_barcode5_orientFR.bam`: Reads with the 5' barcode in forward orientation
   - `test_out_2_barcode5_orientRC.bam`: Reads with the 5' barcode in reverse complement orientation
   - `test_out_2_barcode3_orientFR.bam`: Reads with the 3' barcode in forward orientation
   - `test_out_2_barcode3_orientRC.bam`: Reads with the 3' barcode in reverse complement orientation
   - `test_out_2_extraction_stats.json` and `test_out_2_extraction_stats.tsv`: Extraction statistics

#### FASTQ File Example

BarcodeSeqKit can also process FASTQ files with the same barcode specification options:

```bash
barcodeseqkit --fastq1 read1.fastq.gz \
              --fastq2 read2.fastq.gz \
              --barcode5 CTGACTCCTTAAGGGCC \
              --output-prefix fastq_test \
              --output-dir ./test_output \
              --search-both-reads
```

This will process the paired FASTQ files, searching for the barcode in both reads, and create corresponding output FASTQ files for each barcode category.

## Special Features

### Extracting Softclipped Regions from BAM Files

BarcodeSeqKit includes an option to analyze only the softclipped sequences from read alignments. Barcodes are often present in softclipped regions of the reads. This feature is also useful for looking for splice leader sequences in trypanosomatids.

The `--search-softclipped` option extracts orientation-specific softclipped sequences:
- For reads on the forward strand (+): extracts the softclipped sequence at the 5' end
- For reads on the reverse strand (-): extracts the softclipped sequence at the 3' end

```bash
barcodeseqkit --bam test.bam \
              --barcode CTGACTCCTTAAGGGCC \
              --output-prefix softclip_test \
              --search-softclipped
```

### Examining Extraction Statistics

BarcodeSeqKit generates detailed extraction statistics in both JSON and TSV formats. Let's examine a sample statistics file:

In [None]:
import pandas as pd
import json
import os

# Example (modify path to a real stats file if needed)
# df = pd.read_csv('./test_output/test_out_1_extraction_stats.tsv', sep='\t')
# df

The statistics file provides information about:
- Total reads processed
- Total barcode matches found
- Reads without barcode matches
- Match rate 
- Counts by barcode type (5'/3'/generic)
- Counts by orientation (forward/reverse complement)
- Counts by specific category

## Programmatic Usage

You can also use BarcodeSeqKit programmatically in your Python code. The library has been redesigned with a simplified, direct interface that makes it easy to extract barcodes from both BAM and FASTQ files.

### Processing BAM Files

In [None]:
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeSeqKitConfig
from BarcodeSeqKit.bam_processing import extract_barcodes_from_bam

# Define barcode configurations
barcode5 = BarcodeConfig(
    sequence="CTGACTCCTTAAGGGCC",
    location=BarcodeLocationType.FIVE_PRIME,
    name="5",
    description="5' barcode sequence"
)

barcode3 = BarcodeConfig(
    sequence="TAACTGAGGCCGGC",
    location=BarcodeLocationType.THREE_PRIME,
    name="3",
    description="3' barcode sequence"
)

# Create configuration
config = BarcodeSeqKitConfig(
    barcodes=[barcode5, barcode3],
    output_prefix="programmatic_example",
    output_dir="./output",
    max_mismatches=0,
    search_softclipped=True,
    verbose=True
)

# Extract barcodes (commented out to avoid execution)
# stats = extract_barcodes_from_bam("test.bam", config)
# print(f"Extraction complete: {stats.total_barcode_matches} matches in {stats.total_reads} alignments")

### Processing FASTQ Files

In [None]:
from BarcodeSeqKit.fastq_processing import extract_barcodes_from_fastq

# Create configuration for FASTQ processing
config = BarcodeSeqKitConfig(
    barcodes=[barcode5, barcode3],
    output_prefix="fastq_example",
    output_dir="./output",
    search_both_reads=True,
    compress_output=True,
    verbose=True
)

# Extract barcodes (commented out to avoid execution)
# fastq_files = ["read1.fastq.gz", "read2.fastq.gz"]
# stats = extract_barcodes_from_fastq(fastq_files, config)
# print(f"Extraction complete: {stats.total_barcode_matches} matches in {stats.total_reads} reads")

## Advanced Options

BarcodeSeqKit provides several advanced options for barcode extraction:

### Command-Line Options

- **Fuzzy matching**: Use `--max-mismatches` to allow a specific number of mismatches in barcode detection
- **Paired-end FASTQ files**: Use `--fastq1` and `--fastq2` to process paired-end FASTQ files
- **Softclipped regions**: Use `--search-softclipped` to search in softclipped regions of BAM alignments
- **Both reads**: Use `--search-both-reads` to search for barcodes in both reads of paired FASTQ files
- **Merge orientations**: Use `--merge-orientations` to create combined output files for different orientations
- **Output compression**: Use `--no-compress` to disable compression for FASTQ output files
- **Verbose logging**: Use `--verbose` to enable detailed logging

For a complete list of options, run `barcodeseqkit --help`.

## Library Structure

BarcodeSeqKit has been redesigned with a simplified, modular structure:

1. **Core Module** (`00_core.ipynb`):
   - Core data structures and enumerations
   - Configuration handling
   - Statistics tracking

2. **Sequence Utilities** (`01_sequence_utils.ipynb`):
   - Sequence manipulation functions
   - Barcode detection algorithms
   - Classification utilities

3. **BAM Processing** (`02_bam_processing.ipynb`):
   - BAM file handling
   - Barcode extraction from BAM alignments
   - Softclipped region analysis

4. **FASTQ Processing** (`03_fastq_processing.ipynb`):
   - FASTQ file handling
   - Paired-end read processing
   - Output file management

5. **Command-Line Interface** (`04_cli.ipynb`):
   - Argument parsing
   - Configuration setup
   - Main entry point

## Conclusion

BarcodeSeqKit provides a flexible and efficient way to extract and classify sequences based on barcode presence. Whether you're working with BAM or FASTQ files, single or multiple barcodes, BarcodeSeqKit offers a straightforward interface for your barcode extraction needs.

The library features:
- A simplified, direct design that's easy to understand and use
- Comprehensive support for both BAM and FASTQ file formats
- Flexible barcode configuration options
- Detailed extraction statistics
- Both command-line and programmatic interfaces

For more detailed information, check out the documentation for each module.