# BarcodeSeqKit

> Extract and classify sequences based on barcode presence in BAM and FASTQ files

## Overview

BarcodeSeqKit is a Python library for extracting and classifying sequencing reads based on the presence of specific barcode sequences. It supports both BAM and FASTQ file formats and provides flexible options for barcode matching and output generation.

## Installation

You can install the package via pip:

```bash
pip install barcodeseqkit
```

Or directly from the repository:

```bash
pip install git+https://github.com/username/BarcodeSeqKit.git
```

## Usage

BarcodeSeqKit can be used as a Python library or as a command-line tool. Below, we'll demonstrate both usage patterns.

### Command-line Usage

BarcodeSeqKit provides a command-line interface through the `barcodeseqkit` command. Here are the basic arguments:

```
usage: barcodeseqkit [-h] [--bam BAM | --fastq1 FASTQ1 | --fastq-dir FASTQ_DIR] [--fastq2 FASTQ2]
                       [--barcode-config BARCODE_CONFIG | --barcode BARCODE | --barcode5 BARCODE5] [--barcode3 BARCODE3]
                       [--max-mismatches MAX_MISMATCHES] --output-prefix OUTPUT_PREFIX
                       [--output-dir OUTPUT_DIR] [--discard-unmatched]
                       [--search-both-reads] [--verbose] [--log-file LOG_FILE]
```

#### Single Barcode Example

When you have a single barcode sequence, you can use the `--barcode` option. This creates two output files with reads matching the barcode in forward orientation (`barcode_orientFR`) and reverse complement orientation (`barcode_orientRC`).

Let's run an example using the test BAM file provided with the package:

```python
barcodeseqkit --bam ../tests/test.bam \
--barcode CTGACTCCTTAAGGGCC --output-prefix test_out_1 \
--output-dir ../tests/test_out_1
```

This command will:
1. Process the BAM file at `tests/test.bam`
2. Search for the barcode sequence `CTGACTCCTTAAGGGCC` (and its reverse complement)
3. Create the following output files in the `tests/test_out_1` directory:
   - `test_out_1_barcode_orientFR.bam`: Reads with the barcode in forward orientation
   - `test_out_1_barcode_orientRC.bam`: Reads with the barcode in reverse complement orientation
   - `test_out_1_noBarcode.bam`: Reads without the barcode
   - `test_out_1_extraction_stats.json` and `test_out_1_extraction_stats.tsv`: Extraction statistics

#### Dual Barcode Example

When you have two barcode sequences with specific locations (5' and 3'), you can use the `--barcode5` and `--barcode3` options. This creates four output files for all combinations of barcode locations and orientations.

Let's run an example with both 5' and 3' barcodes:

```python
!barcodeseqkit --bam ../tests/test.bam \
--barcode5 CTGACTCCTTAAGGGCC \
--barcode3 TAACTGAGGCCGGC \
--output-prefix test_out_2 --output-dir ../tests/test_out_2
```

This command will:
1. Process the BAM file at `tests/test.bam`
2. Search for the 5' barcode sequence `CTGACTCCTTAAGGGCC` and the 3' barcode sequence `TAACTGAGGCCGGC` (and their reverse complements)
3. Create the following output files in the `tests/test_out_2` directory:
   - `test_out_2_barcode5_orientFR.bam`: Reads with the 5' barcode in forward orientation
   - `test_out_2_barcode5_orientRC.bam`: Reads with the 5' barcode in reverse complement orientation
   - `test_out_2_barcode3_orientFR.bam`: Reads with the 3' barcode in forward orientation
   - `test_out_2_barcode3_orientRC.bam`: Reads with the 3' barcode in reverse complement orientation
   - `test_out_2_noBarcode.bam`: Reads without any barcode
   - `test_out_2_extraction_stats.json` and `test_out_2_extraction_stats.tsv`: Extraction statistics

## Examining Extraction Statistics

Let's examine the extraction statistics from both examples. First, we'll load the necessary libraries:

In [None]:
import pandas as pd
import json
import os

  from pandas.core import (


### Single Barcode Statistics

Let's load and display the statistics for the single barcode example:

In [None]:
df = pd.read_csv('../tests/test_out_1/test_out_1_extraction_stats.tsv',sep='\t')
df

Unnamed: 0,Metric,Value
0,TotalReads,498
1,TotalBarcodeMatches,10
2,NoBarcodeCount,481
3,MatchRate,0.0201
4,Barcode,Count
5,generic,10
6,Orientation,Count
7,FR,7
8,RC,3
9,Category,Count


### Dual Barcode Statistics

Now let's examine the statistics for the dual barcode example:

In [None]:
df = pd.read_csv('../tests/test_out_2/test_out_2_extraction_stats.tsv',sep='\t')
df

Unnamed: 0,Metric,Value
0,TotalReads,498
1,TotalBarcodeMatches,18
2,NoBarcodeCount,470
3,MatchRate,0.0361
4,Barcode,Count
5,5,10
6,3,8
7,Orientation,Count
8,FR,10
9,RC,8


## Programmatic Usage

You can also use BarcodeSeqKit programmatically in your Python code. Here's an example:

In [None]:
#!samtools sort -n -o ../tests/test_sorted_by_name.bam ../tests/test.bam
#!bamToFastq -i ../tests/test_sorted_by_name.bam -fq ../tests/test.1.fastq -fq2 ../tests/test.2.fastq
#!gzip ../tests/test.2.fastq

In [None]:
from BarcodeSeqKit.core import BarcodeConfig, ExtractorConfig, ExtractorFactory, BarcodeLocationType

# Define barcode configurations
barcode5 = BarcodeConfig(
    sequence="CTGACTCCTTAAGGGCC",
    location=BarcodeLocationType.FIVE_PRIME,
    name="5",
    description="5' barcode sequence"
)

barcode3 = BarcodeConfig(
    sequence="TAACTGAGGCCGGC",
    location=BarcodeLocationType.THREE_PRIME,
    name="3",
    description="3' barcode sequence"
)

# Create extractor configuration
config = ExtractorConfig(
    barcodes=[barcode5, barcode3],
    output_prefix="programmatic_example",
    output_dir="../tests/programmatic_out",
    keep_unmatched=True,
    verbose=True
)

# Create the extractor
extractor = ExtractorFactory.create_extractor(config, "../tests/test.bam")
stats = extractor.extract()
print(f"Extraction complete: {stats.total_barcode_matches} matches in {stats.total_reads} aligments")
#extractor

2025-03-14 10:00:29,646 - BarcodeSeqKit - INFO - Extraction complete: 18 matches in 498 reads
2025-03-14 10:00:29,740 - BarcodeSeqKit - INFO - Statistics saved to ../tests/programmatic_out/programmatic_example_extraction_stats.json and ../tests/programmatic_out/programmatic_example_extraction_stats.tsv


Extraction complete: 18 matches in 498 aligments


In [None]:
# Initialize the extractor
from BarcodeSeqKit.fastq_processing import FastqExtractor
extractor = FastqExtractor(
    barcodes=[barcode5, barcode3],
    output_prefix="res",
    fastq_files=['../tests/test.1.fastq.gz', '../tests/test.2.fastq.gz'],
    output_dir='../tests/programmatic_out_fastq',
    verbose=True
)
stats = extractor.extract()
# Run extraction (commented out to avoid running it by default)
#stats = extractor.extract()
print(f"Extraction complete: {stats.total_barcode_matches} matches in {stats.total_reads} reads")

0it [00:00, ?it/s]

2025-03-14 10:00:42,365 - BarcodeSeqKit - INFO - Statistics saved to ../tests/programmatic_out_fastq/res_extraction_stats.json and ../tests/programmatic_out_fastq/res_extraction_stats.tsv
2025-03-14 10:00:42,365 - BarcodeSeqKit - INFO - Statistics saved to ../tests/programmatic_out_fastq/res_extraction_stats.json and ../tests/programmatic_out_fastq/res_extraction_stats.tsv


Extraction complete: 18 matches in 247 reads


## Advanced Options

BarcodeSeqKit provides several advanced options for barcode extraction:

- **Fuzzy matching**: Use `--max-mismatches` to allow a specific number of mismatches in barcode detection.
- **Paired-end FASTQ files**: Use `--fastq1` and `--fastq2` to process paired-end FASTQ files.
- **Discarding unmatched reads**: Use `--discard-unmatched` to exclude reads without barcode matches from the output.

For more options, run `barcodeseqkit --help`.

## Conclusion

BarcodeSeqKit provides a flexible and efficient way to extract and classify sequences based on barcode presence. Whether you're working with BAM or FASTQ files, single or multiple barcodes, BarcodeSeqKit offers a straightforward interface for your barcode extraction needs.