# EpiBERT Data Processing Workflow

This notebook demonstrates the complete data processing pipeline for preparing input data for EpiBERT analysis. 

## Overview

EpiBERT requires two main types of input data:

1. **Motif enrichment file**: Generated using Simple Enrichment Analysis (SEA) from the MEME suite
2. **Processed ATAC-seq data**: Fragment end positions in bedgraph format

## Prerequisites

Before running this workflow, ensure you have the following tools installed:

- **bedtools**: For genomic interval operations
- **samtools**: For BAM file processing  
- **MEME suite**: For motif enrichment analysis
- **tabix**: For indexing compressed files

## Data Requirements

- **Peak calls**: ATAC-seq peaks in BED format (e.g., from ENCODE)
- **BAM files**: Aligned ATAC-seq reads
- **Reference genome**: hg38 FASTA file with index
- **Motif database**: Consensus PWMs for transcription factors

Let's walk through each step of the processing pipeline.  

In [None]:
## Step 1: Download Example ATAC-seq Peaks

We'll use K562 ATAC-seq peaks from ENCODE as an example dataset. These are IDR-thresholded peaks representing high-confidence chromatin accessibility regions.

In [None]:
# Download K562 ATAC-seq peaks from ENCODE
# File: ENCFF135AEX - K562 IDR thresholded peaks
wget -q https://www.encodeproject.org/files/ENCFF135AEX/@@download/ENCFF135AEX.bed.gz

echo "✓ Downloaded ATAC-seq peaks file"
ls -lh ENCFF135AEX.bed.gz


In [None]:
## Step 2: Prepare Reference Genome

Ensure you have the hg38 reference genome FASTA file downloaded and indexed. You can download it from ENCODE: https://www.encodeproject.org/references/ENCSR938RZZ/

In [None]:
# Index the reference genome FASTA file
# This creates a .fai index file needed for bedtools getfasta
samtools faidx hg38_erccpatch.fa

echo "✓ Reference genome indexed"
ls -lh hg38_erccpatch.fa*


In [None]:
## Step 3: Extract Peak Sequences for Motif Analysis

Now we'll extract sequences from the top peaks for motif enrichment analysis. We take the top 50,000 peaks by signal and extract 128bp around each peak center.

sort: write failed: 'standard output': Broken pipe
sort: write error


In [None]:
# Extract top 50,000 peaks and get 128bp around peak centers
# Sort by signal value (column 5), take top peaks, extract peak center ± 64bp
zcat ENCFF135AEX.bed.gz | \
  sort -k5,5nr | \
  head -n 50000 | \
  awk '{OFS = "\t"}{print $1,$2+$10-64,$2+$10+64}' | \
  sort -k1,1 -k2,2n > ENCFF135AEX.sorted.peaks.bed

echo "✓ Extracted peak regions"
wc -l ENCFF135AEX.sorted.peaks.bed

# Extract sequences for foreground (peaks of interest)
bedtools getfasta -fi hg38_erccpatch.fa -bed ENCFF135AEX.sorted.peaks.bed > ENCFF135AEX.peaks.fasta

# Extract sequences for background (using provided background peaks)
bedtools getfasta -fi hg38_erccpatch.fa -bed all_peaks_merged.counts.shared.centered.bed > bg_peaks.fasta

echo "✓ Extracted FASTA sequences"
ls -lh *.fasta


In [None]:
## Step 4: Run Motif Enrichment Analysis

Run Simple Enrichment Analysis (SEA) from MEME using consensus PWMs. We use the 693 consensus PWMs from Vierstra et al. (https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.0beta/)

In [None]:
# Run Simple Enrichment Analysis (SEA) from MEME
# Parameters:
# --p: positive sequences (peaks of interest)
# --m: motif database (consensus PWMs)
# --n: negative/background sequences  
# --thresh: significance threshold
# --verbosity: output level

/home/jupyter/meme-5.5.6/src/sea \
  --p ENCFF135AEX.peaks.fasta \
  --m consensus_pwms.meme \
  --n bg_peaks.fasta \
  --thresh 50000.0 \
  --verbosity 1

# Move the output file to a more descriptive name
mv sea_out/sea.tsv ENCFF135AEX.motifs.tsv

echo "✓ Motif enrichment analysis completed"
ls -lh ENCFF135AEX.motifs.tsv
head -5 ENCFF135AEX.motifs.tsv


In [None]:
## Step 5: Prepare ATAC-seq BAM File

Next, we'll process the ATAC-seq BAM file to extract fragment end positions. You can download the processed BAM file from ENCODE for K562: https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam

**Note**: This is a large file (~20GB). Make sure you have sufficient disk space and bandwidth.

In [None]:
## Step 6: Process BAM to Extract Fragment Ends

Convert the BAM file to extract ATAC-seq fragment end positions with proper Tn5 adjustments.


In [None]:
# First, sort the BAM file by read name (required for bedpe output)
# Uncomment the line below if you need to sort your BAM file
# samtools sort -n ENCFF534DCE.bam -o ENCFF534DCE.sorted.bam

# Extract fragment ends with Tn5 adjustments
# Steps:
# 1. Convert BAM to BEDPE format
# 2. Filter for proper pairs (same chromosome)
# 3. Filter for fragments >= 20bp
# 4. Apply Tn5 adjustments (+4bp for 5' end, -5bp for 3' end)
# 5. Ensure start < end coordinates
# 6. Sort and compress

bedtools bamtobed -i ENCFF534DCE.sorted.bam -bedpe | \
  awk '$1 == $4' | \
  awk '$8 >= 20' | \
  awk -v OFS="\t" '{if($9 == "+"){print $1,$2+4,$6-5}else if($9=="-"){print $1,$5+4,$3-5}}' | \
  awk -v OFS="\t" '{if($3<$2){print $1,$3,$2}else if($3>$2){print $1,$2,$3}}' | \
  awk '$3-$2 > 0' | \
  sort -k1,1 -k2,2n | gzip > K562.bed.gz

echo "✓ Fragment ends extracted"

# Calculate scale factor (fragments per million)
zcat K562.bed.gz | wc -l | awk '{ print $1 / 1000000.0 }' > K562.num_fragments.out

echo "✓ Scale factor calculated:"
cat K562.num_fragments.out


In [None]:
## Step 7: Create Bedgraph Signal Track

Convert fragment ends to a bedgraph signal track representing chromatin accessibility.

In [None]:
# Create separate files for forward and reverse strand cut sites
zcat K562.bed.gz | awk '{OFS="\t"}{print $1,$2+4,$2+4+1}' | gzip > fwd.bed.gz
zcat K562.bed.gz | awk '{OFS="\t"}{print $1,$3-5,$3-5+1}' | gzip > rev.bed.gz

# Combine forward and reverse strand cut sites
zcat fwd.bed.gz rev.bed.gz | sort -k1,1 -k2,2n > HG_K562.bed.temp

echo "✓ Cut sites extracted for both strands"

# Create bedgraph with proper scaling
# Steps:
# 1. Extend each cut site by ±5bp (10bp total window)
# 2. Filter out non-standard chromosomes
# 3. Remove negative coordinates
# 4. Scale to reads per 20 million (scale factor: 20/111.279 = 0.1797)
# 5. Generate bedgraph coverage

cat HG_K562.bed.temp | \
  awk '{OFS="\t"}{print $1,$2-5,$3+5}' | \
  grep -v 'KI\|GL\|EBV\|chrM\|chrMT\|K\|J' | \
  awk '$2 >= 0' | \
  sort -k1,1 -k2,2n | \
  bedtools genomecov -i - -g hg38.genome -scale 0.1797 -bg | \
  sort -k1,1 -k2,2n > K562.bedgraph

echo "✓ Bedgraph created"

# Rename and compress for tabix indexing
mv K562.bedgraph K562.adjust.bed
bgzip K562.adjust.bed
tabix K562.adjust.bed.gz

echo "✓ Final processed file created: K562.adjust.bed.gz"
ls -lh K562.adjust.bed.gz*

# Clean up intermediate files
rm -f fwd.bed.gz rev.bed.gz HG_K562.bed.temp


In [None]:
## Summary

This notebook demonstrates the complete data processing pipeline for EpiBERT:

### Output Files Generated:

1. **ENCFF135AEX.motifs.tsv** - Motif enrichment scores for input to EpiBERT
2. **K562.adjust.bed.gz** - Processed ATAC-seq signal track for input to EpiBERT

### Key Processing Steps:

1. **Peak Selection**: Extract top 50K peaks by signal strength
2. **Sequence Extraction**: Get 128bp regions around peak centers
3. **Motif Analysis**: Run SEA to identify enriched transcription factor motifs
4. **Fragment Processing**: Extract ATAC-seq fragment ends with Tn5 corrections
5. **Signal Generation**: Create normalized bedgraph tracks for accessibility

### Usage in EpiBERT:

These processed files can now be used as inputs for EpiBERT analysis:
- The motif file provides transcription factor binding context
- The bedgraph file provides chromatin accessibility signal
- Together they enable prediction of variant effects on accessibility

### Quality Control:

- Verify motif enrichment results show expected TF families
- Check bedgraph signal correlates with known accessible regions
- Ensure file formats are compatible with EpiBERT input requirements

### Next Steps:

1. Use these files in the `caqTL_predict.ipynb` notebook
2. Adapt the pipeline for your own cell type and experimental data
3. Experiment with different peak selection criteria or motif databases
