# EpiBERT Data Processing Workflow

This notebook demonstrates preparing input data for EpiBERT analysis.

## Overview

We require two main types of input data:

1. Motif enrichment file: Generated using Simple Enrichment Analysis (SEA) from the MEME suite
2.  ATAC-seq data: Fragment end positions in bedgraph format

## Prerequisites

Before running this workflow, ensure you have the following tools installed:

### System Requirements
- Python 3.8+ with EpiBERT dependencies (see `requirements.txt`)
- bedtools v2.30+: For genomic interval operations
- samtools v1.15+: For BAM file processing  
- MEME suite v5.4+: For motif enrichment analysis
-*tabix: For indexing compressed files

## Data Requirements

- Peak calls ATAC-seq peaks in BED format (e.g., from ENCODE)
- BAM files Aligned ATAC-seq reads
- Reference genome hg38 FASTA file with index
- Motif database Consensus PWMs for transcription factors

In [None]:
## Step 1: Download Example ATAC-seq Peaks

# Use K562 IDR thresholded ATAC-seq peaks from ENCODE as an example dataset. 

# Download K562 ATAC-seq peaks from ENCODE
# File: ENCFF135AEX - K562 IDR thresholded peaks
!wget -q https://www.encodeproject.org/files/ENCFF135AEX/@@download/ENCFF135AEX.bed.gz

## Step 2: Prepare Reference Genome

#Ensure you have the hg38 reference genome FASTA file downloaded and indexed. You can download it from ENCODE: https://www.encodeproject.org/references/ENCSR938RZZ/

# Index the reference genome FASTA file
# This creates a .fai index file needed for bedtools getfasta
!wget -O hg38_erccapatch.fa.gz ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
!gzip -d hg38_erccapatch.fa.gz
!samtools faidx hg38_erccpatch.fa

## Step 3: Extract Peak Sequences for Motif Analysis

# Extract top 50,000 peaks and get 128bp around peak centers
# Sort by signal value (column 5), take top peaks, extract peak center Â± 64bp
zcat ENCFF135AEX.bed.gz | \
  sort -k5,5nr | \
  head -n 50000 | \
  awk '{OFS = "\t"}{print $1,$2+$10-64,$2+$10+64}' | \
  sort -k1,1 -k2,2n > ENCFF135AEX.sorted.peaks.bed

!wc -l ENCFF135AEX.sorted.peaks.bed

!bedtools getfasta -fi hg38_erccpatch.fa -bed ENCFF135AEX.sorted.peaks.bed > ENCFF135AEX.peaks.fasta # Extract sequences for foreground (peaks of interest)
!bedtools getfasta -fi hg38_erccpatch.fa -bed all_peaks_merged.counts.shared.centered.bed > bg_peaks.fasta # Extract sequences for background (using provided background peaks)

In [None]:
## Step 4: Run Motif Enrichment Analysis

# Run Simple Enrichment Analysis (SEA) from MEME using consensus PWMs. We use the 693 consensus PWMs from Vierstra et al. (https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.0beta/)

# Run Simple Enrichment Analysis (SEA) from MEME
# Parameters:
# --p: positive sequences (peaks of interest)
# --m: motif database (consensus PWMs)
# --n: negative/background sequences  
# --thresh: significance threshold
# --verbosity: output level

/insertpathtomemesuiteinstallation/meme-5.5.6/src/sea \
  --p ENCFF135AEX.peaks.fasta \
  --m consensus_pwms.meme \
  --n bg_peaks.fasta \
  --thresh 50000.0 \
  --verbosity 1

# Move the output file to a more descriptive na
!v sea_out/sea.tsv ENCFF135AEX.motifs.tsv
head -5 ENCFF135AEX.motifs.tsv


In [None]:
## Step 5/6: Process ATAC-seq BAM File

# Next process the ATAC-seq BAM file to extract fragment end positions w/ Tn5 adjustmenets. You can download the processed BAM file from ENCODE for K562: https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam
# **Note**: This is a large file (~20GB). Make sure you have sufficient disk space and bandwidth.

# Srt the BAM file by read name (required for bedpe output)
# Uncomment the line below if you need to sort your BAM file
# samtools sort -n ENCFF534DCE.bam -o ENCFF534DCE.sorted.bam

# Extract fragment ends with Tn5 adjustments
# Steps:
# 1. Convert BAM to BEDPE format
# 2. Filter for proper pairs (same chromosome)
# 3. Filter for fragments >= 20bp
# 4. Apply Tn5 adjustments (+4bp for 5' end, -5bp for 3' end)
# 5. Ensure start < end coordinates
# 6. Sort and compress

# first we sort the bam
#!samtools sort -n ENCFF534DCE.bam -o ENCFF534DCE.sorted.bam
# then for each pair extract the 5 and 3' cut sites and make the required Tn5 adjustmenet
!bedtools bamtobed -i ENCFF534DCE.sorted.bam -bedpe | \
        awk '$1 == $4' | \
        awk '$8 >= 20' | \
        awk -v OFS="\t" '{if($9 == "+"){print $1,$2+4,$6-5}else if($9=="-"){print $1,$5+4,$3-5}}' | \
        awk -v OFS="\t" '{if($3<$2){print $1,$3,$2}else if($3>$2){print $1,$2,$3}}' | \
        awk '$3-$2 > 0' | \
        sort -k1,1 -k2,2n | gzip > K562.bed.gz


## get scale factor which is 111.279
!zcat K562.bed.gz | wc -l | awk '{ print $1 / 1000000.0 }' > K562.num_fragments.out

# Create separate files for forward and reverse strand cut sites
zcat K562.bed.gz | awk '{OFS="\t"}{print $1,$2,$2+1}' | gzip > fwd.bed.gz
zcat K562.bed.gz | awk '{OFS="\t"}{print $1,$3,$3+1}' | gzip > rev.bed.gz

# Combine forward and reverse strand cut sites
zcat fwd.bed.gz rev.bed.gz | sort -k1,1 -k2,2n > HG_K562.bed.temp

# turn into bedgraph (get 10bp around each insertion site, then scale to reads per 20 million -> 20/ 111.279  = 0.1797
# make sure you provide a genome file for bedtools genomcov 
!cat HG_K562.bed.temp | awk '{ OFS="\t" } {print $1,$2-5,$3+5}' | \
                    grep -v 'KI\|GL\|EBV\|chrM\|chrMT\|K\|J' | \
                    awk '$2 >= 0' | sort -k1,1 -k2,2n | \
                    bedtools genomecov -i - -g hg38.genome -scale 0.1797 -bg | sort -k1,1 -k2,2n > K562.bedgraph

!mv K562.bedgraph K562.adjust.bed # rename since tabix will throw an error if bedgraph named 
!bgzip K562.adjust.bed # bgzip for tabix 
!tabix K562.adjust.bed.gz # tabix index

# Clean up intermediate files
rm -f fwd.bed.gz rev.bed.gz HG_K562.bed.temp