# Complete Fragment Processing Workflow

This notebook performs the complete workflow for processing scATAC-seq fragments across multiple species:
1. **Link fragments** to organized folder structure (by cell type)
2. **Liftover fragments** to human genome (hg38) using parallel implementation
3. **Convert fragments** to coverage bigWig files

---

## Setup and Configuration

Define species, cell types, and file paths.

In [None]:
# Activate the genomes environment for liftover operations
mamba activate genomes

In [None]:
# Define cell types to process
CELL_TYPES=("Enterocytes")

# Define species to process
SPECIES=("Human" "Gorilla" "Chimpanzee" "Bonobo" "Macaque" "Marmoset")

# Base directories
BASE_DIR="/cluster/project/treutlein/USERS/jjans/analysis/adult_intestine/atac"
CHAIN_DIR="${BASE_DIR}/genomes/chains"
MARMOSET_CHAIN_DIR="/cluster/project/treutlein/USERS/jjans/data/intestine/nhp_atlas/genomes/chain_files"
CHROM_SIZES="/cluster/home/jjanssens/jjans/analysis/cerebellum/genomes_new/homo_sapiens/hg38.chrom.sizes"

# Chain file mapping
declare -A CHAIN_FILES
CHAIN_FILES["Gorilla"]="${CHAIN_DIR}/gorGor4ToHg38.over.chain.gz"
CHAIN_FILES["Chimpanzee"]="${CHAIN_DIR}/panTro5ToHg38.over.chain.gz"
CHAIN_FILES["Bonobo"]="${CHAIN_DIR}/panPan2ToHg38.over.chain.gz"
CHAIN_FILES["Macaque"]="${CHAIN_DIR}/rheMac10ToHg38.over.chain.gz"
CHAIN_FILES["Marmoset_step1"]="${MARMOSET_CHAIN_DIR}/calJac1ToCalJac4.over.chain"
CHAIN_FILES["Marmoset_step2"]="${MARMOSET_CHAIN_DIR}/calJac4ToHg38.over.chain"

echo "‚úÖ Configuration loaded"

---
## Step 1: Link Fragment Files

Create organized directory structure and link fragment files by cell type.

In [None]:
# Create directory structure for each cell type
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    mkdir -p "fragment_files/${cell_type_lower}"
    echo "üìÅ Created directory: fragment_files/${cell_type_lower}"
done

echo "‚úÖ Directory structure created"

In [None]:
# Link fragment files for each species and cell type
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    for species in "${SPECIES[@]}"; do
        species_lower=$(echo "$species" | tr '[:upper:]' '[:lower:]')
        
        # Source file path
        source_file="${BASE_DIR}/consensus_peak_calling_${species}/pseudobulk_bed_files/${cell_type}.fragments.tsv.gz"
        
        # Target link path
        target_link="fragment_files/${cell_type_lower}/${species_lower}_${cell_type_lower}.fragments.tsv.gz"
        
        # Create symlink if source exists
        if [[ -f "$source_file" ]]; then
            ln -sf "$source_file" "$target_link"
            echo "üîó Linked: ${species} ${cell_type}"
        else
            echo "‚ö†Ô∏è  Source not found: ${species} ${cell_type}"
        fi
    done
done

echo ""
echo "‚úÖ All fragment files linked"

In [None]:
# Verify linked files
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    echo "\nüìÇ Fragment files for ${cell_type}:"
    ls -lh "fragment_files/${cell_type_lower}/"
done

---
## Step 2: Liftover Fragments to hg38

Use parallel implementation to liftover fragments from each species to the human genome (hg38).

**Important:** This step requires the `liftover_fragments_par.sh` and `liftover_fragments_parchr.sh` scripts.

In [None]:
# Create output directories for lifted fragments (organized by cell type)
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    mkdir -p "lifted_fragments/${cell_type_lower}"
    echo "üìÅ Created directory: lifted_fragments/${cell_type_lower}"
done

echo "‚úÖ Output directories created"

### Human Fragments (No Liftover Needed)

Human fragments are already in hg38, so we just create a symlink.

In [None]:
# Link Human fragments directly (already in hg38)
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    source_file="fragment_files/${cell_type_lower}/human_${cell_type_lower}.fragments.tsv.gz"
    target_link="lifted_fragments/${cell_type_lower}/human_${cell_type_lower}.hg38.fragments.tsv.gz"
    
    if [[ -f "$source_file" ]]; then
        ln -sf "../../${source_file}" "$target_link"
        echo "üîó Linked Human ${cell_type} (no liftover needed)"
    fi
done

### Standard Species Liftover (Gorilla, Chimpanzee, Bonobo, Macaque)

Single-step liftover using parallel implementation with 30 CPUs.

In [None]:
# Liftover for standard species (single-step)
STANDARD_SPECIES=("Gorilla" "Chimpanzee" "Bonobo" "Macaque")

for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    for species in "${STANDARD_SPECIES[@]}"; do
        species_lower=$(echo "$species" | tr '[:upper:]' '[:lower:]')
        
        input_file="fragment_files/${cell_type_lower}/${species_lower}_${cell_type_lower}.fragments.tsv.gz"
        chain_file="${CHAIN_FILES[$species]}"
        output_file="lifted_fragments/${cell_type_lower}/${species_lower}_${cell_type_lower}.hg38.fragments_par.tsv.gz"
        log_file="lifted_fragments/${cell_type_lower}/${species_lower}_${cell_type_lower}.log"
        
        if [[ -f "$input_file" && -f "$chain_file" ]]; then
            echo "üöÄ Running liftover for ${species} ${cell_type}..."
            bash liftover_fragments_par.sh \
                --i "$input_file" \
                --c "$chain_file" \
                --o "$output_file" \
                --ncpu 30 \
                &> "$log_file"
            echo "‚úÖ Done with ${species} ${cell_type}"
        else
            echo "‚ö†Ô∏è  Skipping ${species} ${cell_type} (missing input or chain file)"
        fi
        echo ""
    done
done

echo "‚úÖ Standard species liftover completed"

### Marmoset Two-Step Liftover

Marmoset requires two liftover steps:
1. calJac1 ‚Üí calJac4
2. calJac4 ‚Üí hg38

In [None]:
# Marmoset liftover - Step 1: calJac1 to calJac4
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    input_file="fragment_files/${cell_type_lower}/marmoset_${cell_type_lower}.fragments.tsv.gz"
    chain_file="${CHAIN_FILES[Marmoset_step1]}"
    output_file="lifted_fragments/${cell_type_lower}/marmoset_${cell_type_lower}.calJac4.fragments_par.tsv.gz"
    log_file="lifted_fragments/${cell_type_lower}/marmoset_${cell_type_lower}_step1.log"
    
    if [[ -f "$input_file" && -f "$chain_file" ]]; then
        echo "üöÄ Running Marmoset ${cell_type} liftover - Step 1 (calJac1 ‚Üí calJac4)..."
        bash liftover_fragments_parchr.sh \
            --i "$input_file" \
            --c "$chain_file" \
            --o "$output_file" \
            --ncpu 30 \
            &> "$log_file"
        echo "‚úÖ Done with Marmoset ${cell_type} Step 1"
    else
        echo "‚ö†Ô∏è  Skipping Marmoset ${cell_type} Step 1 (missing input or chain file)"
    fi
    echo ""
done

In [None]:
# Marmoset liftover - Step 2: calJac4 to hg38
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    input_file="lifted_fragments/${cell_type_lower}/marmoset_${cell_type_lower}.calJac4.fragments_par.tsv.gz"
    chain_file="${CHAIN_FILES[Marmoset_step2]}"
    output_file="lifted_fragments/${cell_type_lower}/marmoset_${cell_type_lower}.hg38.fragments_par.tsv.gz"
    log_file="lifted_fragments/${cell_type_lower}/marmoset_${cell_type_lower}_step2.log"
    
    if [[ -f "$input_file" && -f "$chain_file" ]]; then
        echo "üöÄ Running Marmoset ${cell_type} liftover - Step 2 (calJac4 ‚Üí hg38)..."
        bash liftover_fragments_parchr.sh \
            --i "$input_file" \
            --c "$chain_file" \
            --o "$output_file" \
            --ncpu 30 \
            &> "$log_file"
        echo "‚úÖ Done with Marmoset ${cell_type} Step 2"
    else
        echo "‚ö†Ô∏è  Skipping Marmoset ${cell_type} Step 2 (missing input or chain file)"
    fi
    echo ""
done

echo "‚úÖ Marmoset liftover completed"

In [None]:
# Verify lifted fragment files
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    echo "\nüìÇ Lifted fragments for ${cell_type}:"
    ls -lh "lifted_fragments/${cell_type_lower}/" | grep -E '\.hg38\.fragments.*\.tsv\.gz$'
done

---
## Step 3: Convert Fragments to BigWig Coverage Files

Convert lifted fragment files to coverage bigWig format for visualization.

**Note:** Requires `scatac_fragment_tools` to be installed.

In [None]:
# Activate the scatac_fragment_tools environment
mamba activate scatac_fragment_tools

In [None]:
# Create output directories for bigWig files (organized by cell type)
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    mkdir -p "bigwigs/${cell_type_lower}"
    echo "üìÅ Created directory: bigwigs/${cell_type_lower}"
done

echo "‚úÖ BigWig output directories created"

In [None]:
# Convert fragments to bigWig for all species and cell types
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    
    for species in "${SPECIES[@]}"; do
        species_lower=$(echo "$species" | tr '[:upper:]' '[:lower:]')
        
        # Determine input file name based on species
        if [[ "$species" == "Human" ]]; then
            # Human uses direct link (no _par suffix)
            input_frag="lifted_fragments/${cell_type_lower}/${species_lower}_${cell_type_lower}.hg38.fragments.tsv.gz"
        else
            # Other species use parallel liftover output (_par suffix)
            input_frag="lifted_fragments/${cell_type_lower}/${species_lower}_${cell_type_lower}.hg38.fragments_par.tsv.gz"
        fi
        
        output_bw="bigwigs/${cell_type_lower}/${species_lower}_${cell_type_lower}.hg38.cov.bw"
        
        if [[ -f "$input_frag" && -f "$CHROM_SIZES" ]]; then
            echo "üéØ Converting ${species} ${cell_type} to bigWig..."
            scatac_fragment_tools bigwig \
                -i "$input_frag" \
                -c "$CHROM_SIZES" \
                -o "$output_bw" \
                -n
            echo "‚úÖ Done: $output_bw"
        else
            echo "‚ö†Ô∏è  Missing input for ${species} ${cell_type} ‚Äî skipping"
        fi
        echo ""
    done
done

echo "‚úÖ All bigWig conversions completed"

In [None]:
# Verify bigWig files
for cell_type in "${CELL_TYPES[@]}"; do
    cell_type_lower=$(echo "$cell_type" | tr '[:upper:]' '[:lower:]')
    echo "\nüìÇ BigWig files for ${cell_type}:"
    ls -lh "bigwigs/${cell_type_lower}/"
done

---
## Summary

### Output Structure

```
fragment_files/
‚îî‚îÄ‚îÄ enterocytes/
    ‚îú‚îÄ‚îÄ human_enterocytes.fragments.tsv.gz
    ‚îú‚îÄ‚îÄ gorilla_enterocytes.fragments.tsv.gz
    ‚îú‚îÄ‚îÄ chimpanzee_enterocytes.fragments.tsv.gz
    ‚îú‚îÄ‚îÄ bonobo_enterocytes.fragments.tsv.gz
    ‚îú‚îÄ‚îÄ macaque_enterocytes.fragments.tsv.gz
    ‚îî‚îÄ‚îÄ marmoset_enterocytes.fragments.tsv.gz

lifted_fragments/
‚îî‚îÄ‚îÄ enterocytes/
    ‚îú‚îÄ‚îÄ human_enterocytes.hg38.fragments.tsv.gz
    ‚îú‚îÄ‚îÄ gorilla_enterocytes.hg38.fragments_par.tsv.gz
    ‚îú‚îÄ‚îÄ chimpanzee_enterocytes.hg38.fragments_par.tsv.gz
    ‚îú‚îÄ‚îÄ bonobo_enterocytes.hg38.fragments_par.tsv.gz
    ‚îú‚îÄ‚îÄ macaque_enterocytes.hg38.fragments_par.tsv.gz
    ‚îî‚îÄ‚îÄ marmoset_enterocytes.hg38.fragments_par.tsv.gz

bigwigs/
‚îî‚îÄ‚îÄ enterocytes/
    ‚îú‚îÄ‚îÄ human_enterocytes.hg38.cov.bw
    ‚îú‚îÄ‚îÄ gorilla_enterocytes.hg38.cov.bw
    ‚îú‚îÄ‚îÄ chimpanzee_enterocytes.hg38.cov.bw
    ‚îú‚îÄ‚îÄ bonobo_enterocytes.hg38.cov.bw
    ‚îú‚îÄ‚îÄ macaque_enterocytes.hg38.cov.bw
    ‚îî‚îÄ‚îÄ marmoset_enterocytes.hg38.cov.bw
```

### Processing Notes

- **Human**: No liftover needed (already hg38), direct symlink created
- **Standard species** (Gorilla, Chimpanzee, Bonobo, Macaque): Single-step liftover with 30 CPUs
- **Marmoset**: Two-step liftover (calJac1 ‚Üí calJac4 ‚Üí hg38) with 30 CPUs
- All outputs organized by cell type for easy access and management
- BigWig files use normalized coverage (`-n` flag) for cross-sample comparison