# MEPP Walkthrough: Motif Enrichment Positional Profiling
This notebook will walk you through how to perform a score based motif enrichment analysis that profiles motif enrichment at multiple positions across a set of sequences centered on biologically relevant features, in this case a central binding motif of interest. 

## Quickstart
If you already have MEPP installed, have a scored BED file, and want to quickly get started, jump to the section ["Convert scored bed file to scored sequences, and run MEPP analysis"](#Convert-scored-bed-file-to-scored-sequences,-and-run-MEPP-analysis)

# Install prerequisites

You will need the following prerequisites:
* MEPP
* HOMER
* pandas
* numpy
* gtfparse
* coolbox
* wget
* samtools
* deeptools
* bedtools
* bedops
* wiggletools
* wigToBigWig
* bigWigToWig
* tensorflow

To install most of these through mamba:
```
mamba create -d -n mepp_walkthrough -c bioconda -c conda-forge homer pandas numpy gtfparse coolbox wget samtools deeptools bedtools bedops wiggletools ucsc-wigtobigwig ucsc-bigwigtowig tensorflow
```

To install most of these through conda (slower):
```
conda create -d -n mepp_walkthrough -c bioconda -c conda-forge homer pandas numpy coolbox wget samtools deeptools bedtools bedops wiggletools ucsc-wigtobigwig ucsc-bigwigtowig tensorflow
```

To activate the environment:
```
conda activate mepp_walkthrough
```

To install MEPP, use pip:
```
pip install git+https://github.com/npdeloss/mepp@main
```

Or, if you only have user privileges:
```
pip install git+https://github.com/npdeloss/mepp@main --user
```

You may need to append the following to your ~/.bashrc:
```
export PATH="$HOME/.local/bin:$PATH"
```

# Import key libraries

In [None]:
import pandas as pd
import numpy as np

# Enumerate sample sheet with alignment files for download
Here we will be comparing K562 and HCT116 cell lines from ENCODE.

In [None]:
%%file k562.vs.hct116.atac-seq.samples.txt
cell_type replicate bam_url
k562 1 https://www.encodeproject.org/files/ENCFF512VEZ/@@download/ENCFF512VEZ.bam
k562 2 https://www.encodeproject.org/files/ENCFF987XOV/@@download/ENCFF987XOV.bam
hct116 1 https://www.encodeproject.org/files/ENCFF724QHH/@@download/ENCFF724QHH.bam
hct116 2 https://www.encodeproject.org/files/ENCFF927YUB/@@download/ENCFF927YUB.bam
hepg2 1 https://www.encodeproject.org/files/ENCFF239RGZ/@@download/ENCFF239RGZ.bam
hepg2 2 https://www.encodeproject.org/files/ENCFF394BBD/@@download/ENCFF394BBD.bam
dnd-41 1 https://www.encodeproject.org/files/ENCFF538YYI/@@download/ENCFF538YYI.bam
dnd-41 2 https://www.encodeproject.org/files/ENCFF080WSN/@@download/ENCFF080WSN.bam
dnd-41 3 https://www.encodeproject.org/files/ENCFF626KDS/@@download/ENCFF626KDS.bam

# Load sample sheet

In [None]:
samplesheet_filepath = 'k562.vs.hct116.atac-seq.samples.txt'
samplesheet_sep = ' '

In [None]:
samplesheet_df = pd.read_csv(samplesheet_filepath, sep = samplesheet_sep)
samplesheet_df['sample'] = True
samplesheet_df

# Download Alignment files

In [None]:
samplesheet_df['basename'] = samplesheet_df['cell_type'] + '_rep' + samplesheet_df['replicate'].astype(str)
samplesheet_df['bam_filepath'] = samplesheet_df['basename'] + '.bam'

In [None]:
samplesheet_df['wget_bam_cmd'] = (
    'wget -nc -O ' +
    samplesheet_df['bam_filepath'] + ' ' +
    '"' + samplesheet_df['bam_url'] + '"'
)

In [None]:
%%time

run_wget_bam_cmds = run_cmd = True
for cmd in list(samplesheet_df['wget_bam_cmd']):
    print(cmd)
    if run_cmd:
        ! {cmd}

# Index alignment files
Necessary to generate bigWig files

In [None]:
index_bam_threads = 8
samplesheet_df['index_bam_cmd'] = (
    'samtools index ' +
    samplesheet_df['bam_filepath']
)

In [None]:
%%time

run_index_bam_cmds = run_cmd = True
for cmd in list(samplesheet_df['index_bam_cmd']):
    print(cmd)
    if run_cmd:
        ! {cmd}

# Compute coverage bigWig files from alignment files
You will need to use [effective genome size](https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html) numbers for the relevant genome.
You could also use other normalizations of choice at this step. Here we use the deeptools `bamCoverage` defaults.
These will allow you to later visualize and quantify calculations on this coverage data.

In [None]:
effective_genome_size = 2913022398
bamcoverage_binsize = 10
bamcoverage_threads = 'max/2'
# Value for GRCh38, from:
# https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html

In [None]:
samplesheet_df['bw_filepath'] = samplesheet_df['basename'] + '.bw'
samplesheet_df['raw_bw_filepath'] = samplesheet_df['basename'] + '.raw.bw'

In [None]:
samplesheet_df['bamcoverage_cmd'] = (
    f'bamCoverage ' + 
    f' -p {bamcoverage_threads} ' + 
    f' --effectiveGenomeSize {effective_genome_size} ' + 
    f' --normalizeUsing RPKM '
    f' -bs {bamcoverage_binsize}'
    f' -b ' + samplesheet_df['bam_filepath'] + 
    f' -o ' + samplesheet_df['bw_filepath']
)

samplesheet_df['bamcoverage_raw_cmd'] = (
    f'bamCoverage ' + 
    f' -p {bamcoverage_threads} ' + 
    f' --effectiveGenomeSize {effective_genome_size} ' + 
    f' --normalizeUsing None '
    f' -bs {bamcoverage_binsize}'
    f' -b ' + samplesheet_df['bam_filepath'] + 
    f' -o ' + samplesheet_df['raw_bw_filepath']
)

bamcoverage_cmds = list(samplesheet_df['bamcoverage_cmd']) + list(samplesheet_df['bamcoverage_raw_cmd'])

In [None]:
%%time

run_bamcoverage_cmds = run_cmd = True
for cmd in bamcoverage_cmds:
    print(cmd)
    if run_cmd:
        ! {cmd}

# Download reference genome
Also generate index and chromosome size files for bedtools and wigToBigWig

In [None]:
genome_fa_filepath = 'hg38.fa'
genome_fai_filepath = f'{genome_fa_filepath}.fai'
genome_chromsizes_filepath = f'{genome_fa_filepath}.chromsizes.tab'
genome_fa_gz_url = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.masked.gz'

In [None]:
download_genome_fa_cmd = f'wget -nc -O {genome_fa_filepath}.gz "{genome_fa_gz_url}"; zcat {genome_fa_filepath}.gz > {genome_fa_filepath}'

In [None]:
index_genome_fa_cmd = f'samtools faidx {genome_fa_filepath}'
genome_chromsizes_cmd = f'cut -f1,2 {genome_fai_filepath} > {genome_chromsizes_filepath}'

In [None]:
%%time

run_download_genome_fa_cmd = run_cmd = True
for cmd in [download_genome_fa_cmd, index_genome_fa_cmd, genome_chromsizes_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

# Designate comparison
Here we compare HCT116 vs. K562 cells. Comparison groups are designated by the `cell_type` column.

In [None]:
group_1  = 'hct116'
group_2  = 'k562'

group_column = 'cell_type'
sample_column = 'sample'
sort_column = 'replicate'

comparison_prefix = f'{group_1}.vs.{group_2}'


In [None]:
print(comparison_prefix)

# List bigwigs belonging to each group

In [None]:
sample_subset_df = samplesheet_df[samplesheet_df[sample_column]].sort_values(by = sort_column).copy()

group_1_sample_bw_filepaths = list(sample_subset_df[sample_subset_df[group_column] == group_1]['bw_filepath'])
group_2_sample_bw_filepaths = list(sample_subset_df[sample_subset_df[group_column] == group_2]['bw_filepath'])
sample_bw_filepaths = group_1_sample_bw_filepaths + group_2_sample_bw_filepaths
group_1_sample_bw_filepaths_str = ' '.join(group_1_sample_bw_filepaths)
group_2_sample_bw_filepaths_str = ' '.join(group_2_sample_bw_filepaths)
sample_bw_filepaths_str = ' '.join(sample_bw_filepaths)

group_1_sample_raw_bw_filepaths = list(sample_subset_df[sample_subset_df[group_column] == group_1]['raw_bw_filepath'])
group_2_sample_raw_bw_filepaths = list(sample_subset_df[sample_subset_df[group_column] == group_2]['raw_bw_filepath'])
sample_raw_bw_filepaths = group_1_sample_raw_bw_filepaths + group_2_sample_raw_bw_filepaths
group_1_sample_raw_bw_filepaths_str = ' '.join(group_1_sample_raw_bw_filepaths)
group_2_sample_raw_bw_filepaths_str = ' '.join(group_2_sample_raw_bw_filepaths)
sample_raw_bw_filepaths_str = ' '.join(sample_raw_bw_filepaths)

# Calculate Bigwig of Log2FC between groups.
First we calculate the means of each group, in `{group_1}.mean.bw` and `{group_2}.mean.bw`. Then we compute the Log2 Fold Change (with pseudocount) as `log2((group_1_mean+1)/(group_2_mean+1))`. A pseudocount prevents division by zero in the ratio calculation.

We also compute the sum of coverage across all samples, for use later.

In [None]:
group_1_bw_filepath = f'{group_1}.mean.bw'
group_2_bw_filepath = f'{group_2}.mean.bw'

sum_bw_filepath = f'{group_1}.vs.{group_2}.sum.bw'
log2fc_bw_filepath = f'{group_1}.vs.{group_2}.log2fc.bw'

In [None]:
group_1_bw_cmd = (
    f'wiggletools write {group_1_bw_filepath}.wig mean {group_1_sample_bw_filepaths_str} ; '
    f'wigToBigWig -clip {group_1_bw_filepath}.wig {genome_chromsizes_filepath} {group_1_bw_filepath} ; '
    f'rm {group_1_bw_filepath}.wig'
)
group_2_bw_cmd = (
    f'wiggletools write {group_2_bw_filepath}.wig mean {group_2_sample_bw_filepaths_str} ; '
    f'wigToBigWig -clip {group_2_bw_filepath}.wig {genome_chromsizes_filepath} {group_2_bw_filepath} ; '
    f'rm {group_2_bw_filepath}.wig'
)

sum_bw_cmd = (
    f'wiggletools write {sum_bw_filepath}.wig sum {group_1_sample_raw_bw_filepaths_str} {group_2_sample_raw_bw_filepaths_str} ; '
    f'wigToBigWig -clip {sum_bw_filepath}.wig {genome_chromsizes_filepath} {sum_bw_filepath} ; '
    f'rm {sum_bw_filepath}.wig'
)

log2fc_bw_cmd = (
    f'wiggletools write {group_1_bw_filepath}.plus_1.wig offset 1 {group_1_bw_filepath} ; '
    f'wiggletools write {group_2_bw_filepath}.plus_1.wig offset 1 {group_2_bw_filepath} ; '
    f'wiggletools write {group_1}.vs.{group_2}.ratio.wig ratio {group_1_bw_filepath}.plus_1.wig {group_2_bw_filepath}.plus_1.wig ; '
    f'wiggletools write {log2fc_bw_filepath}.wig log 2 {group_1}.vs.{group_2}.ratio.wig ; '
    f'wigToBigWig -clip {log2fc_bw_filepath}.wig {genome_chromsizes_filepath} {log2fc_bw_filepath} ; '
    f'rm {group_1_bw_filepath}.plus_1.wig ; '
    f'rm {group_2_bw_filepath}.plus_1.wig ; '
    f'rm {group_1}.vs.{group_2}.ratio.wig ; '
    f'rm {log2fc_bw_filepath}.wig '
)

In [None]:
bw_cmds = [group_1_bw_cmd, group_2_bw_cmd, sum_bw_cmd, log2fc_bw_cmd]

In [None]:
%%time

run_bw_cmds = run_cmd = True
for cmd in list(bw_cmds):
    print(cmd)
    if run_cmd:
        ! {cmd}

# Download annotation file
Not strictly necessary, just for the benefit of visualization

In [None]:
# genes_gtf_filepath = 'hg38_ensGene.gtf'
# genes_gtf_gz_url = 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ensGene.gtf.gz'

genes_gtf_filepath = 'ENCFF159KBI.gtf'
genes_bed_filepath = genes_gtf_filepath[:-len('.gtf')]+'.bed'
genes_gtf_gz_url = 'https://www.encodeproject.org/files/ENCFF159KBI/@@download/ENCFF159KBI.gtf.gz'

In [None]:
download_genes_gtf_cmd = f'wget -nc -O {genes_gtf_filepath}.gz "{genes_gtf_gz_url}"; zcat {genes_gtf_filepath}.gz > {genes_gtf_filepath}'
# index_gtf_cmd = f'tabix -p gff {genes_gtf_filepath}'

In [None]:
%%time

run_download_genes_gtf_cmd = run_cmd = True
for cmd in [download_genes_gtf_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
! head {genes_gtf_filepath}

In [None]:
from gtfparse import read_gtf

In [None]:
genes_df = read_gtf(genes_gtf_filepath)
genes_df = genes_df[genes_df["feature"] == "gene"].copy()
genes_df = genes_df[['seqname', 'start', 'end', 'gene_name', 'score', 'strand']].copy()
genes_df['start'] = genes_df['start']-1
genes_df['score'] = 0
genes_df.to_csv(genes_bed_filepath, sep = '\t', index = False, header = None)
genes_df

# Visualize Log2 Fold Change bigWig with Coolbox

In [None]:
import coolbox
from coolbox.api import *

In [None]:
coolbox.__version__

In [None]:
genes_df[genes_df['gene_name']=='HBB']

In [None]:
test_margin = 2000
test_gene_df=genes_df[genes_df['gene_name']=='HBB'].copy().reset_index(drop = True)
test_chr = list(test_gene_df['seqname'])[0]
test_start = list(test_gene_df['start'])[0] - test_margin
test_end = list(test_gene_df['end'])[0] + test_margin

In [None]:
# test_range = 'chr9:5000000-5500000'
test_range = f'{test_chr}:{test_start}-{test_end}'
frame = (
    XAxis() + 
    BED(genes_bed_filepath) +
    Title('Genes') + TrackHeight(8) + Color('#323232') +
    BigWig(log2fc_bw_filepath) + Title('Log2FC') + Color('#cf32cf') +
    BigWig(group_1_bw_filepath) + Title(group_1) + Color('#3232cf') +
    BigWig(group_2_bw_filepath) + Title(group_2) + Color('#cf3232') +
    BigWig(sum_bw_filepath) + Title('Coverage') + Color('#32cd32')
)
frame.plot(test_range)
# bsr = Browser(frame)
# bsr.show()

# Copy over the HOMER motif library
Use the below commands to locate and copy your motif library if you installed HOMER with conda/mamba

In [None]:
copy_homer_motif_cmd = f'cp -rf $(dirname $(which homer))/../share/homer/motifs ./homer_motifs'

In [None]:
%%time

run_copy_homer_motif_cmd = run_cmd = True
for cmd in [copy_homer_motif_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

# Scan for motif instances to center sequence on

In [None]:
motif_basename = 'gata'

In [None]:
motif_subpath = '/'.join(motif_basename.split(' '))
motif_safe_basename = '_'.join(motif_subpath.split('/'))
scanned_motif_filepath = f'homer_motifs/{motif_subpath}.motif'
motif_scans_bed_filepath = f'{motif_safe_basename}.scans.bed'

In [None]:
scan_motifs_cmd = (
    f'scanMotifGenomeWide.pl {scanned_motif_filepath} {genome_fa_filepath} '
    f'-bed -5p 1> {motif_scans_bed_filepath} 2> {motif_scans_bed_filepath}.log'
)


In [None]:
%%time

run_scan_motifs_cmd = run_cmd = True
for cmd in [scan_motifs_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
! head {scanned_motif_filepath}

In [None]:
! head {motif_scans_bed_filepath}
! wc -l {motif_scans_bed_filepath}

# Score motif scans by Log2 Fold Change, and annotate with coverage

In [None]:
scored_motif_scans_filepath = f'{motif_safe_basename}.scans.scored_by.{log2fc_bw_filepath}.bed'
coverage_summed_motif_scans_filepath = f'{motif_safe_basename}.scans.scored_by.{sum_bw_filepath}.bed'

In [None]:
score_scans_cmd = f'bigWigToWig {log2fc_bw_filepath} >(wig2bed -x) | bedmap --echo --delim \'\\t\' --wmean {motif_scans_bed_filepath} - | awk \'$7!="NAN"\' | awk \'{{FS=OFS="\\t";$5=$7;print $1,$2,$3,$4,$5,$6}}\' > {scored_motif_scans_filepath}'

In [None]:
coverage_sum_scans_cmd = f'bigWigToWig {sum_bw_filepath} >(wig2bed -x) | bedmap --echo --delim \'\\t\' --wmean {scored_motif_scans_filepath} - > {coverage_summed_motif_scans_filepath}'

In [None]:
%%time

run_score_scans_cmds = run_cmd = True
for cmd in [score_scans_cmd, coverage_sum_scans_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
! head {scored_motif_scans_filepath}

In [None]:
! head {coverage_summed_motif_scans_filepath}

# Perform cluster deduplication, designate sequence length
By default, HOMER genomewide motif scans extract intervals +/-100bp of the motif 5' end. This step prevents extraction of overlapping sequences in this interval, which prevents e.g. identification of artifactual periodicities/positionalities due to repetitive sequence. For example, you might have a motif repeat with itself within +/- 100bp of its own instances, giving rise to artificial periodicity due to repetitive sampling of the same genomic DNA.

Briefly, we cluster overlapping intervals, then for each cluster we select only the interval with the highest summed coverage across all samples.

We also only select intervals witha minimum summed coverage of 5, to avoid picking up unbound intervals.

In [None]:
sequence_length = 200

In [None]:
sequence_length = max(sequence_length, 200)
slop = (sequence_length-200)//2

In [None]:
cluster_deduplicated_scored_motif_scans_filepath = scored_motif_scans_filepath[:-len('.bed')] + '.cluster_deduplicated.bed'
slopped_cluster_deduplicated_scored_motif_scans_filepath = cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')] + f'.slop_{slop}.bed'

In [None]:
min_coverage = 5.0
cluster_deduplication_cmd = (
    f'bedtools cluster -s -i {coverage_summed_motif_scans_filepath} '
    f'|awk \'$7>={min_coverage}\' '
    f'| sort -k8,8n -k7,7nr | awk \'!a[$8]++\' '
    f'| bedtools sort -i - |cut -f1-6 '
    f'> {cluster_deduplicated_scored_motif_scans_filepath}'
)

slop_cmd = (
    f'bedtools slop -i {cluster_deduplicated_scored_motif_scans_filepath} -b {slop} -g {genome_chromsizes_filepath} > {slopped_cluster_deduplicated_scored_motif_scans_filepath}'
)



In [None]:
%%time

run_bedtools_cmds = run_cmd = True
for cmd in [cluster_deduplication_cmd, slop_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
! head {slopped_cluster_deduplicated_scored_motif_scans_filepath}

# Preview BED file of scored intervals 

In [None]:
bed_columns = 'Chr Start End Name Score Strand'.split()
bed_df = pd.read_csv(cluster_deduplicated_scored_motif_scans_filepath, sep = '\t', header = None, names = bed_columns)
bed_df

# Visualize score distribution
Like most tools, MEPP prefers normal score distributions.

In [None]:
bed_df[['Score']].hist(bins=100)

# Download JASPAR-converted HOMER vertebrate motifs

In [None]:
motifs_url = 'https://raw.githubusercontent.com/npdeloss/mepp/main/data/homer.motifs.txt'
motifs_filepath = 'homer.motifs.txt'

In [None]:
wget_motifs_cmd = (
    f'wget -nc -O {motifs_filepath} "{motifs_url}"'
)

In [None]:
%%time

run_wget_motifs_cmds = run_cmd = True
for cmd in [wget_motifs_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

# Convert scored bed file to scored sequences, and run MEPP analysis
In `mepp.get_scored_fasta` we handle reverse complementation of sequence according the BED interval's strand value. We can then pipe that directly into MEPP.

## Explanation of parameters

`python -m mepp.get_scored_fasta`
* Utility for extracting scored FASTA files (sequence score in header) from scored bed files
    * `-fi {genome_fa_filepath}`: Extract sequence from the specified genome FASTA (required).
    * `-bed {bed_filepath}`: Extract sequences from the intervals specified in this BED file (required). 
    
`|python -m mepp.cli`
* Pipe output from previous command into MEPP
    * `--fa - `: Receive scored FASTA from the output of the previous command (Required).
    * `--motifs {motifs_filepath}:` Analyze these motifs from a JASPAR-formatted motif matrix collection file (Required).
    * `--out {mepp_filepath} `: Output to this directory (Required).
    * `--perms 100 `: Use 100 permutations for confidence interval statistics (Default: 1000, can be costly in time & memory).
    * `--batch 1000`: Use 1000 as tensorflow batch size (Default: 1000, adjust according to machine memory)
    * `dgt 50`: Only analyze sequences with less than 50% degenerate base content (Default: 100, adjust according to analysis needs)
    * `--jobs 20`: Use 20 jobs for multithreaded tasks. (Default: Use all cores)
    * `--gjobs 20`: Use 20 jobs for multithreaded tasks optimizable by Tensorflow GPU usage. (Default: 1)
    * `--nogpu`: Don't use the GPU (Default: Use GPU.)
        * if set, `--gjobs` is simply the number of cores used to process motifs in parallel.
    * `--dpi 100`: DPI of plots. Important, since the motif occurrence heatmap is DPI-dependent. (Default: 300)
    * `--orientations +,- `: Analyze these orientations of the motifs (Forward, and reverse). (Default: +,+/-, analyze Forward, and non-orientation specific)
    * Not specified here:
        * ` --margin {INTEGER}`: Number of bases along either side of motif to "blur" motif matches for smoothing. (Default: 2)
            * It can be useful to set this depending on how strictly your sequences have been centered. If centering on ChIP-seq peak centers, consider a larger margin.

In [None]:
mepp_filepath = slopped_cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')]+f'.for_notebook.mepp'

mepp_cmd = (
    f'python -m mepp.get_scored_fasta -fi {genome_fa_filepath} '
    f'-bed {slopped_cluster_deduplicated_scored_motif_scans_filepath} '
    f'|python -m mepp.cli '
    f'--fa - '
    f'--motifs {motifs_filepath} '
    f'--out {mepp_filepath} '
    f'--perms 100 '
    f'--batch 1000 '
    f'--dgt 50 '
    f'--jobs 15 '
    f'--gjobs 15 '
    f'--nogpu '
    f'--dpi 100 '
    f'--orientations +,- '
    f'&> {mepp_filepath}.log'
)

In [None]:
%%time

run_mepp_cmd = run_cmd = True
for cmd in [mepp_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
! tail {mepp_filepath}.log

# Show links to MEPP HTML outputs
MEPP outputs HTML files that are useful for visualizing and navigating your data

In [None]:
from IPython.display import display, Markdown

In [None]:

mepp_results_table_fwd_md = f'[Results table, + orientation]({mepp_filepath}/results_table_orientation_fwd.html)'
mepp_clustermap_fwd_md = f'[Clustermap, + orientation]({mepp_filepath}/clustermap_orientation_fwd.html)'

mepp_results_table_rev_md = f'[Results table, - orientation]({mepp_filepath}/results_table_orientation_rev.html)'
mepp_clustermap_rev_md = f'[Clustermap, - orientation]({mepp_filepath}/clustermap_orientation_rev.html)'

In [None]:
display(Markdown(mepp_results_table_fwd_md))
display(Markdown(mepp_clustermap_fwd_md))
display(Markdown(mepp_results_table_rev_md))
display(Markdown(mepp_clustermap_rev_md))

# Example commands for set-based analysis with CentriMo
Threshold top and bottom 10% of scored sequences, then use these as positive/negative inputs to set-based MEA, e.g. CentriMo

In [None]:
percent = 10
lower_percent = percent
higher_percent = 100.0-percent
lower_thresh, upper_thresh = list(np.percentile(bed_df['Score'], [lower_percent, higher_percent]))

In [None]:
upper_slopped_cluster_deduplicated_scored_motif_scans_filepath = cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')]+'.upper.bed'
lower_slopped_cluster_deduplicated_scored_motif_scans_filepath = cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')]+'.lower.bed'

upper_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath = cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')]+'.upper.fa'
lower_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath = cluster_deduplicated_scored_motif_scans_filepath[:-len('.bed')]+'.lower.fa'

In [None]:
upper_bed_df = bed_df[bed_df['Score']>=upper_thresh].copy()
lower_bed_df = bed_df[bed_df['Score']<=lower_thresh].copy()

In [None]:
upper_bed_df.to_csv(upper_slopped_cluster_deduplicated_scored_motif_scans_filepath, sep = '\t', index = False, header = None)
! head {upper_slopped_cluster_deduplicated_scored_motif_scans_filepath}

In [None]:
lower_bed_df.to_csv(lower_slopped_cluster_deduplicated_scored_motif_scans_filepath, sep = '\t', index = False, header = None)
! head {lower_slopped_cluster_deduplicated_scored_motif_scans_filepath}

In [None]:
upper_fa_cmd = ( f'python -m mepp.get_scored_fasta -fi {genome_fa_filepath} '
    f'-bed {upper_slopped_cluster_deduplicated_scored_motif_scans_filepath} '
    f'> {upper_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath}'
)

lower_fa_cmd = ( f'python -m mepp.get_scored_fasta -fi {genome_fa_filepath} '
    f'-bed {lower_slopped_cluster_deduplicated_scored_motif_scans_filepath} '
    f'> {lower_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath}'
)

In [None]:
%%time

run_thresh_fa_cmds = run_cmd = True
for cmd in [upper_fa_cmd, lower_fa_cmd]:
    print(cmd)
    if run_cmd:
        ! {cmd}

In [None]:
upper_bed_df.shape

In [None]:
lower_bed_df.shape

# Example equivalent Centrimo command

In [None]:
meme_motifs_filepath = 'homer.motifs.id_fixed.meme'
centrimo_filepath = mepp_filepath[:-len('.mepp')]+f'.upper_vs_lower.for_notebook.centrimo'
centrimo_cmd = (
    f'mkdir -p {centrimo_filepath} ;'
    f'$(which time) --verbose '
    f'centrimo --oc {centrimo_filepath} '
    f'--neg {lower_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath} '
    f'--norc --sep --local --noseq '
    f'{upper_slopped_cluster_deduplicated_scored_motif_scans_fa_filepath} '
    f'{meme_motifs_filepath}'
)
print(centrimo_cmd)