Skip to content

jvierstra/motif-clustering

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
 
 
 
 
 
 

Non-redundant TF motif matches genome-wide

Below describes the general workflow for clustering motif models to remove redundancy, generating an "archetype" motif, and then finally, performing genome-wide scans of motifs and remvoval of redundancy. Please note that this documentation is incomplete, as should be used a a rough guide only.

If you are looking for the final results (motif clusters, genome-wide scans and a browser shot) please see the following website: https://resources.altius.org/~jvierstra/projects/motif-clustering/

Note: The above link is for version 1.0 of the motif clusters. While we have yet to update the website, version 2.0-beta (latest) can be found at https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.0beta/.

Contact me at jvierstra (at) altius.org with any questions/requests/comments.

Requirements

Versions

Quick start

Step 1: Download and preproces motifs

See runall script in each motif database directory (databases/*)

Step 1: Compute pair-wise motif similarity

Here we use TOMTOM to determine the similarity between all motif models (all pairwise) with the following code:

meme2meme databases/*/*.meme > tomtom/all.dbs.meme

tomtom \
	-bfile /net/seq/data/projects/motifs/hg19.K36.mappable_only.5-order.markov \
	-dist kullback \
	-motif-pseudo 0.1 \
	-text \
	-min-overlap 1 \
	tomtom/all.dbs.meme tomtom/all.dbs.meme \
> tomtom/tomtom.all.txt

I have a provided a script that will load this operation up on a SLURM parallel compute cluster (see bin/runall.tomtom.v2.0beta-human for an example)

Step 2: Hierarchically cluster motifs

After running TOMTOM, open up the provided Jupyter Notebook to perform the clustering and visualization

We perform hierarchical clustering (distance: correlation, complete linkage) from the TOMTOM similarity E-values. Below is a heatmap representation of motifs clustered by simililarity and clusters identified cutting the dendrogram at height 0.7.

Clustered heatmap cut at height 0.7

Step 3: Process each cluster to build a motif archetype

Again, inside the notebook there is code that will process and visualize each motif cluster.

AC0002 (homeodomain) AC0240 (CCAAT-box)
AC0002 AC0240

Step 4: Make HTML output

Run the BASH script bin/runall.make-html to generate an HTML webpage (index.html) in the results directory

Step 5: Scan genome using all motif models (individually and archetype)

I use the software package MOODS to find motif matches genome-wide. Its a great tool and that I highly reccomend. See bin/runall.scan_models for an example of how to do this on a SLURM cluster.

Step 5: Create working and browser tracks

To create a bigBed file from a bed9+4, we need to include an AutoSql file (bed_format.as)

table hg38_motifs_collapsed
"Collapsed motifs matches in hg38 (see: http://www.github.com/jvierstra/motif-clustering)"
(
string  chrom;        "Reference sequence chromosome or scaffold"
uint    chromStart;    "Start position of feature on chromosome"
uint    chromEnd;    "End position of feature on chromosome"
string  name;        "Name of motif"
uint    score;        "Score"
char[1] strand;        "+ or - for strand"
uint    thickStart;    "Coding region start"
uint    thickEnd;    "Coding region end"
uint      reserved;    "itemRgb"
)

Make the tracks for the archetypes

bedToBigBed -as=bed_format.as -type=bed9+4 -tab moods.combined.all.bed chrom.sizes moods.combined.all.bb
awk -v OFS="\t" '{ print $1, $2, $3, $4, $11, $6, $10, $13}' moods.combined.all.bed | bgzip -c > moods.combined.all.bed.gz
tabix -p bed moods.combined.all.bed.gz

Make the tracks for the full motif scans.

fetchChromSizes hg38 > /tmp/chrom.sizes
awk -v OFS="\t" '{ print $1, 0, $2; }' /tmp/chrom.sizes | sort-bed - > /tmp/chrom.sizes.bed
bedops -e 100% moods.combined.all.bed /tmp/chrom.sizes.bed \
| awk -v OFS="\t" '{ print $1, $2, $3, $4, 0, $6, $2, $3, "0,0,0", $5, $7 }' > /tmp/moods
bedToBigBed -as=bed_format.as -type=bed9+2 -tab /tmp/moods chrom.sizes moods.combined.all.bb