Support for MEME-based initialization #57

AvantiShri · 2020-04-22T01:23:48Z

An initial clustering can be specified using the initclusterer_factory argument of TfModiscoSeqletsToPatternsFactory. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):

initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
   meme_command="meme", base_outdir="meme_out",   
   max_num_seqlets_to_use=10000,
   nmotifs=10,
   n_jobs=4)

Explanation of the arguments:

meme_command: this is just meme if the meme executable is in the PATH; if it's not in the path, then meme_command should specify the full path to the executable, e.g. /software/meme/5.0.1/bin/meme on the kundajelab servers.
base_outdir: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.
max_num_seqlets_to_use: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.
nmotifs: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.
njobs: specifies the value of the -p argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.

The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.

The cluster initialization affects the TF-MoDISco workflow in two places:

First, the fine-grained similarity is computed not just on the set of nearest-neighbors that have the highest coarse-grained similarity across all seqlets, but also on the set of nearest-neighbors that have the highest coarse-grained-similarity within each initialized cluster.
Second, it is used to initialize Leiden community detection.

Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.

Other changes:

Moved from Louvain -> Leiden for the main community detection step. Note that I am no longer doing consensus clustering with Leiden because it didn't appear to work well (consistent with this discussion on twitter); instead, I am just taking the best modularity over 50 runs of Leiden with different random seeds. To go back to using Louvain for the main community detection step, set the use_louvain argument to True in TfModiscoSeqletsToPatternsFactory - but note that the cluster initialization functionality isn't supported with Louvain.*
Updated the Nanog notebook to showcase the MEME initialization functionality
Updated the Nanog notebook to use better normalization (I'm now just doing mean normalization across ACGT at each position, which I think is more intuitive and has a similar effect as the normalization I described in the GkmExplain paper). Also updated the notebook to apply normalization to the importance scores of the dinuc-shuffled null (previously, the scores for the null distribution weren't normalized)
Added tests for the MEME-based initialization

*The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).

…private into cluster_init

AvantiShri added 26 commits April 3, 2020 03:08

added leiden clustering

603362e

added leiden clustering

51bafa3

todos

0b63ad3

partial implementation for cluster intialization

fd1e8d8

initial implementation in

27a1f8c

saving the MEME-DISco-derived motifs

1ba88d5

saving and loading of initcluster motifs implemented not tested

9d9f228

debugged saving

5c95d4d

notebook with meme init working on gkmexplain Nanog example

03a01fe

notebook with meme init working on gkmexplain Nanog example

a390190

results on Nanog gkmexplain without meme-disco init

a09be8e

gkmexplain Nanog results with MEME-DISco 10 init

b86c5ae

bugfix on seqlet subsample for meme run

01a08ee

trying to implement init with louvain again

0c42575

bugfix the argsort of the affinities

98d9052

With Leiden instead of Louvain

8911816

making Leiden the default, remove the init implementation for Louvain

c591859

comitting updated notebooks, changed arg name

f1c2d72

updated Nanog notebook

5f10648

Created using Colaboratory

40cafde

Delete Nanog_GkmExplain_Generate_Data.ipynb

716186d

Created using Colaboratory

de01768

updated notebook link

dc8061c

Merge branch 'cluster_init' of https://github.com/kundajelab/modisco_…

d76f68a

…private into cluster_init

updated JustExtractSeqlets as well

98c44bb

fixed test, added -p argument

e1db274

AvantiShri merged commit 1bfc63a into master Apr 22, 2020

AvantiShri deleted the cluster_init branch April 22, 2020 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for MEME-based initialization #57

Support for MEME-based initialization #57

AvantiShri commented Apr 22, 2020 •

edited

Loading

Support for MEME-based initialization #57

Support for MEME-based initialization #57

Conversation

AvantiShri commented Apr 22, 2020 • edited Loading

AvantiShri commented Apr 22, 2020 •

edited

Loading