Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for MEME-based initialization #57

Merged
merged 26 commits into from
Apr 22, 2020
Merged

Support for MEME-based initialization #57

merged 26 commits into from
Apr 22, 2020

Conversation

AvantiShri
Copy link
Collaborator

@AvantiShri AvantiShri commented Apr 22, 2020

An initial clustering can be specified using the initclusterer_factory argument of TfModiscoSeqletsToPatternsFactory. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):

initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
   meme_command="meme", base_outdir="meme_out",   
   max_num_seqlets_to_use=10000,
   nmotifs=10,
   n_jobs=4)

Explanation of the arguments:

  • meme_command: this is just meme if the meme executable is in the PATH; if it's not in the path, then meme_command should specify the full path to the executable, e.g. /software/meme/5.0.1/bin/meme on the kundajelab servers.
  • base_outdir: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.
  • max_num_seqlets_to_use: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.
  • nmotifs: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.
  • njobs: specifies the value of the -p argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.

The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.

The cluster initialization affects the TF-MoDISco workflow in two places:

  • First, the fine-grained similarity is computed not just on the set of nearest-neighbors that have the highest coarse-grained similarity across all seqlets, but also on the set of nearest-neighbors that have the highest coarse-grained-similarity within each initialized cluster.
  • Second, it is used to initialize Leiden community detection.

Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.

Other changes:

  • Moved from Louvain -> Leiden for the main community detection step. Note that I am no longer doing consensus clustering with Leiden because it didn't appear to work well (consistent with this discussion on twitter); instead, I am just taking the best modularity over 50 runs of Leiden with different random seeds. To go back to using Louvain for the main community detection step, set the use_louvain argument to True in TfModiscoSeqletsToPatternsFactory - but note that the cluster initialization functionality isn't supported with Louvain.*
  • Updated the Nanog notebook to showcase the MEME initialization functionality
  • Updated the Nanog notebook to use better normalization (I'm now just doing mean normalization across ACGT at each position, which I think is more intuitive and has a similar effect as the normalization I described in the GkmExplain paper). Also updated the notebook to apply normalization to the importance scores of the dinuc-shuffled null (previously, the scores for the null distribution weren't normalized)
  • Added tests for the MEME-based initialization

*The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).

@AvantiShri AvantiShri merged commit 1bfc63a into master Apr 22, 2020
@AvantiShri AvantiShri deleted the cluster_init branch April 22, 2020 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant