-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for MEME-based initialization #57
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
An initial clustering can be specified using the
initclusterer_factory
argument ofTfModiscoSeqletsToPatternsFactory
. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):Explanation of the arguments:
meme_command
: this is justmeme
if the meme executable is in the PATH; if it's not in the path, thenmeme_command
should specify the full path to the executable, e.g./software/meme/5.0.1/bin/meme
on the kundajelab servers.base_outdir
: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.max_num_seqlets_to_use
: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.nmotifs
: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.njobs
: specifies the value of the-p
argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.
The cluster initialization affects the TF-MoDISco workflow in two places:
Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.
Other changes:
use_louvain
argument toTrue
inTfModiscoSeqletsToPatternsFactory
- but note that the cluster initialization functionality isn't supported with Louvain.**The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).