Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Probabilities don't sum to 1 #79

Closed
BeyondTheProof opened this issue Jan 22, 2021 · 10 comments
Closed

AssertionError: Probabilities don't sum to 1 #79

BeyondTheProof opened this issue Jan 22, 2021 · 10 comments

Comments

@BeyondTheProof
Copy link

BeyondTheProof commented Jan 22, 2021

Hi Avanti, thanks so much for making TF-MoDISco available to the community. We're finding some interesting things already!

I've run it multiple times on different datasets, and never had a problem until now. Do you know how to address this error?

(Round 1) Computed affinity matrix on nearest neighbors in 76.26 s
MEMORY 6.714896384
Filtered down to 4911 of 5381
(Round 1) Retained 4911 rows out of 5381 after filtering
MEMORY 6.715236352
(Round 1) Computing density adapted affmat
MEMORY 6.560714752
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 4911 samples in 0.026s...
[t-SNE] Computed neighbors for 4911 samples in 0.156s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4911
[t-SNE] Computed conditional probabilities for sample 2000 / 4911
[t-SNE] Computed conditional probabilities for sample 3000 / 4911
[t-SNE] Computed conditional probabilities for sample 4000 / 4911
[t-SNE] Computed conditional probabilities for sample 4911 / 4911
[t-SNE] Mean sigma: 0.137986
(Round 1) Computing clustering
MEMORY 6.560534528
Beginning preprocessing + Leiden
  0%|                                                                                                                                               | 0/50 [00:00<?, ?it/s]
Quality: 0.9017222780398779
  2%|██▋                                                                                                                                    | 1/50 [00:00<00:11,  4.38it/s]
Quality: 0.9017980768302472
 14%|██████████████████▉                                                                                                                    | 7/50 [00:01<00:10,  3.96it/s]
Quality: 0.9018023087167797
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
Got 23 clusters after round 1
Counts:
{17: 38, 1: 612, 6: 337, 14: 80, 19: 35, 0: 702, 7: 313, 2: 540, 4: 419, 3: 471, 18: 37, 11: 103, 5: 392, 10: 125, 12: 99, 16: 60, 8: 182, 13: 89, 22: 19, 21: 21, 9: 129,
15: 73, 20: 35}
MEMORY 6.343176192
(Round 1) Aggregating seqlets in each cluster
MEMORY 6.343176192
Aggregating for cluster 0 with 702 seqlets
MEMORY 6.343176192
Trimming eliminated 0 seqlets out of 702
Skipped 15 seqlets
Aggregating for cluster 1 with 612 seqlets
MEMORY 6.347374592
Trimming eliminated 0 seqlets out of 612
Skipped 16 seqlets
Aggregating for cluster 2 with 540 seqlets
MEMORY 6.3483904
Trimming eliminated 0 seqlets out of 540
Skipped 25 seqlets
Aggregating for cluster 3 with 471 seqlets
MEMORY 6.348947456
Trimming eliminated 0 seqlets out of 471
Skipped 11 seqlets
Aggregating for cluster 4 with 419 seqlets
MEMORY 6.349836288
Trimming eliminated 0 seqlets out of 419
Skipped 26 seqlets
Skipped 3 seqlets
Aggregating for cluster 5 with 392 seqlets
MEMORY 6.35052032
Trimming eliminated 0 seqlets out of 392
Skipped 13 seqlets
Aggregating for cluster 6 with 337 seqlets
MEMORY 6.351568896
Trimming eliminated 0 seqlets out of 337
Skipped 21 seqlets
Skipped 1 seqlets
Aggregating for cluster 7 with 313 seqlets
MEMORY 6.351806464
Trimming eliminated 0 seqlets out of 313
Skipped 22 seqlets
Aggregating for cluster 8 with 182 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 182
Skipped 6 seqlets
Aggregating for cluster 9 with 129 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 129
Skipped 3 seqlets
Aggregating for cluster 10 with 125 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 125
Skipped 3 seqlets
Aggregating for cluster 11 with 103 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 103
Skipped 4 seqlets
Skipped 4 seqlets
Aggregating for cluster 12 with 99 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 99
Skipped 1 seqlets
Aggregating for cluster 13 with 89 seqlets
MEMORY 6.352879616
Trimming eliminated 0 seqlets out of 89
Skipped 8 seqlets
Skipped 1 seqlets
Aggregating for cluster 14 with 80 seqlets
MEMORY 6.353133568
Trimming eliminated 0 seqlets out of 80
Skipped 2 seqlets
Aggregating for cluster 15 with 73 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 73
Skipped 4 seqlets
Aggregating for cluster 16 with 60 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 60
Skipped 2 seqlets
Aggregating for cluster 17 with 38 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 38
Skipped 1 seqlets
Traceback (most recent call last):
  File "/home/artur/miniconda3/envs/basepairmodels/bin/modisco", line 8, in <module>
    sys.exit(modisco_main())
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/basepairmodels/cli/run_modisco.py", line 115, in modisco_main
    one_hot=onehot_data)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/workflow.py", line 392, in __call__
    metacluster_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 884, in __call__
    min_seqlets_in_motif=0)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 680, in get_cluster_to_aggregate_mo
tif
    motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 516, in __call__
    to_return = self.postprocessor(to_return)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 59, in __call__
    arr=self.score_positions(aggregated_seqlet),
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 79, in score_positions
    ppm=ppm, background=self.bg_freq, pseudocount=0.001)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/util.py", line 479, in compute_per_position_ic
    +str(ppm)+"\n"+str(np.sum(ppm, axis=1)))
AssertionError: Probabilities don't sum to 1 along axis 1 in [[0.02702703 0.         0.94594595 0.02702703]
[0.97297297 0.02702703 0.         0.        ]
 [0.83783784 0.13513514 0.02702703 0.        ]
 [0.05405405 0.08108108 0.         0.86486486]
 [0.08108108 0.13513514 0.72972973 0.05405405]
 [0.02702703 0.         0.91891892 0.05405405]
 [0.97297297 0.         0.02702703 0.        ]
 [0.86486486 0.13513514 0.         0.        ]
 [0.08108108 0.         0.         0.91891892]
 [0.         0.13513514 0.86486486 0.        ]
 [0.         0.05405405 0.94594595 0.        ]
 [0.91891892 0.         0.05405405 0.02702703]
 [0.81081081 0.16216216 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.21621622 0.75675676 0.02702703]
 [0.02702703 0.         0.97297297 0.        ]
 [1.         0.         0.         0.        ]
 [0.83783784 0.08108108 0.05405405 0.02702703]
 [0.02702703 0.         0.         0.97297297]
 [0.02702703 0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.94594595 0.        ]
 [0.97297297 0.         0.02702703 0.        ]
 [0.89189189 0.08108108 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.10810811 0.89189189 0.        ]
 [0.         0.         0.97297297 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.81081081 0.13513514 0.02702703 0.        ]
 [0.         0.         0.         0.97297297]
 [0.         0.10810811 0.83783784 0.02702703]
 [0.         0.02702703 0.94594595 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.83783784 0.10810811 0.02702703 0.        ]
 [0.         0.02702703 0.         0.94594595]
 [0.         0.16216216 0.81081081 0.        ]
 [0.08108108 0.         0.86486486 0.02702703]
 [0.91891892 0.         0.05405405 0.        ]
 [0.89189189 0.08108108 0.         0.        ]
 [0.05405405 0.02702703 0.         0.89189189]
 [0.         0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.89189189 0.02702703]
 [0.97297297 0.         0.         0.        ]
 [0.67567568 0.2972973  0.         0.        ]
 [0.05405405 0.05405405 0.         0.86486486]
 [0.         0.2972973  0.62162162 0.05405405]]
[1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297]

I tried updating to the latest version, but the error persists. If need be, I can share the importances file with you directly. Thanks so much!

Artur

@AvantiShri
Copy link
Collaborator

AvantiShri commented Jan 22, 2021

Hi Artur, see my response in this other github issue and let me know if that doesn’t fix it! #75

@BeyondTheProof
Copy link
Author

BeyondTheProof commented Jan 22, 2021

Hi Avanti,

Great insight. I've looked through all the peak files I'm passing to interpret (from basepairmodels), and it seems that all the peaks have exactly one of A, C, G, or T in the fasta file I'm using. I used our own one-hot encoding function to check, not TF-MoDISco's, as I'm using the CLI instead of a Jupyter notebook.

So it's not that. Unless maybe interpret extends the regions?

Thanks!
Artur

Edit: I would like to note that I do not get any division by 0 errors.

@AvantiShri
Copy link
Collaborator

Hmm, I am not very familiar with the "interpret" function of basepairmodels. Can you link me to the relevant function definition if you have it handy? I assume your hunch is correct that there might be some manipulation of the regions under-the-hood. Technically this is easy to work around by simply renormalizing the PPM that is passed to the "information content" function so that the rows sum to 1, and I wrote a workaround to that effect (the code now prints a warning rather than throwing an error: #80), but it's worth tracking down exactly what's causing the behavior.

@AvantiShri
Copy link
Collaborator

@zahoorz are you able to point me to the lines of code corresponding to the "interpret" function of basepairmodels?

@AvantiShri
Copy link
Collaborator

@BeyondTheProof in the meantime, can you try out the workaround and see if it satisfies your needs?

@BeyondTheProof
Copy link
Author

@AvantiShri The workaround works!

I've narrowed down the problem. The one-hot encoding function I was using converts Ns to [0.25, 0.25, 0.25, 0.25] instead of all 0s. I thought the offending line is in basepairmodels.cli.interpret.interpret, lines 82-83. The regions are resized to a standard length, 3088 by default. However, after checking, plenty of the regions have Ns in them, so I don't think it's that. In interpret, one_hot_encode does indeed translate Ns to all 0s.

So this doesn't solve the problem. There may, however, be a step where the probabilities are normalized in certain peaks but not others. It's likely it's due to something involving Ns, as the problematic peaks are H3K9me3, which often fall in repetitive regions.

@AvantiShri
Copy link
Collaborator

Glad the workaround works - I think I'm a bit confused why you said "However, after checking, plenty of the regions have Ns in them, so I don't think it's that" - do you say "I don't think it's that" because this is the first time you are getting the error, and your previous runs didn't get the error? The error only occurs if a region containing N's winds up inside or near a motif (I write "near a motif" because tfmodico includes "flank expansion" steps where the seqlets are expanded to include potentially relevant signal in the flanks). In most cases, your motif instances won't be near the N's, and so it makes sense to me that you didn't see this error for most runs!

@BeyondTheProof
Copy link
Author

I see, that makes a lot of sense. So yes, I think it only came up with the H3K9me3 because of the abundance of repeats. These are going to be much more likely to be near Ns, so it's all consistent with what you said. Thanks so much! Should I mark this resolved, or did you want to keep it open until we hear from @zahoorz?

@AvantiShri
Copy link
Collaborator

Seems like you managed to nail it down!

@BeyondTheProof
Copy link
Author

Hi Avanti, I've reopened this issue to fix a small bug. You had a < instead of > in your fix. I left a one-line review on the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants