AssertionError: Probabilities don't sum to 1 #79

BeyondTheProof · 2021-01-22T19:50:25Z

Hi Avanti, thanks so much for making TF-MoDISco available to the community. We're finding some interesting things already!

I've run it multiple times on different datasets, and never had a problem until now. Do you know how to address this error?

(Round 1) Computed affinity matrix on nearest neighbors in 76.26 s
MEMORY 6.714896384
Filtered down to 4911 of 5381
(Round 1) Retained 4911 rows out of 5381 after filtering
MEMORY 6.715236352
(Round 1) Computing density adapted affmat
MEMORY 6.560714752
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 4911 samples in 0.026s...
[t-SNE] Computed neighbors for 4911 samples in 0.156s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4911
[t-SNE] Computed conditional probabilities for sample 2000 / 4911
[t-SNE] Computed conditional probabilities for sample 3000 / 4911
[t-SNE] Computed conditional probabilities for sample 4000 / 4911
[t-SNE] Computed conditional probabilities for sample 4911 / 4911
[t-SNE] Mean sigma: 0.137986
(Round 1) Computing clustering
MEMORY 6.560534528
Beginning preprocessing + Leiden
  0%|                                                                                                                                               | 0/50 [00:00<?, ?it/s]
Quality: 0.9017222780398779
  2%|██▋                                                                                                                                    | 1/50 [00:00<00:11,  4.38it/s]
Quality: 0.9017980768302472
 14%|██████████████████▉                                                                                                                    | 7/50 [00:01<00:10,  3.96it/s]
Quality: 0.9018023087167797
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
Got 23 clusters after round 1
Counts:
{17: 38, 1: 612, 6: 337, 14: 80, 19: 35, 0: 702, 7: 313, 2: 540, 4: 419, 3: 471, 18: 37, 11: 103, 5: 392, 10: 125, 12: 99, 16: 60, 8: 182, 13: 89, 22: 19, 21: 21, 9: 129,
15: 73, 20: 35}
MEMORY 6.343176192
(Round 1) Aggregating seqlets in each cluster
MEMORY 6.343176192
Aggregating for cluster 0 with 702 seqlets
MEMORY 6.343176192
Trimming eliminated 0 seqlets out of 702
Skipped 15 seqlets
Aggregating for cluster 1 with 612 seqlets
MEMORY 6.347374592
Trimming eliminated 0 seqlets out of 612
Skipped 16 seqlets
Aggregating for cluster 2 with 540 seqlets
MEMORY 6.3483904
Trimming eliminated 0 seqlets out of 540
Skipped 25 seqlets
Aggregating for cluster 3 with 471 seqlets
MEMORY 6.348947456
Trimming eliminated 0 seqlets out of 471
Skipped 11 seqlets
Aggregating for cluster 4 with 419 seqlets
MEMORY 6.349836288
Trimming eliminated 0 seqlets out of 419
Skipped 26 seqlets
Skipped 3 seqlets
Aggregating for cluster 5 with 392 seqlets
MEMORY 6.35052032
Trimming eliminated 0 seqlets out of 392
Skipped 13 seqlets
Aggregating for cluster 6 with 337 seqlets
MEMORY 6.351568896
Trimming eliminated 0 seqlets out of 337
Skipped 21 seqlets
Skipped 1 seqlets
Aggregating for cluster 7 with 313 seqlets
MEMORY 6.351806464
Trimming eliminated 0 seqlets out of 313
Skipped 22 seqlets
Aggregating for cluster 8 with 182 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 182
Skipped 6 seqlets
Aggregating for cluster 9 with 129 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 129
Skipped 3 seqlets
Aggregating for cluster 10 with 125 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 125
Skipped 3 seqlets
Aggregating for cluster 11 with 103 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 103
Skipped 4 seqlets
Skipped 4 seqlets
Aggregating for cluster 12 with 99 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 99
Skipped 1 seqlets
Aggregating for cluster 13 with 89 seqlets
MEMORY 6.352879616
Trimming eliminated 0 seqlets out of 89
Skipped 8 seqlets
Skipped 1 seqlets
Aggregating for cluster 14 with 80 seqlets
MEMORY 6.353133568
Trimming eliminated 0 seqlets out of 80
Skipped 2 seqlets
Aggregating for cluster 15 with 73 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 73
Skipped 4 seqlets
Aggregating for cluster 16 with 60 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 60
Skipped 2 seqlets
Aggregating for cluster 17 with 38 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 38
Skipped 1 seqlets
Traceback (most recent call last):
  File "/home/artur/miniconda3/envs/basepairmodels/bin/modisco", line 8, in <module>
    sys.exit(modisco_main())
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/basepairmodels/cli/run_modisco.py", line 115, in modisco_main
    one_hot=onehot_data)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/workflow.py", line 392, in __call__
    metacluster_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 884, in __call__
    min_seqlets_in_motif=0)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 680, in get_cluster_to_aggregate_mo
tif
    motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 516, in __call__
    to_return = self.postprocessor(to_return)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 59, in __call__
    arr=self.score_positions(aggregated_seqlet),
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 79, in score_positions
    ppm=ppm, background=self.bg_freq, pseudocount=0.001)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/util.py", line 479, in compute_per_position_ic
    +str(ppm)+"\n"+str(np.sum(ppm, axis=1)))
AssertionError: Probabilities don't sum to 1 along axis 1 in [[0.02702703 0.         0.94594595 0.02702703]
[0.97297297 0.02702703 0.         0.        ]
 [0.83783784 0.13513514 0.02702703 0.        ]
 [0.05405405 0.08108108 0.         0.86486486]
 [0.08108108 0.13513514 0.72972973 0.05405405]
 [0.02702703 0.         0.91891892 0.05405405]
 [0.97297297 0.         0.02702703 0.        ]
 [0.86486486 0.13513514 0.         0.        ]
 [0.08108108 0.         0.         0.91891892]
 [0.         0.13513514 0.86486486 0.        ]
 [0.         0.05405405 0.94594595 0.        ]
 [0.91891892 0.         0.05405405 0.02702703]
 [0.81081081 0.16216216 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.21621622 0.75675676 0.02702703]
 [0.02702703 0.         0.97297297 0.        ]
 [1.         0.         0.         0.        ]
 [0.83783784 0.08108108 0.05405405 0.02702703]
 [0.02702703 0.         0.         0.97297297]
 [0.02702703 0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.94594595 0.        ]
 [0.97297297 0.         0.02702703 0.        ]
 [0.89189189 0.08108108 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.10810811 0.89189189 0.        ]
 [0.         0.         0.97297297 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.81081081 0.13513514 0.02702703 0.        ]
 [0.         0.         0.         0.97297297]
 [0.         0.10810811 0.83783784 0.02702703]
 [0.         0.02702703 0.94594595 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.83783784 0.10810811 0.02702703 0.        ]
 [0.         0.02702703 0.         0.94594595]
 [0.         0.16216216 0.81081081 0.        ]
 [0.08108108 0.         0.86486486 0.02702703]
 [0.91891892 0.         0.05405405 0.        ]
 [0.89189189 0.08108108 0.         0.        ]
 [0.05405405 0.02702703 0.         0.89189189]
 [0.         0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.89189189 0.02702703]
 [0.97297297 0.         0.         0.        ]
 [0.67567568 0.2972973  0.         0.        ]
 [0.05405405 0.05405405 0.         0.86486486]
 [0.         0.2972973  0.62162162 0.05405405]]
[1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297]

I tried updating to the latest version, but the error persists. If need be, I can share the importances file with you directly. Thanks so much!

Artur

The text was updated successfully, but these errors were encountered:

AvantiShri · 2021-01-22T19:54:34Z

Hi Artur, see my response in this other github issue and let me know if that doesn’t fix it! #75

BeyondTheProof · 2021-01-22T20:17:40Z

Hi Avanti,

Great insight. I've looked through all the peak files I'm passing to interpret (from basepairmodels), and it seems that all the peaks have exactly one of A, C, G, or T in the fasta file I'm using. I used our own one-hot encoding function to check, not TF-MoDISco's, as I'm using the CLI instead of a Jupyter notebook.

So it's not that. Unless maybe interpret extends the regions?

Thanks!
Artur

Edit: I would like to note that I do not get any division by 0 errors.

AvantiShri · 2021-01-22T20:35:46Z

Hmm, I am not very familiar with the "interpret" function of basepairmodels. Can you link me to the relevant function definition if you have it handy? I assume your hunch is correct that there might be some manipulation of the regions under-the-hood. Technically this is easy to work around by simply renormalizing the PPM that is passed to the "information content" function so that the rows sum to 1, and I wrote a workaround to that effect (the code now prints a warning rather than throwing an error: #80), but it's worth tracking down exactly what's causing the behavior.

AvantiShri · 2021-01-22T20:37:08Z

@zahoorz are you able to point me to the lines of code corresponding to the "interpret" function of basepairmodels?

AvantiShri · 2021-01-22T20:37:44Z

@BeyondTheProof in the meantime, can you try out the workaround and see if it satisfies your needs?

BeyondTheProof · 2021-01-22T22:00:35Z

@AvantiShri The workaround works!

I've narrowed down the problem. The one-hot encoding function I was using converts Ns to [0.25, 0.25, 0.25, 0.25] instead of all 0s. I thought the offending line is in basepairmodels.cli.interpret.interpret, lines 82-83. The regions are resized to a standard length, 3088 by default. However, after checking, plenty of the regions have Ns in them, so I don't think it's that. In interpret, one_hot_encode does indeed translate Ns to all 0s.

So this doesn't solve the problem. There may, however, be a step where the probabilities are normalized in certain peaks but not others. It's likely it's due to something involving Ns, as the problematic peaks are H3K9me3, which often fall in repetitive regions.

AvantiShri · 2021-01-22T22:27:03Z

Glad the workaround works - I think I'm a bit confused why you said "However, after checking, plenty of the regions have Ns in them, so I don't think it's that" - do you say "I don't think it's that" because this is the first time you are getting the error, and your previous runs didn't get the error? The error only occurs if a region containing N's winds up inside or near a motif (I write "near a motif" because tfmodico includes "flank expansion" steps where the seqlets are expanded to include potentially relevant signal in the flanks). In most cases, your motif instances won't be near the N's, and so it makes sense to me that you didn't see this error for most runs!

BeyondTheProof · 2021-01-22T22:30:15Z

I see, that makes a lot of sense. So yes, I think it only came up with the H3K9me3 because of the abundance of repeats. These are going to be much more likely to be near Ns, so it's all consistent with what you said. Thanks so much! Should I mark this resolved, or did you want to keep it open until we hear from @zahoorz?

AvantiShri · 2021-01-22T23:03:04Z

Seems like you managed to nail it down!

BeyondTheProof · 2021-02-01T19:04:26Z

Hi Avanti, I've reopened this issue to fix a small bug. You had a < instead of > in your fix. I left a one-line review on the PR.

AvantiShri mentioned this issue Jan 22, 2021

Changing sum-to-1 error to a warning for calculating information content #80

Merged

AvantiShri closed this as completed Jan 22, 2021

BeyondTheProof mentioned this issue Feb 1, 2021

Bug in checking weights sum to 1 #81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: Probabilities don't sum to 1 #79

AssertionError: Probabilities don't sum to 1 #79

BeyondTheProof commented Jan 22, 2021 •

edited

AvantiShri commented Jan 22, 2021 •

edited

BeyondTheProof commented Jan 22, 2021 •

edited

AvantiShri commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Feb 1, 2021

AssertionError: Probabilities don't sum to 1 #79

AssertionError: Probabilities don't sum to 1 #79

Comments

BeyondTheProof commented Jan 22, 2021 • edited

AvantiShri commented Jan 22, 2021 • edited

BeyondTheProof commented Jan 22, 2021 • edited

AvantiShri commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Jan 22, 2021

AvantiShri commented Jan 22, 2021

BeyondTheProof commented Feb 1, 2021

BeyondTheProof commented Jan 22, 2021 •

edited

AvantiShri commented Jan 22, 2021 •

edited

BeyondTheProof commented Jan 22, 2021 •

edited