Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't understand the number of schemes analysed for the greedy algorithm #58

Closed
roblanf opened this issue Feb 14, 2022 · 3 comments
Closed
Labels
modelfinder2 things to do before benchmarking modelfinder2

Comments

@roblanf
Copy link
Collaborator

roblanf commented Feb 14, 2022

I'm analysing a dataset with 168 data blocks, using the greedy algorithm for merging subsets.

When I do this in PF2, the first step of the greedy algorithm computes the BIC score of 14028 subset pairs. This is expected - 168 choose 2 = 14028.

But when I run what I think is the same analysis in MF2, the first step of the algorithm analyses just 2328 subset pairs. At least, this is what's reported in the logged output. I really can't figure out quite what's going on.

It gets stranger when you look at the next step of the algorithm, which should only be analysing subsets that add to the new subset, but in this case my output is as below.

I'm guessing my commandline is just wrong, and in particular that --merge greedy is not actually doing what I think it's doing, i.e. implementing an algorithm the same as the PF2 greedy algorithm.

Commandline

Command: /data/rob/mf2/iqtree-2.1.2-Linux/bin/iqtree2 -s alignment.phy -spp partitions.nex -m TESTMERGEONLY -mset GTR -mrate E,G,I+G --merge-model GTR --merge-rate E,G,I+G --merge greedy -nt 128 --seed 123456742
Merging ES321806nucl_3rdpos+O10821nucl_3rdpos with BIC score: 572164.159 (LnL: -276386.833  df: 1908)
2497 GTR+F        572129.139  0.173       O19758nucl_2ndpos+O21222nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2498 GTR+F        572105.823  0.370       O16400nucl_2ndpos+O21222nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2499 GTR+F+I+G4   572118.259  0.481       O10295nucl_1stpos+O11722nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2500 GTR+F+G4     572085.885  1.141       O5959nucl_1stpos+O8818nucl_1stpos     0h:0m:4s (0h:0m:26s left)
2501 GTR+F+I+G4   572208.592  1.394       O10600nucl_2ndpos+O13660nucl_1stpos   0h:0m:4s (0h:0m:26s left)
2502 GTR+F+G4     572085.680  1.204       O5959nucl_1stpos+O23857nucl_1stpos    0h:0m:4s (0h:0m:26s left)
2503 GTR+F+G4     572067.794  0.866       O13507nucl_1stpos+O21944nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2504 GTR+F+G4     572117.617  1.104       ES321879nucl_1stpos+O8818nucl_1stpos  0h:0m:5s (0h:0m:26s left)
2505 GTR+F+G4     572122.542  15.451      ES319935nucl_3rdpos+O988nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2506 GTR+F+G4     572086.353  0.905       O19189nucl_1stpos+O22156nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2507 GTR+F+G4     572078.146  1.229       ES321882nucl_1stpos+O16400nucl_1stpos 0h:0m:5s (0h:0m:26s left)
2508 GTR+F+G4     572084.614  1.052       O8775nucl_1stpos+O10821nucl_1stpos    0h:0m:5s (0h:0m:26s left)
2509 GTR+F+G4     572076.994  0.839       O13660nucl_1stpos+O19189nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2510 GTR+F+G4     572104.762  9.295       O10600nucl_3rdpos+O17424nucl_3rdpos   0h:0m:5s (0h:0m:26s left)
2511 GTR+F+I+G4   572099.081  11.129      O1187nucl_3rdpos+O19102nucl_3rdpos    0h:0m:5s (0h:0m:27s left)
2512 GTR+F+I+G4   572144.710  14.689      ES319935nucl_3rdpos+ES321806nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:27s left)
2513 GTR+F+I+G4   572065.166  10.102      O14147nucl_3rdpos+O15665nucl_3rdpos   0h:0m:5s (0h:0m:27s left)
2514 GTR+F+G4     572122.068  6.868       O9874nucl_3rdpos+O19058nucl_3rdpos    0h:0m:5s (0h:0m:27s left)
2515 GTR+F+I+G4   572087.934  1.532       ES321806nucl_1stpos+O10419nucl_1stpos 0h:0m:5s (0h:0m:27s left)
2516 GTR+F+I+G4   572133.599  9.612       O10867nucl_3rdpos+O17424nucl_3rdpos   0h:0m:5s (0h:0m:27s left)
2517 GTR+F+I+G4   572116.124  15.400      ES321806nucl_3rdpos+ES321879nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:27s left)
2518 GTR+F+I+G4   572150.231  17.265      ES321806nucl_3rdpos+O6843nucl_3rdpos+O10821nucl_3rdpos        0h:0m:5s (0h:0m:27s left)
2519 GTR+F+I+G4   572289.800  16.016      ES321806nucl_3rdpos+O10821nucl_3rdpos+O18599nucl_3rdpos       0h:0m:5s (0h:0m:27s left)
2520 GTR+F+I+G4   572078.331  15.736      ES321806nucl_3rdpos+O988nucl_1stpos+O10821nucl_3rdpos 0h:0m:5s (0h:0m:28s left)
2521 GTR+F+I+G4   572060.003  15.333      ES321806nucl_3rdpos+O8818nucl_3rdpos+O10821nucl_3rdpos        0h:0m:5s (0h:0m:28s left)
2522 GTR+F+G4     572075.551  0.932       ES316890nucl_1stpos+O17157nucl_1stpos 0h:0m:5s (0h:0m:28s left)
2523 GTR+F+I+G4   572184.827  16.689      ES321806nucl_3rdpos+O10821nucl_3rdpos+O11240nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2524 GTR+F+I+G4   572059.969  15.301      ES316508nucl_3rdpos+ES321806nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:28s left)
2525 GTR+F+I+G4   572132.526  16.415      ES321806nucl_3rdpos+O10821nucl_3rdpos+O13660nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2526 GTR+F+I+G4   572143.651  15.058      ES321806nucl_3rdpos+ES321882nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:28s left)
2527 GTR+F+I+G4   572133.231  15.255      ES321806nucl_3rdpos+O10821nucl_3rdpos+O21286nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2528 GTR+F+I+G4   572107.812  15.332      ES321806nucl_3rdpos+O540nucl_3rdpos+O10821nucl_3rdpos 0h:0m:5s (0h:0m:29s left)
Merging O17252nucl_3rdpos+O17764nucl_3rdpos with BIC score: 572031.302 (LnL: -276371.219  df: 1898)
@roblanf roblanf added the modelfinder2 things to do before benchmarking modelfinder2 label Feb 14, 2022
@roblanf
Copy link
Collaborator Author

roblanf commented Feb 16, 2022

OK, there are two issues here.

FIrst, —merge greedy isn’t working to force it to do the greedy algorithm. The fix is to make sure that when the default -rcluster-max is specified, it doesn’t overwrite the —merge option. The upshot of this issue is that the algorithm here was rcluster, not greedy.

Second, we now understand why the number of schemes wasn’t right, even for the rcluster algorithm. We have a cool new feature for the rcluster algorithm which we forgot about! In the original in partition finder, we take the e.g. 10% of closest pairs based on absolute difference. THe issue with that is that we’ll choose lower-rate pairs preferentially, because they have the lower difference. So we’ve added a feature where we take the union of two sets:

  1. The rcluster-max closest in absolute rate difference
  2. The rcluster-max closest in log(rate) difference

The upshot is that the number of schemes in every step is between rcluster-max and 2*rcluster_max

@roblanf
Copy link
Collaborator Author

roblanf commented Feb 16, 2022

@bqminh can you post here when --merge greedy is working. Details in comment above.

@thomaskf
Copy link
Collaborator

Hi Rob,

Thanks for reporting the issue. It was due to an upper bound introduced for the number of candidates when computing the best-fit partition scheme.

In version 2.2.2.8 (https://github.com/iqtree/iqtree2/releases/tag/v2.2.2.8), the issue has been fixed and the greedy algorithm for computing the best-fit partitioning scheme (i.e. option: --merge greedy) should work as expected.

Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
modelfinder2 things to do before benchmarking modelfinder2
Projects
None yet
Development

No branches or pull requests

2 participants