I don't understand the number of schemes analysed for the greedy algorithm #58

roblanf · 2022-02-14T21:52:32Z

I'm analysing a dataset with 168 data blocks, using the greedy algorithm for merging subsets.

When I do this in PF2, the first step of the greedy algorithm computes the BIC score of 14028 subset pairs. This is expected - 168 choose 2 = 14028.

But when I run what I think is the same analysis in MF2, the first step of the algorithm analyses just 2328 subset pairs. At least, this is what's reported in the logged output. I really can't figure out quite what's going on.

It gets stranger when you look at the next step of the algorithm, which should only be analysing subsets that add to the new subset, but in this case my output is as below.

I'm guessing my commandline is just wrong, and in particular that --merge greedy is not actually doing what I think it's doing, i.e. implementing an algorithm the same as the PF2 greedy algorithm.

Commandline

Command: /data/rob/mf2/iqtree-2.1.2-Linux/bin/iqtree2 -s alignment.phy -spp partitions.nex -m TESTMERGEONLY -mset GTR -mrate E,G,I+G --merge-model GTR --merge-rate E,G,I+G --merge greedy -nt 128 --seed 123456742

Merging ES321806nucl_3rdpos+O10821nucl_3rdpos with BIC score: 572164.159 (LnL: -276386.833  df: 1908)
2497 GTR+F        572129.139  0.173       O19758nucl_2ndpos+O21222nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2498 GTR+F        572105.823  0.370       O16400nucl_2ndpos+O21222nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2499 GTR+F+I+G4   572118.259  0.481       O10295nucl_1stpos+O11722nucl_2ndpos   0h:0m:4s (0h:0m:26s left)
2500 GTR+F+G4     572085.885  1.141       O5959nucl_1stpos+O8818nucl_1stpos     0h:0m:4s (0h:0m:26s left)
2501 GTR+F+I+G4   572208.592  1.394       O10600nucl_2ndpos+O13660nucl_1stpos   0h:0m:4s (0h:0m:26s left)
2502 GTR+F+G4     572085.680  1.204       O5959nucl_1stpos+O23857nucl_1stpos    0h:0m:4s (0h:0m:26s left)
2503 GTR+F+G4     572067.794  0.866       O13507nucl_1stpos+O21944nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2504 GTR+F+G4     572117.617  1.104       ES321879nucl_1stpos+O8818nucl_1stpos  0h:0m:5s (0h:0m:26s left)
2505 GTR+F+G4     572122.542  15.451      ES319935nucl_3rdpos+O988nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2506 GTR+F+G4     572086.353  0.905       O19189nucl_1stpos+O22156nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2507 GTR+F+G4     572078.146  1.229       ES321882nucl_1stpos+O16400nucl_1stpos 0h:0m:5s (0h:0m:26s left)
2508 GTR+F+G4     572084.614  1.052       O8775nucl_1stpos+O10821nucl_1stpos    0h:0m:5s (0h:0m:26s left)
2509 GTR+F+G4     572076.994  0.839       O13660nucl_1stpos+O19189nucl_1stpos   0h:0m:5s (0h:0m:26s left)
2510 GTR+F+G4     572104.762  9.295       O10600nucl_3rdpos+O17424nucl_3rdpos   0h:0m:5s (0h:0m:26s left)
2511 GTR+F+I+G4   572099.081  11.129      O1187nucl_3rdpos+O19102nucl_3rdpos    0h:0m:5s (0h:0m:27s left)
2512 GTR+F+I+G4   572144.710  14.689      ES319935nucl_3rdpos+ES321806nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:27s left)
2513 GTR+F+I+G4   572065.166  10.102      O14147nucl_3rdpos+O15665nucl_3rdpos   0h:0m:5s (0h:0m:27s left)
2514 GTR+F+G4     572122.068  6.868       O9874nucl_3rdpos+O19058nucl_3rdpos    0h:0m:5s (0h:0m:27s left)
2515 GTR+F+I+G4   572087.934  1.532       ES321806nucl_1stpos+O10419nucl_1stpos 0h:0m:5s (0h:0m:27s left)
2516 GTR+F+I+G4   572133.599  9.612       O10867nucl_3rdpos+O17424nucl_3rdpos   0h:0m:5s (0h:0m:27s left)
2517 GTR+F+I+G4   572116.124  15.400      ES321806nucl_3rdpos+ES321879nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:27s left)
2518 GTR+F+I+G4   572150.231  17.265      ES321806nucl_3rdpos+O6843nucl_3rdpos+O10821nucl_3rdpos        0h:0m:5s (0h:0m:27s left)
2519 GTR+F+I+G4   572289.800  16.016      ES321806nucl_3rdpos+O10821nucl_3rdpos+O18599nucl_3rdpos       0h:0m:5s (0h:0m:27s left)
2520 GTR+F+I+G4   572078.331  15.736      ES321806nucl_3rdpos+O988nucl_1stpos+O10821nucl_3rdpos 0h:0m:5s (0h:0m:28s left)
2521 GTR+F+I+G4   572060.003  15.333      ES321806nucl_3rdpos+O8818nucl_3rdpos+O10821nucl_3rdpos        0h:0m:5s (0h:0m:28s left)
2522 GTR+F+G4     572075.551  0.932       ES316890nucl_1stpos+O17157nucl_1stpos 0h:0m:5s (0h:0m:28s left)
2523 GTR+F+I+G4   572184.827  16.689      ES321806nucl_3rdpos+O10821nucl_3rdpos+O11240nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2524 GTR+F+I+G4   572059.969  15.301      ES316508nucl_3rdpos+ES321806nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:28s left)
2525 GTR+F+I+G4   572132.526  16.415      ES321806nucl_3rdpos+O10821nucl_3rdpos+O13660nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2526 GTR+F+I+G4   572143.651  15.058      ES321806nucl_3rdpos+ES321882nucl_3rdpos+O10821nucl_3rdpos     0h:0m:5s (0h:0m:28s left)
2527 GTR+F+I+G4   572133.231  15.255      ES321806nucl_3rdpos+O10821nucl_3rdpos+O21286nucl_3rdpos       0h:0m:5s (0h:0m:28s left)
2528 GTR+F+I+G4   572107.812  15.332      ES321806nucl_3rdpos+O540nucl_3rdpos+O10821nucl_3rdpos 0h:0m:5s (0h:0m:29s left)
Merging O17252nucl_3rdpos+O17764nucl_3rdpos with BIC score: 572031.302 (LnL: -276371.219  df: 1898)

The text was updated successfully, but these errors were encountered:

roblanf · 2022-02-16T04:28:48Z

OK, there are two issues here.

FIrst, —merge greedy isn’t working to force it to do the greedy algorithm. The fix is to make sure that when the default -rcluster-max is specified, it doesn’t overwrite the —merge option. The upshot of this issue is that the algorithm here was rcluster, not greedy.

Second, we now understand why the number of schemes wasn’t right, even for the rcluster algorithm. We have a cool new feature for the rcluster algorithm which we forgot about! In the original in partition finder, we take the e.g. 10% of closest pairs based on absolute difference. THe issue with that is that we’ll choose lower-rate pairs preferentially, because they have the lower difference. So we’ve added a feature where we take the union of two sets:

The rcluster-max closest in absolute rate difference
The rcluster-max closest in log(rate) difference

The upshot is that the number of schemes in every step is between rcluster-max and 2*rcluster_max

roblanf · 2022-02-16T05:16:06Z

@bqminh can you post here when --merge greedy is working. Details in comment above.

thomaskf · 2023-07-31T00:27:52Z

Hi Rob,

Thanks for reporting the issue. It was due to an upper bound introduced for the number of candidates when computing the best-fit partition scheme.

In version 2.2.2.8 (https://github.com/iqtree/iqtree2/releases/tag/v2.2.2.8), the issue has been fixed and the greedy algorithm for computing the best-fit partitioning scheme (i.e. option: --merge greedy) should work as expected.

Thomas

roblanf added the modelfinder2 things to do before benchmarking modelfinder2 label Feb 14, 2022

thomaskf closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I don't understand the number of schemes analysed for the greedy algorithm #58

I don't understand the number of schemes analysed for the greedy algorithm #58

roblanf commented Feb 14, 2022

roblanf commented Feb 16, 2022

roblanf commented Feb 16, 2022

thomaskf commented Jul 31, 2023

I don't understand the number of schemes analysed for the greedy algorithm #58

I don't understand the number of schemes analysed for the greedy algorithm #58

Comments

roblanf commented Feb 14, 2022

roblanf commented Feb 16, 2022

roblanf commented Feb 16, 2022

thomaskf commented Jul 31, 2023