Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFGS+EM for optimising model parameters #1

Closed
roblanf opened this issue Jul 25, 2020 · 6 comments
Closed

BFGS+EM for optimising model parameters #1

roblanf opened this issue Jul 25, 2020 · 6 comments

Comments

@roblanf
Copy link
Collaborator

roblanf commented Jul 25, 2020

The other day we discussed which algorithm was better for model parameter estimation. We have two implementations:

  1. EM, which will fix all parameters and optimise each one in turn while holding the others fixed, then iterate until it's done.

  2. BFGS, which will optimise all parameters at once

@bqminh mentioned that in the original modelfinder, they had compared these and found that BFGS was quicker but gave worse likelihoods, so they decided to stick with EM.

@JamesBarbetti mentioned that often it's better to chain them together

I said I'd take a look at the huge SARS-CoV-2 alignments and see what happened. Here's what happened, confirming that this is hard and also interesting. This is an alignment of ~30K bp and ~40K sequences. Free rate models fit much better than the other models.

JC+I+R5 optimised with EM:
lnL: -388587.635
Proportion of invariable sites: 0.389
Site proportion and rates: (0.529,0.808) (0.074,5.739) (0.009,16.933)

JC+I+R5 optimised with BFGS:
lnL: <-375974.923841
Proportion of invariable sites: 0.616
Site proportion and rates: (0.426,0.269) (0.415,0.625) (0.159,3.936)

It's less than that because I had to kill the analysis before it had a chance to do the final round of fine-grained model parameter optimisations. The first analysis took about 2 days. I killed the second on after 1 day, and I'd guess it was going to take a fairly similar total execution time compared the first one. BFGS is certainly no more than twice as fast.

The surprise: BFGS got WAY better likelihoods.

So, my suggestion is that we implement @JamesBarbetti's suggestion of the two algorithms chained together and compare the performance on a range of datasets. I guess there are lots of options for how to chain them, including:

  1. BFGS until convergence followed EM until convergence
  2. As in 1 but EM followed by BFGS
  3. Switch between BFGS and EM on each iteration

@JamesBarbetti suggested that option 3 might be best, and that's my intuition too.

I think this could be a nice addition to ModelFinder2, and simple to compare on a representative datasets.

@bqminh
Copy link
Member

bqminh commented Jul 30, 2020

tagging @thomaskf here for discussions.
That's interesting. What Lars and Thomas did, was not exactly about log-likelihoods but about accuracy. So they simulated a bunch of alignments and free-rates, and ask whether BFGS or EM most frequently obtained the the true rates back. And the answer was EM.
I remember that, I also did a bunch of empirical alignments, and observed that none of them always obtained higher likelihood than the other. Sometimes BFGS got higher likelihood, sometimes EM.
@roblanf 's suggestion is sensible. I believe Thomas put such code in IQ-TREE after our meeting. Thomas, can you pls test this data set?

@roblanf
Copy link
Collaborator Author

roblanf commented Jul 30, 2020

@thomaskf if you can post the relevant commandline here, I can test it on this dataset (it's a GISAID dataset so hard to share).

@roblanf
Copy link
Collaborator Author

roblanf commented Jul 30, 2020

Also I'm not totally convinced about simulating data and then asking whether we get the 'true' rates back. We only expect to get the 'true' rates if we have infinite species (well, infinite tree length really) and infinite sites.

E.g. if the number of species is limited, more and more sites will look like constant sites rather than sites from a 'true' low rate category. Because of these limitations, I tend to think that in ML software we should really consider the methods (within reason, of course!) that give us the best likelihood. If that disagrees with simulation conditions, it will often be because of some other limitation.

JamesBarbetti added a commit that referenced this issue Jul 30, 2020
(to output in the list of sequences) in Alignment::checkSeqName.
1. Added AlignmentSummary::constructSequenceMatrixNoisily
    (which is like AlignmentSummary::constructSequenceMatrix
    but reports its progress via a progress_display).
2. Refactored Alignment::checkSeqName (which needs a rename!)
    so that it uses AlignmentSummary to construct a sequence
    matrix (via the member function added in change #1 above),
    so that gap characters can be counted via flat memory accesses,
    and then looks at sequences in parallel (all it needs to figure out
    for each sequence is percent_gaps, failed, and pvalue, which are
    written into members of (the local struct) SequenceInfo in a
    dynamically allocated array, to be written out in one sequential
    block after the parallelized bit.
3. Alignment::extractSubAlignment now reports its progress.
@thomaskf
Copy link
Collaborator

@roblanf sorry for the late reply. I agree with your point. I have updated the codes last Saturday so that in each iteration 1-BFGS->EM is performed, and the iteration will be repeated until the convergence is reached. I am uploading the codes soon so that you can have a test. Meanwhile, I still need a bit more time to finalize the paper with Allen and I will work at full speed on the project starting from next week. Apology for the delay and the inconvenience caused.

@roblanf
Copy link
Collaborator Author

roblanf commented Jul 31, 2020

Hi @thomaskf, that sounds good. Just make sure that the three algorithms are available as different flags:

  1. one for EM only
  2. one for BFGS only
  3. one for EM/BFGS combined

JamesBarbetti added a commit that referenced this issue Aug 5, 2020
lower triangular, as well as square), and zlib compression
(writing: changes #1 thru #4, #6 thru #9, reading: change #5).

1. In utils/tools.h, added a dist_format member to the Params class.
   (which defaults to "square"), and a dist_compression_level member
   (which defaults to 1).
2. In utils/tools.cpp, added code to recognise and process a
   new "-dist-format" command-line parameter (which specifies the
   output format to use for distance matrix files, is now recognized.
   It can be "square" (default), "upper", "lower', "square.gz",
   "upper.gz" or "lower.gz": the string is written to Params::dist_format.
   Suffixing it with a compression level sets dist_compression_level.
   (e.g. "lower.gz5" for lower-triangular matrix, gz compression, and
   Compression level 5) (a level > 9 is treated as 9).
3. Also in utils/tools.cpp, added supporting functions
   string_to_lower, throw_if_not_in_set, and strip_number_suffix
   (for now they are in an anonymous namespace in that file).
4. Alignment::printDist takes additional (format and compression_level)
   parameters (which will be set, by the caller, from Params::dist_format
   and Params::dist_compression_level); see the list of allowed values
   in #2 above).  There's an overload (templated on the stream type),
   which is passed an ogzstream (rather than an ofstream), for writing
   Compressed formats. Note: there *isn't* a trailing space on the end
   Of every line of the distance matrix, in the output file, any more.
5. Alignment::readDist now takes an igzstream rather than an ifstream
   parameter (and displays a report of progress). It detects
   upper-triangle or lower-triangle format (Boolean variables: upper
   and lower), by counting how many distances there are on the first
   line of distances (If 0, lower, if nseqs-1, upper, otherwise square).
   Mapping between row/column numbers is only done once (to speed up
   The copying between the temporary distance matrix "as read from the
   File" and the distance matrix.  Note: pos is now a size_t rather
   Than an int, so readDist should now work when nseqs * nseqs > (2^31).
   (It would've segfaulted, before, when pos overflowed and went -ve).
6. Updated the call, from PhyloTree::printDistanceFile to
   Alignment::printDist (it needs to pass the format and compression
   level parameters).
7. In PhyloTree::computeBioNJ, times to write distance files and
   run the distance matrix algorithm aren't logged until both are done.
8. Removed commented-out calls to Alignment::printDist, in other files
   (e.g. one in model/ratemeyerhaeseler.cpp).  They wouldn't work now
   (format and compression level parameters would be needed).
9. Added compression_level as a parameter to constructors, and open()
   member functions, of gzstreambase and ogzstream, so I could set it
   (in Alignment::printDist).  A setting of 9 achieves much
   better compression (8:1 in my tests) but it takes so much time, it
   probably isn't worth it unless you have a very slow HDD.
   A compression level of 5 results in (roughly) 4:1 compression,
   which is... adequate (even a compression level of 1 gives ~3:1).
   The default is 1.

Note that Alignment::readDist could, already, just about work "incrementally".
At present, it throws if/when a sequence is missing from the distance matrix
file.  But it could ignore that, instead (and write a "not-found" value into
actualToTemp). You'd need some changes in the for-loops that copy the
distances from the temporary distance matrix (write zero for unknown).
Because of how later steps treat zeroes in the distance matrix (they
recalculate this distance), this would "just about" work.

For a large matrix, and a compression level of 1, a lower.gz distance file
is about 1/6th the size of a square (un-compressed) square-matrix distance
file (a third the size of an (un-compressed) Triangular-matrix distance file).
JamesBarbetti added a commit that referenced this issue Aug 7, 2020
Sometimes PhyloTree::getBestNNIForBran did a delete [] on a pointer
it didn't own, resulting in segmentation faults (when called on
Partitioned trees).  Now, it *never* has an nniMoves pointer that
it allocates on the heap, so it never has to delete[] it
(Instead it allocates a size-2 array of NNIMove on the stack).

1. PartInfo constructor now sets cur_ptnlh to nullptr.
2. NNIMove constructor now sets node1, node2, ptnlh to nullptr,
   And sets newloglh to -DBL_MAX.
3. Removed code in PhyloSuperTree::getBestNNIForBran that explicitly
   initialised those members of NNIMove (and moved declarations of
   Local NNIMove instances closer to the places they're used).
4. In PhyloSuperTreePlen::getBestNNIForBran, the local allocation
   of two NNIMove instances is done on the stack, so there's no need
   to keep track of whether it was used, and no need to delete [] it
   before function exit (if so).  There's no need to initialise the
   ptnlh members of those instances (or the node member of the first),
   Since the constructor does that (see change #2 above).  That's why
   There's no ... if (newNNIMoves) { delete [] nniMoves; } at the end.
5. In PhyloTree::getBestNNIForBran, likewise.  There's no
   delete [] nniMoves at the end (that should've been if done only
   if (newNNIMoves) was true; the defect introduced in commit e9f7061
   was that it was *no longer* checking if newNNIMoves was true).
6. PhyloTree::computeNNIPatternLh removed code explicitly setting
   node1 and node2 members (of two locally-declared NNIMove instances)
   to nullptr.  The constructor now does that (see change #2 above).
7. Various getBestNNIForBran declarations now have nniMoves defaulting
   to nullptr rather than NULL.
8. Code in main/phylotesting.cpp that was explicitly initialising
   PartInfo::cur_ptnlh and NNIMove::ptnlh is no longer needed
   (Because the constructors do that, see changes #1 and #2 above).
JamesBarbetti added a commit that referenced this issue Dec 3, 2020
…(PhyloTree needs to *own* it!).

1. Renamed distance matrix computation members (of
    PhyloTree) to indicate they do a whole matrix.
    Changed their signatures, so they don't "supply"
    distance matrices to the PhyloTree instance (why?
    they're always the ones the PhyloTree already owns
    in its dist_matrix and var_matrix instance variables).
2.  Added a dist_matrix_rank member to PhyloTree
     (if PhyloTree doesn't know the size of
     its matrix, it doesn't know if it is big enough!)
     When -merge is used an existing matrix might no
     longer be large enough (after additional sequences
     have been added to the alignment, and the tree).
3. Added PhyloTree::ensureDistanceMatrixAllocated
     (so that clients don't need to know about the
     existence of dist_matrix_rank)
4. Made PhyloTree::printDistanceFile() const.
5. Consequences elsewhere (of change #1):
(a) computeMLDist no longer needs to do its own distance
     and variance matrix allocations (since all that is now
     done by the PhyloTree instance itself).
(b) computeInitialDist no longer needs to pass
     a PhyloTree its own distance and variance
    matrix pointers (!)
(c) PhyloTree::computeInitialTree, likewise
6. Likewise, computeObsDist is now
    computeObservedDistanceMatrix, and it too acts
    on the dist_matrix distance matrix instance member
    of the PhyloTree.
7. Added a HUNTING_HEAP_CORRUPTION symbol,
    for MTree::logLine.

Note: I'm not happy that IQTree::init reinitializes
members that belong to PhyloTree (that will already
have happened!).

It would be better if the distance and variance matrices were instances of FlatMatrix<double> rather than naked pointers (more sharing of code with distanceTree would be possible), and set-up and tear-down wouldn't be needed in PhyloTree or IQTree's init() member functions.
thomaskf added a commit that referenced this issue Jun 3, 2021
Merge with the master branch from iqtree/iqtree2
tamaramagdr pushed a commit that referenced this issue Oct 7, 2021
tamaramagdr pushed a commit that referenced this issue Oct 7, 2021
optimisation (~2x) & tree search (~2x) by "serialising" their
memory access patterns to improve spatial and temporal locality
of reference.

1. PhyloTree alignment summaries may now be "borrowed" from another tree
   that has the same alignment (the relevant member is isSummaryBorrowed;
   if it is true, this instance doesn't own the summary, it is only
   "Borrowing" a reference to a summary owned by another tree).
2. PhyloTree member functions copyPhyloTree and copyPhyloTreeMixlen take an
   extra parameter indicating whether the copy is to "borrow" a copy of the
   alignment summary of the original (if it has one).  This matters a lot
   for ratefree.cpp and +R free rate models, and modelmixture.cpp!
   The temporary copies of the phylo tree that are used during parameter
   Optimization can now re-use the AlignmentSummary of the original; which
   means they can "linearise" their memory access to sites, when they are
   Optimising branch lengths (see changes listed below, e.g. #4, #5, #6, #7).
3. PhyloTree::setAlignment does its "name check" a different way (rather
   than finding each sequence by name by scanning the tree, if asks
   MTree::getMapOfTaxonNameToNode for a name to leaf node map, and checks
   the array of sequence names against the map (updating the id on the
   node for each hit).  The new approach is equivalent but is much faster,
   O(n.ln(n)) rather than O(n^2).  This speeds up tree loads markedly
   (particularly for large trees), but it matters most for free rate
   parameter optimization (on middling inputs this was a significant
   factor: about ~10% of parameter optimization time).  This can be turned
   off by changing the FAST_NAME_CHECK symbol.
4. IQTree::optimizeModelParameters now calls prepareToComputeDistances()
   (So that AlignmentSummary's matrix of converted sequences) will be
   available (to be borrowed, via calls to PhyloTree::copyPhyloTree (see
   change # 2 above, and changes #5 through #7 below).
   Likewise IQTree::doNNISearch (so changes #5, #8 help tree searches too).
5. AlignmentPairwise::computeFunction and  AlignmentPairwise::computeFuncDerv(
   can now make use of AlignmentSummary's "Matrix of converted sequences"
   (if it is available) via PhyloTree's accessor methods,
   e.g. PhyloTree::getConvertedSequenceByNumber().
   For this to work as expected, it's necessary for callers to ask
   AlignmentSummary to construct that matrix *including* even sites where
   there is no variety at all (added the keepBoringSites parameter on the
   AlignmentSummary constructor for this).
6. RateMeyerDiscrete::computeFunction and RateMeyerDiscrete::computeFuncDerv
   likewise. And RateMeyerDiscrete::normalizeRates can make use of the "flat"
   frequency array exposed by PhyloTree::getConvertedSequenceFrequencies() too.
7. PhyloTree::computePartialLikelihoodGenericSIMD (in phylokernelnew.h)
   makes use of the matrix of converted sequences (if one is available), in
   about six (!) different places.  In terms of actual effect, this is the
   most important change in this commit, but it needs changes #1, #2, and #4
   committed too, if it is to have any effect.  This change speeds up both
   parameter optimisation and tree searching significantly.
8. As well as inv_eigenvectors, there is now an iv_eigenvectors_transposed
   (Using the transpose makes for some faster multiplications; see change #9
   listed below). ModelMarkov::calculateSquareMatrixTranspose is used to
   calculate the transpose of the inverse eigen vectors.  Unpleasant
   consequence: ModelMarkov::update_eigen_pointers has to take an extra
   parameter.  Keeping this additional member set correctly is the only
   Thing that forced changes to modelpomomixture.cpp (and .h), modelset.cpp,
   and modelsubst.h.
9. ModelMarkov::computeTransMatrix and ModelMarkov::computeTransDerv
   now use (a) calculateExponentOfScalarMultiply and
   (b) aTimesDiagonalBTimesTransposeOfC to calculate transition matrices
   (This is quite a bit faster than the Eigen code, since it doesn't
   bother to construct the diagonal matrix B.asDiagonal()...).
   (a) and (b) and the supporting functions, calculateHadamardProduct
   And dotProduct, are (for now) members of ModelMarkov.
10.Minor tweaks to vector processing code in phylokernelnew.h:
   (a) dotProductVec hand-unrolled treatment of the V array;
   (b) dotProductPairAdd treated the last item (in A and B) as the
       special case, when handling an odd number of items.
       Possibly the treatment of the AD and BD arrays should be
       hand-unrolled here, too, but I haven't tried that yet.
   (c) dotProductTriple (checking for odd uses & rather than %) (faster!)
11.The aligned_free free function (from phylotree.h ?!) does the "pointer
   Null?" check itself, and (because it takes a T*& rather than a T*),
   can itself set the pointer to nullptr.  This means that client code that
   used to go... if (x) { aligned_free(x); x=NULL; } ... can now be
   simplified to just... aligned_free(x);
12.Next to it (in phylotree.h), there is now an ensure_aligned_allocated
   method.  That lets you replace code like ... this:
        if (!eigenvalues)
            eigenvalues = aligned_alloc<double>(num_states);
   With:
        ensure_aligned_allocated(eigenvalues,  num_states);
   which is,
   I reckon, more readable.
13.In many places where there was code of the form... if (x) { delete x; }
   I have replaced it with delete x (likewise delete [] x).  delete always
   checks for null (it's required to, that's in the C++ standards), and
   "Rolling your own check" merely devalues the check that delete will
   later do! I've made similar "don't bother to check for null" changes
   in some other files, that I haven't included in this commit (since there
   aren't any *material* changes to anything in those files).
tamaramagdr pushed a commit that referenced this issue Oct 7, 2021
(to output in the list of sequences) in Alignment::checkSeqName.
1. Added AlignmentSummary::constructSequenceMatrixNoisily
    (which is like AlignmentSummary::constructSequenceMatrix
    but reports its progress via a progress_display).
2. Refactored Alignment::checkSeqName (which needs a rename!)
    so that it uses AlignmentSummary to construct a sequence
    matrix (via the member function added in change #1 above),
    so that gap characters can be counted via flat memory accesses,
    and then looks at sequences in parallel (all it needs to figure out
    for each sequence is percent_gaps, failed, and pvalue, which are
    written into members of (the local struct) SequenceInfo in a
    dynamically allocated array, to be written out in one sequential
    block after the parallelized bit.
3. Alignment::extractSubAlignment now reports its progress.
tamaramagdr pushed a commit that referenced this issue Oct 7, 2021
Merge with the master branch from iqtree/iqtree2
@bqminh
Copy link
Member

bqminh commented May 16, 2024

@thomaskf : I don't think 1-BFGS->EM a good approach, because after doing BFGS, you will be in a local optimum. If you now start EM from that local optimum, I don't see how it help to escape it, except for moving a bit around it. It's better to have independent starting points, with/without different algorithms. Since this is quite a long time with no sensible solution, I recommend that we close this discussion, also because everyone is busy with other projects and we don't really have person to pursue it further, esp. in testing different approaches.

@bqminh bqminh closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants