Strange results on log-likelihood calculation #102

cbravo93 · 2020-07-22T11:18:32Z

Hi!

I have encountered a strange behaviour when using across several data sets, including a simulated data set with 100 documents and ~100K words in which I expect to find ~10 topics (used as example here). When comparing the log-likelihood between the models, I find that it increases as the topics increase. Below I attach the log likelihood per iteration with alpha = 50/t and beta = 0.1 as an example:

To find out whether it was because of the data itself or related to the implementation, I also used the lda R package (https://github.com/slycoder/R-lda), which also uses collapsed gibbs sampling. In this case I got the expected results:

Since both packages are supposed to implement the same algorithm, which can be the difference here? I have also checked the models with 10 topics from both implementations and they are comparable, could it be some issue on the log likelihood calculation? I have other test it in three other data sets, getting a similar effect.

I am happy to share the data and/or code if it can help!

Cheers!

C

internaut · 2020-07-23T14:07:48Z

You should probably compare how the log-likelihood (LL) is calculated in the R package and in this package (see https://github.com/lda-project/lda/blob/develop/lda/_lda.pyx#L81).

Btw, if you want to find out the "optimal" number of topics, you probably shouldn't (only) rely on the LL. There are several other metrics which may provide better results in "real world" datasets. See for example: https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Evaluation-of-topic-models

PS: Shouldn't the LL be monotonically increasing with the number of iterations? It's a bit odd that in the R package the LL is decreasing after it reached a peak in the beginning.

cbravo93 · 2020-07-23T17:41:33Z

Thanks for your reply! Using as metrics Arun_2010, Cao_Juan_2009 and Minmo_2011 (as named in tmtoolkit) I also get 10 topics to be the best as I expected. In regards to the LL, normally I see a maximum around the optimal topics with other implementations and then it decreases (as in Griffiths & Steyvers (https://www.pnas.org/content/pnas/101/suppl_1/5228.full.pdf) for example); rather than monotonically decreasing with the number of topics. I will look deeper in the code to see if I can figure out why here is different.

internaut · 2020-07-24T07:20:57Z

I think you misread my "PS" note. I said "monotonically increasing with the number of iterations", not number of topics! This is what I found odd in your second picture.

cbravo93 · 2020-07-24T08:04:52Z

Hi!

Sorry if I wasn't clear! I didn't comment on that, it is from the other package and I have no explanation for that, but from Fig 2 in Griffiths and Steyvers seems that it can happen. It doesn't happen in all cases either, this another example on a real data set:

Regards to this package, what I find weird is that the likelihood decreases with the number of topics. I also found a related issue on Google groups (https://groups.google.com/forum/#!topic/lda-users/P7sYtsj8b6E); but it does not seem a parameter/data issue to me.

internaut · 2020-07-28T09:57:05Z

Thanks for clearing this up!

Would be interested to hear if you find a difference in the LL estimation code, which may explain this behavior.

cbravo93 · 2020-08-07T15:57:27Z

Hi!

I found the code they use for the LL (https://github.com/slycoder/R-lda/blob/master/src/gibbs.c):

    double const_prior = 0;
    double const_ll = 0;

    if (compute_log_likelihood) {
      //                log B(\alpha)
      const_prior = (K * lgammafn(alpha) - lgammafn(alpha * K)) * nd;
      //                log B(\eta)
      const_ll = (V * lgammafn(eta) - lgammafn(eta * V)) * K;
    } 

    /...More code.../

    /*Compute the likelihoods:*/
    if (compute_log_likelihood) {
      double doc_ll = 0;
      for (dd = 0; dd < nd; ++dd) {
        double sum = alpha * K;
        for (kk = 0; kk < K; ++kk) {
          doc_ll += lgammafn(INTEGER(document_sums)[K * dd + kk] + alpha);
          sum += INTEGER(document_sums)[K * dd + kk];
        }
        doc_ll -= lgammafn(sum);
      }
      double topic_ll = 0;
      for (kk = 0; kk < K; ++kk) {
        double sum = eta * V;
        for (ii = 0; ii < V; ++ii) {
          topic_ll += lgammafn(INTEGER(topics)[kk + K * ii] + eta);
          sum += INTEGER(topics)[kk + K * ii];
        }
        topic_ll -= lgammafn(sum);
      }
      if (trace >= 2) {
        REprintf("ll: %g + %g - %g - %g = %g\n", doc_ll, topic_ll, const_ll, const_prior,
            doc_ll + topic_ll - const_ll - const_prior);
      }
      REAL(log_likelihood)[2 * iteration] = doc_ll - const_prior + topic_ll - const_ll;
      REAL(log_likelihood)[2 * iteration + 1] = topic_ll - const_ll;
    }

cbravo93 · 2020-08-07T16:35:37Z

My attempt to convert this to cython (not tested):

  cpdef double _loglikelihood(int[:, :] nzw, int[:, :] ndz, double alpha, double eta) nogil:
    cdef int k, d
    cdef int D = ndz.shape[0]
    cdef int n_topics = ndz.shape[1]
    cdef int vocab_size = nzw.shape[1]

    cdef double ll = 0
    cdef double const_prior = 0
    cdef double const_ll = 0
    
    const_prior = (n_topics * lgamma(alpha) - lgamma(alpha*n_topics)) * D
    const_ll = (vocab_size*lgamma(eta) - lgamma(eta*vocab_size)) * n_topics

    # calculate log p(w|z)
    
    with nogil:
    	cdef double topic_ll = 0
        for k in range(n_topics):
            cdef double sum = eta*vocab_size
            for w in range(vocab_size):
                # if nzw[k, w] == 0 addition and subtraction cancel out
                if nzw[k, w] > 0:
                    topic_ll=lgamma(nzw[k, w]+eta)
                    sum += nzw[k, w]
            topic_ll -= lgamma(sum)
            

        # calculate log p(z)
        cdef double doc_ll = 0
        for d in range(D):
            cdef double sum = alpha*n_topics
            for k in range(n_topics):
                if ndz[d, k] > 0:
                    doc_ll=lgamma(ndz[d, k] + alpha)
                    sum += ndz[d, k]
            doc_ll -= lgamma(sum)
            
        ll=doc_ll-const_prior+topic_ll-const_ll
            
        return ll

cbravo93 · 2020-08-10T11:38:32Z

I have done a quick try calculating the loglikelihood in the last iteration with this formula, and I get the expected results (y: log-likelihood in the last iteration, x topics):

For this I adapted the function to python:

import math
def loglikelihood(nzw, ndz, alpha, eta):
    D = ndz.shape[0]
    n_topics = ndz.shape[1]
    vocab_size = nzw.shape[1]

    ll = 0
    const_prior = 0
    const_ll = 0
    
    const_prior = (n_topics * math.lgamma(alpha) - math.lgamma(alpha*n_topics)) * D
    const_ll = (vocab_size* math.lgamma(eta) - math.lgamma(eta*vocab_size)) * n_topics

    # calculate log p(w|z)
    
    topic_ll = 0
    for k in range(n_topics):
        sum = eta*vocab_size
        for w in range(vocab_size):
            if nzw[k, w] > 0:
                topic_ll=math.lgamma(nzw[k, w]+eta)
                sum += nzw[k, w]
        topic_ll -= math.lgamma(sum)
            

    # calculate log p(z)
    doc_ll = 0
    for d in range(D):
        sum = alpha*n_topics
        for k in range(n_topics):
            if ndz[d, k] > 0:
                doc_ll=math.lgamma(ndz[d, k] + alpha)
                sum += ndz[d, k]
        doc_ll -= math.lgamma(sum)
            
    ll=doc_ll-const_prior+topic_ll-const_ll
            
    return ll

Same plot, but using the log-likelihood in the last iteration values as calculated in this package with the current formula. Here likelihood decreases with the number of topics:

I have not been able to spot where the difference is (I used exactly the same models for both plots, as calculated from this package), apart that they calculate nd[d] within the function instead of giving it as argument.

Let me know if you are able to spot the difference and decide which one is the correct approach, and whether you are planning to update the log-likelihood calculation in this package.

Thanks for looking into it!

C

internaut · 2020-08-11T13:29:17Z

Thanks, this looks promising! However, I'm not the author of the lda package so if you want to get this integrated into lda, you should make a PR and maybe @ariddell will accept the changes. Unfortunately though, it seems to me that development of this package has stalled since the end of last year. I therefore created a fork at https://github.com/WZBSocialScienceCenter/lda (mainly because I needed Python 3.8 binary wheels). You may also make a PR there. I didn't have the chance to review the code yet. Anyway I think it should be converted to Cython but I could also do that. I'll be on vacation soon, but at the start of September I could have a look at it.

ariddell · 2020-08-11T14:31:25Z

Comparing loglikelihoods from models using different numbers of topics is a bit tricky because the models are very different (each additional topic adds a huge number of parameters). It's less tricky if you are calculating the probability of held-out data.

There could, of course, also be some sort of typo in my code. This sort of thing happens all the time.

re: wheels, it is my intention to release new wheels for new versions of Python. I'll try to do this soon.

Update loglikelihood calculation function. Related to lda-project#102 .

Update loglikelihood calculation function. Related to lda-project#102.

cbravo93 · 2020-08-12T09:32:49Z

I have submitted the PR adding the function in cython, it works properly. @internaut I can also submit it in your repo if you also want it there.

Thanks both :)!

ariddell · 2020-08-12T16:17:04Z

I have not been able to spot where the difference

I'm a little concerned by this. How do we know which is correct?

The loglikelihood on a synthetic dataset should be easy to verify by hand. If there's no test which does this, one should be added. I think it makes sense to start there. How does that sound?

cbravo93 · 2020-08-13T14:40:30Z

I am working with a synthetic data set here. It comes from biological data, but the idea is that we sample 20 'documents' from 5 very long documents, ending with 100 documents. Of the 5 documents, 2 are of one type and the other 3 are from another. We expect to have a topic representing each document of origin, at least two topics representing the type of the documents and at least a generally shared topic; so around ~10 topics in total. A quick check is to use the doc_topic distributions for clustering and dimensionality reduction (e.g. UMAP). This works quite nicely with the model with 10 topics.

Also, other measurements also point to ~10 topics as the optimal. For example in the plot below you can see Cao_Juan and Arun from tmtoolkit and the inverted log likelihood with the new formula (so all should be minimized).

I have also used 'real' data and results are always the same, the R package log likelihood increases with the number of topics and peaks around where expected (and biologically topics make a lot of sense); while the current formula here always decreases (here not inverted, so the maximum should be the 'best'):

I did not look deeply for the difference between the codes, but there should be something. When comparing models with the same settings and different model selection criterias results are quite similar between both implementations; and the behaviour of the loglikelihood over iterations looks correct (increasing). Maybe some constant that is substracted more times with the number of topics? I will take a deeper look!

(Also I am happy to share the synthetic data if it can be useful, and the more tests the better :))

stale · 2020-10-25T11:33:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbravo93 added a commit to cbravo93/lda that referenced this issue Aug 12, 2020

Update loglikelihood calculation function

ab1f589

Update loglikelihood calculation function. Related to lda-project#102 .

cbravo93 mentioned this issue Aug 12, 2020

Update loglikelihood calculation function cbravo93/lda#1

Merged

cbravo93 added a commit to cbravo93/lda that referenced this issue Aug 12, 2020

Update loglikelihood calculation function

c4fa154

Update loglikelihood calculation function. Related to lda-project#102.

cbravo93 mentioned this issue Aug 12, 2020

Update loglikelihood calculation function #104

Closed

stale bot added the wontfix label Oct 25, 2020

stale bot closed this as completed Nov 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results on log-likelihood calculation #102

Strange results on log-likelihood calculation #102

cbravo93 commented Jul 22, 2020

internaut commented Jul 23, 2020

cbravo93 commented Jul 23, 2020 •

edited

Loading

internaut commented Jul 24, 2020

cbravo93 commented Jul 24, 2020 •

edited

Loading

internaut commented Jul 28, 2020

cbravo93 commented Aug 7, 2020 •

edited

Loading

cbravo93 commented Aug 7, 2020 •

edited

Loading

cbravo93 commented Aug 10, 2020 •

edited

Loading

internaut commented Aug 11, 2020

ariddell commented Aug 11, 2020

cbravo93 commented Aug 12, 2020

ariddell commented Aug 12, 2020

cbravo93 commented Aug 13, 2020 •

edited

Loading

stale bot commented Oct 25, 2020

Strange results on log-likelihood calculation #102

Strange results on log-likelihood calculation #102

Comments

cbravo93 commented Jul 22, 2020

internaut commented Jul 23, 2020

cbravo93 commented Jul 23, 2020 • edited Loading

internaut commented Jul 24, 2020

cbravo93 commented Jul 24, 2020 • edited Loading

internaut commented Jul 28, 2020

cbravo93 commented Aug 7, 2020 • edited Loading

cbravo93 commented Aug 7, 2020 • edited Loading

cbravo93 commented Aug 10, 2020 • edited Loading

internaut commented Aug 11, 2020

ariddell commented Aug 11, 2020

cbravo93 commented Aug 12, 2020

ariddell commented Aug 12, 2020

cbravo93 commented Aug 13, 2020 • edited Loading

stale bot commented Oct 25, 2020

cbravo93 commented Jul 23, 2020 •

edited

Loading

cbravo93 commented Jul 24, 2020 •

edited

Loading

cbravo93 commented Aug 7, 2020 •

edited

Loading

cbravo93 commented Aug 7, 2020 •

edited

Loading

cbravo93 commented Aug 10, 2020 •

edited

Loading

cbravo93 commented Aug 13, 2020 •

edited

Loading