Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigramCollocationFinder.score_ngrams with BigramAssocMeasure.likelihood_ratio raises ValueError: math domain error #2200

Open
Querela opened this issue Dec 3, 2018 · 4 comments

Comments

@Querela
Copy link

Querela commented Dec 3, 2018

I want to compute bigram collocations per sentence and tried to output my demo results in a jupyter notebook. Only for those four/five sentence combinations and a window size of larger than 4, the ordering with loglikelihood-ratio fails. Probably because of a negative log computation.
I'm not sure what to do at this point. Below my demo code to recreate the error.

import collections
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from somajo import Tokenizer

sentences = [
    "Ich gehe gerne nach Hause.",
    "Arbeit an der Uni macht Spaß.",
    "Ich esse gerne Eis.",
    #"Warum ist die Erde rund?",
    "Jemand anderes isst gerne Eis."
]

sentences_tok = [tokenizer.tokenize(s) for s in sentences]
all_tok = [word for sentence in sentences_tok for word in sentence]

def compute_neighbor_collocations(sentences_tok, window_size=2):
    sentences_tok = iter(sentences_tok)
    
    sentence_tok = next(sentences_tok)
    first_finder = BigramCollocationFinder.from_words(sentence_tok, window_size=window_size)
    for sentence_tok in sentences_tok:
        finder = BigramCollocationFinder.from_words(sentence_tok, window_size=window_size)
        first_finder.ngram_fd.update(finder.ngram_fd)
        first_finder.word_fd.update(finder.word_fd)
    
    return first_finder

# window_size=2 raises no error
finder_sentence = compute_neighbor_collocations(sentences_tok, window_size=255)
scored_sentence = finder_sentence.score_ngrams(BigramAssocMeasures.likelihood_ratio)
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-3af87a5a33c6> in <module>
      1 #finder_sentence.apply_freq_filter(3)
      2 #finder_sentence.nbest(BigramAssocMeasures.raw_freq, 100)
----> 3 scored_sentence = finder_sentence.score_ngrams(BigramAssocMeasures.likelihood_ratio)
      4 scored_sentence

/opt/miniconda3/envs/ekoerner/lib/python3.6/site-packages/nltk/collocations.py in score_ngrams(self, score_fn)
    119         lowest score, as determined by the scoring function provided.
    120         """
--> 121         return sorted(self._score_ngrams(score_fn), key=lambda t: (-t[1], t[0]))
    122 
    123     def nbest(self, score_fn, n):

/opt/miniconda3/envs/ekoerner/lib/python3.6/site-packages/nltk/collocations.py in _score_ngrams(self, score_fn)
    111         """
    112         for tup in self.ngram_fd:
--> 113             score = self.score_ngram(score_fn, *tup)
    114             if score is not None:
    115                 yield tup, score

/opt/miniconda3/envs/ekoerner/lib/python3.6/site-packages/nltk/collocations.py in score_ngram(self, score_fn, w1, w2)
    183         n_ix = self.word_fd[w1]
    184         n_xi = self.word_fd[w2]
--> 185         return score_fn(n_ii, (n_ix, n_xi), n_all)
    186 
    187 

/opt/miniconda3/envs/ekoerner/lib/python3.6/site-packages/nltk/metrics/association.py in likelihood_ratio(cls, *marginals)
    141         return (cls._n *
    142                 sum(obs * _ln(obs / (exp + _SMALL) + _SMALL)
--> 143                     for obs, exp in zip(cont, cls._expected_values(cont))))
    144 
    145     @classmethod

/opt/miniconda3/envs/ekoerner/lib/python3.6/site-packages/nltk/metrics/association.py in <genexpr>(.0)
    141         return (cls._n *
    142                 sum(obs * _ln(obs / (exp + _SMALL) + _SMALL)
--> 143                     for obs, exp in zip(cont, cls._expected_values(cont))))
    144 
    145     @classmethod

ValueError: math domain error
@BLKSerene
Copy link
Contributor

BLKSerene commented Dec 8, 2018

The problem is with the calculation of the contingency table.

                w1    ~w1
             ------ ------
         w2 | n_ii | n_oi | = n_xi
             ------ ------
        ~w2 | n_io | n_oo |
             ------ ------
             = n_ix        TOTAL = n_xx
    """

When you call score_ngram, it will first fetch the value of n_ii, n_xi, n_ix and n_xx.
The value of n_ii will be fetched from finder.ngram_fd and divided by (window_size - 1), and n_xi (the number of occurrences of word 1) and n_ix (the number of occurrences of number 2) are fetched from finder.word_fd.
Since you've updated ngram_fd and word_fd while looping through each sentence, it will get the correct values for the first three variables. But it will also fetch the value of n_oo through self.N, which is the total length of the tokens you've passed to first_finder, which will always be 6 (the length of tokens of your first sentence "Ich gehe gerne nach Hause."), and you didn't update this value in the loop.

So for example, as for word token "gerne" and the period token "."
n_ii = 3 / (255 -1) = 0.011811023622047244 (approximately 0)
n_ix (the number of occurrences of "gerne") = 3
n_xi (the number of occurrences of the period ".") = 4
n_oo = n_xx - n_xi - n_ix + n_ii = 6 - 4 - 3 + 0.011811023622047244 = -0.988188976377952756

The formula for the statistic of log-likelihood ratio test is:

2 * (n_ii * math.log(n_ii / m_ii) +
     n_oi * math.log(n_oi / m_oi) +
     n_io * math.log(n_io / m_io) +
     n_oo * math.log(n_oo / m_oo))

* where m_ii is the expected value of n_ii and thus m_ii = n_ix (the row total) * n_xi (the column total) / n_xx, and m_oi and so on...

Trying to compute the logarithm of a negative value would of course raise an error.
And I'm wondering why don't you just pass a list of all tokens into the finder, then you don't have to update ngram_fd and word_fd in the loop.

@Querela
Copy link
Author

Querela commented Dec 8, 2018

Ahh, thank you. Seems I was not as thorough as I thought. My workplace has a Java tool that computes word and sentence cooccurrences but for 'quick' prototyping it is somewhat complicated to use. So I found the NLTK CollocationFinderand played around a bit. The generic way to compute frequencies, log-likelihood was even better incentive. :-)
I found the n_** variables but did not easily find the origin of those.

The reason I don't just paste the whole text / concatenate the sentences is that I want to compute the collocations/cooccurrences for sentences, not texts. My/Our sentences have no references to their origin text and are therefore 'isolated'(?). I would not like to have (".", "Ich") because it may not occur that way normally. Another reason, I want to compute sentence collocations/cooccurrences, that means I want to know for a given word what other word appears in this sentence, too. With a window size of e. g. 255 I can have those pairs for all words in a sentence (just separated by left-/right-hand side - I am working on combining those the correct way for only word-sentence (somewhat like a word-document matrix)). Inputting the whole text I would get wrong cooccurrences since the window would be too large, and the sentences have different sizes.

And thank you for giving a example computation. I may use it later to look up details.

Again, thank you. The next time I will look better through the source. I think this issue can be closed?

@BLKSerene
Copy link
Contributor

BLKSerene commented Dec 8, 2018

You mean you want to find collocation in each sentence of your text?
You may instantiate a new BigramCollocationFinder for each of your sentences, and call the score_ngrams method of each collocation finder to measure the association of the bigrams (in each sentence).

@Querela
Copy link
Author

Querela commented Dec 10, 2018

Yes and no? I have a whole corpus but the sentences are not contiguous. I want to generate statistics for all the sentences, so the a BigramCollocationFinder per sentence but with aggregated scores.
E. g. the collocation (".", "The") should be almost zero (when pasting all sentences together is would be equal to the number of occurrences of "The"). But I want to know whether ("The", "king") occurs more often than ("The", "queen"), or for some verbs etc. Then the log-likelihood for comparing which is more significant than the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants