Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while summarizing text #805

Closed
pedro-walter opened this issue Jul 25, 2016 · 14 comments · Fixed by #1653
Closed

Error while summarizing text #805

pedro-walter opened this issue Jul 25, 2016 · 14 comments · Fixed by #1653
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@pedro-walter
Copy link

Hi,

I've received the following error when trying to summarize the body of this news article:

https://www.theguardian.com/media/2016/jun/19/sun-times-brexit-in-out-shake-it-all-about

The error follows:

File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 202, in summarize
most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 161, in summarize_corpus
pagerank_scores = _pagerank(graph)
File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted
vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors?
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1271, in eigs
ncv, v0, maxiter, which, tol)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 685, in init
raise ValueError("k must be less than ndim(A)-1, k=%d" % k)
ValueError: k must be less than ndim(A)-1, k=1

Regards,

@piskvorky piskvorky added the bug Issue described a bug label Jul 27, 2016
@madewild
Copy link

I am also affected by this bug...

@tmylk
Copy link
Contributor

tmylk commented Aug 16, 2016

@madewild could you post a code snippet to reproduce?

@madewild
Copy link

Unfortunately the text fed to the summarizer is confidential, but my guess is that the error was triggered by an unusually high repetition of some sentences... I also notice now that the error raised was not exactly the same:
ValueError: k must be less than rank(A)-1, k=1

@MridulS
Copy link
Contributor

MridulS commented Aug 25, 2016

@tmylk The reason for this failure looks like the number of nodes in the graph that is used to calculate the pagerank of the corpus graph, after removing unreachable nodes the graph is left with only 2 nodes and hence it builds a matrix of shape 2 * 2 (for which scipy.sparse.linalg.eigs() will fail for k=1). We should probably raise an error if number of nodes (after removing unreachable nodes) goes below 3.

@tmylk
Copy link
Contributor

tmylk commented Aug 30, 2016

@MridulS could you submit a pr for this?

@MridulS
Copy link
Contributor

MridulS commented Aug 30, 2016

@tmylk What kind of error should I raise?

@tmylk
Copy link
Contributor

tmylk commented Sep 2, 2016

The error should say "Please add more sentences to the text. The number of reachable nodes is below 3"

@tmylk tmylk added the difficulty easy Easy issue: required small fix label Sep 25, 2016
@anmolgulati
Copy link
Contributor

Hi I worked on this issue. I have sent out a pull request for the same. Please review.

anmolgulati added a commit to anmolgulati/gensim that referenced this issue Sep 25, 2016
@tmylk tmylk closed this as completed in f7dd826 Sep 29, 2016
harshuljain13 pushed a commit to harshuljain13/gensim that referenced this issue Sep 30, 2016
…words (piskvorky#885)

* Added check in summarize_corpus to fix bug in summarizer

* Fix piskvorky#805: Added check in summarizing text

* Added test for checking low number of distinct words in text

* Text split method changed to allow running in Python 3.3 and above.

* Change to fix test in python versions 3.3 and higher

* Added blank line test_wikicorpus.py file

Added blank line to fix issue with travis CI
@vitordouzi
Copy link

Hello, I think that the problem is still open. I replicated this error with the document 1403 from the Hulth2003 dataset):
from gensim.summarization import keywords
print(keywords('IT: Utilities A look at five utilities to make your PCs more, efficient, effective, and efficacious'))

Traceback (most recent call last):
File "<stdin>", line 1, in
File "/gensim/summarization/keywords.py", line 214, in keywords
pagerank_scores = _pagerank(graph)
File "/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted
vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors?
File "/anaconda3/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1299, in eigs
ncv, v0, maxiter, which, tol)
File "/anaconda3/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 692, in init
raise ValueError("k must be less than ndim(A)-1, k=%d" % k)
ValueError: k must be less than ndim(A)-1, k=1

Looking to the document, make sense say that the possible problems are the terms frequencies! All the terms have frequency equal 1.

@piskvorky
Copy link
Owner

Hmm, that's not good, looks like a bug.

Can you suggest a fix @vitordouzi ?

@piskvorky piskvorky reopened this Aug 22, 2017
@vitordouzi
Copy link

@piskvorky, no, I don't! sorry! Maybe this TODO in the pagerank_weighted.py file can help.

File "/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted
vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors?

What exactly are the complex eigenvectors?

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills test before incubator and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017
@xelez
Copy link
Contributor

xelez commented Oct 25, 2017

Hello everyone. I started investigating this issue and basically, this is the same one as @MridulS described, but in different function:

The reason for this failure looks like the number of nodes in the graph that is used to calculate the pagerank of the corpus graph, after removing unreachable nodes the graph is left with only 2 nodes and hence it builds a matrix of shape 2 * 2 (for which scipy.sparse.linalg.eigs() will fail for k=1). We should probably raise an error if number of nodes (after removing unreachable nodes) goes below 3.

On text, given by @vitordouzi we end up with graph:

('effect', 'effici'), ('effici', 'effect'), ('effici', 'effici'), ('effect', 'effect')

which ends in 2x2 matrix and pagerank fails.

But I'm not sure how to fix this. @vitordouzi, @menshikh-iv any ideas on the desired outcome? An exception this time doesn't feel right. Maybe set some predefined scores instead of running pagerank? Or maybe add special case to pagerank?

@xelez
Copy link
Contributor

xelez commented Oct 25, 2017

Anyway, some notes about pagerank_weighted:

https://github.com/RaRe-Technologies/gensim/blob/9481915915bf61aa6e4e719a2f26d509677e6779/gensim/summarization/pagerank_weighted.py#L18-L26

  1. Matrix can have complex eigenvalues and eigenvector, but we a finding the largest eigenvalue, which is proven to be real. (because pagerank_matrix is positive and Perron–Frobenius theorem)
  2. Either assert about less than 3 vertices should be added, or, maybe, special handling of this case.
  3. It took me awhile to figure out why eigs() is used instead of eig() on a dense matrix. It was discussed in Summarisation optimisations #441, Summarize calculates all eigenvectors when only the largest is needed #438. I'd like to add a comment about this.

@menshikh-iv
Copy link
Contributor

About (1), (2) @xelez - need to handle special case, the comment from (3) should be useful too.

menshikh-iv pushed a commit that referenced this issue Oct 26, 2017
* added a regression test for summarization.keywords()
 * handled case with graph smaller than 3 nodes
 * removed TODO about complex eigenvectors
 * added more comments
horpto pushed a commit to horpto/gensim that referenced this issue Oct 28, 2017
* added a regression test for summarization.keywords()
 * handled case with graph smaller than 3 nodes
 * removed TODO about complex eigenvectors
 * added more comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants