Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error due to mallet 2.0.7 solved #664

Merged
merged 10 commits into from
Jun 3, 2016
Merged

Conversation

RishabGoel
Copy link
Contributor

purposing a solution for : gensim==0.12.4 mallet wrapper is not compatible with mallet-2.0.7 #599

@@ -36,6 +36,8 @@

import numpy

from bs4 import BeautifulSoup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Beautiful soup for parsing correct XML is an overkill. Please use from xml.etree import ElementTree

@@ -245,11 +247,20 @@ def show_topic(self, topicid, topn=10):
def print_topic(self, topicid, topn=10):
return ' + '.join(['%.3f*%s' % v for v in self.show_topic(topicid, topn)])

######### function to return the version of mallet ######
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Gensim style of doc-strings (PEP257).

doc = []
while pointer < len(parts):
if float(parts[pointer]) == int(parts[pointer]):
if float(parts[pointer+1]) > eps:
Copy link
Owner

@piskvorky piskvorky Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the while condition, this [pointer + 1] access can result in out-of-bounds IndexError, no?

Also, the logic around these ifs deserves some comments, it's not clear what's happening and why.

Copy link
Contributor Author

@RishabGoel RishabGoel Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so. I am adding comments for this if logic. Except when there are no topics for a document, it will work fine. Neverthless I wiill run some more test cases to validate it.

Copy link
Owner

@piskvorky piskvorky Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this branch correctly, then if the last number on a line is an int, this if will raise an IndexError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the boundary condition as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will. But consider the following:
There are only 2 formats possible namely 20.2345 or 2 0.2345, right
in the former case the while block will go to the else condition and in latter case it will go to if section. The format type produced by mallet is a kind of security for this error.

@RishabGoel
Copy link
Contributor Author

RishabGoel commented Apr 20, 2016

@janrygl : Could you post the parameters you used that lead to issue #599.
Thanks

@RishabGoel
Copy link
Contributor Author

@cscorley Please review the fix in this PR. I have also sent you an email about the same with other details.

"""
Check version of mallet via jar file
"""
archive = zipfile.ZipFile(direc_path, 'r')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use smart_open

Copy link
Owner

@piskvorky piskvorky Apr 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart_open doesn't handle zip files (=multiple files in a single archive). It only handles file compression, not multi-file archives like zip, tar, rar etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, according to the following eg in the smart_open repo's read me:
for line in smart_open.smart_open('./foo.txt.gz'):
... print line

We can only unzip a file (say a txt file).
I think zipfile is OK. If there not then I will look for another library to handle jar file.

@tmylk tmylk merged commit 8787f8c into piskvorky:develop Jun 3, 2016
@syting
Copy link

syting commented Jun 3, 2016

Testing this with Mallet 2.0.8RC3 it seems that when topic_threshold is set the output reverts back to topic/proportion format from the dense format.

@RishabGoel
Copy link
Contributor Author

That seems to be correct. Thanks for pointing out @syting . I have taken care of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants