Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nltk fixes #441

Merged
merged 14 commits into from
Jul 16, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions nltk/classify/maxent.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
from collections import defaultdict

from nltk import compat
from nltk.data import GzipUnicodeFile
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist

Expand Down Expand Up @@ -1440,7 +1440,7 @@ def train(cls, train_toks, **kwargs):
weightfile_fd, weightfile_name = \
tempfile.mkstemp(prefix='nltk-tadm-weights-')

trainfile = GzipUnicodeFile(trainfile_name, 'w')
trainfile = gzip_open_unicode(trainfile_name, 'w')
write_tadm_file(train_toks, encoding, trainfile)
trainfile.close()

Expand Down
4 changes: 2 additions & 2 deletions nltk/classify/tadm.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ def write_tadm_file(train_toks, encoding, stream):
labels = encoding.labels()
for featureset, label in train_toks:
length_line = '%d\n' % len(labels)
stream.write(length_line.encode('utf8'))
stream.write(length_line)
for known_label in labels:
v = encoding.encode(featureset, known_label)
line = '%d %d %s\n' % (
int(label == known_label),
len(v),
' '.join('%d %d' % u for u in v)
)
stream.write(line.encode('utf8'))
stream.write(line)

def parse_tadm_weights(paramfile):
"""
Expand Down
5 changes: 5 additions & 0 deletions nltk/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ def u(s):
reload = reload
raw_input = raw_input

#python 2.7 pickle compatibility
import UserString
import collections
UserString.defaultdict = collections.defaultdict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate more on this? What is it for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-pickled the data files using python 3 by using code like the following:

pickle.dump(destpath,pickle.load(srcpath,encoding='latin-1'),protocol=2)

If there's a collections.defaultdict in the pickle dump, python 3.3 pickles it to UserString.defaultdict instead of collections.defaultdict. I'm however not sure why this is, but python 2.7 and 2.6 do not have a defaultdict class in UserString, so I've added that as a compatibility fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's quite strange because Python 3.3 doesn't have UserString module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Try the following:

pickle.dump(collections.defaultdict(),open('test.pickle','wb'),protocol=2)

Then dump the pickle file:

python -m pickletools test.pickle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I believe it is a Python bug.

In [6]: dct = collections.defaultdict()

In [8]: pickle.dumps(dct, protocol=2)
Out[8]: b'\x80\x02cUserString\ndefaultdict\nq\x00)Rq\x01.'

In [10]: pickle.dumps(dct, protocol=2, fix_imports=False)
Out[10]: b'\x80\x02ccollections\ndefaultdict\nq\x00)Rq\x01.'

The cause of error is here: http://hg.python.org/cpython/file/7272ef213b7c/Lib/_compat_pickle.py#l80

We could try fix_imports=False - it will work as soon as we don't use moved functions from stdlib directly. It should work if we e.g. import urlencode from compat.py.

Another way is to create workarounds in compat.py (as you've done in your pull request). I believe we should also add defaultdict to UserList in this case because REVERSE_IMPORT_MAPPING relies on dict order that can vary between runs. But these workarounds are weird. Maybe it is better to patch REVERSE_IMPORT_MAPPING directly because this makes workaround more clear.

But I hope fix_imports=False would work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup you're right, good job on finding the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about "it will work as soon as we don't use moved functions from stdlib directly. It should work if we e.g. import urlencode from compat.py.", please disregard this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but fix_imports=False could still work and I think we should try it first


from itertools import imap, izip

try:
Expand Down
Loading