Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError #32

Closed
scus1 opened this issue Mar 8, 2016 · 10 comments
Closed

UnicodeDecodeError #32

scus1 opened this issue Mar 8, 2016 · 10 comments
Labels
🐛 bug Something isn't working, or a fix is proposed

Comments

@scus1
Copy link

scus1 commented Mar 8, 2016

I tried to use mdedup on my maildir with 8276 and got the following error:

Traceback (most recent call last):
  File "/home/user/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/cli.py", line 139, in deduplicate
    dedup.add_maildir(maildir)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 80, in add_maildir
    mail_file, message, self.use_message_id)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 103, in compute_hash
    canonical_headers_text = cls.canonical_headers(mail_file, message)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 125, in canonical_headers
    canonical_value = cls.canonical_header_value(header, value)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 148, in canonical_header_value
    value = re.sub('\s+', ' ', value).strip()
  File "/usr/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 4: ordinal not in range(128)

Unfortunately I can't identify the message, which causes this error. mdedup -v doesn't show more information. How can I find the problematic message?

@kdeldycke kdeldycke added the bug label Mar 13, 2016
@jameseck
Copy link

Hi,

I've encountered similar issues with a migration I've been doing.

To replicate this, you can grab the 2015-2016 archive files from the freeipa-users list (https://www.redhat.com/archives/freeipa-users/).

I used the python script available at https://blogs.gnome.org/muelli/2012/11/converting-mailman-archives-mboxes-to-maildir/ to convert the mbox files into a maildir.

Unfortunately, this script seems to have produced a number of duplicates, so I tried to use your script to remove them.

I've encountered several different exceptions:

Subject: Re: [Freeipa-users] 389 DS & admin consoles
Traceback (most recent call last):
  File "/usr/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/usr/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/cli.py", line 140, in deduplicate
    dedup.run()
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 220, in run
    sorted_messages_size = self.size_sort(messages)
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 328, in size_sort
    size = len(''.join(body).decode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 15: ordinal not in range(128)

I got around this exception by commenting line 328 in the deduplicate.py file and uncommenting line 327. This let the dedupe get further until I got the exception below:

Subject: Re: [Freeipa-users] Squid authentication in FreeIPA
Traceback (most recent call last):
  File "/usr/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/usr/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/cli.py", line 140, in deduplicate
    dedup.run()
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 220, in run
    sorted_messages_size = self.size_sort(messages)
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 326, in size_sort
    body = cls.get_lines_from_message_body(message)
  File "/usr/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 342, in get_lines_from_message_body
    header_text, sep, body = message.as_string().partition("\n\n")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3584: ordinal not in range(128)

@boagoa
Copy link

boagoa commented Mar 19, 2016

I could solve it with adding

import sys
sys.setdefaultencoding("latin-1")

to python27\lib\site-packages\sitecustomize.py (setting utf-8 also throwed the error ...)

@kdeldycke
Copy link
Owner

I removed the setdefaultencoding code in commit de94ee0, as it was clearly a hack. I though the work I did on unicode / string stuff to support both Python 2 and Python 3 made the hack irrelevant. Looks like I was wrong.

I still think we do not need this hack if we had properly handled strings in maildir-deduplicate. Anyway, the codebase starts to show its age and needs some kind of code cleaning / refactoring.

A first step might be to provide a unittest to clearly expose the issue discussed here, so we can try to find a cleaner way to handle that case.

@juantascon juantascon mentioned this issue Mar 28, 2016
@kdeldycke
Copy link
Owner

Fixed by #33.

@dmacvicar
Copy link

dmacvicar commented Mar 30, 2016

I still get this error (using the develop branch)

Traceback (most recent call last):
  File "/home/duncan/.pyenv/versions/2.7.9/bin/mdedup", line 9, in <module>
    load_entry_point('maildir-deduplicate==1.2.1', 'console_scripts', 'mdedup')()
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/space/git/tmp/maildir-deduplicate/maildir_deduplicate/cli.py", line 145, in deduplicate
    dedup.add_maildir(maildir)
  File "/space/git/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 82, in add_maildir
    mail_file, message, self.use_message_id)
  File "/space/git/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 106, in compute_hash
    canonical_headers_text = cls.canonical_headers(mail_file, message)
  File "/space/git/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 128, in canonical_headers
    canonical_value = cls.canonical_header_value(header, value)
  File "/space/git/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 155, in canonical_header_value
    value = re.sub('\s+', ' ', value).strip()
  File "/home/duncan/.pyenv/versions/2.7.9/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 36: ordinal not in range(128)

@kdeldycke
Copy link
Owner

@dmacvicar : Ok, I reopened the issue.

We definitively needs unit-tests to to cover this area.

@kdeldycke kdeldycke reopened this Mar 30, 2016
@ychaouche
Copy link

ychaouche commented Apr 25, 2016

Put the setdefaultencoding("utf-8") back in the code as a temporary workaround, worked fine for me.

@asifiqbal
Copy link

asifiqbal commented May 5, 2016

I put these three lines back in deduplicate.py and the unicode decode error went away

import sys
reload(sys)
sys.setdefaultencoding('latin-1')

If I put 'utf-8' instead of 'latin-1' I get the UnicodeDecodeError like below

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 72: invalid start byte

I installed it today using pip install maildir-deduplicate ubuntu 16.04 lts 64bit

@kdeldycke
Copy link
Owner

kdeldycke commented Aug 7, 2016

Fixed in 587cae2 and 7a206ec .

@github-actions
Copy link

github-actions bot commented Oct 5, 2020

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 5, 2020
@kdeldycke kdeldycke added 🐛 bug Something isn't working, or a fix is proposed and removed bug labels Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐛 bug Something isn't working, or a fix is proposed
Projects
None yet
Development

No branches or pull requests

7 participants