New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in clipboard (linux, python2) with unicode and separator #13747

Closed
pijucha opened this Issue Jul 22, 2016 · 9 comments

Comments

Projects
None yet
4 participants
@pijucha
Contributor

pijucha commented Jul 22, 2016

This is probably a known bug but I couldn't find a github issue.

There is a disabled test test_clipboard.py which fails with the following error

======================================================================
FAIL: test_round_trip_frame_sep (pandas.io.tests.test_clipboard.TestClipboard)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/tests/test_clipboard.py", line 73, in test_round_trip_frame_sep
    self.check_round_trip_frame(dt, sep=',')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/tests/test_clipboard.py", line 69, in check_round_trip_frame
    tm.assert_frame_equal(data, result, check_dtype=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/util/testing.py", line 1276, in assert_frame_equal
    right.columns))
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/util/testing.py", line 1022, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame are different

DataFrame shape (number of columns) are different
[left]:  2, Index([u'en', u'es'], dtype='object')
[right]: 0, Index([], dtype='object')

Code Sample, a copy-pastable example if possible

More explicitly (the example from the above test):

nonascii = pd.DataFrame({'en': 'in English'.split(), 'es': 'en español'.split()})

nonascii.to_clipboard(sep=',')

read_clipboard(sep=',', index_col=0)
Out[154]: 
Empty DataFrame
Columns: []
Index: [0       in       en, 1  English  español]

read_clipboard()
Out[155]: 
        en       es
0       in       en
1  English  español

Expected Output

read_clipboard(sep=',', index_col=0)
Out[134]: 
        en       es
0       in       en
1  English  español

read_clipboard()
Out[135]: 
              ,en,es
0            0,in,en
1  1,English,español

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.18.1+240.gbb6b5e5
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.0
Cython: 0.24.1
numpy: 1.11.0

There are probably 2 issues in the code.

  1. `.encode('utf-8') is called on a py2 string, which raises if there is a non-ascii character in the string, and then
  2. to_clipboard falls back to to_string method.
    (In this case, fixing 1 solves the problem. But in general, if something else raises and we fall back here, a separator is ignored.)

I don't know what to do about 2, but 1 seems to be easy.
Part of the code in util.clipboard.py calls subprocess.Popen.communicate(), which operates on byte types (bytes in PY3 and strings in PY2). So, encode/decode are needed only in PY3.

I believe this 6d4fdb0 fixes the problem. But for now I tested only one pair of functions (in KDE) and couldn't possibly test it on OS X.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 22, 2016

Contributor

this is the same as #12580 / #12529 ; title is not very clear though

Contributor

jreback commented Jul 22, 2016

this is the same as #12580 / #12529 ; title is not very clear though

@jreback jreback closed this Jul 22, 2016

@jreback jreback added this to the No action milestone Jul 22, 2016

@pijucha

This comment has been minimized.

Show comment
Hide comment
@pijucha

pijucha Jul 22, 2016

Contributor

OK. Actually, I saw that but thought it was purely windows related. The bug here is an incorrect use of subprocess in py2.

Contributor

pijucha commented Jul 22, 2016

OK. Actually, I saw that but thought it was purely windows related. The bug here is an incorrect use of subprocess in py2.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 22, 2016

Contributor

I get the same on macosx / py2. so your report is prob better here. actually we cannot repro this on the builds anyhow which would be ideal.

ok will re-open and make this an xref issue of that one.

Contributor

jreback commented Jul 22, 2016

I get the same on macosx / py2. so your report is prob better here. actually we cannot repro this on the builds anyhow which would be ideal.

ok will re-open and make this an xref issue of that one.

@jreback jreback reopened this Jul 22, 2016

@jreback jreback modified the milestones: Next Major Release, No action Jul 22, 2016

@jreback jreback added the Bug label Jul 22, 2016

@pijucha

This comment has been minimized.

Show comment
Hide comment
@pijucha

pijucha Jul 24, 2016

Contributor

I see now that pandas.util.clipboard is probably entirely taken from pyperclip and it actually expects unicode input. So maybe it's better not tinker inside and just pass unicode to it. I tested this pijucha@e53dcb0 and it seems to work for both windows and linux.

Apparently, it also solves #12529. So, indeed, these issues are more closely related than I thought.

Contributor

pijucha commented Jul 24, 2016

I see now that pandas.util.clipboard is probably entirely taken from pyperclip and it actually expects unicode input. So maybe it's better not tinker inside and just pass unicode to it. I tested this pijucha@e53dcb0 and it seems to work for both windows and linux.

Apparently, it also solves #12529. So, indeed, these issues are more closely related than I thought.

@jlou2u

This comment has been minimized.

Show comment
Hide comment
@jlou2u

jlou2u Sep 15, 2016

+1 to fix this when possible. It's the only test that fails for me (OSX) in the codebase. More of an annoyance but still...

I'd also suggest not changing/setting default values for kwargs when to_clipboard is called, seems confusing at best and I think functionality is unchanged (proposed fix in jlou2u@01277af that includes pijucha's change)

I haven't done anything with travis before but looking at .travis.yml it seems that xsel is only added onto python 3 builds, but not python 2 builds. I think pandas.util.clipboard will raise import error if it can't find a clipboard utility and test_clipboard.py will raise nose.SkipTest("no clipboard found") if it can't find a clipboard. That's my best guess at why this can't be reproduced in the builds.

jlou2u commented Sep 15, 2016

+1 to fix this when possible. It's the only test that fails for me (OSX) in the codebase. More of an annoyance but still...

I'd also suggest not changing/setting default values for kwargs when to_clipboard is called, seems confusing at best and I think functionality is unchanged (proposed fix in jlou2u@01277af that includes pijucha's change)

I haven't done anything with travis before but looking at .travis.yml it seems that xsel is only added onto python 3 builds, but not python 2 builds. I think pandas.util.clipboard will raise import error if it can't find a clipboard utility and test_clipboard.py will raise nose.SkipTest("no clipboard found") if it can't find a clipboard. That's my best guess at why this can't be reproduced in the builds.

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay

aileronajay Nov 5, 2016

Contributor

This test is failing for me on OSX, (with the latest code)

Contributor

aileronajay commented Nov 5, 2016

This test is failing for me on OSX, (with the latest code)

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay

aileronajay Nov 6, 2016

Contributor

i was able to change _copyOSX function in pandas.util.clipboard.py

def _copyOSX(text):
p = Popen(['pbcopy', 'w'], stdin=PIPE, close_fds=True)
try:
p.communicate(input=text.encode('utf-8'))
except UnicodeDecodeError:
p.communicate(input=text)

to make the test pass. The test fails because to_clipboard fails for the data frame and it falls back to the string to clipboard. The to_clipboard fails because, we are trying to encode from ascii to utf8 when we call encode, but the str is already in UTF-8 when we have non ascii characters in the dataframe, hence when it tries to read the non ascii character using ascii, we get a UnicodeDecodeError. By capturing the UnicodeDecodeError exception and passing the string as it (as it is unicode encoded) we can make it work

Contributor

aileronajay commented Nov 6, 2016

i was able to change _copyOSX function in pandas.util.clipboard.py

def _copyOSX(text):
p = Popen(['pbcopy', 'w'], stdin=PIPE, close_fds=True)
try:
p.communicate(input=text.encode('utf-8'))
except UnicodeDecodeError:
p.communicate(input=text)

to make the test pass. The test fails because to_clipboard fails for the data frame and it falls back to the string to clipboard. The to_clipboard fails because, we are trying to encode from ascii to utf8 when we call encode, but the str is already in UTF-8 when we have non ascii characters in the dataframe, hence when it tries to read the non ascii character using ascii, we get a UnicodeDecodeError. By capturing the UnicodeDecodeError exception and passing the string as it (as it is unicode encoded) we can make it work

@pijucha

This comment has been minimized.

Show comment
Hide comment
@pijucha

pijucha Nov 7, 2016

Contributor

@aileronajay The thing is the file pandas.util.clipboard.py contains external code (unmodified Pyperclip v1.5.15 - if nothing has changed since July) and it expects unicode on input. I believe it's better to provide it with such input instead of tinkering inside (I tried both). The fix linked in my comment above should do the trick but I can't test it on OSX.

If I have some time this week I can submit a PR. Unless someone else can do it faster and better.

Contributor

pijucha commented Nov 7, 2016

@aileronajay The thing is the file pandas.util.clipboard.py contains external code (unmodified Pyperclip v1.5.15 - if nothing has changed since July) and it expects unicode on input. I believe it's better to provide it with such input instead of tinkering inside (I tried both). The fix linked in my comment above should do the trick but I can't test it on OSX.

If I have some time this week I can submit a PR. Unless someone else can do it faster and better.

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay

aileronajay Nov 7, 2016

Contributor

created a pull request (containing the same change as your commit pijucha/pandas@e53dcb0 ), fb922d6

Contributor

aileronajay commented Nov 7, 2016

created a pull request (containing the same change as your commit pijucha/pandas@e53dcb0 ), fb922d6

@jreback jreback closed this in 4a1a330 Nov 18, 2016

@jreback jreback modified the milestones: 0.19.2, Next Major Release Nov 18, 2016

amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

BUG in clipboard (linux, python2) with unicode and separator (GH13747)
vendered updated version of Pyperclip

closes #13747
closes #14362
closes #12807
closes #12529

Author: Ajay Saxena <ajasaxen@Ajays-MacBook-Pro.local>
Author: Ajay Saxena <aileronajay@gmail.com>

Closes #14599 from aileronajay/master and squashes the following commits:

2aafb66 [Ajay Saxena] moved comment inside test and added github issue labels to test
b74fbc1 [Ajay Saxena] ignore lint test for pyperclip files
9db42d8 [Ajay Saxena] whatsnew conflict
1dca292 [Ajay Saxena] conflict resolution
98b61e8 [Ajay Saxena] merge conflict
cedb690 [Ajay Saxena] merge conflict in whats new file
7af95da [Ajay Saxena] merging lastest changes
ac8ae60 [Ajay Saxena] skip clipboard test if clipboard primitives are absent
b03ed56 [Ajay Saxena] changed whatsnew file
c0aafd7 [Ajay Saxena] Merge branch 'test_branch'
9946fb7 [Ajay Saxena] Merge branch 'master' of https://github.com/pandas-dev/pandas into test_branch
ed1375f [Ajay Saxena] Merge branch 'test_branch'
0665fd4 [Ajay Saxena] fixed linting and test case as per code review
d202fd0 [Ajay Saxena] added test for valid encoding, modified setup.py so that pandas/util/clipboard can be found
dd57ae3 [Ajay Saxena] code review changes and read clipboard invalid encoding test
71d58d0 [Ajay Saxena] testing encoding in kwargs to to_clipboard and test case for the same
02f87b0 [Ajay Saxena] removed duplicate files
825bbe2 [Ajay Saxena] all files related to pyperclip are under pandas.util.clipboard
c5a87d8 [Ajay Saxena] Merge branch 'test_branch' of https://github.com/aileronajay/pandas into test_branch
f708c2e [Ajay Saxena] Merge branch 'master' of https://github.com/aileronajay/pandas
d565b1f [Ajay Saxena] updated pyperclip to the latest version
14d94a0 [Ajay Saxena] changed the pandas util clipboard file to return unicode if the python version is 2, else str
66d8ebf [Ajay Saxena] removed the disabled tag for clipboard test so that we can check if they pass after this change
edb8553 [Ajay Saxena] refactored the new unicode test to be in sync with the rest of the file
c83d000 [Ajay Saxena] added test case for unicode round trip
fb922d6 [Ajay Saxena] changes for GH 13747

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Dec 14, 2016

BUG in clipboard (linux, python2) with unicode and separator (GH13747)
vendered updated version of Pyperclip

closes #13747
closes #14362
closes #12807
closes #12529

Author: Ajay Saxena <ajasaxen@Ajays-MacBook-Pro.local>
Author: Ajay Saxena <aileronajay@gmail.com>

Closes #14599 from aileronajay/master and squashes the following commits:

2aafb66 [Ajay Saxena] moved comment inside test and added github issue labels to test
b74fbc1 [Ajay Saxena] ignore lint test for pyperclip files
9db42d8 [Ajay Saxena] whatsnew conflict
1dca292 [Ajay Saxena] conflict resolution
98b61e8 [Ajay Saxena] merge conflict
cedb690 [Ajay Saxena] merge conflict in whats new file
7af95da [Ajay Saxena] merging lastest changes
ac8ae60 [Ajay Saxena] skip clipboard test if clipboard primitives are absent
b03ed56 [Ajay Saxena] changed whatsnew file
c0aafd7 [Ajay Saxena] Merge branch 'test_branch'
9946fb7 [Ajay Saxena] Merge branch 'master' of https://github.com/pandas-dev/pandas into test_branch
ed1375f [Ajay Saxena] Merge branch 'test_branch'
0665fd4 [Ajay Saxena] fixed linting and test case as per code review
d202fd0 [Ajay Saxena] added test for valid encoding, modified setup.py so that pandas/util/clipboard can be found
dd57ae3 [Ajay Saxena] code review changes and read clipboard invalid encoding test
71d58d0 [Ajay Saxena] testing encoding in kwargs to to_clipboard and test case for the same
02f87b0 [Ajay Saxena] removed duplicate files
825bbe2 [Ajay Saxena] all files related to pyperclip are under pandas.util.clipboard
c5a87d8 [Ajay Saxena] Merge branch 'test_branch' of https://github.com/aileronajay/pandas into test_branch
f708c2e [Ajay Saxena] Merge branch 'master' of https://github.com/aileronajay/pandas
d565b1f [Ajay Saxena] updated pyperclip to the latest version
14d94a0 [Ajay Saxena] changed the pandas util clipboard file to return unicode if the python version is 2, else str
66d8ebf [Ajay Saxena] removed the disabled tag for clipboard test so that we can check if they pass after this change
edb8553 [Ajay Saxena] refactored the new unicode test to be in sync with the rest of the file
c83d000 [Ajay Saxena] added test case for unicode round trip
fb922d6 [Ajay Saxena] changes for GH 13747

(cherry picked from commit 4a1a330)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment