Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: TestMetrics.test_formatted_output may fail on Windows #1929

Closed
Suor opened this issue Apr 25, 2019 · 7 comments

Comments

3 participants
@Suor
Copy link
Collaborator

commented Apr 25, 2019

It runs successfully on appveyor, but fails on Windows 10 free image with:

E AssertionError: assert '\tmetrics.csv:\n\t\tvalor_mse desviación_mse data_set \n\t\t0.421601 0.173461 entrenamiento \n\t\t0.67528 0.289545 pruebas \n\t\t0.671502 0.297848 validación' in 'unable to read metric in 'metrics.csv' in branch ''\n\tmetrics.tsv:\n\t\tvalue_mse deviation_mse data_set ...\t\t "0.671502"\n\t\t ]\n\t\t}\n\tmetrics.txt:\n\t\tROC_AUC: 0.64\n\t\tKS: 78.9999999996\n\t\tF_SCORE: 77'

Note that "unable to read metric in 'metrics.csv' in branch ...", the full error is:

 unable to read metric in 'metrics.csv' in branch ''
Traceback (most recent call last):
  File "E:\dvc\repo\metrics\show.py", line 138, in _read_metric
    return _format_output(fd.read().strip(), typ)
  File "c:\users\user\envs\dvc\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 18: invalid continuation byte

So it's something about encodongs. Most probably some open() somewhere doesn't specify encoding and that falls back to system default, say cp1252, which somewhat works for text in the test, but fails later then it's read as unicode.

@Suor

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 25, 2019

NOTE: to prevent future errors of this type we may setlocale() globally to use UTF-8.

@efiop efiop added this to To do in Weekly tasks via automation Apr 25, 2019

@mroutis mroutis moved this from To do to In progress in Weekly tasks Apr 25, 2019

@mroutis

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

@Suor , do you have reproduce steps? I'm having some trouble trying to come up with a solution, maybe you or @efiop can shade some light into the issue.

❯ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
# encoding: utf-8

from __future__ import unicode_literals
import locale

locale.setlocale(locale.LC_ALL, "en_US.UTF-8")

with open('á', 'w') as fobj:
    fobj.write('hello')
# ❯ python scratchpad.py
Traceback (most recent call last):
  File "scratchpad.py", line 8, in <module>
    with open('á', 'w') as fobj:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

However, when I prepend it with LANG=en_US.UTF-8 it works as expected.
I'm not sure this can be handle correctly on the application layer, and also, how is this supposed to guarantee consistency on a multi-thread environment.

Maybe I'm overlooking something 😞

@mroutis

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

Also, I don't know how the locale should look like in Windows, I have the following code but haven't test it on Windows:

if os.name == "nt":
    locale.setlocale(locale.LC_ALL, "English_United States.1252'")
else:
    locale.setlocale(locale.LC_ALL, "en_US.UTF-8")

The idea was taken from matplotlib testing module

@Suor

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 26, 2019

@mroutis "English_United States.1252'" is not what we are looking for, this is not utf-8. And we are not trying to fix unicode filenames here, but unicode file contents.

BTW, I can't break this on Linux with any combinations of LANG and setlocale().

@mroutis

This comment has been minimized.

Copy link
Collaborator

commented Apr 26, 2019

@Suor , thanks for taking a look at this! I'll try to come up with a way to reproduce it on a Windows machine.

@Suor

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 26, 2019

If using setlocale() wouldn't prove to be reliable or will complicate things more that help. We can solve this with a wrapper + organizational measures:

  • provide our wrapper for open(),
  • not placed in compat.py, as it not something to go away,
  • using utf-8 by default,
  • working as context manager,
  • ensure we always use it to open files.
@Suor

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 26, 2019

P.S. We will also need same wrapper for pathlib.Path.write_text() and pathlib.Path.read_text().

@mroutis mroutis assigned mroutis and unassigned mroutis Apr 26, 2019

mroutis added a commit to mroutis/dvc that referenced this issue May 1, 2019

@mroutis mroutis referenced this issue May 1, 2019

Merged

tests: use UTF-8 when creating metrics files #1956

2 of 2 tasks complete

@mroutis mroutis moved this from In progress to Needs review in Weekly tasks May 2, 2019

@mroutis mroutis closed this in #1956 May 2, 2019

Weekly tasks automation moved this from Needs review to Done May 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.