Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_train_test(): UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #20

Closed
pbenner opened this issue Apr 30, 2023 · 1 comment · Fixed by #21
Labels
bug Something isn't working

Comments

@pbenner
Copy link
Collaborator

pbenner commented Apr 30, 2023

Running the following script fails:

>>> from matbench_discovery.data import load_train_test
>>> load_train_test('mp_computed_structure_entries')
Downloading 'mp_computed_structure_entries' from https://figshare.com/ndownloader/files/40344436
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 95, in load_train_test
    df = reader(url)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 819, in __init__
    self.data = self._preprocess_data(data)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 831, in _preprocess_data
    data = data.read()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The only files for which the file download works are 'wbm_summary' and 'mp_energies'.

@janosh janosh added the bug Something isn't working label Apr 30, 2023
@janosh
Copy link
Owner

janosh commented Apr 30, 2023

Ah, that error is due to pandas being unable to infer the file is compressed JSON since we're only passing it a Figshare URL https://figshare.com/ndownloader/files/40344436. Should be easy to fix.

@janosh janosh closed this as completed in 5d7c620 Apr 30, 2023
janosh added a commit that referenced this issue Jun 20, 2023
* fix load_train_test() for compressed figshare data (closes #20)

* load_train_test() only accept answer 'y' or 'n' (as orig intended) (close #17)

* add test covering load_train_test() with compressed JSON file from URL

* mv run-scripts.yml test-scripts.yml

* add slow-tests.yml for running slow tests only on PR merges (to save CI budget)
janosh added a commit that referenced this issue Jun 20, 2023
* fix load_train_test() for compressed figshare data (closes #20)

* load_train_test() only accept answer 'y' or 'n' (as orig intended) (close #17)

* add test covering load_train_test() with compressed JSON file from URL

* mv run-scripts.yml test-scripts.yml

* add slow-tests.yml for running slow tests only on PR merges (to save CI budget)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants