Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurs with downloading kkbox dataset #68

Closed
daehwanahn opened this issue Jan 29, 2021 · 10 comments
Closed

Error occurs with downloading kkbox dataset #68

daehwanahn opened this issue Jan 29, 2021 · 10 comments

Comments

@daehwanahn
Copy link

Hi, havakv

I have the same issue with #42. (Error occurs with 'download.kkbox()').

My OS is windows and installed with pip.
I checked the path you suggested and 3 files (for training) were there.
But, I got the same problem (FileNotFoundError) during 'extracting train...'.

Do you have any idea about this?

Thank you


Hi,
What OS are you on (Window, Mac, Linux)?
Have you installed the package using pip or by pulling this repo?
Have you set the PYCOX_DATA_DIR environment variable?

Can you check the directory <pycox_path>/datasets/data/kkbox and list the files there?
The <pycox_path> can be found by running

import pycox
pycox.__file__  # '/Users/teboozas/anaconda3/envs/some_env/lib/python3.8/site-packages/pycox/__init__.py'

and remove the __init__.py part. So I want to you list the content of a folder such as
/Users/teboozas/anaconda3/envs/some_env/lib/python3.8/site-packages/pycox/datasets/data/kkbox (if you're on a mac).
You should be seeing some *.7z files.

Originally posted by @havakv in #42 (comment)

@havakv
Copy link
Owner

havakv commented Jan 31, 2021

Thank you for posting the issue. I've not tested obtaining this dataset on windows, so it's not that surprising there might be some bugs.

It looks like the code is failing here, so if there is a file not found, then that path might not be correct. The other alternative would be that the 7z command doesn't work as expected.
To verify that the path is correct, can you try this:

from pycox.datasets import kkbox
self = kkbox
train_path = self._path_dir /  "train.csv.7z"
print(train_path.exists())  # This should print "True" if the file is found

And if this prints "True", can you then try:

print(subprocess.check_output(['7z', '--help']).decode('utf-8'))

which should print out the help pages for 7z to ensure that 7z works on your machine.

Finally, if both of these works, can you try this and poste the error message that you get from it?

import subprocess
subprocess.check_output(['7z',  'x', str(train_path), f"-o{self._path_dir}", '-y'])

@daehwanahn
Copy link
Author

Thanks for your reply!
I tested your suggestions and I got the following results.

from pycox.datasets import kkbox
self = kkbox
train_path = self._path_dir /  "train.csv.7z"
print(train_path.exists()) 

=> True

import subprocess
print(subprocess.check_output(['7z', '--help']).decode('utf-8'))

=> [WinError 2] The system cannot find the file specified

import subprocess
subprocess.check_output(['7z',  'x', str(train_path), f"-o{self._path_dir}", '-y'])

=> [WinError 2] The system cannot find the file specified

@havakv
Copy link
Owner

havakv commented Jan 31, 2021

So then the issues seems to be that 7z doesn't work. Do you know how to check if it installed? And if it is not installed could you try to install it?

In the mean time I'll check if there is a way I can unzip with a python package, such that we don't have to call a non-python program for unzipping as we do now.

@havakv
Copy link
Owner

havakv commented Jan 31, 2021

So, can you try installing py7zr with pip install py7zr and running the following?

import py7zr
archive = py7zr.SevenZipFile(str(train_path), mode='r')
archive.extractall(path=str(self._path_dir))
print((self._path_dir / 'train.csv').exists())

If this doesn't error out, and prints "True", we can use this package for uncompressing instead of the os command.

@daehwanahn
Copy link
Author

daehwanahn commented Feb 1, 2021

Hi, havakv

  1. I found that I didn't have py7zr. So, I installed it.

import py7zr
archive = py7zr.SevenZipFile(str(train_path), mode='r')
archive.extractall(path=str(self._path_dir))
print((self._path_dir / 'train.csv').exists())

This command works~ it returns 'True'.

  1. But, I had the same error with 'subprogress' and 'datasets.kkbox.download_kkbox()'.
    You're right. It seems like we need to use py7zr instead of subprogress in Windows OS.

@daehwanahn
Copy link
Author

Hi, havakv

I extracted the data by using the google colab.
So, this is not an urgent problem.

Many thanks~!

@daehwanahn
Copy link
Author

!pip install pycox
from google.colab import files
files.upload() #upload your kaggle.json
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
from pycox import datasets
datasets.kkbox.download_kkbox()
import numpy as np
from google.colab import files
kkbox_survival = np.array(datasets.kkbox.read_df())
np.save('kkbox_survival.npy', kkbox_survival)
files.download('kkbox_survival.npy') 

@havakv
Copy link
Owner

havakv commented Feb 1, 2021

It's great that you found a way to get your data @daehwanahn, and thank you for testing py7zr on windows for me. I'll rewrite the code to use py7zr for windows then.

@havakv
Copy link
Owner

havakv commented Feb 1, 2021

Let's just keep it open until this works smoothly in windows too.

@havakv havakv reopened this Feb 1, 2021
@havakv
Copy link
Owner

havakv commented Feb 2, 2021

#69

@havakv havakv closed this as completed Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants