-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading Data From S3 Path in Sagemaker #878
Comments
This would be a neat feature |
I dint get these clearly, can you please elaborate like how to work on these |
It could maybe work almost out of the box just by using |
Thanks thomwolf and julien-c I'm still confusion on what you guys said, I have solved the problem as follows:
|
We were brainstorming around your use-case. Let's keep the issue open for now, I think this is an interesting question to think about. |
Sure thomwolf, Thanks for your concern |
I agree it would be cool to have that feature. Also that's good to know that pandas supports this. |
Don't get |
Any updates on this issue? |
Hi dorlavie, |
mahesh1amour, thanks for the fast reply Unfortunately, in my case I can not read with pandas. The dataset is too big (50GB). |
@dorlavie could use boto3 example documentation import boto3
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME') datasets example documentation from datasets import load_dataset
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) |
Thanks @philschmid for the suggestion. I guess that many other people try to train / fit those models on huge datasets (e.g entire Wiki), what is the best practice in those cases? |
If I understand correctly you're not allowed to write data on disk that you downloaded from S3 for example ? |
@lhoestq yes you are correct. |
@dorlavie are you using sagemaker for training too? Then you could use S3 URI, for example sagemaker start training job pytorch_estimator.fit({'train':'s3://my-bucket/my-training-data','eval':'s3://my-bucket/my-evaluation-data'}) in the train.py script from datasets import load_from_disk
train_dataset = load_from_disk(os.environ['SM_CHANNEL_TRAIN']) I have created an example of how to use transformers and datasets with sagemaker. The example contains a jupyter notebook |
In Sagemaker Im tring to load the data set from S3 path as follows
`train_path = 's3://xxxxxxxxxx/xxxxxxxxxx/train.csv'
valid_path = 's3://xxxxxxxxxx/xxxxxxxxxx/validation.csv'
test_path = 's3://xxxxxxxxxx/xxxxxxxxxx/test.csv'
I getting an error of
algo-1-7plil_1 | File "main.py", line 21, in <module> algo-1-7plil_1 | datasets = load_dataset(extension, data_files=data_files) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 603, in load_dataset algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 155, in __init__ algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 305, in _create_builder_config algo-1-7plil_1 | m.update(str(os.path.getmtime(data_file))) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime algo-1-7plil_1 | return os.stat(filename).st_mtime algo-1-7plil_1 | FileNotFoundError: [Errno 2] No such file or directory: 's3://lsmv-sagemaker/pubmedbert/test.csv
But when im trying with pandas , it is able to load from S3
Does the datasets library support S3 path to load
The text was updated successfully, but these errors were encountered: