Loading Data From S3 Path in Sagemaker #878

mahesh1amour · 2020-11-23T09:17:22Z

In Sagemaker Im tring to load the data set from S3 path as follows

`train_path = 's3://xxxxxxxxxx/xxxxxxxxxx/train.csv'
valid_path = 's3://xxxxxxxxxx/xxxxxxxxxx/validation.csv'
test_path = 's3://xxxxxxxxxx/xxxxxxxxxx/test.csv'

data_files = {}
data_files["train"] = train_path
data_files["validation"] = valid_path
data_files["test"] = test_path
extension = train_path.split(".")[-1]
datasets = load_dataset(extension, data_files=data_files, s3_enabled=True)
print(datasets)`

I getting an error of

algo-1-7plil_1 | File "main.py", line 21, in <module> algo-1-7plil_1 | datasets = load_dataset(extension, data_files=data_files) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 603, in load_dataset algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 155, in __init__ algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 305, in _create_builder_config algo-1-7plil_1 | m.update(str(os.path.getmtime(data_file))) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime algo-1-7plil_1 | return os.stat(filename).st_mtime algo-1-7plil_1 | FileNotFoundError: [Errno 2] No such file or directory: 's3://lsmv-sagemaker/pubmedbert/test.csv

But when im trying with pandas , it is able to load from S3

Does the datasets library support S3 path to load

The text was updated successfully, but these errors were encountered:

julien-c · 2020-11-23T09:42:12Z

This would be a neat feature

mahesh1amour · 2020-11-23T09:47:43Z

neat feature

I dint get these clearly, can you please elaborate like how to work on these

thomwolf · 2020-11-23T10:30:02Z

It could maybe work almost out of the box just by using cached_path in the text/csv/json scripts, no?

mahesh1amour · 2020-11-23T11:36:37Z

Thanks thomwolf and julien-c

I'm still confusion on what you guys said,

I have solved the problem as follows:

read the csv file using pandas from s3
Convert to dictionary key as column name and values as list column data
convert it to Dataset using
from datasets import Dataset
train_dataset = Dataset.from_dict(train_dict)

thomwolf · 2020-11-23T11:38:43Z

We were brainstorming around your use-case.

Let's keep the issue open for now, I think this is an interesting question to think about.

mahesh1amour · 2020-11-23T11:40:46Z

We were brainstorming around your use-case.

Let's keep the issue open for now, I think this is an interesting question to think about.

Sure thomwolf, Thanks for your concern

lhoestq · 2020-11-24T09:28:36Z

I agree it would be cool to have that feature. Also that's good to know that pandas supports this.
For the moment I'd suggest to first download the files locally as thom suggested and then load the dataset by providing paths to the local files

jman1973 · 2020-11-24T14:52:59Z

Don't get

dorlavie · 2020-12-21T14:29:51Z

Any updates on this issue?
I face a similar issue. I have many parquet files in S3 and I would like to train on them.
To be honest I even face issues with only getting the last layer embedding out of them.

mahesh1amour · 2020-12-21T15:18:23Z

Hi dorlavie,
You can find one solution that i have mentioned above, that can help you.
And there is one more solution also which is downloading files locally

dorlavie · 2020-12-21T15:30:09Z

Hi dorlavie,
You can find one solution that i have mentioned above, that can help you.
And there is one more solution also which is downloading files locally

mahesh1amour, thanks for the fast reply

Unfortunately, in my case I can not read with pandas. The dataset is too big (50GB).
In addition, due to security concerns I am not allowed to save the data locally

philschmid · 2020-12-21T16:11:36Z

@dorlavie could use boto3 to download the data to your local machine and then load it with dataset

boto3 example documentation

import boto3

s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')

datasets example documentation

from datasets import load_dataset
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

dorlavie · 2020-12-21T16:38:02Z

Thanks @philschmid for the suggestion.
As I mentioned in the previous comment, due to security issues I can not save the data locally.
I need to read it from S3 and process it directly.

I guess that many other people try to train / fit those models on huge datasets (e.g entire Wiki), what is the best practice in those cases?

lhoestq · 2020-12-21T16:49:45Z

If I understand correctly you're not allowed to write data on disk that you downloaded from S3 for example ?
Or is it the use of the boto3 library that is not allowed in your case ?

dorlavie · 2020-12-21T17:03:59Z

@lhoestq yes you are correct.
I am not allowed to save the "raw text" locally - The "raw text" must be saved only on S3.
I am allowed to save the output of any model locally.
It doesn't matter how I do it boto3/pandas/pyarrow, it is forbidden

philschmid · 2020-12-23T09:53:08Z

@dorlavie are you using sagemaker for training too? Then you could use S3 URI, for example s3://my-bucket/my-training-data and pass it within the .fit() function when you start the sagemaker training job. Sagemaker would then download the data from s3 into the training runtime and you could load it from disk

sagemaker start training job

pytorch_estimator.fit({'train':'s3://my-bucket/my-training-data','eval':'s3://my-bucket/my-evaluation-data'})

in the train.py script

from datasets import load_from_disk

train_dataset = load_from_disk(os.environ['SM_CHANNEL_TRAIN'])

I have created an example of how to use transformers and datasets with sagemaker.
https://github.com/philschmid/huggingface-sagemaker-example/tree/main/03_huggingface_sagemaker_trainer_with_data_from_s3

The example contains a jupyter notebook sagemaker-example.ipynb and an src/ folder. The sagemaker-example is a jupyter notebook that is used to create the training job on AWS Sagemaker. The src/ folder contains the train.py, our training script, and requirements.txt for additional dependencies.

mahesh1amour closed this as completed Nov 23, 2020

thomwolf reopened this Nov 23, 2020

lhoestq added enhancement New feature or request question Further information is requested labels Nov 27, 2020

philschmid mentioned this issue Jan 12, 2021

ADD S3 support for downloading and uploading processed datasets #1723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Data From S3 Path in Sagemaker #878

Loading Data From S3 Path in Sagemaker #878

mahesh1amour commented Nov 23, 2020

julien-c commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

thomwolf commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

thomwolf commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

lhoestq commented Nov 24, 2020

jman1973 commented Nov 24, 2020

dorlavie commented Dec 21, 2020

mahesh1amour commented Dec 21, 2020

dorlavie commented Dec 21, 2020

philschmid commented Dec 21, 2020

dorlavie commented Dec 21, 2020

lhoestq commented Dec 21, 2020

dorlavie commented Dec 21, 2020

philschmid commented Dec 23, 2020

Loading Data From S3 Path in Sagemaker #878

Loading Data From S3 Path in Sagemaker #878

Comments

mahesh1amour commented Nov 23, 2020

julien-c commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

thomwolf commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

thomwolf commented Nov 23, 2020

mahesh1amour commented Nov 23, 2020

lhoestq commented Nov 24, 2020

jman1973 commented Nov 24, 2020

dorlavie commented Dec 21, 2020

mahesh1amour commented Dec 21, 2020

dorlavie commented Dec 21, 2020

philschmid commented Dec 21, 2020

dorlavie commented Dec 21, 2020

lhoestq commented Dec 21, 2020

dorlavie commented Dec 21, 2020

philschmid commented Dec 23, 2020