Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Data From S3 Path in Sagemaker #878

Open
mahesh1amour opened this issue Nov 23, 2020 · 16 comments
Open

Loading Data From S3 Path in Sagemaker #878

mahesh1amour opened this issue Nov 23, 2020 · 16 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@mahesh1amour
Copy link

In Sagemaker Im tring to load the data set from S3 path as follows

`train_path = 's3://xxxxxxxxxx/xxxxxxxxxx/train.csv'
valid_path = 's3://xxxxxxxxxx/xxxxxxxxxx/validation.csv'
test_path = 's3://xxxxxxxxxx/xxxxxxxxxx/test.csv'

data_files = {}
data_files["train"] = train_path
data_files["validation"] = valid_path
data_files["test"] = test_path
extension = train_path.split(".")[-1]
datasets = load_dataset(extension, data_files=data_files, s3_enabled=True)
print(datasets)`

I getting an error of

algo-1-7plil_1 | File "main.py", line 21, in <module> algo-1-7plil_1 | datasets = load_dataset(extension, data_files=data_files) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 603, in load_dataset algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 155, in __init__ algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 305, in _create_builder_config algo-1-7plil_1 | m.update(str(os.path.getmtime(data_file))) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime algo-1-7plil_1 | return os.stat(filename).st_mtime algo-1-7plil_1 | FileNotFoundError: [Errno 2] No such file or directory: 's3://lsmv-sagemaker/pubmedbert/test.csv

But when im trying with pandas , it is able to load from S3

Does the datasets library support S3 path to load

@julien-c
Copy link
Member

This would be a neat feature

@mahesh1amour
Copy link
Author

neat feature

I dint get these clearly, can you please elaborate like how to work on these

@thomwolf
Copy link
Member

It could maybe work almost out of the box just by using cached_path in the text/csv/json scripts, no?

@mahesh1amour
Copy link
Author

Thanks thomwolf and julien-c

I'm still confusion on what you guys said,

I have solved the problem as follows:

  1. read the csv file using pandas from s3
  2. Convert to dictionary key as column name and values as list column data
  3. convert it to Dataset using
    from datasets import Dataset
    train_dataset = Dataset.from_dict(train_dict)

@thomwolf
Copy link
Member

We were brainstorming around your use-case.

Let's keep the issue open for now, I think this is an interesting question to think about.

@mahesh1amour
Copy link
Author

We were brainstorming around your use-case.

Let's keep the issue open for now, I think this is an interesting question to think about.

Sure thomwolf, Thanks for your concern

@lhoestq
Copy link
Member

lhoestq commented Nov 24, 2020

I agree it would be cool to have that feature. Also that's good to know that pandas supports this.
For the moment I'd suggest to first download the files locally as thom suggested and then load the dataset by providing paths to the local files

@jman1973
Copy link

Don't get

@lhoestq lhoestq added enhancement New feature or request question Further information is requested labels Nov 27, 2020
@dorlavie
Copy link

Any updates on this issue?
I face a similar issue. I have many parquet files in S3 and I would like to train on them.
To be honest I even face issues with only getting the last layer embedding out of them.

@mahesh1amour
Copy link
Author

Hi dorlavie,
You can find one solution that i have mentioned above, that can help you.
And there is one more solution also which is downloading files locally

@dorlavie
Copy link

Hi dorlavie,
You can find one solution that i have mentioned above, that can help you.
And there is one more solution also which is downloading files locally

mahesh1amour, thanks for the fast reply

Unfortunately, in my case I can not read with pandas. The dataset is too big (50GB).
In addition, due to security concerns I am not allowed to save the data locally

@philschmid
Copy link
Member

@dorlavie could use boto3 to download the data to your local machine and then load it with dataset

boto3 example documentation

import boto3

s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')

datasets example documentation

from datasets import load_dataset
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

@dorlavie
Copy link

Thanks @philschmid for the suggestion.
As I mentioned in the previous comment, due to security issues I can not save the data locally.
I need to read it from S3 and process it directly.

I guess that many other people try to train / fit those models on huge datasets (e.g entire Wiki), what is the best practice in those cases?

@lhoestq
Copy link
Member

lhoestq commented Dec 21, 2020

If I understand correctly you're not allowed to write data on disk that you downloaded from S3 for example ?
Or is it the use of the boto3 library that is not allowed in your case ?

@dorlavie
Copy link

@lhoestq yes you are correct.
I am not allowed to save the "raw text" locally - The "raw text" must be saved only on S3.
I am allowed to save the output of any model locally.
It doesn't matter how I do it boto3/pandas/pyarrow, it is forbidden

@philschmid
Copy link
Member

@dorlavie are you using sagemaker for training too? Then you could use S3 URI, for example s3://my-bucket/my-training-data and pass it within the .fit() function when you start the sagemaker training job. Sagemaker would then download the data from s3 into the training runtime and you could load it from disk

sagemaker start training job

pytorch_estimator.fit({'train':'s3://my-bucket/my-training-data','eval':'s3://my-bucket/my-evaluation-data'})

in the train.py script

from datasets import load_from_disk

train_dataset = load_from_disk(os.environ['SM_CHANNEL_TRAIN'])

I have created an example of how to use transformers and datasets with sagemaker.
https://github.com/philschmid/huggingface-sagemaker-example/tree/main/03_huggingface_sagemaker_trainer_with_data_from_s3

The example contains a jupyter notebook sagemaker-example.ipynb and an src/ folder. The sagemaker-example is a jupyter notebook that is used to create the training job on AWS Sagemaker. The src/ folder contains the train.py, our training script, and requirements.txt for additional dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants
@julien-c @thomwolf @dorlavie @philschmid @mahesh1amour @lhoestq @jman1973 and others