New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[datasets] Add remote filesystem support to datasets module #1244

Merged

tgaddair merged 75 commits into ludwig-ai:master from ANarayan:dataset-remotefs-support

Aug 18, 2021

Collaborator

ANarayan commented Jul 23, 2021

This PR adds support for reading and writing to remote filesystems (e.g. S3 buckets) in the Datasets module.

cc: @w4nderlust , @tgaddair

ANarayan added 30 commits

June 24, 2021 18:52


          first pass @ e2e autotrain

26a9663


          first pass @ auto batch scaling

ff115c8


          add additional parameter for pbt scheduler and supports passing time …

d8543f4

…budget to tune


          add default hyperparameter search space + tune_batch_size parameter

7419fc8


          add comments and delete tune_config.py

aeba080


          fix bug in assignment of pbt scheduler paramter

fa28c82


          fix bug to support pbt scheduler

ee130f7


          fix bug and cpu/gpu resource specification in config

cc81e27


          fix pbt scheduelr params and validation metric bug in config files

03ba456


          add max_trials to auto tune function

1da9aae


          change search space encoding to only json encode lists which do not c…

79e66e2

…ontain floats or ints


          add function to support training for tune_batch_size and tune_learnin…

aa5e174

…g_rate


          change default scheduler to async_hyperband

efcca6e


          sort imports

bfa0794


          makes train an internal func. & adds output_dir param to auto_train

725f688


          minor naming changes

32fa44b


          add a first pass @ an auto learning rate tuner

a887f04


          minor naming change

df96d21


          replace GPUtil/psutil with ray cluster resources

cfda49f


          fix bugs in tune_learning_rate

45a9af3


          fix bugs in function imports

dc995b9


          add missing type to concat config

eb116e0


          add support for dask df inputs and add return dict from auto_train api

336b17e


          only exclude text features if there are no available GPUs

e783e60


          add float to TrialResults dataclass to handle nans produced when auto…

f93289a

… tune time > auto_train budget


          add support for auto keyword for batch_size and learning_rate

85e7b56


          add limit on tune batch size halving capacity

5bb9312


          fix bug in tune batch size

01aa523


          fixed bug in halving logic and added limit on batch_size bound

cb2b171


          add eager mode execution to tune_batch_size

ANarayan added 5 commits

July 22, 2021 16:42


          modifications to local <-> remote file copy logic

f4dc71e


          modify process single file logic

f30ff90


          mod to ieee fraud dataset to support remote fs system

f4d1275


          fix bug

70aaf29


          fix more variable bugs

d12d5f0

tgaddair reviewed

View reviewed changes

Collaborator

tgaddair left a comment

Nice! Few comments about moving some of this into fs_utils.

ludwig/datasets/mixins/download.py Outdated

+                          else:
+                              shutil.copytree(tmpdir, self.raw_dataset_path)
+                      #os.rename(self.raw_temp_path, self.raw_dataset_path)

Collaborator

tgaddair Jul 23, 2021

Can remove this line.

ludwig/datasets/mixins/download.py Outdated

-                      os.rename(self.raw_temp_path, self.raw_dataset_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)
+                      with tempfile.TemporaryDirectory() as tmpdir:

Collaborator

tgaddair Jul 23, 2021

Can we simplify to something like this?

with fs_utils.upload_output_file(self.raw_dataset_path) as tmpdir:
    ...

If that doesn't work, we should still make something like this in fs_utils as we do the same thing below.

ludwig/datasets/mixins/download.py Outdated

-                              with ZipFile(BytesIO(zipresp.read())) as zfile:
-                                  zfile.extractall(self.raw_temp_path)
-                      os.rename(self.raw_temp_path, self.raw_dataset_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)

Collaborator

tgaddair Jul 23, 2021

Can remove this.

ludwig/datasets/mixins/download.py Outdated

-                              shutil.copyfile(f, os.path.join(self.raw_temp_path, f.name))
-                      os.rename(self.raw_temp_path, self.raw_dataset_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)

Collaborator

tgaddair Jul 23, 2021

Can remove this.

ludwig/datasets/mixins/download.py Outdated

-                      os.rename(self.raw_temp_path, self.raw_dataset_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)
+                      with tempfile.TemporaryDirectory() as tmpdir:

Collaborator

tgaddair Jul 23, 2021

Same as above regarding using fs_utils.

ludwig/datasets/mixins/download.py Outdated

-                                                                       filename))
-                      os.rename(self.raw_temp_path, self.raw_dataset_path)
+                      with tempfile.TemporaryDirectory() as tmpdir:

Collaborator

tgaddair Jul 23, 2021

Same as above.

ludwig/datasets/mixins/process.py Outdated

@@ @@ -26,7 +30,13 @@ class IdentityProcessMixin: @@
                   processed_dataset_path: str
                   def process_downloaded_dataset(self):
-                      os.rename(self.raw_dataset_path, self.processed_dataset_path)
+                      protocol, _ = split_protocol(self.processed_dataset_path)

Collaborator

tgaddair Jul 23, 2021

I would move this into fs_utils as a utility function, like fs_utils.rename.

ludwig/datasets/mixins/process.py Outdated

+                      protocol, _ = split_protocol(self.processed_dataset_path)
+                      if protocol is not None:
+                          fs = fsspec.filesystem(protocol)
+                          fs.copy(self.raw_dataset_path,

Collaborator

tgaddair Jul 23, 2021

Maybe we can use fsspec.mv or fsspec.rename here. Otherwise the behavior is different between local and remote versions.

ludwig/datasets/mixins/process.py Outdated

-                      os.makedirs(self.processed_dataset_path, exist_ok=True)
+                      #os.makedirs(self.processed_dataset_path, exist_ok=True)
+                      #makedirs(self.process_downloaded_dataset, exist_ok=True)
+                      with fsspec.open(self.processed_dataset_path, mode="wb") as f:

Collaborator

tgaddair Jul 23, 2021

Is it necessary to write this file? Does it work without this?

ludwig/datasets/ieee_fraud/__init__.py Outdated

@@ @@ -65,7 +66,8 @@ def process_downloaded_dataset(self): @@
                       final_train = pd.merge(
                           train_dfs['train_transaction'], train_dfs['train_identity'], on='TransactionID', how='left')
-                      os.makedirs(self.processed_dataset_path, exist_ok=True)
+                      with fsspec.open(self.processed_dataset_path, mode="wb") as f:

Collaborator

tgaddair Jul 23, 2021

Do we need to create this file?

ANarayan added 9 commits

July 22, 2021 19:49


          remove unused import

e6fcea5


          modify existing datasets to support remote fs

3228aae


          modify to use upload_output_file

624dd97


          add rename function to fs_utils

46c180d


          remove uneeded fsspec open call

d9135aa


          remove uneeded fsspec open call

060ea89


          modify rename function in fs_utils

6fb7d90


          modify download logic to lean on fs_utils module

e050e20


          wrap loadmat call in fs_utils open_file func

8410c9f

Collaborator

w4nderlust commented Aug 5, 2021

The new changes look good to me, are all places where fsspec was used covered? Asking just to be sure.
Other than that, le'ts address the failures, bt the structure of the code looks fine tome. Thanks!

ANarayan added 3 commits

August 5, 2021 10:33


          add exist_ok clause to makedirs call

a5810fa


          remove uneeded shutil call

23d16f4


          [fix] name of value being passed to makedirs

7a0479e

tgaddair reviewed

View reviewed changes

Collaborator

tgaddair left a comment

Nice! Just a couple things we can refactor.

ludwig/datasets/adult_census_income/__init__.py Outdated

                       df.to_csv(os.path.join(self.processed_temp_path, self.csv_filename),
                                 index=False)
-                      os.rename(self.processed_temp_path, self.processed_dataset_path)
+                      protocol, _ = split_protocol(self.processed_dataset_path)

Collaborator

tgaddair Aug 6, 2021

Can we make this fs_utils.mv(self.processed_temp_path, self.processed_dataset_path)?

Collaborator

tgaddair Aug 6, 2021

We should probably also remove self.processed_temp_path at the end of the remote case, right?

ludwig/datasets/mixins/kaggle.py Outdated

-                      download_func(self.competition_name, path=self.raw_temp_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)
+                      with tempfile.TemporaryDirectory() as tmpdir:

Collaborator

tgaddair Aug 6, 2021

Can this also use fs_utils.upload_output_directory?

ludwig/utils/fs_utils.py Outdated

               def makedirs(url, exist_ok=False):
                   fs, path = get_fs_and_path(url)
-                  return fs.makedirs(path, exist_ok=exist_ok)
+                  if fs == "s3":

Collaborator

tgaddair Aug 6, 2021

Not sure why this is needed for s3, since directories don't exist. Is this so checks like exists(dirname) will pass?

This may also be true for other object stores. Maybe something more generic we could do here is:

fs.makedirs(path, exist_ok=exist_ok)
if not exists(paths):
    # Directory was not created -> no directory concept in filesystem
    with fsspec.open(url, mode="wb") as f:
            pass

What do you think?

ludwig/datasets/mixins/kaggle.py Outdated

-                          download_func = api.dataset_download_files
-                      # Download all files for a competition/dataset
-                      download_func(self.competition_name, path=self.raw_temp_path)
+                      #os.makedirs(self.raw_temp_path, exist_ok=True)

Collaborator

tgaddair Aug 6, 2021

Can remove this line.

ANarayan added 7 commits

August 6, 2021 10:14


          modify makedirs logic to be more generic

5d21d59


          update kaggle dataset mixin to use fs_util module

9c508bb


          replace fsspec dataset logic w/fs_utils rename func

5c96655


          replace fsspec dataset logic w/fs_utils rename func

ed8e6bd


          remove unecessary fsspec imports

e2cba94


          remove unecessary fsspec imports

d397870


          Merge remote-tracking branch 'upstream/master' into dataset-remotefs-…

72088bc

…support

tgaddair approved these changes

View reviewed changes

Collaborator

tgaddair left a comment

LGTM!

tgaddair merged commit 30d1fdf into ludwig-ai:master

ShreyaR pushed a commit that referenced this pull request


          [datasets] Add remote filesystem support to datasets module (#1244)

a205b8a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet