Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pydantic causing simple parse script to fail on build #141

Open
riley-brady opened this issue Nov 3, 2022 · 3 comments
Open

[Bug]: pydantic causing simple parse script to fail on build #141

riley-brady opened this issue Nov 3, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@riley-brady
Copy link

riley-brady commented Nov 3, 2022

What happened?

When attempting to run Build.build(custom_parser) I get a pydantic error that kills the catalog build, despite following the tutorial and successfully running the parser on test files.

What did you expect to happen?

A catalog to build successfully

Minimal Complete Verifiable Example

I am trying to build a simple custom parser. I am following [this guide](https://ecgtools.readthedocs.io/en/latest/how-to/use-a-custom-parser.html).

0. Make some fake data.

for var in ['pr', 'tas', 'tasmax']:
    ds = xr.DataArray([1,2,3], dims='time').rename(var).to_dataset()
    ds.to_netcdf(f"./data/{var}_daily_EC-Earth3_historical.nc")


1. Put together list of paths.
```python
root_path = pathlib.Path("./data/")
files = list(root_path.glob("*.nc"))
# Convert to string paths since the builder only takes string paths.
files = [str(f) for f in files]
  1. I built a simple custom parser to test this out, following the guide.
def parse_dummy(file):
    fpath = pathlib.Path(file)
    info = {}

    try:
        # Just extracting metadata from the filename in this case. Same errors
        # occur when including loading into a dataset.
        variable, temporal_resolution, model, scenario = fpath.stem.split('_')
        info = {
                'variable': variable,
                'temporal': temporal_resolution,
                'source': model,
                'path': str(file),
            }

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}
  1. I tested this on a simple file and it was successful
parse_dummy(files[0])
{'variable': 'pr',
 'temporal': 'daily',
 'source': 'EC-Earth3',
 'path': './data/pr_daily_EC-Earth3_historical.nc'}
  1. Now I make a builder object. The object successfully returns the expected list of files.
# Tried the pathlib object here following the demo but got an error that pydantic
# wanted strings only.
cat_builder = Builder(files)
>>> Builder(paths=['data/pr_daily_EC-Earth3_historical.nc', 'data/tasmax_daily_EC-Earth3_historical.nc', 'data/tas_daily_EC-Earth3_historical.nc'], storage_options={}, depth=0, exclude_patterns=[], include_patterns=[], joblib_parallel_kwargs={})
  1. Build the catalog with the parse script.
cat_builder.build(parse_dummy)

Relevant log output

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [84], in <cell line: 1>()
----> 1 cat_builder.build(parse_fake)

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:40, in pydantic.decorator.validate_arguments.validate.wrapper_function()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:133, in pydantic.decorator.ValidatedFunction.call()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:130, in pydantic.decorator.ValidatedFunction.init_model_instance()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.__init__()

ValidationError: 2 validation errors for Build
parsing_func
  field required (type=value_error.missing)
args
  1 positional arguments expected but 2 given (type=type_error)

Anything else we need to know?

pydantic version: '1.10.2'
ecgtools version: '2022.10.7'

@riley-brady riley-brady added the bug Something isn't working label Nov 3, 2022
@riley-brady
Copy link
Author

riley-brady commented Nov 3, 2022

FYI, it seems like this works just fine if I revert back to pip install ecgtools==2021.9.23, so it seems like this is a breaking issue with the new ecgtools release.

Although I would prefer to use the new release, since I'd rather pass in a list of files to the builder and do the directory crawling on my own. In most cases, I'll be traversing an s3 directory and want to write some custom code for deriving the filelist. It doesn't seem like the built-in crawler from 2021.9.23 can handle S3Path (See https://github.com/liormizr/s3path).

@riley-brady
Copy link
Author

riley-brady commented Nov 3, 2022

Sorry for these streams of consciousness.

I'm realizing that even the new version uses the base paths to crawl and find the files, which has been breaking for me when trying to traverse an S3 directory. I'd like an option to just pass in raw S3 files.

I hacked this together with a local install of the package via:

  1. Deleting the * from which seems to force .build(...) to take two arguments rather than just the custom parser.
  2. Deleting all cases of @pydantic.validate_arguments to shut down the pydantic checker.
  3. Changing
    def get_assets(self):
    assets = [directory.walk() for directory in self._root_dirs]
    self.assets = sorted(toolz.unique(toolz.concat(assets)))
    return self
    to the following:
    def get_assets(self):
        self.assets = self.paths
        return self
  1. Deleting the * from to get my catalog to save (similarly threw an error about expecting 2 arguments... maybe this is a necessity for pydantic?)

Probably (1) and (4) are unsustainable, because I'm sure it does something but I'm in a rush. (2) Could be fixed of course with focus on argument validation (which might be related to issues with (1)).

(3) Could be some sort of switch in Builder, something like crawl=bool.

Happy to lead an effort on a PR here if you find this valuable. This would be huge for my work, since I can just do the crawling myself and pass a list of Zarr files from a private S3 store. Please let me know if there's some way else to make the crawler work with an S3 store, though. Based on the size of our datasets on S3, it's not feasible for me to build the catalog locally (we're deriving/downloading/publishing the datasets natively through AWS).

@andersy005
Copy link
Member

andersy005 commented Nov 15, 2022

Thank you for your patience @riley-brady! i just realized i accidentally stopped watching this repository a while ago.

I am trying to build a simple custom parser. I am following this guide.

It turns out that guide is a bit outdated ;(

ValidationError: 2 validation errors for Build
parsing_func
field required (type=value_error.missing)
args
1 positional arguments expected but 2 given (type=type_error)

w/ the keyword-only argument requirement via *, the following should fix the ValidationError

cat_builder.build(parsing_func=parse_dummy)
cat_builder.save(name=..., another_argument=..., another_argument=...)

(3) Could be some sort of switch in Builder, something like crawl=bool.

👍🏽 for providing an option to disable the crawling or allowing users to provide custom crawlers.

if you are looking into a quick workaround, the following might work

import joblib

def parsing_func(file):
   ....


cat_builder  = Builder(.....)
cat_builders.assets = assets # assets here is a list of files to parse
cat_builders.entries =   joblib.Parallel(**cat_builder.joblib_parallel_kwargs)(
            joblib.delayed(parsing_func)(asset, **parsing_func_kwargs) for asset in cat_builder.assets
        )

cat_builder.df = pd.DataFrame(cat_builder.entries)
cat_builder = cat_builder.clean_dataframe()

After this, you should be able to save the catalog

cat_builder.save(name=..., another_argument=..., another_argument=...)

👍🏽 👍🏽 for a PR when you have time...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants