[Bug]: pydantic causing simple parse script to fail on build #141

riley-brady · 2022-11-03T20:05:50Z

What happened?

When attempting to run Build.build(custom_parser) I get a pydantic error that kills the catalog build, despite following the tutorial and successfully running the parser on test files.

What did you expect to happen?

A catalog to build successfully

Minimal Complete Verifiable Example

I am trying to build a simple custom parser. I am following [this guide](https://ecgtools.readthedocs.io/en/latest/how-to/use-a-custom-parser.html).

0. Make some fake data.

for var in ['pr', 'tas', 'tasmax']:
    ds = xr.DataArray([1,2,3], dims='time').rename(var).to_dataset()
    ds.to_netcdf(f"./data/{var}_daily_EC-Earth3_historical.nc")


1. Put together list of paths.
```python
root_path = pathlib.Path("./data/")
files = list(root_path.glob("*.nc"))
# Convert to string paths since the builder only takes string paths.
files = [str(f) for f in files]

I built a simple custom parser to test this out, following the guide.

def parse_dummy(file):
    fpath = pathlib.Path(file)
    info = {}

    try:
        # Just extracting metadata from the filename in this case. Same errors
        # occur when including loading into a dataset.
        variable, temporal_resolution, model, scenario = fpath.stem.split('_')
        info = {
                'variable': variable,
                'temporal': temporal_resolution,
                'source': model,
                'path': str(file),
            }

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

I tested this on a simple file and it was successful

parse_dummy(files[0])
{'variable': 'pr',
 'temporal': 'daily',
 'source': 'EC-Earth3',
 'path': './data/pr_daily_EC-Earth3_historical.nc'}

Now I make a builder object. The object successfully returns the expected list of files.

# Tried the pathlib object here following the demo but got an error that pydantic
# wanted strings only.
cat_builder = Builder(files)
>>> Builder(paths=['data/pr_daily_EC-Earth3_historical.nc', 'data/tasmax_daily_EC-Earth3_historical.nc', 'data/tas_daily_EC-Earth3_historical.nc'], storage_options={}, depth=0, exclude_patterns=[], include_patterns=[], joblib_parallel_kwargs={})

Build the catalog with the parse script.

cat_builder.build(parse_dummy)

Relevant log output

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [84], in <cell line: 1>()
----> 1 cat_builder.build(parse_fake)

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:40, in pydantic.decorator.validate_arguments.validate.wrapper_function()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:133, in pydantic.decorator.ValidatedFunction.call()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/decorator.py:130, in pydantic.decorator.ValidatedFunction.init_model_instance()

File ~/miniconda3/envs/analysis_py39/lib/python3.9/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.__init__()

ValidationError: 2 validation errors for Build
parsing_func
  field required (type=value_error.missing)
args
  1 positional arguments expected but 2 given (type=type_error)

Anything else we need to know?

pydantic version: '1.10.2'
ecgtools version: '2022.10.7'

The text was updated successfully, but these errors were encountered:

riley-brady · 2022-11-03T21:14:28Z

FYI, it seems like this works just fine if I revert back to pip install ecgtools==2021.9.23, so it seems like this is a breaking issue with the new ecgtools release.

Although I would prefer to use the new release, since I'd rather pass in a list of files to the builder and do the directory crawling on my own. In most cases, I'll be traversing an s3 directory and want to write some custom code for deriving the filelist. It doesn't seem like the built-in crawler from 2021.9.23 can handle S3Path (See https://github.com/liormizr/s3path).

riley-brady · 2022-11-03T22:19:11Z

Sorry for these streams of consciousness.

I'm realizing that even the new version uses the base paths to crawl and find the files, which has been breaking for me when trying to traverse an S3 directory. I'd like an option to just pass in raw S3 files.

I hacked this together with a local install of the package via:

Deleting the * from

ecgtools/ecgtools/builder.py

Line 176 in f9340de

*,

which seems to force .build(...) to take two arguments rather than just the custom parser.
Deleting all cases of @pydantic.validate_arguments to shut down the pydantic checker.

Changing

ecgtools/ecgtools/builder.py

Lines 141 to 144 in f9340de

    
           def get_assets(self): 
        
               assets = [directory.walk() for directory in self._root_dirs] 
        
               self.assets = sorted(toolz.unique(toolz.concat(assets))) 
        
               return self

to the following:

    def get_assets(self):
        self.assets = self.paths
        return self

Deleting the * from

ecgtools/ecgtools/builder.py

Line 214 in f9340de

*,

to get my catalog to save (similarly threw an error about expecting 2 arguments... maybe this is a necessity for pydantic?)

Probably (1) and (4) are unsustainable, because I'm sure it does something but I'm in a rush. (2) Could be fixed of course with focus on argument validation (which might be related to issues with (1)).

(3) Could be some sort of switch in Builder, something like crawl=bool.

Happy to lead an effort on a PR here if you find this valuable. This would be huge for my work, since I can just do the crawling myself and pass a list of Zarr files from a private S3 store. Please let me know if there's some way else to make the crawler work with an S3 store, though. Based on the size of our datasets on S3, it's not feasible for me to build the catalog locally (we're deriving/downloading/publishing the datasets natively through AWS).

andersy005 · 2022-11-15T01:51:12Z

Thank you for your patience @riley-brady! i just realized i accidentally stopped watching this repository a while ago.

I am trying to build a simple custom parser. I am following this guide.

It turns out that guide is a bit outdated ;(

ValidationError: 2 validation errors for Build
parsing_func
field required (type=value_error.missing)
args
1 positional arguments expected but 2 given (type=type_error)

w/ the keyword-only argument requirement via *, the following should fix the ValidationError

cat_builder.build(parsing_func=parse_dummy)

cat_builder.save(name=..., another_argument=..., another_argument=...)

(3) Could be some sort of switch in Builder, something like crawl=bool.

👍🏽 for providing an option to disable the crawling or allowing users to provide custom crawlers.

if you are looking into a quick workaround, the following might work

import joblib

def parsing_func(file):
   ....


cat_builder  = Builder(.....)
cat_builders.assets = assets # assets here is a list of files to parse
cat_builders.entries =   joblib.Parallel(**cat_builder.joblib_parallel_kwargs)(
            joblib.delayed(parsing_func)(asset, **parsing_func_kwargs) for asset in cat_builder.assets
        )

cat_builder.df = pd.DataFrame(cat_builder.entries)
cat_builder = cat_builder.clean_dataframe()

After this, you should be able to save the catalog

cat_builder.save(name=..., another_argument=..., another_argument=...)

👍🏽 👍🏽 for a PR when you have time...

riley-brady added the bug Something isn't working label Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: pydantic causing simple parse script to fail on build #141

[Bug]: pydantic causing simple parse script to fail on build #141

riley-brady commented Nov 3, 2022 •

edited

riley-brady commented Nov 3, 2022 •

edited

riley-brady commented Nov 3, 2022 •

edited

andersy005 commented Nov 15, 2022 •

edited

[Bug]: pydantic causing simple parse script to fail on build #141

[Bug]: pydantic causing simple parse script to fail on build #141

Comments

riley-brady commented Nov 3, 2022 • edited

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

Relevant log output

Anything else we need to know?

riley-brady commented Nov 3, 2022 • edited

riley-brady commented Nov 3, 2022 • edited

andersy005 commented Nov 15, 2022 • edited

riley-brady commented Nov 3, 2022 •

edited

riley-brady commented Nov 3, 2022 •

edited

riley-brady commented Nov 3, 2022 •

edited

andersy005 commented Nov 15, 2022 •

edited