Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with most recent format from kart diff #487

Closed
volaya opened this issue Oct 3, 2021 · 6 comments · Fixed by #489
Closed

Issues with most recent format from kart diff #487

volaya opened this issue Oct 3, 2021 · 6 comments · Fixed by #489
Assignees

Comments

@volaya
Copy link

volaya commented Oct 3, 2021

Looks like diff output format has changed in the latest release. Before, if several layers had changes, the output could be sent to a folder, and kart created several files, one for each layer, with the corresponding changes. Now all changes are written to a single file, with no information about the layer that each feature belongs to. It can maybe be figured out from the attributes schema, but that adds complexity to the workflow, and there can be cases where it not possible to do it (several layers might have the same schema, or the user might be looking at an older diff where the layer had a different schema to its most recent one...)

Would it be possible to add an additional property somewhere in the feature definition, so it can be used to separate the affected features into groups, depending on the layer they belong to?

@craigds
Copy link
Member

craigds commented Oct 3, 2021

Thanks for the issue. To clarify are you referring to the GeoJSON diff output? --output-format=geojson

@craigds craigds self-assigned this Oct 3, 2021
@craigds
Copy link
Member

craigds commented Oct 3, 2021

triage

I can confirm the bug

0.10.0: works fine, output to directory
0.10.1: crashes with a traceback (ref #453)
0.10.2: broken, outputs features from all layers to the same file

The bad codepath seems to be this line:

if isinstance(self.output_path, Path) and self.output_path.is_dir():

Where, if the directory doesn't exist, the is_dir() returns False and so the single-file mode is used.

additional bug

In addition, if I create the directory beforehand I get this traceback instead:

Traceback (most recent call last):
  File "/usr/local/bin/kart", line 33, in <module>
    sys.exit(load_entry_point('kart', 'console_scripts', 'kart')())
  File "/Users/cdestigter/checkout/kart/kart/cli.py", line 335, in entrypoint
    cli()
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/cdestigter/checkout/kart/kart/cli.py", line 158, in invoke
    return super().invoke(ctx)
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/cdestigter/checkout/kart/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/cdestigter/checkout/kart/kart/diff.py", line 145, in diff
    diff_writer.write_diff()
  File "/Users/cdestigter/checkout/kart/kart/json_diff_writers.py", line 269, in write_diff
    if isinstance(self.output_path, Path) and self.output_path.is_dir():
  File "/Users/cdestigter/checkout/kart/kart/json_diff_writers.py", line 303, in write_file_per_dataset
    json_style=self.json_style,
  File "/Users/cdestigter/checkout/kart/kart/output_util.py", line 207, in dump_json_output
    for chunk in json_encoder.iterencode(output):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 439, in _iterencode
    yield from _iterencode(o, _current_indent_level)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 429, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/cdestigter/checkout/kart/kart/output_util.py", line 63, in default
    return json.JSONEncoder.default(self, obj)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Delta is not JSON serializable

Fixing that might be the first thing, since at least then we would have a workaround (create the directory first)

patch fix proposal

Since this is a regression in 0.10.x I think the right fix for the next 0.10.x release would be to change it to assume directory output always, and only use single-file mode when the output is going to stdout.

That latter case isn't a regression since 0.10.0 and earlier refused to do geojson diff output at all when multiple datasets were present and no directory was specified:

$ kart diff --output-format=geojson 0ed6aace073385fc90d260abc2ae4916a1966c5c..
Usage: kart diff [OPTIONS] [COMMIT_SPEC] [FILTERS]...

Error: Invalid value for --output: Need to specify a directory via --output for --geojson with >1 dataset

longer term fix proposal

I think probably to avoid ambiguity, maybe we should split the geojson diff type into two separate types.

some options:

  • geojson-dir (dir) and geojson (single-file) - this would be backward-incompatible, but is an easy incompatibility to deal with
  • geojson (dir) and geojson-flat (single-file)

@craigds
Copy link
Member

craigds commented Oct 3, 2021

alternatively, just remove the single-file mode

looks like the issue was introduced in bf73da7 - as part of refactoring the diff output code, show was expanded to support all the formats that diff supports. But because show wants to dump output to a single stream, the geojson diff code had to be able to do that.

I think it's probably not terribly useful though. Perhaps we could just remove it again. We could change diff back to using directory output only for multi-dataset diffs, and either:

  1. remove geojson output for the show command altogether
  2. make show work the same as diff - refuse to show multi-dataset commits when --output-format=geojson is specified

@craigds
Copy link
Member

craigds commented Oct 4, 2021

I was hoping to work on this today but had a lot of other stuff on and so it will be tomorrow at the earliest

I think just removing the single-file mode is the best solution, since it fixes the issue, keeps everything simple and removes an unnecessary feature. I also think show doesn't really need geojson output so I'd just do whatever's easiest there - if it's easy to keep it for single-dataset commits then I'd do that, otherwise remove it entirely. @rcoup does that sound like a plan?

@volaya
Copy link
Author

volaya commented Oct 4, 2021

Yes, That's the problem that I am seeing. One thing to note is that I only see the issue in 0.10.4. Reverting to kart 10.2 fixes it and the output is correct. So not exactly what you mention, but I guess it's the same issue

@rcoup
Copy link
Member

rcoup commented Oct 4, 2021

I think just removing the single-file mode is the best solution, since it fixes the issue, keeps everything simple and removes an unnecessary feature.

If there's no useful way for multiple datasets to co-exist in the same file then this seems like a reasonable approach.

I also think show doesn't really need geojson output so I'd just do whatever's easiest there - if it's easy to keep it for single-dataset commits then I'd do that

We shouldn't have features that only work sometimes, so I'm -1 on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants