Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Databook Write #18

Closed
rbreslavsky opened this issue Jun 6, 2023 · 7 comments
Closed

Bugfix: Databook Write #18

rbreslavsky opened this issue Jun 6, 2023 · 7 comments

Comments

@rbreslavsky
Copy link

I noticed that when I have custom columns in my run matrix that use strings with a comma, CAPE cannot write databook files correctly. In cape/cfdx/dataBook.py/write(), the run matrix cells are correctly read and interpreted, but on the write command the comma inside the string is recognized as a delimiter. The result is that the strings are split into multiple columns. This breaks anything that needs to process databooks.

I wonder if there's a way to make sure that strings are recognized and not split up? Another approach could be to use pandas to read/write csv files, but that's probably a longer term solution. I'm sure it would allow you to eliminate some code.

@nasa-ddalle
Copy link
Collaborator

I remember encountering this problem years ago and deciding that our team would just use a different delimiter for run matrix keys, e.g 1.5,2.0,poweroff,bro|ta using | as a secondary delimiter. The problem seemed intractable b/c the DataBook read used np.loadtxt(), which completely ignored quotation marks. Looking at the current code, I don't think it will be hard to fix this, though.

@nasa-ddalle
Copy link
Collaborator

Should be fixed with commit e72d019 "Move updated DataBook line splitter to cape.util".

There's an update to test/001_cape/007_databook that makes sure the str with comma within can be read by the databook (see aero_fuselage.csv).

@rbreslavsky
Copy link
Author

rbreslavsky commented Jun 7, 2023

Thanks Derek. Now a list of strings, when read from a CSV, is encoded in CAPE as "['one', 'two']", which writes correctly to CSV format. However, a list of values would be encoded as as '[1, 2]'; this is not recognized as a string in CSV, and is still delimited (split) by the comma. What we'd want is something like "[1, 2]".

With that in mind, line 3481 of cfdx/dataBook.py might benefit from this change:
fmtj = '%r' --> fmtj = '%r' if "'" in vj else '"%r"'

The downside is that this will break if the number of apostrophes isn't 2. So maybe there's a more robust way to code it. What do you think?

@nasa-ddalle
Copy link
Collaborator

This makes me a bit uncomfortable b/c there's a silent transition from a list to a string. So in your example writing out [1, 2] would then cause it to be read back in as "[1, 2]". I can do it, but there's an argument you should just convert your lists to strings prior to writing... I can put it in though. As for the quotations worry, I'll just double-call repr(). For example repr(repr([1, 'a'])) -> "[1, 'a']", and repr(repr([1, 'a "b"'])) -> '[1, \'a "b"\']'.

@nasa-ddalle
Copy link
Collaborator

Ok, I pushed a new commit d1a6a313 that's supposed to fix this. I didn't update the test yet, but it should work as you described. I'm not sure what happens when you read something like '[1, \'a "b"\']' back into a new databook instance, though. In particular, if that column is expecting an int or float, I think it will raise an exception. If that column was already set to be a str, the solution should be correct.

@rbreslavsky
Copy link
Author

rbreslavsky commented Jun 8, 2023

Just to be clear, all of the lists in my run matrix are indeed written as strings. I've been using pandas to generate the run matrix in the first place, and it writes to CSV in the correct manner. So all I'm trying to do is make sure that when my matrix columns are read in and written to the databooks, they're still strings (as recognized by CSVs).

@nasa-ddalle
Copy link
Collaborator

Ok, so I now understand the issue to be that single quotes from Python's repr() are not going over well. So for example writing the string "[1, 2]" to file with the text '[1, 2]' has compatibility issues. With that in mind, bad8e2d uses json.dumps() instead of repr(). That will always produce double quotes for the outer string definition while still being able to escape quotes within, e.g. "[1, \"a\"]".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants