Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-12: compact-and-reversible-JSON-interface.md #53714

Merged
merged 31 commits into from Oct 2, 2023

Conversation

loco-philippe
Copy link
Contributor

This new PDEP is proposed on the advise of Dr-Irv

@loco-philippe
Copy link
Contributor Author

I'm not sure the PDEP number is correct. If not, can you send me the correct number ?

@MarcoGorelli
Copy link
Member

I'm not sure the PDEP number is correct. If not, can you send me the correct number ?

thanks for opening this - I haven't taken a look at the content yet, but the next number in line would be 12

@MarcoGorelli MarcoGorelli added the PDEP pandas enhancement proposal label Jun 19, 2023
@loco-philippe
Copy link
Contributor Author

PDEP number is now 12

@MarcoGorelli MarcoGorelli changed the title Create 0007-compact-and-reversible-JSON-interface.md PDEP-12: compact-and-reversible-JSON-interface.md Jun 19, 2023
@rhshadrach
Copy link
Member

I think this does not meet the bar of being a commonly used format for implementation within pandas. I would support a third party package that reads/writes this format to/from pandas DataFrames being listed in the ecosystem.

@loco-philippe
Copy link
Contributor Author

Thanks Richard for your advice.

And regarding the main feature: having a reversible JSON interface, do you think it will be important (or strategic) for pandas to have this interface?

@jreback
Copy link
Contributor

jreback commented Jun 21, 2023

The table interface already exists for JSON interchange. This PDEP proposes an alternative format w/o even mentioning the existing or commenting on any defects.

@loco-philippe
Copy link
Contributor Author

Thanks Jeff,

if your remark concerns the option orient='table', examples and defects are given in the linked NoteBook included in the PDEP.

If you think the list provided is not sufficient, I can complete it.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 22, 2023
@phofl
Copy link
Member

phofl commented Aug 3, 2023

Not read all of this, but the fact that the table interface is buggy is not a reason to add a new interface. I'd rather fix those bugs.

@loco-philippe
Copy link
Contributor Author

Not read all of this, but the fact that the table interface is buggy is not a reason to add a new interface. I'd rather fix those bugs.

Thank you Patrick for this remark which is quite relevant (see other similar remarks in the thread).

In addition to the answers given (and indicated in the FAQ), I specify that it is mainly a functional problem:

Today, the orient='table' interface only partially meets the Table-Schema specification (only 5 data types out of 20 are taken into account - table here) and without addition to the notion of dtype, it will not be possible to respect the specification.

Thus, if the objective is to implement an interface that respects the Table-schema specification, it will be necessary to implement a solution similar to the one proposed (this has moreover been partially achieved with the notion of extDtype found in the interface for several formats).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 26, 2023

I think this does not meet the bar of being a commonly used format for implementation within pandas. I would support a third party package that reads/writes this format to/from pandas DataFrames being listed in the ecosystem.

I finally had a chance to read this, and I agree with @rhshadrach . Having a third party package that uses this format seems to be in the best interest of pandas at this time.

The JSON-NTV format and the NTV structure are now included in the IETF data-base as an Internet-Draft. Comments are welcome !

If that format does get adopted as a standard, then we can consider including this within pandas.

We've also had discussions about creating an interface to allow various file formats to be externalized. If we end up doing that, then that would be an easier way of supporting this format versus putting the burden on the pandas team to maintain any interface.

Based on our currently proposed procedures for handling PDEP's, we're past the 60 day discussion period, but because we didn't make it very visible, I'll ping @pandas-dev/pandas-core here in case there are additional comments, and call for a vote on September 11.

@loco-philippe
Copy link
Contributor Author

Thanks @Dr-Irv for your answer.

Indeed, during this two-month period, I was surprised not to have any comments on the difficulties and expectations around the JSON interface as well as on the contributions of the proposal. Perhaps this interface remains secondary.

In relation to the work done, it seems to me that several actions could be taken:

1. carry out an inventory of the JSON interface: There are inconsistencies and shortcomings that should be identified (first work done in the Notebook)
2. define objectives and orientations for the JSON interface: For example:

  • (orientation) have an alternative format to CSV (deprecated) especially for open-data needs
  • (orientation) share pandas data by API
  • (goal) consider all pandas dtypes
  • (goal) take into account the main types of Python
  • (goal) to have a reversible interface ( df.equals(read_json(to_json(df))) )

3. have feedback on the proposed solution (which meets the guidelines proposed above): the proposal to integrate the proposed solution into the ecosystem can meet this need
4. define an action plan when the three points above have been addressed

Of course, I can help on these points if necessary.

@attack68
Copy link
Contributor

sorry I missed this. I am a big fan of the JSON interface and use it with all my web microservices. Pandas is not great with JSON as you have highlighted and I think generally working towards its improvement is a good thing. In my use cases I have monkey patched some things to suit my dtypes.

Also other web developers have commented that a better Styler.to_json output would be useful to translate the conditional attribute-value pairs into JSON for interface with Javascript developed components, which I have also monkey patched at times. I included this comment in my own PDEP (4?) but have unfortunately not had the time to follow it up

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 28, 2023

Indeed, during this two-month period, I was surprised not to have any comments on the difficulties and expectations around the JSON interface as well as on the contributions of the proposal. Perhaps this interface remains secondary.

Partially my fault for not reading what you had put together and subsequently alerting the core team. Apologies for that.

In relation to the work done, it seems to me that several actions could be taken:

One thing that you could consider is to be incremental in modifying the existing JSON interface, i.e., create PR's that make small incremental improvements to the existing interface rather than bring forth an entirely new format. There are currently 38 open issues in pandas with read_json in them, so working on those might be a better way to move forward.

@loco-philippe
Copy link
Contributor Author

Indeed, during this two-month period, I was surprised not to have any comments on the difficulties and expectations around the JSON interface as well as on the contributions of the proposal. Perhaps this interface remains secondary.

Partially my fault for not reading what you had put together and subsequently alerting the core team. Apologies for that.

In fact, my intention in this remark was to clarify that the answers obtained were only about the "how" but not the "what" nor the "why", which surprised me (it was not about your support for this action) .

In relation to the work done, it seems to me that several actions could be taken:

One thing that you could consider is to be incremental in modifying the existing JSON interface, i.e., create PR's that make small incremental improvements to the existing interface rather than bring forth an entirely new format. There are currently 38 open issues in pandas with read_json in them, so working on those might be a better way to move forward.

I understand that the priority of the project is to address existing issues rather than adding new features that might add new issues!
But let me make a few points:

  • the proposal made includes elements that already exist in the current interface:
    • the proposed JSON format already works with read_json (e.g. pd.read_json('{"test integer":[1,2,3], "test string": ["a", "b", "c"]} ') but not with to_json,
    • the notion of type (as defined in NTV) already exists under the name extDtype (e.g. 'extDtype': 'Int32' is added in the result of to_json(orient='table').
  • there are inconsistencies in the current interface which seem to me to be a priority to deal with. For example :
    • why JSON format '{"test integer":[1,2,3], "test string": ["a", "b", "c"]}' is accepted in read_json but cannot be generated by to-json ?
    • why are dates treated as datetime in the JSON interface while they are treated as date in the CSV interface?
    • the TableSchema interface only deals with 6 data types out of the 24 defined, read_json does not exist for dtype=period[ ] and dtype=timedelta64[ns] (while to_json works)

It therefore seems to me that it is also a priority to make an inventory of the Json interface to highlight the priorities.

That said, I also agree with the remark and I will first understand these 38 issues and share a summary of them (but I am not sure that I am the right profile to correct them).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 30, 2023

It therefore seems to me that it is also a priority to make an inventory of the Json interface to highlight the priorities.

That said, I also agree with the remark and I will first understand these 38 issues and share a summary of them (but I am not sure that I am the right profile to correct them).

It seems to me that you should consider withdrawing this PDEP, and you can create a new meta-issue in pandas that summarizes the inventory you create, pointing to each of the relevant issues, and then that allows you or others to work on different aspects.

However, you could keep the PDEP open, and I'll call for a vote on September 11, although my sense is that it is unlikely to get accepted.

@loco-philippe
Copy link
Contributor Author

It therefore seems to me that it is also a priority to make an inventory of the Json interface to highlight the priorities.
That said, I also agree with the remark and I will first understand these 38 issues and share a summary of them (but I am not sure that I am the right profile to correct them).

It seems to me that you should consider withdrawing this PDEP, and you can create a new meta-issue in pandas that summarizes the inventory you create, pointing to each of the relevant issues, and then that allows you or others to work on different aspects.

I don't understand why I should consider withdrawing this PDEP:

  • the proposed interface provides benefits (extended types, systematic reversibility) that the current interface cannot provide,
  • the code can be decoupled from the current code (no risk of interference) and does not increase dependencies,
  • the only negative point received concerns the NTV format which although simple is not yet a standard format (the return of the IETF on the Internet-Draft filed will be interesting). From my point of view, it is not blocking (it may be from Pandas point of view).

For the meta-issue, I agree with the proposal to integrate the analysis of the issues in addition to the analysis already carried out (shared via the Notebook) concerning the questions of type conservation and reversibility (roundtrip). I will try to provide it before September 11th.

However, you could keep the PDEP open, and I'll call for a vote on September 11, although my sense is that it is unlikely to get accepted.

I prefer this second solution which will allow explicit feedback on the proposal (with the uncertainties mentioned)

@loco-philippe
Copy link
Contributor Author

Several comments remind us that the Table Schema interface already exists for JSON interchange. The principles of Table Schema are also consistent with those addressed in this proposal (notion of type).

I therefore took into account in the proposal an extension of the Table Schema interface (addition of types defined by Table Schema and not yet implemented in orient="table" - issue #55038).

Furthermore, the analysis of the 71 JSON issues shows that the proposal resolves or provides an alternative solution for the following issues: #12997, #14358, #16492, #35420, #35464, #36211, #39537, #49585, #50782, #51375, #52595, #53252

@loco-philippe
Copy link
Contributor Author

As proposed by @Dr-Irv, I shared the IO JSON parsing in a QST issue #55046 (maybe not the right channel !)

@loco-philippe
Copy link
Contributor Author

attack68-comment

sorry I missed this. I am a big fan of the JSON interface and use it with all my web microservices. Pandas is not great with JSON as you have highlighted and I think generally working towards its improvement is a good thing. In my use cases I have monkey patched some things to suit my dtypes.

Thank you @attack68 for your support !
If you have some example requirements about dtypes, I'm interested

Also other web developers have commented that a better Styler.to_json output would be useful to translate the conditional attribute-value pairs into JSON for interface with Javascript developed components, which I have also monkey patched at times. I included this comment in my own PDEP (4?) but have unfortunately not had the time to follow it up

I don't use Styler but I see in the documentation that Styler.to_json no longer exists??
Do you have another way to interface with Javascript?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 26, 2023

This PDEP has failed to gain enough support. See #55095 (comment)

@loco-philippe Can you update the status as rejected, and then push, so we keep it in the archives?

@loco-philippe
Copy link
Contributor Author

This PDEP has failed to gain enough support. See #55095 (comment)

@loco-philippe Can you update the status as rejected, and then push, so we keep it in the archives?

I followed and took the vote into account.

My conclusion, after reflections on this PDEP, is the following:

  • the comments are constructive and interesting,
  • the PDEP process is adapted,
  • it seems necessary to me that someone analyzes the JSON interface in detail (regardless of the bugs) which presents some inconsistencies.

So I will update the PDEP to add the conclusions and the result of the vote to close this PR (this weekend).

As indicated in the latest version of the PDEP, I have integrated the Table Schema Interface into the JSON-NTV library, so this module is now compatible with all of the data types defined in the Table Schema specification.

Thank you again Irv for your involvement in this PDEP !

@loco-philippe
Copy link
Contributor Author

PDEP is updated with rejected status and vote results.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @loco-philippe for your patience while we went through this process. We had a healthy discussion.

@Dr-Irv Dr-Irv merged commit 9282d9f into pandas-dev:main Oct 2, 2023
13 checks passed
@loco-philippe loco-philippe deleted the ntv branch October 3, 2023 12:47
@loco-philippe loco-philippe restored the ntv branch October 6, 2023 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants