Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: eliminate method _write() in json writers #36218

Merged
merged 11 commits into from
Oct 22, 2020

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Sep 8, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Attempted to figure out what is wrong with the issue #36211.
I noticed that the private methods _write are defined in each child of class Writer, which simply re-define obj and call the parent _write method.
It turns out that the source of the recursion does not lie there, but I refactored the module slightly by removing the mediator _write method. Instead I put the differences between the writers into their __init__ method.

It seems that source of the problem with the recursion mentioned in the bug report is in pandas/_libs/src/ujson/python/objToJSON.c.

class SeriesWriter(Writer):
_default_orient = "index"
if not index and orient == "split":
self.obj = {"name": obj.name, "data": obj.values}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why do we need to do this now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know. I simply used the original implementation and moved whatever assignments were in _write into __init__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK - I see what was done here. Let's not assign to self.obj as that will mess with the idempotency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assign to self.obj at the moment of the writer initialization, not when calling a method. So, I would argue that it is not harmful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is confusing to change what self.obj is for certain orients - would prefer not to do this

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Sep 8, 2020
@ivanovmg
Copy link
Member Author

ivanovmg commented Sep 9, 2020

If there is a problem with updating self.obj in __init__ of child classes, then we probably should abandon this PR. Maybe other opinion?

@WillAyd
Copy link
Member

WillAyd commented Sep 9, 2020

Yea I don't think this is an improvement as it is better for the subclasses to override the implementation where appropriate rather than modifying the object that gets stored during initialization.

Let's keep open for a few days as someone may disagree

@ivanovmg
Copy link
Member Author

Would you consider another implementation? I created a new property obj_to_write (we can call it a table_obj, or something like this). This allowed us to avoid re-definition of self.obj, reduce __init__ boilerplate and also (as initially) get rid of _write method.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think this is a better approach

)

@property
@abstractmethod
def obj_to_write(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an annotation for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -33,7 +35,7 @@
# interface to/from
def to_json(
path_or_buf,
obj,
obj: NDFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be FrameOrSeries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with FrameOrSeries is that it does not allow changing the type of return value.
Inside the function to_json there is a statement, which changes object from series to frame, which is not allowed.

    if orient == "table" and isinstance(obj, Series):
        obj = obj.to_frame(name=obj.name or "values")

mypy error:

pandas/io/json/_json.py:78: error: Incompatible types in assignment (expression has type "DataFrame", variable has type "FrameOrSeries")  [assignment]

I tried to use FrameOrSeriesUnion, which allows changing type from series to frame, but I get another mypy error.
Apparently for NDFrame only FrameOrSeries is a correct casting.

pandas/core/generic.py:2159: error: The erased type of self "Union[pandas.core.frame.DataFrame, pandas.core.series.Series]" is not a supertype of its class "pandas.core.generic.NDFrame"  [misc]

So, I have no clue how to deal with typing here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd i expected FrameOrSeriesUnion would work for this, am i missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanovmg can you merge master and update the error message here? I don't think that line number is reflective of the latest code base

I think the error message is stemming from a mixture of using a Union and a TypeVar (the latter being somewhere in generic). Ideally should be cleaned up but I don't think needs to hold up this PR

Copy link
Member Author

@ivanovmg ivanovmg Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go.
I merged master, then modified pandas/core/generic.py.

line 2164:

    def to_json(
        self: FrameOrSeriesUnion,
        path_or_buf: Optional[FilePathOrBuffer] = None,

And replaced NDFrame with FrameOrSeriesUnion in _json.py.

The error would be:

pandas/core/generic.py:2164: error: The erased type of self "Union[pandas.core.frame.DataFrame, pandas.core.series.Series]" is not a supertype of its class "pandas.core.generic.NDFrame"  [misc]

However, if I do not touch generic.py, then the following error with be thrown.

pandas/core/generic.py:2437: error: Argument "obj" to "to_json" has incompatible type "NDFrame"; expected "Union[DataFrame, Series]"  [arg-type]

@abstractmethod
def obj_to_write(self) -> Union[NDFrame, Dict[str, Any]]:
"""Object to write in JSON format."""

Copy link
Member

@jbrockmendel jbrockmendel Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explicitly pass here

are the keys in the Dict necessarily str, or can they be any Index entries?

Copy link
Member Author

@ivanovmg ivanovmg Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, actually the keys in the dictionary can be of IndexLabel type.
In FrameWriter the resulting dictionary keys can be tuples of strings in case of multicolumns.
In other cases the keys seem to be only strings.

@jbrockmendel
Copy link
Member

nitpick about annotations, otherwise LGTM

@ivanovmg
Copy link
Member Author

All checks are good after merging with master, except Windows:

##[error]We stopped hearing from agent Azure Pipelines 11. 

@jbrockmendel, unfortunately, I cannot fully resolve the issue with typing. Please see my comment above in relation to NDFrame and FrameOrSeries.

@ivanovmg ivanovmg requested a review from WillAyd October 20, 2020 08:19
@WillAyd WillAyd added this to the 1.2 milestone Oct 22, 2020
@WillAyd WillAyd merged commit dd25f1a into pandas-dev:master Oct 22, 2020
@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2020

Thanks @ivanovmg

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
@ivanovmg ivanovmg deleted the bug_36211 branch November 6, 2020 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants