Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas read from json don't infer data types #916

Merged
merged 6 commits into from Mar 8, 2019

Conversation

mparkhe
Copy link
Contributor

@mparkhe mparkhe commented Feb 23, 2019

  • Dont infer data types while converting from json
  • unit tests for parse_json_input

by default pandas.read_json will attempt at up-converting data types. In the following example "zip_code" column is presented as strings in json but converted to int64 by default. Auto infer can be troublesome when writing custom model scoring code. This PR uses dtype=False argument to stop this auto infer.

>>> json_string = '{"columns":["zip_code","cost"],"index":[0,1,2],"data":[["95120",10.45],["95128",23.0],["95128",12.1]]}'

>>> str(pd.read_json(json_string, orient="split").dtypes["zip_code"])
'int64'

>>> str(pd.read_json(json_string, orient="split", dtype=False).dtypes["zip_code"])
'object'

Same behavior is seen when using orient="records"

@mateiz
Copy link
Contributor

mateiz commented Feb 24, 2019

What was the problem exactly? It's hard to tell from the patch.

@mparkhe
Copy link
Contributor Author

mparkhe commented Feb 25, 2019

What was the problem exactly? It's hard to tell from the patch.

Added more details in description.

@dbczumar
Copy link
Collaborator

This seems like it might create backwards compatibility issues if users' existing pipelines depend on the inferred datatypes. Perhaps we should consider implementing an interim warning about the fact that this behavior will be changing (e.g. in 0.8.3.) and implement the change in a later version (e.g. 0.9.0).

@tomasatdatabricks
Copy link
Contributor

This would mean all integers are parsed as object / string if I understand it correctly? I am not sure if that is a good idea.

Can we instead add a parameter to the REST api or during the deployment process?

@mparkhe
Copy link
Contributor Author

mparkhe commented Feb 27, 2019

Reply to @tomasatdatabricks, re:

This would mean all integers are parsed as object / string if I understand it correctly? I am not sure if that is a good idea.

Can we instead add a parameter to the REST api or during the deployment process?

No. Users can pass integer and floats as a part of json. Look at examples in tests--
test_records_oriented_json_to_df and test_split_oriented_json_to_df

Or above in description that shows "cost" field sent in as a float64 without quotes. They are not serialized as objects.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though we definitely need docs as well prior to the release.

@mparkhe mparkhe merged commit e20e712 into mlflow:master Mar 8, 2019
@mparkhe mparkhe deleted the pandas_from_json_dtypes branch March 8, 2019 03:00
eedeleon pushed a commit to eedeleon/mlflow that referenced this pull request Mar 13, 2019
* pandas read from json don't infer data types
* added more tests
* Adding int64 columns for json -> pandas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants