Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: new orient setting for read_json to support common API format #39913

Open
ryancasburn-KAI opened this issue Feb 19, 2021 · 2 comments
Open
Labels
Enhancement IO JSON read_json, to_json, json_normalize

Comments

@ryancasburn-KAI
Copy link

ryancasburn-KAI commented Feb 19, 2021

Is your feature request related to a problem?

I see many APIs return results in the form:

{
"results": [
    {"a" :1,
     "b":2
    },
    {
     "a" :3,
     "b":4
    }
  ]
}

This format isn't directly supported by pandas. The data is in the "records" orient, but there is an extra layer. Currently to load this file I am using the requests module to load from https, then using the json module to strip out the outer layer, then feeding this data to pd.read_json as text. This feels like overkill, since pandas can read from https, but for this format (which is common for APIs) I need multiple other packages and lines of code.

This will change a three import, multi-line issue into a single import, single line solution.

Describe the solution you'd like

While I initially describe this as a new orient, I don't think that is the best way to implement this. I believe the read_json function should have a new parameter (such as "strip_layer") which will be the value of that outer layer. In the example above that would be "results". I make this suggestion as what is inside the outer layer could be several different orients, so we need to leave that as a possibility. This is something that happens first, then the data is processed.

API breaking implications

Need to consider what this means for chunking.

Additional context

My current code:

import pandas as pd
import requests
import json

data = requests.get(url).json()
data = data["results"]
data = json.dumps(data)
data = pd.read_json(data)

versus my desired code with this improvement:

import pandas as pd

data = pd.read_json(url, strip_layer="results")

Might I suggest this gets added to the IO Method Robustness/Input Types Project?

@ryancasburn-KAI ryancasburn-KAI added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@attack68
Copy link
Contributor

you don't have to dump and re-read, what about:

data = requests.get(url).json()
data = pd.DataFrame(data["results"])

But I see your point..

@ryancasburn-KAI
Copy link
Author

True, I wasn't the most efficient in my example. Still need another package either way.

@attack68 attack68 added IO JSON read_json, to_json, json_normalize Styler conditional formatting using DataFrame.style and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@attack68 attack68 removed the Styler conditional formatting using DataFrame.style label Jul 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

2 participants