New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The ability to specify a schema when using dfr open
and dfr into-df
#11634
Conversation
Includes: - Some type conversion from Value to polars schema - Command for displaying the schema.
Wow, you're making great progress on this PR. Looks like the CI just needs to get happy with it too. 😆 One thing that stood out as a little odd was this.
nushell has standardized on |
@fdncred, I am going to look into the test today. It doesn't fail on my mac, I'll see if I can get it to fail on my linux box. My first thought was going to go with <> to be consistent with nushell. I ended up choosing [] because that is what polars uses in all their logging messages... most of the time it wouldn't be an issue but I would need to re-implement both dfr schema and dfr dtypes to not rely on the polar's DataType to_string to make it consistent. thoughts? |
That sounds like a pain. dataframes is confusing and different enought without adding more differences if we can help it. we'll have to get some other people to chime. |
The consensus so far is that we should only have 1 way of displaying a generic like |
@fdncred, The tests have been fixed, the generic parameter format has been changed, and I have updated the description documentation. This should be good to go. |
compiling and testing now |
Let's go! This looks rad! Thanks for all the time and effort you put into it! |
…f` (nushell#11634) # Description There are times where explicitly specifying a schema for a dataframe is needed such as: - Opening CSV and JSON lines files and needing provide more information to polars to keep it from failing or in a desire to override default type conversion - When converting a nushell value to a dataframe and wanting to override the default conversion behaviors. This pull requests provides: - A flag to allow specifying a schema when using dfr into-df - A flag to allow specifying a schema when using dfr open that works for CSV and JSON types - A new command `dfr schema` which displays schema information and will allow display support schema dtypes Schema is specified creating a record that has the key value and the dtype. Examples usages: ``` {a:1, b:{a:2}} | dfr into-df -s {a: u8, b: {a: i32}} | dfr schema {a: 1, b: {a: [1 2 3]}, c: [a b c]} | dfr into-df -s {a: u8, b: {a: list<u64>}, c: list<str>} | dfr schema dfr open -s {pid: i32, ppid: i32, name: str, status: str, cpu: f64, mem: i64, virtual: i64} /tmp/ps.jsonl | dfr schema ``` Supported dtypes: null bool u8 u16 u32 u64 i8 i16 i32 i64 f32 f64 str binary date datetime[time_unit: (ms, us, ns) timezone (optional)] duration[time_unit: (ms, us, ns)] time object unknown list[dtype] structs are also supported but are specified via another record: {a: u8, b: {d: str}} Another feature with the dfr schema command is that it returns the data back in a format that can be passed to provide a valid schema that can be passed in as schema argument: <img width="638" alt="Screenshot 2024-01-29 at 10 23 58" src="https://github.com/nushell/nushell/assets/56345/b49c3bff-5cda-4c86-975a-dfd91d991373"> --------- Co-authored-by: Jack Wright <jack.wright@disqo.com>
Description
There are times where explicitly specifying a schema for a dataframe is needed such as:
This pull requests provides:
dfr schema
which displays schema information and will allow display support schema dtypesSchema is specified creating a record that has the key value and the dtype. Examples usages:
Supported dtypes:
null
bool
u8
u16
u32
u64
i8
i16
i32
i64
f32
f64
str
binary
date
datetime[time_unit: (ms, us, ns) timezone (optional)]
duration[time_unit: (ms, us, ns)]
time
object
unknown
list[dtype]
structs are also supported but are specified via another record:
{a: u8, b: {d: str}}
Another feature with the dfr schema command is that it returns the data back in a format that can be passed to provide a valid schema that can be passed in as schema argument: