Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ability to specify a schema when using dfr open and dfr into-df #11634

Merged
merged 23 commits into from Jan 29, 2024

Conversation

ayax79
Copy link
Contributor

@ayax79 ayax79 commented Jan 25, 2024

Description

There are times where explicitly specifying a schema for a dataframe is needed such as:

  • Opening CSV and JSON lines files and needing provide more information to polars to keep it from failing or in a desire to override default type conversion
  • When converting a nushell value to a dataframe and wanting to override the default conversion behaviors.

This pull requests provides:

  • A flag to allow specifying a schema when using dfr into-df
  • A flag to allow specifying a schema when using dfr open that works for CSV and JSON types
  • A new command dfr schema which displays schema information and will allow display support schema dtypes

Schema is specified creating a record that has the key value and the dtype. Examples usages:

{a:1, b:{a:2}} | dfr into-df -s {a: u8, b: {a: i32}} | dfr schema
{a: 1, b: {a: [1 2 3]}, c: [a b c]} | dfr into-df -s {a: u8, b: {a: list<u64>}, c: list<str>} | dfr schema
 dfr open -s {pid: i32, ppid: i32, name: str, status: str, cpu: f64, mem: i64, virtual: i64} /tmp/ps.jsonl  | dfr schema

Supported dtypes:
null
bool
u8
u16
u32
u64
i8
i16
i32
i64
f32
f64
str
binary
date
datetime[time_unit: (ms, us, ns) timezone (optional)]
duration[time_unit: (ms, us, ns)]
time
object
unknown
list[dtype]

structs are also supported but are specified via another record:
{a: u8, b: {d: str}}

Another feature with the dfr schema command is that it returns the data back in a format that can be passed to provide a valid schema that can be passed in as schema argument:

Screenshot 2024-01-29 at 10 23 58

@fdncred
Copy link
Collaborator

fdncred commented Jan 25, 2024

Wow, you're making great progress on this PR. Looks like the CI just needs to get happy with it too. 😆

One thing that stood out as a little odd was this.

dfr into-df -s {a: u8, b: {a: list[u64]}, c: list[str]} 

nushell has standardized on list<int> and here you have list[u64]. I think it would be better to support list<u64> here if possible. Thoughts?

@ayax79
Copy link
Contributor Author

ayax79 commented Jan 25, 2024

@fdncred, I am going to look into the test today. It doesn't fail on my mac, I'll see if I can get it to fail on my linux box.

My first thought was going to go with <> to be consistent with nushell. I ended up choosing [] because that is what polars uses in all their logging messages... most of the time it wouldn't be an issue but I would need to re-implement both dfr schema and dfr dtypes to not rely on the polar's DataType to_string to make it consistent. thoughts?

@fdncred
Copy link
Collaborator

fdncred commented Jan 25, 2024

but I would need to re-implement both dfr schema and dfr dtypes to not rely on the polar's DataType to_string to make it consistent

That sounds like a pain. dataframes is confusing and different enought without adding more differences if we can help it. we'll have to get some other people to chime.

@fdncred
Copy link
Collaborator

fdncred commented Jan 25, 2024

The consensus so far is that we should only have 1 way of displaying a generic like list<u8> and not have multiple ways to do it. It just gets too confusing.

@ayax79
Copy link
Contributor Author

ayax79 commented Jan 29, 2024

@fdncred, The tests have been fixed, the generic parameter format has been changed, and I have updated the description documentation. This should be good to go.

@fdncred
Copy link
Collaborator

fdncred commented Jan 29, 2024

compiling and testing now

@fdncred fdncred merged commit f879c00 into nushell:main Jan 29, 2024
19 checks passed
@fdncred
Copy link
Collaborator

fdncred commented Jan 29, 2024

Let's go! This looks rad! Thanks for all the time and effort you put into it!

@ayax79 ayax79 deleted the dataframe_schema branch January 29, 2024 19:37
@hustcer hustcer added this to the v0.90.0 milestone Feb 3, 2024
dmatos2012 pushed a commit to dmatos2012/nushell that referenced this pull request Feb 20, 2024
…f` (nushell#11634)

# Description

There are times where explicitly specifying a schema for a dataframe is
needed such as:
- Opening CSV and JSON lines files and needing provide more information
to polars to keep it from failing or in a desire to override default
type conversion
- When converting a nushell value to a dataframe and wanting to override
the default conversion behaviors.

This pull requests provides:
- A flag to allow specifying a schema when using dfr into-df
- A flag to allow specifying a schema when using dfr open that works for
CSV and JSON types
- A new command `dfr schema` which displays schema information and will
allow display support schema dtypes

Schema is specified creating a record that has the key value and the
dtype. Examples usages:

```
{a:1, b:{a:2}} | dfr into-df -s {a: u8, b: {a: i32}} | dfr schema
{a: 1, b: {a: [1 2 3]}, c: [a b c]} | dfr into-df -s {a: u8, b: {a: list<u64>}, c: list<str>} | dfr schema
 dfr open -s {pid: i32, ppid: i32, name: str, status: str, cpu: f64, mem: i64, virtual: i64} /tmp/ps.jsonl  | dfr schema
```

Supported dtypes:
null                                                   
bool                                                   
u8                                                     
u16                                                    
u32                                                    
u64                                                    
i8                                                     
i16                                                    
i32                                                    
i64                                                    
f32                                                    
f64                                                    
str                                                    
binary                                                 
date                                                   
datetime[time_unit: (ms, us, ns) timezone (optional)]  
duration[time_unit: (ms, us, ns)]                      
time                                                   
object                                                 
unknown                                                
list[dtype]


structs are also supported but are specified via another record:
{a: u8, b: {d: str}}

Another feature with the dfr schema command is that it returns the data
back in a format that can be passed to provide a valid schema that can
be passed in as schema argument:

<img width="638" alt="Screenshot 2024-01-29 at 10 23 58"
src="https://github.com/nushell/nushell/assets/56345/b49c3bff-5cda-4c86-975a-dfd91d991373">

---------

Co-authored-by: Jack Wright <jack.wright@disqo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants