Skip to content

Commit

Permalink
The ability to specify a schema when using dfr open and `dfr into-d…
Browse files Browse the repository at this point in the history
…f` (#11634)

# Description

There are times where explicitly specifying a schema for a dataframe is
needed such as:
- Opening CSV and JSON lines files and needing provide more information
to polars to keep it from failing or in a desire to override default
type conversion
- When converting a nushell value to a dataframe and wanting to override
the default conversion behaviors.

This pull requests provides:
- A flag to allow specifying a schema when using dfr into-df
- A flag to allow specifying a schema when using dfr open that works for
CSV and JSON types
- A new command `dfr schema` which displays schema information and will
allow display support schema dtypes

Schema is specified creating a record that has the key value and the
dtype. Examples usages:

```
{a:1, b:{a:2}} | dfr into-df -s {a: u8, b: {a: i32}} | dfr schema
{a: 1, b: {a: [1 2 3]}, c: [a b c]} | dfr into-df -s {a: u8, b: {a: list<u64>}, c: list<str>} | dfr schema
 dfr open -s {pid: i32, ppid: i32, name: str, status: str, cpu: f64, mem: i64, virtual: i64} /tmp/ps.jsonl  | dfr schema
```

Supported dtypes:
null                                                   
bool                                                   
u8                                                     
u16                                                    
u32                                                    
u64                                                    
i8                                                     
i16                                                    
i32                                                    
i64                                                    
f32                                                    
f64                                                    
str                                                    
binary                                                 
date                                                   
datetime[time_unit: (ms, us, ns) timezone (optional)]  
duration[time_unit: (ms, us, ns)]                      
time                                                   
object                                                 
unknown                                                
list[dtype]


structs are also supported but are specified via another record:
{a: u8, b: {d: str}}

Another feature with the dfr schema command is that it returns the data
back in a format that can be passed to provide a valid schema that can
be passed in as schema argument:

<img width="638" alt="Screenshot 2024-01-29 at 10 23 58"
src="https://github.com/nushell/nushell/assets/56345/b49c3bff-5cda-4c86-975a-dfd91d991373">

---------

Co-authored-by: Jack Wright <jack.wright@disqo.com>
  • Loading branch information
ayax79 and ayax79 committed Jan 29, 2024
1 parent d03ad6a commit f879c00
Show file tree
Hide file tree
Showing 90 changed files with 2,392 additions and 1,261 deletions.
82 changes: 44 additions & 38 deletions crates/nu-cmd-dataframe/src/dataframe/eager/append.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,27 @@ impl Command for AppendDF {
example: r#"let a = ([[a b]; [1 2] [3 4]] | dfr into-df);
$a | dfr append $a"#,
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
Column::new(
"a_x".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b_x".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
Column::new(
"a_x".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b_x".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand All @@ -64,26 +67,29 @@ impl Command for AppendDF {
example: r#"let a = ([[a b]; [1 2] [3 4]] | dfr into-df);
$a | dfr append $a --col"#,
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"a".to_string(),
vec![
Value::test_int(1),
Value::test_int(3),
Value::test_int(1),
Value::test_int(3),
],
),
Column::new(
"b".to_string(),
vec![
Value::test_int(2),
Value::test_int(4),
Value::test_int(2),
Value::test_int(4),
],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"a".to_string(),
vec![
Value::test_int(1),
Value::test_int(3),
Value::test_int(1),
Value::test_int(3),
],
),
Column::new(
"b".to_string(),
vec![
Value::test_int(2),
Value::test_int(4),
Value::test_int(2),
Value::test_int(4),
],
),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
11 changes: 7 additions & 4 deletions crates/nu-cmd-dataframe/src/dataframe/eager/drop.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,13 @@ impl Command for DropDF {
description: "drop column a",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr drop a",
result: Some(
NuDataFrame::try_from_columns(vec![Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
)])
NuDataFrame::try_from_columns(
vec![Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
)],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
23 changes: 13 additions & 10 deletions crates/nu-cmd-dataframe/src/dataframe/eager/drop_duplicates.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,19 @@ impl Command for DropDuplicates {
description: "drop duplicates",
example: "[[a b]; [1 2] [3 4] [1 2]] | dfr into-df | dfr drop-duplicates",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"a".to_string(),
vec![Value::test_int(3), Value::test_int(1)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(4), Value::test_int(2)],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"a".to_string(),
vec![Value::test_int(3), Value::test_int(1)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(4), Value::test_int(2)],
),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
52 changes: 29 additions & 23 deletions crates/nu-cmd-dataframe/src/dataframe/eager/drop_nulls.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,20 +43,23 @@ impl Command for DropNulls {
let a = ($df | dfr with-column $res --name res);
$a | dfr drop-nulls"#,
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(1)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(2)],
),
Column::new(
"res".to_string(),
vec![Value::test_int(1), Value::test_int(1)],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(1)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(2)],
),
Column::new(
"res".to_string(),
vec![Value::test_int(1), Value::test_int(1)],
),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand All @@ -66,15 +69,18 @@ impl Command for DropNulls {
example: r#"let s = ([1 2 0 0 3 4] | dfr into-df);
($s / $s) | dfr drop-nulls"#,
result: Some(
NuDataFrame::try_from_columns(vec![Column::new(
"div_0_0".to_string(),
vec![
Value::test_int(1),
Value::test_int(1),
Value::test_int(1),
Value::test_int(1),
],
)])
NuDataFrame::try_from_columns(
vec![Column::new(
"div_0_0".to_string(),
vec![
Value::test_int(1),
Value::test_int(1),
Value::test_int(1),
Value::test_int(1),
],
)],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
26 changes: 15 additions & 11 deletions crates/nu-cmd-dataframe/src/dataframe/eager/dtypes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,19 @@ impl Command for DataTypes {
description: "Dataframe dtypes",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr dtypes",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"column".to_string(),
vec![Value::test_string("a"), Value::test_string("b")],
),
Column::new(
"dtype".to_string(),
vec![Value::test_string("i64"), Value::test_string("i64")],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"column".to_string(),
vec![Value::test_string("a"), Value::test_string("b")],
),
Column::new(
"dtype".to_string(),
vec![Value::test_string("i64"), Value::test_string("i64")],
),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down Expand Up @@ -79,6 +82,7 @@ fn command(
.dtype();

let dtype_str = dtype.to_string();

dtypes.push(Value::string(dtype_str, call.head));

Value::string(*v, call.head)
Expand All @@ -88,7 +92,7 @@ fn command(
let names_col = Column::new("column".to_string(), names);
let dtypes_col = Column::new("dtype".to_string(), dtypes);

NuDataFrame::try_from_columns(vec![names_col, dtypes_col])
NuDataFrame::try_from_columns(vec![names_col, dtypes_col], None)
.map(|df| PipelineData::Value(df.into_value(call.head), None))
}

Expand Down
22 changes: 14 additions & 8 deletions crates/nu-cmd-dataframe/src/dataframe/eager/filter_with.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,13 @@ impl Command for FilterWith {
example: r#"let mask = ([true false] | dfr into-df);
[[a b]; [1 2] [3 4]] | dfr into-df | dfr filter-with $mask"#,
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new("a".to_string(), vec![Value::test_int(1)]),
Column::new("b".to_string(), vec![Value::test_int(2)]),
])
NuDataFrame::try_from_columns(
vec![
Column::new("a".to_string(), vec![Value::test_int(1)]),
Column::new("b".to_string(), vec![Value::test_int(2)]),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand All @@ -55,10 +58,13 @@ impl Command for FilterWith {
description: "Filter dataframe using an expression",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr filter-with ((dfr col a) > 1)",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new("a".to_string(), vec![Value::test_int(3)]),
Column::new("b".to_string(), vec![Value::test_int(4)]),
])
NuDataFrame::try_from_columns(
vec![
Column::new("a".to_string(), vec![Value::test_int(3)]),
Column::new("b".to_string(), vec![Value::test_int(4)]),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
34 changes: 20 additions & 14 deletions crates/nu-cmd-dataframe/src/dataframe/eager/first.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,13 @@ impl Command for FirstDF {
description: "Return the first row of a dataframe",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr first",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new("a".to_string(), vec![Value::test_int(1)]),
Column::new("b".to_string(), vec![Value::test_int(2)]),
])
NuDataFrame::try_from_columns(
vec![
Column::new("a".to_string(), vec![Value::test_int(1)]),
Column::new("b".to_string(), vec![Value::test_int(2)]),
],
None,
)
.expect("should not fail")
.into_value(Span::test_data()),
),
Expand All @@ -56,16 +59,19 @@ impl Command for FirstDF {
description: "Return the first two rows of a dataframe",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr first 2",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
])
NuDataFrame::try_from_columns(
vec![
Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
),
Column::new(
"b".to_string(),
vec![Value::test_int(2), Value::test_int(4)],
),
],
None,
)
.expect("should not fail")
.into_value(Span::test_data()),
),
Expand Down
11 changes: 7 additions & 4 deletions crates/nu-cmd-dataframe/src/dataframe/eager/get.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,13 @@ impl Command for GetDF {
description: "Returns the selected column",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr get a",
result: Some(
NuDataFrame::try_from_columns(vec![Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
)])
NuDataFrame::try_from_columns(
vec![Column::new(
"a".to_string(),
vec![Value::test_int(1), Value::test_int(3)],
)],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
11 changes: 7 additions & 4 deletions crates/nu-cmd-dataframe/src/dataframe/eager/last.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,13 @@ impl Command for LastDF {
description: "Create new dataframe with last rows",
example: "[[a b]; [1 2] [3 4]] | dfr into-df | dfr last 1",
result: Some(
NuDataFrame::try_from_columns(vec![
Column::new("a".to_string(), vec![Value::test_int(3)]),
Column::new("b".to_string(), vec![Value::test_int(4)]),
])
NuDataFrame::try_from_columns(
vec![
Column::new("a".to_string(), vec![Value::test_int(3)]),
Column::new("b".to_string(), vec![Value::test_int(4)]),
],
None,
)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
2 changes: 1 addition & 1 deletion crates/nu-cmd-dataframe/src/dataframe/eager/melt.rs
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ impl Command for MeltDF {
Value::test_string("c"),
],
),
])
], None)
.expect("simple df for test should not fail")
.into_value(Span::test_data()),
),
Expand Down
Loading

0 comments on commit f879c00

Please sign in to comment.