Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivoting #28

Open
moodymudskipper opened this issue Sep 18, 2020 · 7 comments
Open

pivoting #28

moodymudskipper opened this issue Sep 18, 2020 · 7 comments

Comments

@moodymudskipper
Copy link
Owner

moodymudskipper commented Sep 18, 2020

It's part of #29 .

I want to find a syntax that sees pivoting as a grouped operation and supports easier multi-spread or multi-gather ops.

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Sep 18, 2020

Reshaping Theory

Reshaping has been done through various packages, the best known being {stats} (in base R),
{reshape2}, and {tidyr}. {data.table} used to use {reshape2} and now has its own implementation of melt() and dcast() which behaves slightly differently but not enough to have its own item here.

There are 2 main types of reshaping, reshaping to wide, and reshaping to long.

reshaping wide

in our source table we have 4 types of columns :

  • the id columns, are kept and contain unique combinations in the end
  • the key columns, contain combinations used to define new columns
  • the value column, unique in all known existing solutions, contains the values used to fill new created columns
  • the ignored columns, dropped from the input

The new columns will require names, they are in all existing solutions built from
the values of the key columns.

Finally, we might have a way to aggregate the values when the sets of ids and key columns
are not unique.

Because of this important number of parameters, all approaches use default values to reduce
verbosity. Additionally starting from a given table we can deduce the columns of any 4th type knowing the
3 others, and if we postulate that we have no ignored columns we just need to give 2,
all solutions use this information in different ways.

Other differences are :

  • order of arguments
  • way to give them (naked names, strings, use of dots etc)
  • column positions in the output
  • default separators
  • possibility to include margins
  • possibility to subset before pivoting
  • possibility to apply an aggregation function

pivot to wider

function id columns key columns value column
unstack() ~ rhs lhs ~
reshape() idvar timevar v.names
dcast() lhs ~ ~ rhs value.var
spread() key value
pivot_wider() id_cols names_from values_from

reshaping long

When reshaping long, the key and value columns don't exist yet

pivot to longer

function id columns key columns value column varying columns
stack() select
reshape() idvar timevar v.names varying
melt() id.vars variable.name value.name measure.vars
gather() key value ...
pivot_longer() id_cols names_to values_to cols

for reshape() in order to have column names in the key name we must also use times = names(.)[varying]
(replacing varying by the value fed to varying).

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Sep 18, 2020

Pivoting to wider

  • id columns : these are actually the grouping columns and we'll treat them this way, i.e. on the rhs of ~
  • value column : as a value, it'll be on the rhs of =
  • key columns : as they describe the names, they're on the lhs of =

so, taking the simplest example from pivot_wider()'s doc :

fish_encounters
#> # A tibble: 114 x 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # ... with 104 more rows
fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA

We can write :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

A few observations :

  • in the pivot_wider() solution, fish is implicit, we could also have { "{station}" = seen } ~ (unused) (I don't like { "{station}" = seen } ~ . because the dot is already used a lot for other things).
  • all names need to be checked for the presence of "{", maybe we could find a syntax that signals more clearly that we're spreading (edit: only string literals in fact, not all names)
  • We can mix with other aggregation operations, and we can have several spreads in one go
  • pivot_wider has values_fill to replace the NAs, I don't know how to do it here
  • We don't have to name our columns exactly like the column content, we could have written "station_{station}" = (as we do below).
  • We can use several columns as keys too : "spread_{col1}_{col2}" = so no need to unite() beforehand

Let's see how we would handle the multi spread :

us_rent_income %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> # A tibble: 52 x 6
#>    GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
#>    <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
#>  1 01    Alabama                        24476           747        136        3
#>  2 02    Alaska                         32940          1200        508       13
#>  3 04    Arizona                        27517           972        148        4
#>  4 05    Arkansas                       23789           709        165        5
#>  5 06    California                     29454          1358        109        3
#>  6 08    Colorado                       32401          1125        109        5
#>  7 09    Connecticut                    35326          1123        195        5
#>  8 10    Delaware                       31560          1076        247       10
#>  9 11    District of Columbia           43198          1424        681       17
#> 10 12    Florida                        25952          1077         70        3
#> # ... with 42 more rows
us_rent_income %.% {
  {
    "estimate_{variable}" = estimate
    "estimate_{moe}" = moe
  } ~ (unused)
}

Our proposal is less general than pivot_wider() here, but I don't think there are many cases where one will want to spread many value columns. if we need it we could allow a data.frame on the rhs, and in that case we can do (though it would mean we can't spread a data.frame, which is another strange/rare thing to do):

us_rent_income %.% {
  { "{.value}_{variable}" = tibble(estimate, moe)   } ~ (unused)
  # or using our  `?selection`
  { "{.value}_{variable}" = ?c("estimate", "moe")   } ~ (unused)
}

Final example :

warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")])
warpbreaks
#> # A tibble: 54 x 3
#>    wool  tension breaks
#>    <fct> <fct>    <dbl>
#>  1 A     L           26
#>  2 A     L           30
#>  3 A     L           54
#>  4 A     L           25
#>  5 A     L           70
#>  6 A     L           52
#>  7 A     L           51
#>  8 A     L           26
#>  9 A     L           67
#> 10 A     M           18
#> # ... with 44 more rows
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = list(breaks = mean)
  )
#> # A tibble: 3 x 3
#>   tension     A     B
#>   <fct>   <dbl> <dbl>
#> 1 L        44.6  28.2
#> 2 M        24    28.8
#> 3 H        24.6  18.8

Here our solution is very straightforward :

warpbreaks %.% {
  { "{wool}" = mean(breaks) } ~ tension
}

with {tidyr} I think we cannot do the following without copying the breaks column first:

warpbreaks %.% {
  { 
    "mean_{wool}" = mean(breaks)
    "median_{wool}" = median(breaks)
   } ~ tension
  # or { "{.value}_{wool} = tibble(mean = mean(breaks), median = median(breaks)) } ~ tension
}

I looked at all the examples from reshape(), dcast() and spread() too, and I believe the solution proposed above can do everything apart from :

  • replacing NAs ({tidyr})
  • adding margins ({reshape2})

I don't have a clear idea on how to fill in NAs and it's a useful feature so I must keep thinking.

Here's a crazy idea, using a pseudo op =fill:<value>=

"mean_{wool}" =fill:0= mean(breaks)

Margins are not an issue specific to reshaping but specific to aggregation, tidyverse won't work on this because margins are not tidy (one row is one obs). But still they're sometimes useful. to be treated in another issue.

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Sep 18, 2020

Pivoting to longer

We need : id columns, key columns, value column and varying columns

That's one more than to pivot to wider, and it makes the challenge much harder.

The simplest way is to use stack (if we want to stay in base R) :

relig_income %>%
  pivot_longer(
    -religion, 
    names_to = "income", 
    values_to = "count")

becomes :

relig_income %.% {
  stack() ~ religion
}

And then we rename our columns, because stack doesn't support custom names.

Note that we can summarize to more than one row per group, dplyr::summarize() allows it too (since v1). It's unlikely to mix this operation with other operations though, but it would be supported.

This is not satisfactory because pivot_longer() has much fancier features, and unfortunately, the fancier the harder to use/understand. So it'd be nice to have something more intuitive.

Let's try to rewrite the above with a candidate syntax :

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

We solve the problem of having one more type of column by using 2 on the lhs. I believe the order is intuitive, first the labels, then the values. ?(".*") ensures we consider all (non grouping) columns.


next example :

billboard %>%
 pivot_longer(
   cols = starts_with("wk"),
   names_to = "week",
   names_prefix = "wk",
   values_to = "rank",
   values_drop_na = TRUE
 )
#> # A tibble: 5,307 x 5
#>    artist  track                   date.entered week   rank
#>    <chr>   <chr>                   <date>       <chr> <dbl>
#>  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
#>  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
#>  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
#>  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
#>  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
#>  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
#>  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
#>  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
#>  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
#> # ... with 5,297 more rows

The issue here is we need to spot a column, and to rename it, we can get close with the following, using previous syntax, but we'll have to rename downstream, and filter out. The output is still arguably more readable, through I would have liked to do the renaming in the gathering operation.

relig_income %.% {
  { c("week", "rank") = ?("^wk") } ~ religion
  week = gsub("^wk", "", week)
  subset(!is.na(rank))
}

We can use regex and capture groups, we could support named capture groups in case of ambiguity :

relig_income %.% {
  { c("week", "rank") = ?("^wk_(.*?)") } ~ religion
  subset(!is.na(rank))
}

A difference is that all columns that match the regex are to be kept, while in principle, the cols argument could impose a subset first. In practice I think it will virtually never be an issue. If we had a good syntax for selection intersection we could leverage it here.


next example :

who %>% 
  pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)",
  values_to = "count"
)
#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender age   count
#>    <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # ... with 405,430 more rows

Here we are going to create several columns, it's quite straightforward following our last example :

who %.% {
  { 
    c("diagnosis", "gender", "age", "count") = ?("^new_(.*)_(.)(.*)") 
  } ~ (unused)
}

cols = new_sp_m014:newrel_f65 means here : take everything that satisfies the pattern, so it's actually a feature not to have to type it. If it was more restrictive, we'd have to select out the columns as a previous step.


last example :

This one is tricky because we want several value columns, pivot_longer() itself is not easy to understand here :

anscombe
#>    x1 x2 x3 x4    y1   y2    y3    y4
#> 1  10 10 10  8  8.04 9.14  7.46  6.58
#> 2   8  8  8  8  6.95 8.14  6.77  5.76
#> 3  13 13 13  8  7.58 8.74 12.74  7.71
#> 4   9  9  9  8  8.81 8.77  7.11  8.84
#> 5  11 11 11  8  8.33 9.26  7.81  8.47
#> 6  14 14 14  8  9.96 8.10  8.84  7.04
#> 7   6  6  6  8  7.24 6.13  6.08  5.25
#> 8   4  4  4 19  4.26 3.10  5.39 12.50
#> 9  12 12 12  8 10.84 9.13  8.15  5.56
#> 10  7  7  7  8  4.82 7.26  6.42  7.91
#> 11  5  5  5  8  5.68 4.74  5.73  6.89
anscombe %>%
 pivot_longer(everything(),
   names_to = c(".value", "set"),
   names_pattern = "(.)(.)"
 )
#> # A tibble: 44 x 3
#>    set       x     y
#>    <chr> <dbl> <dbl>
#>  1 1        10  8.04
#>  2 2        10  9.14
#>  3 3        10  7.46
#>  4 4         8  6.58
#>  5 1         8  6.95
#>  6 2         8  8.14
#>  7 3         8  6.77
#>  8 4         8  5.76
#>  9 1        13  7.58
#> 10 2        13  8.74
#> # ... with 34 more rows

We can borrow the idea and do :

anscombe %.% {
  { 
   c("set", ".value") = ?("(?<.value>.)(.)")  
  } ~ (none)
}

I believe we're better off imposing that the last name on the lhs be the value column(s), this means we have to use named capture to solve this one. but we don't have to name all groups, it can work like named argument to functions, we go by position after matching names.

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Sep 18, 2020

some ambiguity remains


This looks like a summarizing call and i'll have to inspect the symbols to know it's a pivot op :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

Although, strings are highlighter and it would look special, so maybe not that bad.


And this is weird too :

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

Here it's not ambiguous because the lhs is not a symbol, and not a call to ?.


Now would we understand right away that the above are pivot operations ? it would be nice to tag them in some way


We often click on [+] to expand and on [-] to collapse, so maybe we could use those ?

fish_encounters %.% {
  { "{station}" = -- seen } ~ fish
}

relig_income %.% {
  { c("income", "count") = ++ ?(".*") } ~ religion
}

maybe better:

fish_encounters %.% {
  { -- "{station}" = seen } ~ fish
}

relig_income %.% {
  { ++ c("income", "count") = ?(".*") } ~ religion
}

Or we can use a different symbol to signal we're pivoting, but we're a bit short on those, this doesn't look so good:

fish_encounters %.% {
  { "{station}" := seen } ~ fish
}

relig_income %.% {
  { c("income", "count") := ?(".*") } ~ religion
}

Or some infix, used for syntax only, or containing the actual logic :

fish_encounters %.% {
  { "{station}" %=wider%  seen } ~ fish
}

relig_income %.% {
  { c("income", "count") %=longer% ?(".*") } ~ religion
}

They don't look really good though, and I don't like using %% ops for assignment.

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Nov 18, 2020

A couple months later I think to pivot wider and longer respectively those are fine :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

I still think fill is needed for pivoting to wider though, what about :

fish_encounters %.% {
  { "{station}" ~ 0 = seen } ~ fish
}

This has the fill value on the lhs, where we expect only names, so not very intuitive.

or

fish_encounters %.% {
  { "{station}" = seen ~ 0} ~ fish
}

It uses the mutate by syntax, under a summarizing call, which is confusing.

or

fish_encounters %.% {
  { "{station}, fill = 0" = seen} ~ fish
}

This looks clunky for sure, but is readable enough. Can be shortened :

fish_encounters %.% {
  { "{station}/0" = seen} ~ fish
}

we can use '' (simple quotes) to use strings, not the best but maybe the best we can do.

To be compared with

fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Nov 19, 2020

more on the "fill" syntax.

I think reusing ~ creates unnecessary confusion, we have other operators, I think / and : look the best here

fish_encounters %.% {
  { "{station}" / 0 = seen } ~ fish
  { "{station}" : 0 = seen } ~ fish
}

We could also have a special pseudo operator, or consider | special in this position (might backfire)

fish_encounters %.% {
  { "{station}" = seen %fill% 0} ~ fish
  { "{station}" = seen | 0} ~ fish
  { "{station}" = seen = 0} ~ fish
  { "{station}" = seen ? 0} ~ fish
  { "{station}" = seen := 0} ~ fish
}

There's also an option to use () or []

fish_encounters %.% {
  { "{station}" (0) = seen} ~ fish
  { "{station}" [0] = seen} ~ fish
}

@moodymudskipper
Copy link
Owner Author

moodymudskipper commented Nov 19, 2020

A crazy idea that would solve the ambiguity, we use :=, and for fill, we insert something between : and = :

fish_encounters %.% {
  { "{station}" := seen } ~ fish   # no fill
  { "{station}" :0= seen } ~ fish # fill with 0
}

We'd use := to pivot to longer too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant