Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nest_string()/unnest_string() functions #69

Merged
merged 14 commits into from May 24, 2016
Merged

Conversation

aaronwolen
Copy link
Contributor

This PR slightly modifies unnest() so the transform step from the example isn't necessary and can just be written as:

df <- data.frame(
  x = 1:3,
  y = c("a", "d,e,f", "g,h"),
  stringsAsFactors = FALSE
)
unnest(df, y)

I'm often doing these kind of unnesting operations and just wanted to save myself some typing by letting unnest() handle the string splitting.

This also adds a nest() function so it's possible to round trip:

df %>% unnest(y) %>% nest(y)

@hadley
Copy link
Member

hadley commented Mar 25, 2015

Hmmmm, that's only one type of nesting - the other is when you have something like:

dplyr::data_frame(x = c(1, 2), y = list(1:3, 9:10))

So maybe a more specific name like unnest_string() ?

@aaronwolen
Copy link
Contributor Author

Ah, good call. unnest_string() makes sense and fits with the precedent set by extract_numeric().

But which type of nesting do you think is more common? If it's string nesting then maybe unnest_list() makes more sense?

@aaronwolen
Copy link
Contributor Author

Thinking about it for the past 15 seconds, unnest_list() sounds more like a base::unsplit() replacement than a data.frame function. unnest_string() is probably the way to go.

Are you envisioning nest_string()/unnest_string() and nest()/unnest() for list columns? Or is that nest overload?

@aaronwolen aaronwolen changed the title Simplify unnest's interface and add nest function Add nest_string()/unnest_string() functions Mar 25, 2015
@hadley
Copy link
Member

hadley commented Mar 30, 2015

To me, unnest() seems like the atomic operation (in the sense you combine it with other atoms to do something useful), so I'd like to keep it the primary verb.

#' stringsAsFactors = FALSE
#' )
#' unnest_string(df, y)
unnest_string <- function(data, col = NULL, sep = "[^[:alnum:]]+", ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this always need a col?

@hadley
Copy link
Member

hadley commented Mar 30, 2015

Could you PTAL at the build failure?

@aaronwolen
Copy link
Contributor Author

Looks like the build failure was introduced in #59. I added a fix here (9c6fcc4) but can submit a separate PR If you prefer.

@aaronwolen
Copy link
Contributor Author

Thanks for the comments. I see your point about nest_string() and removed it to wrap up this PR.

Any interest in extending unnest() (and unnest_string()) to accept multiple columns as suggested in #44?

@@ -17,6 +17,8 @@ Imports:
dplyr (>= 0.2),
stringi,
lazyeval
Suggets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this deliberate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to add data.table to the suggested packages to address this build failure, but listing it under a redundant/misspelled key was just a silly mistake (insert embarrassed emoji here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just discovered that too and fixed it (in a slightly different way). Could you please merge/rebase?

@hadley
Copy link
Member

hadley commented Apr 16, 2015

Would you mind including a couple of unit tests too please?

@aaronwolen
Copy link
Contributor Author

Sure, I'll rebase and add a few tests.

@coveralls
Copy link

Coverage Status

Coverage increased (+5.75%) to 63.83% when pulling a907a56 on aaronwolen:nest into a2bab46 on hadley:master.

@hadley
Copy link
Member

hadley commented May 22, 2016

I wonder if this should be unnest_separate() since it's similar to separate?

@aaronwolen
Copy link
Contributor Author

Yeah, I think that makes sense. Should I update the PR?

@hadley
Copy link
Member

hadley commented May 22, 2016

Yes, that would be great! I think maybe it should have col and convert arguments to be as similar as possible to separate(). And maybe it should be called separate_rows()? Or something like that.

It would also be useful give this function, extract(), and separate() a common @family, maybe @family string splitting ?

@aaronwolen
Copy link
Contributor Author

From a readability perspective I definitely prefer separate_rows() to unnest_separate().

However, I think an argument can be made that this function is conceptually more similar to unnest(), which produces new rows, than it is to separate(), which produces a new column. If you buy that argument, unnest_*something*() might be more inline with user expectations, and if you don't... well, then separate_rows() it is!

@hadley
Copy link
Member

hadley commented May 23, 2016

It's half-way between both, but I think people are more likely to look for it after discovering that separate() doesn't do what they want. I think if you understand unnest() you might not need unnest_string().

* Preserves grouping
* Avoid modifying grouped variable
* Convert works as expected
#' @export
separate_rows_.grouped_df <- function(data, cols, sep = "[^[:alnum:].]+",
convert = FALSE) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should separate_rows() prevent modification of grouped variables, @hadley?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't - you can copy the approach from separate_:

#' @export
separate_.grouped_df <- function(data, col, into, sep = "[^[:alnum:]]+",
                                 remove = TRUE, convert = FALSE,
                                 extra = "warn", fill = "warn", ...) {
  regroup(NextMethod(), data, if (remove) col)
}

@aaronwolen
Copy link
Contributor Author

Cool, this should be ready for review.

@hadley hadley merged commit 32d9dd2 into tidyverse:master May 24, 2016
@hadley
Copy link
Member

hadley commented May 24, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants