-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialisation with indention #21
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21 +/- ##
===========================================
+ Coverage 67.87% 86.97% +19.09%
===========================================
Files 7 8 +1
Lines 193 238 +45
===========================================
+ Hits 131 207 +76
+ Misses 62 31 -31
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The results look great, judging by the tests. We should work a bit on the code.
R/nested.R
Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>% | ||
unlist() | ||
newline <- which(raw == "\n") | ||
token <- setdiff(1:length(raw), union(which(raw == ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq_along()
is slightly safer. Why do we need this post-processing at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Well I could not find a way how to add indention to the first token on a line only. Also, it seems not a big performance deal to just remove the spaces immediately before every token that are not proceeded by a new line in the serialisation. I think it's safe but not very nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example that fails without postprocessing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally every multi-line input for which at least one level of indention is required and there is more than one token per line. Since the indention level is currently added before every token all but the first token end up with too many spaces before them.
Here, I drop the lines 127 to 131 in nested.R, which are
newline <- which(raw == "\n")
token <- setdiff(1:length(raw), union(which(raw == ""),
union(grep("^ +$", raw), newline)))
to_zero <- setdiff(token - 1, newline + 1)
raw[to_zero] <- ""
Using reprex , we get the following example:
library(magrittr)
indented_multi_line_random <- c(
" call( ",
" 1,",
" call2( ",
" 2, 3,", "call3(1, 2, 22),",
" 5",
"),",
" 144",
")")
raw <- styler:::compute_parse_data_nested(indented_multi_line_random) %>%
styler:::create_filler_nested() %>%
styler:::indent_round_nested() %>%
styler:::serialize_parse_data_nested_helper(pass_indent = 0) %>%
unlist()
out <- raw %>%
paste0(collapse = "") %>%
strsplit("\n", fixed = TRUE) %>%
.[[1L]]
out
#> [1] "call("
#> [2] " 1,"
#> [3] " call2 ("
#> [4] " 2 , 3 ,"
#> [5] " call3 ( 1 , 2 , 22 ) ,"
#> [6] " 5"
#> [7] " ),"
#> [8] " 144"
#> [9] ")"
cat(out, sep = "\n")
#> call(
#> 1,
#> call2 (
#> 2 , 3 ,
#> call3 ( 1 , 2 , 22 ) ,
#> 5
#> ),
#> 144
#> )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem with the current data structure: The line break information (newline
variable) is given as newline after a token, but perhaps the information that a newline appears before a token is more useful. Can you replace newline
with a new variable lag_newline = lag(newline)
on the flat parse data (before nesting), and work with that? I hope this will eliminate the need for postprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried that and it works. The question is how exactly it should be implemented. I see two possible approaches.
-
Use lag_newlines only (and drop newlines) and always inserting new lines and indention before a token, spaces after it.
-
Use lag_newlines to determine whether there is indention before a token. Continue using newlines and spaces for inserting non-indention spacing after each token.
I think for now we can stick with the former.
However, the latter might be preferable because it is more flexible in the sense that spacing information other than our computed indention can technically be stored. At the moment, we just set spacing after a token to zero when a line break comes after the token (see R/modify_pd.R, line 20). Maybe the flexibility is not needed later and it just complicates things, so I suggest to go with the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should try resorting only on lag_newlines
. Do we still need the postprocessing now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not necessary anymore. In the next step, I can maybe even see if we need newlines at all, maybe we can get rid of it completely.
R/nested.R
Outdated
pd_nested$spaces, pd_nested$newlines, pd_nested$indent), | ||
function(terminal, text, child, spaces, newlines, indent) { | ||
if (terminal) { | ||
c(rep_char(" ", pass_indent), text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can text
contain newlines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in general I think it can, but if this if clause applies, we are in a terminal token, for which text cannot be newlines ("\n") I think. It could be a string like "'\n'" though, but then it would remain a string and not get evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about newlines embedded in strings, like this:
"a
b"
But we just cannot apply indention on these without changing their semantics. So the point is moot.
R/modify_pd.R
Outdated
indent_round <- function(pd, indent_by) { | ||
start <- which(pd$token == "'('") + 1 | ||
stop <- which(pd$token == "')'") - 1 | ||
if (length(start) == 0 && length(stop) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand what you mean. Probably it's enough to check for just one of them to be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, resolved that now differently anyways.
R/modify_pd.R
Outdated
#' Update indention information of parse data | ||
#' | ||
#' @param pd A nested or flat parse table that is already enhanced with | ||
#' line break and space information via [create_filler] or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[create_filler()]
gives nicer markup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for that one, should know it by now...
R/modify_pd.R
Outdated
|
||
#' @rdname update_indention | ||
indent_round <- function(pd, indent_by) { | ||
start <- which(pd$token == "'('") + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where start != 2
or stop != nrow(pd)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, do you mean stop != nrow(pd) - 1
?
Probably not in the case of round brackets. However, I wanted to keep the function rather general, since later on, we probably want to use a closure that returns functions like indent_round
, but also one that indents after operators such as %>%
.
data_frame(x = 1, y = 2) %>%
__ do_one_thing()
Here, we want stop = nrow(pd)
, which is different from the round brackets, where probably always stop != nrow(pd) -1
.
But anyways, I think you are right, in the end we might just use two different closures then, i.e. one for backets and one for other operators such as %>%
+
etc.
The thing here is rather that you need which
since it might also be the case that there are no round brackets and you want indention to be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a different handler for other expression types, but we can think about it when we're done with the indention of function calls.
R/modify_pd.R
Outdated
pd$indent <- ifelse(1:nrow(pd) %in% start[1]:stop[1], indent_by, 0) * | ||
lag(pd$newlines, default = 0) | ||
} | ||
# general, should maybe not go here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we could wrap this in a function that updates the spaces
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to extract a function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will do that separately.
code <- "a <- xyz(x, 22, if(x > 1) 33 else 4)" | ||
|
||
back_and_forth <- code %>% | ||
styler:::compute_parse_data_nested() %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit styler:::
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Maybe our data structure needs some tweaking?
R/nested.R
Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>% | ||
unlist() | ||
newline <- which(raw == "\n") | ||
token <- setdiff(1:length(raw), union(which(raw == ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem with the current data structure: The line break information (newline
variable) is given as newline after a token, but perhaps the information that a newline appears before a token is more useful. Can you replace newline
with a new variable lag_newline = lag(newline)
on the flat parse data (before nesting), and work with that? I hope this will eliminate the need for postprocessing.
R/modify_pd.R
Outdated
|
||
#' @rdname update_indention | ||
indent_round <- function(pd, indent_by) { | ||
start <- which(pd$token == "'('") + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a different handler for other expression types, but we can think about it when we're done with the indention of function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, have you pushed new changes?
R/nested.R
Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>% | ||
unlist() | ||
newline <- which(raw == "\n") | ||
token <- setdiff(1:length(raw), union(which(raw == ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should try resorting only on lag_newlines
. Do we still need the postprocessing now?
R/modify_pd.R
Outdated
pd$indent <- ifelse(1:nrow(pd) %in% start[1]:stop[1], indent_by, 0) * | ||
lag(pd$newlines, default = 0) | ||
} | ||
# general, should maybe not go here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to extract a function here?
No I have not yet pushed because things in this PR depended quite a bit on how to use |
* use lag_newlines instead of post-processing to handle indention properly. * use specific handler for indent_round(): (seq_along, new start / stop), remove dependency with new_lines. * utility functions add_newlines and add_spaces due to use of lag_newlines.
Regarding EOL stripping, I am not sure whether that is the best way to do it. I mean I looked at how we can integrate these internal functions with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks great. Could you please look at my final comments and merge, and also create new issues for the comments not covered in this PR?
unlist() %>% | ||
paste0(collapse = "") %>% | ||
strsplit("\n", fixed = TRUE) %>% | ||
.[[1L]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr 0.7.0 has pull()
, but we can address this in a separate PR
|
||
## ............................................................................ | ||
|
||
indented_multi_line_correct <- c( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is easier in example files (like in.R
and out.R
). Perhaps you can extend to support an arbitrary number of in.R
files? Again, separate PR.
R/modify_pd.R
Outdated
|
||
#' @rdname update_indention | ||
indent_round <- function(pd, indent_by) { | ||
if (any(pd$token == "')'")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we look only at pd$token[[2]]
or pd$token[[nrow(pd)]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because we have some tibbles with only one row. Looking at the last token is probably also not a good idea because we have two cases
mycall(1, 2)
and (1 + x)
Which have their opening brace at the second and the first position respectively. I think that's really the reason why I implemented the approach with which
in the first place: To get the start right.
I think this function should really handle both cases and it's easier to implement it in one function instead of two. Hence, I propose to go back to the which
approach, at least for the start. I think the end is unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware that this also covers parens when they are used to group arithmetic expressions. Would you mind adding a test?
} else { | ||
start <- stop <- 0 | ||
} | ||
pd$indent <- ifelse(seq_len(nrow(pd)) %in% start:stop, indent_by, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might look clearer if we move this assignment to the if
branch, and remove the else
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you @krlmlr mean. We need to assign a value to pd$indent
anyways, since this column is used in the recursion of serialize_parse_data_nested_helper
. Otherwise we have to change this function so it can handle the case for which the column does not exist, which I think is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Let's leave it for now.
R/modify_pd.R
Outdated
#' @rdname update_indention | ||
indent_round <- function(pd, indent_by) { | ||
if (any(pd$token == "')'")) { | ||
start <- 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check boundaries: Do we really need to indent the second token, or perhaps only starting from the third?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, I think we need to explicitly find the start token because of the two scenarios.
indent_round_nested <- function(pd) { | ||
if (is.null(pd)) return(pd) | ||
pd <- indent_round(pd, indent_by = 2) | ||
pd$child <- map(pd$child, indent_round_nested) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer mutate()
over sub-assignment here and in the functions below and above. This should work well with dplyr 0.7.0. Separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. In a pipe like this?
pd <- pd %>%
indent_round(pd, indent_by = 2) %>%
mutate(child = map(child, indent_round_nested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. What exactly do you want to add to the style guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding sub assignment if possible. See here. I think this particular case is not yet covered. We won't be able to enforce that with styler though I think.
This PR contains an implementation of functions
indent_round
andserialize_parse_data_nested
, so thatindent_round
)serialize_parse_data_nested
).Further, a new utility function
newlines_and_spaces
is introduced to facilitate a common operation.These functions are accompanied by tests for function calls, a vignette and further documentation.