Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialisation with indention #21

Merged
merged 14 commits into from
Jun 14, 2017
Merged

Conversation

lorenzwalthert
Copy link
Collaborator

This PR contains an implementation of functions indent_round and serialize_parse_data_nested, so that

  • indention information for round brackets can be added to the nested parse table (indent_round)
  • and this information can be used when serialising the parse table (serialize_parse_data_nested).

Further, a new utility function newlines_and_spaces is introduced to facilitate a common operation.
These functions are accompanied by tests for function calls, a vignette and further documentation.

@codecov
Copy link

codecov bot commented Jun 12, 2017

Codecov Report

Merging #21 into master will increase coverage by 19.09%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #21       +/-   ##
===========================================
+ Coverage   67.87%   86.97%   +19.09%     
===========================================
  Files           7        8        +1     
  Lines         193      238       +45     
===========================================
+ Hits          131      207       +76     
+ Misses         62       31       -31
Impacted Files Coverage Δ
R/nested.R 100% <100%> (+100%) ⬆️
R/parsed.R 98.03% <100%> (-0.04%) ⬇️
R/utils.R 100% <100%> (ø) ⬆️
R/modify_pd.R 100% <100%> (ø)
R/nested_to_tree.R 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c41c50...8eee6ae. Read the comment docs.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The results look great, judging by the tests. We should work a bit on the code.

R/nested.R Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>%
unlist()
newline <- which(raw == "\n")
token <- setdiff(1:length(raw), union(which(raw == ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_along() is slightly safer. Why do we need this post-processing at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Well I could not find a way how to add indention to the first token on a line only. Also, it seems not a big performance deal to just remove the spaces immediately before every token that are not proceeded by a new line in the serialisation. I think it's safe but not very nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example that fails without postprocessing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally every multi-line input for which at least one level of indention is required and there is more than one token per line. Since the indention level is currently added before every token all but the first token end up with too many spaces before them.
Here, I drop the lines 127 to 131 in nested.R, which are

  newline <- which(raw == "\n")
  token <- setdiff(1:length(raw), union(which(raw == ""),
                                        union(grep("^ +$", raw), newline)))
  to_zero <- setdiff(token - 1, newline + 1)
  raw[to_zero] <- ""

Using reprex , we get the following example:

library(magrittr)
indented_multi_line_random <- c(
"            call(       ", 
"  1,", 
"  call2(               ", 
"    2, 3,", "call3(1, 2, 22),", 
"    5", 
"),",
"           144", 
")")
raw <- styler:::compute_parse_data_nested(indented_multi_line_random) %>%
  styler:::create_filler_nested() %>% 
  styler:::indent_round_nested() %>% 
  styler:::serialize_parse_data_nested_helper(pass_indent = 0) %>% 
  unlist()

out <- raw %>% 
  paste0(collapse = "") %>% 
  strsplit("\n", fixed = TRUE) %>% 
  .[[1L]]

out
#> [1] "call("                                             
#> [2] "  1,"                                              
#> [3] "  call2  ("                                        
#> [4] "    2  ,   3  ,"                                   
#> [5] "    call3    (    1    ,     2    ,     22    )  ,"
#> [6] "    5"                                             
#> [7] "  ),"                                              
#> [8] "  144"                                             
#> [9] ")"
cat(out, sep = "\n")
#> call(
#>   1,
#>   call2  (
#>     2  ,   3  ,
#>     call3    (    1    ,     2    ,     22    )  ,
#>     5
#>   ),
#>   144
#> )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem with the current data structure: The line break information (newline variable) is given as newline after a token, but perhaps the information that a newline appears before a token is more useful. Can you replace newline with a new variable lag_newline = lag(newline) on the flat parse data (before nesting), and work with that? I hope this will eliminate the need for postprocessing.

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried that and it works. The question is how exactly it should be implemented. I see two possible approaches.

  • Use lag_newlines only (and drop newlines) and always inserting new lines and indention before a token, spaces after it.

  • Use lag_newlines to determine whether there is indention before a token. Continue using newlines and spaces for inserting non-indention spacing after each token.

I think for now we can stick with the former.
However, the latter might be preferable because it is more flexible in the sense that spacing information other than our computed indention can technically be stored. At the moment, we just set spacing after a token to zero when a line break comes after the token (see R/modify_pd.R, line 20). Maybe the flexibility is not needed later and it just complicates things, so I suggest to go with the former.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should try resorting only on lag_newlines. Do we still need the postprocessing now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not necessary anymore. In the next step, I can maybe even see if we need newlines at all, maybe we can get rid of it completely.

R/nested.R Outdated
pd_nested$spaces, pd_nested$newlines, pd_nested$indent),
function(terminal, text, child, spaces, newlines, indent) {
if (terminal) {
c(rep_char(" ", pass_indent), text,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can text contain newlines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general I think it can, but if this if clause applies, we are in a terminal token, for which text cannot be newlines ("\n") I think. It could be a string like "'\n'" though, but then it would remain a string and not get evaluated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about newlines embedded in strings, like this:

"a
b"

But we just cannot apply indention on these without changing their semantics. So the point is moot.

R/modify_pd.R Outdated
indent_round <- function(pd, indent_by) {
start <- which(pd$token == "'('") + 1
stop <- which(pd$token == "')'") - 1
if (length(start) == 0 && length(stop) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what you mean. Probably it's enough to check for just one of them to be zero.

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, resolved that now differently anyways.

R/modify_pd.R Outdated
#' Update indention information of parse data
#'
#' @param pd A nested or flat parse table that is already enhanced with
#' line break and space information via [create_filler] or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[create_filler()] gives nicer markup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for that one, should know it by now...

R/modify_pd.R Outdated

#' @rdname update_indention
indent_round <- function(pd, indent_by) {
start <- which(pd$token == "'('") + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where start != 2 or stop != nrow(pd) ?

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, do you mean stop != nrow(pd) - 1?
Probably not in the case of round brackets. However, I wanted to keep the function rather general, since later on, we probably want to use a closure that returns functions like indent_round, but also one that indents after operators such as %>%.
data_frame(x = 1, y = 2) %>%
__ do_one_thing()

Here, we want stop = nrow(pd), which is different from the round brackets, where probably always stop != nrow(pd) -1 .
But anyways, I think you are right, in the end we might just use two different closures then, i.e. one for backets and one for other operators such as %>% + etc.

The thing here is rather that you need which since it might also be the case that there are no round brackets and you want indention to be zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a different handler for other expression types, but we can think about it when we're done with the indention of function calls.

R/modify_pd.R Outdated
pd$indent <- ifelse(1:nrow(pd) %in% start[1]:stop[1], indent_by, 0) *
lag(pd$newlines, default = 0)
}
# general, should maybe not go here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we could wrap this in a function that updates the spaces attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to extract a function here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will do that separately.

code <- "a <- xyz(x, 22, if(x > 1) 33 else 4)"

back_and_forth <- code %>%
styler:::compute_parse_data_nested() %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit styler::: .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do so.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Maybe our data structure needs some tweaking?

R/nested.R Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>%
unlist()
newline <- which(raw == "\n")
token <- setdiff(1:length(raw), union(which(raw == ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem with the current data structure: The line break information (newline variable) is given as newline after a token, but perhaps the information that a newline appears before a token is more useful. Can you replace newline with a new variable lag_newline = lag(newline) on the flat parse data (before nesting), and work with that? I hope this will eliminate the need for postprocessing.

R/modify_pd.R Outdated

#' @rdname update_indention
indent_round <- function(pd, indent_by) {
start <- which(pd$token == "'('") + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a different handler for other expression types, but we can think about it when we're done with the indention of function calls.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, have you pushed new changes?

R/nested.R Outdated
raw <- serialize_parse_data_nested_helper(pd_nested, pass_indent = 0) %>%
unlist()
newline <- which(raw == "\n")
token <- setdiff(1:length(raw), union(which(raw == ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should try resorting only on lag_newlines. Do we still need the postprocessing now?

R/modify_pd.R Outdated
pd$indent <- ifelse(1:nrow(pd) %in% start[1]:stop[1], indent_by, 0) *
lag(pd$newlines, default = 0)
}
# general, should maybe not go here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to extract a function here?

@lorenzwalthert
Copy link
Collaborator Author

No I have not yet pushed because things in this PR depended quite a bit on how to use lag_newlines and I was not sure which approach you prefer. Now it's clear.

* use lag_newlines instead of post-processing to handle indention properly.
* use specific handler for indent_round(): (seq_along, new start / stop),
  remove dependency with new_lines.
* utility functions add_newlines and add_spaces due to use of lag_newlines.
@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Jun 13, 2017

Regarding EOL stripping, I am not sure whether that is the best way to do it. I mean I looked at how we can integrate these internal functions with make_transformer() and I think we can use all (flat) functions from R/rules.R without the need of writing nested versions of them (see this commit in another branch). Hence, maybe the EOL stripping should just be one of them - or maybe not, if you don't want to think of EOL stripping as it as a rule, but it can be thought of being one.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great. Could you please look at my final comments and merge, and also create new issues for the comments not covered in this PR?

unlist() %>%
paste0(collapse = "") %>%
strsplit("\n", fixed = TRUE) %>%
.[[1L]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr 0.7.0 has pull(), but we can address this in a separate PR


## ............................................................................

indented_multi_line_correct <- c(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is easier in example files (like in.R and out.R). Perhaps you can extend to support an arbitrary number of in.R files? Again, separate PR.

R/modify_pd.R Outdated

#' @rdname update_indention
indent_round <- function(pd, indent_by) {
if (any(pd$token == "')'")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look only at pd$token[[2]] or pd$token[[nrow(pd)]]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because we have some tibbles with only one row. Looking at the last token is probably also not a good idea because we have two cases
mycall(1, 2) and (1 + x)
Which have their opening brace at the second and the first position respectively. I think that's really the reason why I implemented the approach with which in the first place: To get the start right.
I think this function should really handle both cases and it's easier to implement it in one function instead of two. Hence, I propose to go back to the which approach, at least for the start. I think the end is unambiguous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that this also covers parens when they are used to group arithmetic expressions. Would you mind adding a test?

} else {
start <- stop <- 0
}
pd$indent <- ifelse(seq_len(nrow(pd)) %in% start:stop, indent_by, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might look clearer if we move this assignment to the if branch, and remove the else branch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you @krlmlr mean. We need to assign a value to pd$indent anyways, since this column is used in the recursion of serialize_parse_data_nested_helper. Otherwise we have to change this function so it can handle the case for which the column does not exist, which I think is not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Let's leave it for now.

R/modify_pd.R Outdated
#' @rdname update_indention
indent_round <- function(pd, indent_by) {
if (any(pd$token == "')'")) {
start <- 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check boundaries: Do we really need to indent the second token, or perhaps only starting from the third?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, I think we need to explicitly find the start token because of the two scenarios.

indent_round_nested <- function(pd) {
if (is.null(pd)) return(pd)
pd <- indent_round(pd, indent_by = 2)
pd$child <- map(pd$child, indent_round_nested)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer mutate() over sub-assignment here and in the functions below and above. This should work well with dplyr 0.7.0. Separate PR?

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. In a pipe like this?

pd <- pd %>%
  indent_round(pd, indent_by = 2) %>%
  mutate(child = map(child, indent_round_nested)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What exactly do you want to add to the style guide?

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding sub assignment if possible. See here. I think this particular case is not yet covered. We won't be able to enforce that with styler though I think.

@lorenzwalthert lorenzwalthert merged commit b3666f0 into r-lib:master Jun 14, 2017
@lorenzwalthert lorenzwalthert mentioned this pull request Jun 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants