Skip to content

Commit

Permalink
apply styler
Browse files Browse the repository at this point in the history
  • Loading branch information
lorenzwalthert committed Feb 20, 2021
1 parent 4082218 commit d861f38
Show file tree
Hide file tree
Showing 26 changed files with 581 additions and 527 deletions.
12 changes: 6 additions & 6 deletions content/blog/2020/corrr-0-4-3/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@ We can create a `cor_df` object containing the pairwise correlations between a f
```{r message = FALSE}
library(palmerpenguins)
penguins_cor <- penguins %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
penguins_cor <- penguins %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
correlate()
penguins_cor
Expand All @@ -80,7 +80,7 @@ penguins_cor
Previously, the default behavior of `rplot()` was that the variables were displayed in alphabetical order in the output. This was an artifact of using `ggplot2` and inheriting its behavior. The new default is to retain the ordering of variables in the input data:

```{r message = FALSE}
rplot(penguins_cor)
rplot(penguins_cor)
```

If alphabetical ordering is desired, set `.order` to "alphabet":
Expand Down Expand Up @@ -116,14 +116,14 @@ cov_df
The resulting data frame behaves just like one returned by `correlate()`, except that it is populated with covariance values rather than correlations. This means we still have access to all corrr's other tooling when working with it. We can still use `shave()` for example to remove duplication, which will set the upper triangle of values to `NA`.

```{r}
cov_df %>%
cov_df %>%
shave()
```

Similarly, we can still use `stretch()` to get the resulting data frame into a longer format:

```{r}
cov_df %>%
cov_df %>%
stretch()
```

Expand All @@ -132,7 +132,7 @@ The first part of the name ("colpair_") comes from the fact that we are comparin
As such, any function passed to `colpair_map()` must accept a vector for both its first and second arguments. To illustrate, let's say we wanted to run a series t-tests to see which of our variables are significantly related to one another. We can write a function to do so as follows:

```{r}
calc_ttest_p_value <- function(vec_a, vec_b){
calc_ttest_p_value <- function(vec_a, vec_b) {
t.test(vec_a, vec_b)$p.value
}
```
Expand Down
100 changes: 52 additions & 48 deletions content/blog/2020/dbplyr-2-0-0/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,40 +56,40 @@ dbplyr now supports all relevant features added in dplyr 1.0.0:
- `across()` is now translated into individual SQL statements.

```{r}
lf <- lazy_frame(g = 1, a = 1, b = 2, c = 3)
lf %>%
group_by(g) %>%
summarise(across(everything(), mean, na.rm = TRUE))
lf <- lazy_frame(g = 1, a = 1, b = 2, c = 3)
lf %>%
group_by(g) %>%
summarise(across(everything(), mean, na.rm = TRUE))
```

- `rename()` and `select()` support dplyr tidyselect syntax, apart from predicate functions which can't easily work on computed queries.
You can now use `rename_with()` to programmatically rename columns.

```{r}
lf <- lazy_frame(x1 = 1, x2 = 2, x3 = 3, y1 = 4, y2 = 3)
lf %>% select(starts_with("x") & !"x3")
lf %>% select(ends_with("2") | ends_with("3"))
lf %>% rename_with(toupper)
lf <- lazy_frame(x1 = 1, x2 = 2, x3 = 3, y1 = 4, y2 = 3)
lf %>% select(starts_with("x") & !"x3")
lf %>% select(ends_with("2") | ends_with("3"))
lf %>% rename_with(toupper)
```

- `relocate()` makes it easy to move columns around:

```{r}
lf <- lazy_frame(x1 = 1, x2 = 2, y1 = 4, y2 = 3)
lf %>% relocate(starts_with("y"))
lf <- lazy_frame(x1 = 1, x2 = 2, y1 = 4, y2 = 3)
lf %>% relocate(starts_with("y"))
```

- `slice_min()`, `slice_max()`, and `slice_sample()` are now supported, and `slice_head()` and `slice_tail()` throw informative error messages (since they don't make sense for databases).

```{r}
lf <- lazy_frame(g = rep(1:2, 5), x = 1:10)
lf %>%
group_by(g) %>%
slice_min(x, prop = 0.5)
lf %>%
group_by(g) %>%
slice_sample(x, n = 10, with_ties = TRUE)
lf <- lazy_frame(g = rep(1:2, 5), x = 1:10)
lf %>%
group_by(g) %>%
slice_min(x, prop = 0.5)
lf %>%
group_by(g) %>%
slice_sample(x, n = 10, with_ties = TRUE)
```

Note that these slices are translated into window functions, and because you can't use a window function directly inside a `WHERE` clause, they must be wrapped in a subquery.
Expand All @@ -109,55 +109,59 @@ Here are a few of the most important:
You can set `na_matches = "na"` to match R's usual join behaviour.

```{r}
df1 <- tibble(x = c(1, 2, NA))
df2 <- tibble(x = c(NA, 1), y = 1:2)
df1 %>% inner_join(df2, by = "x")
df1 <- tibble(x = c(1, 2, NA))
df2 <- tibble(x = c(NA, 1), y = 1:2)
df1 %>% inner_join(df2, by = "x")
db1 <- memdb_frame(x = c(1, 2, NA))
db2 <- memdb_frame(x = c(NA, 1), y = 1:2)
db1 %>% inner_join(db2, by = "x")
db1 <- memdb_frame(x = c(1, 2, NA))
db2 <- memdb_frame(x = c(NA, 1), y = 1:2)
db1 %>% inner_join(db2, by = "x")
db1 %>% inner_join(db2, by = "x", na_matches = "na")
db1 %>% inner_join(db2, by = "x", na_matches = "na")
```

This translation is powered by the new `sql_expr_matches()` generic, because every database seems to have a slightly different way to express this idea.
Learn more at <https://modern-sql.com/feature/is-distinct-from>.

```{r}
db1 %>% inner_join(db2, by = "x") %>% show_query()
db1 %>% inner_join(db2, by = "x", na_matches = "na") %>% show_query()
db1 %>%
inner_join(db2, by = "x") %>%
show_query()
db1 %>%
inner_join(db2, by = "x", na_matches = "na") %>%
show_query()
```

- Subqueries no longer include an `ORDER BY` clause.
This is not part of the formal SQL specification so it has very limited support across databases.
Now such queries generate a warning suggesting that you move your `arrange()` call later in the pipeline.

```{r}
lf <- lazy_frame(g = rep(1:2, each = 5), x = sample(1:10))
lf %>%
group_by(g) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(n > 1)
lf <- lazy_frame(g = rep(1:2, each = 5), x = sample(1:10))
lf %>%
group_by(g) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(n > 1)
```

As the warning suggests, there's one exception: `ORDER BY` is still generated if a `LIMIT` is present.
Across databases, this tends to change which rows are returned, but not necessarily their order.

```{r}
lf %>%
group_by(g) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(5) %>%
filter(n > 1)
lf %>%
group_by(g) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(5) %>%
filter(n > 1)
```

- dbplyr includes built-in backends for Redshift (which only differs from PostgreSQL in a few places) and SAP HANA. These require the development versions of [RPostgres](https://github.com/r-dbi/RPostgres) and [odbc](https://github.com/r-dbi/odbc) respectively.

```{r}
lf <- lazy_frame(x = "a", y = "b", con = simulate_redshift())
lf %>% mutate(z = paste0(x, y))
lf <- lazy_frame(x = "a", y = "b", con = simulate_redshift())
lf %>% mutate(z = paste0(x, y))
```

There are a number of minor changes that affect the translation of individual functions.
Expand All @@ -166,23 +170,23 @@ Here are a few of the most important:
- All backends now translate `n()` to `count(*)` and support `::`

```{r}
lf <- lazy_frame(x = 1:10)
lf %>% summarise(n = dplyr::n())
lf <- lazy_frame(x = 1:10)
lf %>% summarise(n = dplyr::n())
```

- PostgreSQL gets translations for lubridate period functions:

```{r}
lf <- lazy_frame(x = Sys.Date(), con = simulate_postgres())
lf %>%
mutate(year = x + years(1))
lf <- lazy_frame(x = Sys.Date(), con = simulate_postgres())
lf %>%
mutate(year = x + years(1))
```

- Oracle assumes version 12c is available so we can use a simpler translation for `head()` that works in more places:

```{r}
lf <- lazy_frame(x = 1, con = simulate_oracle())
lf %>% head(5)
lf <- lazy_frame(x = 1, con = simulate_oracle())
lf %>% head(5)
```

## New logo
Expand Down
44 changes: 22 additions & 22 deletions content/blog/2020/dplyr-1-0-0-and-vctrs/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ You might wonder why we can't just copy the behaviour of `c()`. Unfortunately `c
underlying integer levels.

```{r}
c(factor("x"), factor("y"))
c(factor("x"), factor("y"))
```

* It's difficult to implement methods when different classes are involved.
Expand All @@ -75,17 +75,17 @@ You might wonder why we can't just copy the behaviour of `c()`. Unfortunately `c
first being translated.

```{r}
today <- as.Date("2020-03-24")
now <- as.POSIXct("2020-03-24 10:34")
c(today, now)
# (the second value is the 11 Dec 4341727-12-11)
class(c(today, now))
unclass(c(today, now))
c(now, today)
class(c(now, today))
unclass(c(now, today))
today <- as.Date("2020-03-24")
now <- as.POSIXct("2020-03-24 10:34")
c(today, now)
# (the second value is the 11 Dec 4341727-12-11)
class(c(today, now))
unclass(c(today, now))
c(now, today)
class(c(now, today))
unclass(c(now, today))
```

It's difficult to change how `c()` works because any changes are likely to break some existing code, and base R is committed to backward compatibility. Additionally, `c()` isn't the only way that base R combines vectors. `rbind()` and `unlist()` can also be used to perform a similar job, but return different results. This is not to say that the tidyverse has been any better in the past --- we have used a variety of ad hoc methods, undoubtedly using well more than three different approaches.
Expand All @@ -97,9 +97,9 @@ Given that it's hard to fix the problem in base R, we've come up with our own al
always get a date-time.

```{r}
vec_c(today, now)
vec_c(today, now)
vec_c(now, today)
vec_c(now, today)
```

* Enrichment: `vec_c(x, y)` should return the richer type, where type `<x>`
Expand All @@ -108,17 +108,17 @@ Given that it's hard to fix the problem in base R, we've come up with our own al
double, and that combining a date and date-time should return a date-time.

```{r}
vec_c(1, 1.5)
vec_c(today, now)
vec_c(1, 1.5)
vec_c(today, now)
```

* Consistency: `vec_c(x, y)` should error if `x` and `y` are of fundamentally
different types. For example, this implies that combining a string and a
number or a factor and a date should error.

```{r, error = TRUE}
vec_c("a", 1)
vec_c(factor("x"), today)
vec_c("a", 1)
vec_c(factor("x"), today)
```

## Errors
Expand Down Expand Up @@ -157,8 +157,8 @@ Where possible, we attempt to give you more information to solve the problem. Fo

```{r, error = TRUE}
df <- tibble(g = c(1, 2))
df %>%
group_by(g) %>%
df %>%
group_by(g) %>%
mutate(y = if (g == 1) "a" else 1)
```

Expand All @@ -175,14 +175,14 @@ Using vctrs in dplyr also causes two behaviour changes. We hope that these don't
create a factor with the union of the individual levels:

```{r}
vec_c(factor("x"), factor("y"))
vec_c(factor("x"), factor("y"))
```

* When combining a factor and a character, dplyr previously warned about
creating a character vector. It now silently creates a character vector:

```{r}
vec_c("x", factor("y"))
vec_c("x", factor("y"))
```

These changes are motivated more by pragmatism than by theory. Strictly speaking, one should probably consider `factor("red")` and `factor("male")` to be incompatible, but this level of strictness causes much pain because character vectors can usually be used interchangeably with factors.
Expand Down
40 changes: 20 additions & 20 deletions content/blog/2020/dplyr-1-0-0-colwise/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,21 +38,21 @@ Today, I wanted to talk a little bit about the new `across()` function that make
It's often useful to perform the same operation on multiple columns, but copying and pasting is both tedious and error prone:

```{r, eval = FALSE}
df %>%
group_by(g1, g2) %>%
df %>%
group_by(g1, g2) %>%
summarise(a = mean(a), b = mean(b), c = mean(c), d = mean(c))
```

You can now rewrite such code using `across()`, which lets you apply a transformation to multiple variables selected with the same syntax as [`select()` and `rename()`](https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-select-rename-relocate/#select-and-renaming):

```{r, eval = FALSE}
df %>%
group_by(g1, g2) %>%
df %>%
group_by(g1, g2) %>%
summarise(across(a:d, mean))
# or with a function
df %>%
group_by(g1, g2) %>%
df %>%
group_by(g1, g2) %>%
summarise(across(where(is.numeric), mean))
```

Expand All @@ -74,17 +74,17 @@ Here are a couple of examples of `across()` used with `summarise()`:
```{r}
library(dplyr, warn.conflicts = FALSE)
starwars %>%
starwars %>%
summarise(across(where(is.character), n_distinct))
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise(across(c(sex, gender, homeworld), n_distinct))
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE), n = n())
```
## Other cool features
Expand All @@ -110,13 +110,13 @@ Why did we decide to move away from these functions in favour of `across()`?
compute the number of rows in each group:

```{r, eval = FALSE}
df %>%
group_by(g1, g2) %>%
summarise(
across(where(is.numeric), mean),
across(where(is.factor), nlevels),
n = n(),
)
df %>%
group_by(g1, g2) %>%
summarise(
across(where(is.numeric), mean),
across(where(is.factor), nlevels),
n = n(),
)
```

2. `across()` reduces the number of functions that dplyr needs to provide.
Expand Down
Loading

0 comments on commit d861f38

Please sign in to comment.