/
introduction.Rmd
244 lines (172 loc) · 6.57 KB
/
introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: "Demonstration of tidyext"
author: "Michel Clark"
date: <span style="font-style:normal;font-family:'Open Sans'">`r Sys.Date()`</span>
output:
html_vignette:
toc: true
toc_depth: 3
df_print: kable
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Demonstration of tidyext}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo = T, message=F, warning=F, error=F, collapse = TRUE,
comment=NA, R.options=list(width=220), # code
dev.args=list(bg = 'transparent'), dev='svglite', # viz
fig.align='center', out.width='75%', fig.asp=.75,
cache.rebuild=F, cache=F) # cache
```
# Getting started with tidyext
## Data Summaries
To begin, we can load up the <span class="pack">tidyverse</span> and this package. I'll also create some data that will be useful for demonstration.
```{r packages}
library(tidyverse)
library(tidyext)
set.seed(8675309)
df1 <- tibble(
g1 = factor(sample(1:2, 50, replace = TRUE), labels = c('a', 'b')),
g2 = sample(1:4, 50, replace = TRUE),
a = rnorm(50),
b = rpois(50, 10),
c = sample(letters, 50, replace = TRUE),
d = sample(c(T, F), 50, replace = TRUE)
)
df_miss = df1
df_miss[sample(1:nrow(df1), 10), sample(1:ncol(df1), 3)] = NA
```
We can start by getting a quick numerical summary for a single column. As the name suggests, this will only work with numeric data.
```{r num_summary}
num_summary(mtcars$mpg)
num_summary(df_miss$a, extra = T)
```
Note that the result's class is a <span class="objclass">data.frame</span>, which makes it easy to work with.
```{r class_num_sum}
x = num_summary(mtcars$mpg)
glimpse(x)
mtcars %>%
map_dfr(num_summary, .id = 'Variable')
```
There are also functions for summarizing missingness.
```{r missing}
sum_NA(df_miss$a)
sum_blank(c(letters, '', ' '))
sum_NaN(c(1, NaN, 2))
```
When dealing with a data frame of mixed types we can use the <span class="func">describe_*</span> functions.
```{r describe}
describe_all(df1)
describe_all_cat(df1)
describe_all_num(df1, digits = 1)
```
Note how the categorical data result is just as ready for visualization as the numeric, as it can be filtered by the `Variable` column. It also has an option to deal with <span class="objclass">NA</span> and some other stuff.
```{r describe_cat, out.width='50%'}
describe_all_cat(df_miss, include_NAcat = TRUE, sort_by_freq = TRUE) %>%
filter(Variable == 'g1') %>%
ggplot(aes(x=Group, y=`%`)) +
geom_col(width = .25)
```
Typically during data processing, we are performing grouped operations. As such there is a corresponding <span class="func">num_by</span> and <span class="func">cat_by</span> to provide the same information by some grouping variable. This basically is saving you from doing `group_by %>% summarize()` and creating variables for all these values. It can also take a set of variables to summarize using <span class="func">vars</span>.
```{r num_by}
df_miss %>%
num_by(a, group_var = g2)
df_miss %>%
num_by(vars(a, b), group_var = g2)
```
For categorical variables summarized by group, you can select whether the resulting percentage is irrespective of the grouping.
```{r cat_by}
df1 %>%
cat_by(d,
group_var = g1,
perc_by_group = TRUE)
df1 %>%
cat_by(d,
group_var = g1,
perc_by_group = FALSE,
sort_by_group = FALSE)
```
## Data Processing
In addition there are some functions for data processing. We can start with the simple one-hot encoding function.
```{r onehot}
onehot(iris) %>%
slice(c(1:2, 51:52, 101:102))
```
It can do it sparsely.
```{r onehot2}
iris %>%
slice(c(1:2, 51:52, 101:102)) %>%
onehot(sparse = TRUE)
```
Choose a specific variable, whether you want to keep the others, and how to deal with NA.
```{r onehot3}
df_miss %>%
onehot(var = c('g1', 'g2'), nas = 'na.omit', keep.original = FALSE) %>%
head()
```
With <span class="func">create_prediction_data</span>, we can quickly create data for use with <span class="func">predict</span> after a model. By default it will put numeric variables at their mean, and categorical variables at their most common category.
```{r create_prediction_data}
create_prediction_data(iris)
create_prediction_data(iris, num = function(x) quantile(x, p=.25))
```
We can also supply specific values.
```{r create_prediction_data2}
cd = data.frame(cyl=4, hp=100)
create_prediction_data(mtcars, conditional_data = cd)
```
For modeling purposes, we often want to center or scale the data, take logs etc. The <span class="func">pre_process</span> function will standardize numeric data by default.
```{r preprocess}
pre_process(df1)
```
Other options are to simply center the data (`scale_by = 0`), start some variables at zero (e.g. time indicators), log some variables (with chosen base), and scale some to range from zero to one.
```{r preprocess2}
pre_process(mtcars,
scale_by = 0,
log_vars = vars(mpg, wt),
zero_start = vars(cyl),
zero_one = vars(hp, starts_with('d'))) %>%
describe_all_num()
```
Note that center/standardizing is done to any numeric variables *not* chosen for `log`, `zero_start`, and `zero_one`.
Here's a specific function you will probably never need, but will be glad to have if you do. Some data columns have multiple entries for each observation/cell. While it's understandable why someone would do this, it's not very good practice. This will split out the entries, or any particular combination of them, into their own indicator column.
```{r combn, message=TRUE}
d = data.frame(id = 1:4, labs = c('A-B', 'B-C-D-E', 'A-E', 'D-E'))
combn_2_col(
data = d,
var = 'labs',
max_m = 2,
sep = '-',
collapse = ':',
toInteger = T
)
combn_2_col(
data = d,
var = 'labs',
max_m = 2,
sparse = T
)
```
Oftentimes I need to create a column that represents the total scores or means of just a few columns. This is a slight annoyance in the tidyverse, and there isn't much support behind the <span class="pack" style = "">dplyr</span>:<span class="func" style = "">rowwise</span> function. As such, tidyext has a couple simple wrappers for row_sums, row_means, and row_apply.
```{r rowapply}
d = data.frame(x = 1:3,
y = 4:6,
z = 7:9,
q = NA)
d %>%
row_sums(x:y)
d %>%
row_means(matches('x|z'))
row_apply(
d ,
x:z,
.fun = function(x)
apply(x, 1, paste, collapse = '')
)
```
<br>
<br>
<br>
<br>
<br>
<br>