-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
223 lines (164 loc) · 5.82 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "Working with Text in R: Text-to-Columns Search Across Columns Parse Medications"
author: "Melinda Higgins"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
code_folding: show
highlight: arrow
theme:
bg: "#f2ede1"
fg: "#000000"
primary: "#2c6b3e"
base_font:
google: "Inter"
code_font:
google: "JetBrains Mono"
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
error = TRUE,
message = FALSE,
warning = FALSE,
attr.output='style="background: #bfe0c8;"',
attr.source='style="padding-right: 10px;"')
# turn on thematic
thematic::thematic_rmd()
library(dplyr)
library(tidyr)
```
## Split text into separate columns
There is a function in EXCEL under the "DATA" tab for "test-to-columns" allowing you to designate a "delimiter" for splitting text chunks into separate columns. There is a similar function in the `tidyr` package, `separate()`. Let's see an example of how this works.
### Example 1: Make and Model of Cars in `mtcars` dataset
Let's take a look at the builtin `mtcars` dataset. This dataset has "row names" for each car's make and model. Here is an example of the top 6 rows of the `mtcars` dataset:
```{r}
# top 6 rows of mtcars dataset
mtcars %>%
head() %>%
knitr::kable(caption = "Top 6 rows of mtcars dataset")
```
Row names of the `mtcars` dataset.
```{r}
# see the list of the row names
row.names(mtcars)
```
Let's add these text "strings" for the names of the cars to the dataset in a new column called `makemodel`:
```{r}
library(dplyr)
makemodel <- row.names(mtcars)
mtcars2 <- mtcars %>%
mutate(makemodel = makemodel)
# view top 6 rows again
mtcars2 %>%
head() %>%
knitr::kable(caption = "Top 6 rows of mtcars dataset")
```
Suppose we now want to break up the make and model into separate columns using the space as our column divider. We can use the `separate()` function from `tidyr` package to do this. Note: given the full list of makes and models some have 2 spaces so you'll end up with 3 columns that we'll call "make", "model" and "type" which is why `into = c("make", "model", "type")` in the code below. This defines the new columns we are adding to the dataset.
```{r bonus2}
library(tidyr)
df <-
tidyr::separate(
data = mtcars2,
col = makemodel,
sep = " ",
into = c("make", "model", "type"),
remove = FALSE
)
df %>% knitr::kable()
```
### Example 2: Extracting "Data" from filenames
Here is a small hypothetical dataset from a lab that created custom IDs to track the subject, visit number and year by combining them into one long "string" (text field) separated by underscores "_". This is the variable `idlong` in the `labdata` dataset (created in code below).
Using the code example above, here is another application of the `tifyr::separate()` function to separate the long string `idlong` into 3 new columns added to the `labdata` dataset individually for "ID", "visit" and "year".
```{r}
# create hypothetical dataset
idlong <- c(
"001_v1_2020",
"001_v2_2021",
"002_v1_2020",
"002_v2_2021",
"003_v1_2020",
"003_v2_2021",
"004_v1_2021",
"004_v2_2022",
"005_v1_2021",
"005_v2_2022"
)
values <- c(34, 31, 28, 26, 34, 34, 27, 28, 30, 25)
labdata <- data.frame(idlong, values)
labdata %>%
knitr::kable(caption = "Hypothetical Dataset With Long filenames")
```
Create 3 new variables "ID", "Visit" and "Year" from `idlong`.
```{r bonus2code}
df <-
tidyr::separate(
data = labdata,
col = idlong,
sep = "_",
into = c("ID", "visit", "year"),
remove = FALSE
)
df %>% knitr::kable(caption = "Three new variables added: ID, Visit, Year - extracted from idlong")
```
### Updated tidyr::separate_wider_delim() function
**IMPORTANT NOTE**
```{r}
df %>%
separate_wider_delim(cols = x,
delim = "-",
names = c("gender", "unit"))
```
## Additional Resources:
* `tidyr` - learn more at: [https://tidyr.tidyverse.org/](https://tidyr.tidyverse.org/)
* `stringr` - learn more at:
- [https://stringr.tidyverse.org/](https://stringr.tidyverse.org/)
- [https://r4ds.had.co.nz/strings.html](https://r4ds.had.co.nz/strings.html)
* `stringi` - learn more at:
- [https://cran.r-project.org/web/packages/stringi/index.html](https://cran.r-project.org/web/packages/stringi/index.html)
- [https://r4ds.had.co.nz/strings.html#stringi](https://r4ds.had.co.nz/strings.html#stringi)
* BOOK: [Text Mining with R](https://www.tidytextmining.com/)
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
# summary of cars dataset
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure}
# plot of pressure dataset
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
## Results {.tabset .tabset-pills}
### Plots
We show a scatter plot in this section.
```{r, fig.dim=c(5, 3)}
# a simple base R plot
par(mar = c(4, 4, .5, .1))
plot(mpg ~ hp, data = mtcars, pch = 19)
```
### Tables
We show the data in this tab.
```{r}
# one table
head(mtcars)
# another table
library(dplyr)
mtcars %>%
select(mpg, disp) %>%
summary() %>%
knitr::kable()
```
## ggplot2
```{r}
library(ggplot2)
ggplot(mtcars, aes(x=disp, y=mpg, color=as.factor(cyl))) +
geom_point() +
geom_smooth(method = "lm")
```