index.Rmd

---
title: "Working with Text in R: Text-to-Columns Search Across Columns Parse Medications"
author: "Melinda Higgins"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_float: true
    code_folding: show
    highlight: arrow
    theme:
      bg: "#f2ede1"
      fg: "#000000"
      primary: "#2c6b3e"
      base_font:
        google: "Inter"
      code_font:
        google: "JetBrains Mono"
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      error = TRUE,
                      message = FALSE,
                      warning = FALSE,
                      attr.output='style="background: #bfe0c8;"',
                      attr.source='style="padding-right: 10px;"')

# turn on thematic
thematic::thematic_rmd()

library(dplyr)
library(tidyr)
```

## Split text into separate columns

There is a function in EXCEL under the "DATA" tab for "test-to-columns" allowing you to designate a "delimiter" for splitting text chunks into separate columns. There is a similar function in the `tidyr` package, `separate()`. Let's see an example of how this works.

### Example 1: Make and Model of Cars in `mtcars` dataset

Let's take a look at the builtin `mtcars` dataset. This dataset has "row names" for each car's make and model. Here is an example of the top 6 rows of the `mtcars` dataset:

```{r}
# top 6 rows of mtcars dataset
mtcars %>%
  head() %>%
  knitr::kable(caption = "Top 6 rows of mtcars dataset")
```

Row names of the `mtcars` dataset.

```{r}
# see the list of the row names
row.names(mtcars)
```

Let's add these text "strings" for the names of the cars to the dataset in a new column called `makemodel`:

```{r}
library(dplyr)
makemodel <- row.names(mtcars)
mtcars2 <- mtcars %>%
  mutate(makemodel = makemodel)

# view top 6 rows again
mtcars2 %>%
  head() %>%
  knitr::kable(caption = "Top 6 rows of mtcars dataset")
```

Suppose we now want to break up the make and model into separate columns using the space as our column divider. We can use the `separate()` function from `tidyr` package to do this. Note: given the full list of makes and models some have 2 spaces so you'll end up with 3 columns that we'll call "make", "model" and "type" which is why `into = c("make", "model", "type")` in the code below. This defines the new columns we are adding to the dataset.

```{r bonus2}
library(tidyr)

df <-
  tidyr::separate(
    data = mtcars2,
    col = makemodel,
    sep = " ",
    into = c("make", "model", "type"),
    remove = FALSE
  )

df %>% knitr::kable()
```

### Example 2: Extracting "Data" from filenames

Here is a small hypothetical dataset from a lab that created custom IDs to track the subject, visit number and year by combining them into one long "string" (text field) separated by underscores "_". This is the variable `idlong` in the `labdata` dataset (created in code below).

Using the code example above, here is another application of the `tifyr::separate()` function to separate the long string `idlong` into 3 new columns added to the `labdata` dataset individually for "ID", "visit" and "year".

```{r}
# create hypothetical dataset
idlong <- c(
  "001_v1_2020",
  "001_v2_2021",
  "002_v1_2020",
  "002_v2_2021",
  "003_v1_2020",
  "003_v2_2021",
  "004_v1_2021",
  "004_v2_2022",
  "005_v1_2021",
  "005_v2_2022"
)

values <- c(34, 31, 28, 26, 34, 34, 27, 28, 30, 25)

labdata <- data.frame(idlong, values)

labdata %>%
  knitr::kable(caption = "Hypothetical Dataset With Long filenames")
```

Create 3 new variables "ID", "Visit" and "Year" from `idlong`.

```{r bonus2code}
df <-
  tidyr::separate(
    data = labdata,
    col = idlong,
    sep = "_",
    into = c("ID", "visit", "year"),
    remove = FALSE
  )

df %>% knitr::kable(caption = "Three new variables added: ID, Visit, Year - extracted from idlong")
```

### Updated tidyr::separate_wider_delim() function

**IMPORTANT NOTE**

```{r}
df %>% 
  separate_wider_delim(cols = x, 
                       delim = "-", 
                       names = c("gender", "unit"))
```


## Additional Resources:

* `tidyr` - learn more at: [https://tidyr.tidyverse.org/](https://tidyr.tidyverse.org/)
* `stringr` - learn more at:
    - [https://stringr.tidyverse.org/](https://stringr.tidyverse.org/)
    - [https://r4ds.had.co.nz/strings.html](https://r4ds.had.co.nz/strings.html)
* `stringi` - learn more at:
    - [https://cran.r-project.org/web/packages/stringi/index.html](https://cran.r-project.org/web/packages/stringi/index.html)
    - [https://r4ds.had.co.nz/strings.html#stringi](https://r4ds.had.co.nz/strings.html#stringi)
* BOOK: [Text Mining with R](https://www.tidytextmining.com/)


## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}
# summary of cars dataset
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure}
# plot of pressure dataset
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

## Results {.tabset .tabset-pills}

### Plots

We show a scatter plot in this section.

```{r, fig.dim=c(5, 3)}
# a simple base R plot
par(mar = c(4, 4, .5, .1))
plot(mpg ~ hp, data = mtcars, pch = 19)
```

### Tables

We show the data in this tab.

```{r}
# one table
head(mtcars)

# another table
library(dplyr)
mtcars %>%
  select(mpg, disp) %>%
  summary() %>%
  knitr::kable()
```

## ggplot2

```{r}
library(ggplot2)
ggplot(mtcars, aes(x=disp, y=mpg, color=as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm")
```