<a href="https://colab.research.google.com/github/mikeniemant/QS_shiny/blob/master/tidyverse_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An introduction to Tidyverse

In this tutorial, we will explore the first four steps of the Tidyverse data science workflow:
1. Import
2. Tidy
3. Transform
4. Visualise
5. (Model)

Today we will be working with an electronic health records (EHR) dataset. Don't worry about privacy related information, this is synthetic data based on summary statistics / values from the UMCU.

We will use R in this Google Colab environment. Apart from Tidyverse, we will also go through the basics of how to work with this Google Colab environment. But before we can start, we first have to do two things:
- initiliaze R
- install tidyverse

## 1. Initialize work environment
### 1.1. Initialize R
To run R in this Python environment, we need to load the `rpy2` library. Either click on the play button or put your cursor in the chunk of code below, and press (Mac: CMD + ENTER, Windows: CTRL + ENTER).

In [None]:
%load_ext rpy2.ipython

Run the following command to test whether R code is executed.

In [None]:
%%R
x <- seq(1,5)
x

As you can see, Google Colab is instructed to run R code when the 'code chunk' (or 'code cell') starts with the following statement: `%%R`. If we remove this statement, it will execute Python code.

Feel free to add more cells (either text or code) to this notebook, by using one of the following options:
- hover your mouse either above or above a chunk, and click on 'code' or 'text'
- In the menu bar, blick on 'Insert', and then on 'Code cell' / 'Text cell'

Correct me if I am wrong, but this notebook is automatically stored on your Google Drive.

### 1.2 Install Tidyverse
Google Colab allows us to install our own packages. Normally it takes quite a while before the complete `tidyverse` meta package is installed, apparently Colab has it already pre-installed somewhere. Lucky us!

In [None]:
%%R
install.packages("tidyverse")

After a cell is run, the environment is updated. You can remove the output of a cell by right-clicking on the output and select `remove output`. Try removing the installation output above.

Import the `tidyverse` package and test it by running the following chunk.

In [None]:
%%R
library(tidyverse)
mtcars %>% 
  nrow()

Did you get `32` as output? Perfetto!

## 2. Tidyverse data science workflow
### 2.1 Import dataset
Now we will start importing the EHR dataset from a Github repository I created: https://github.com/mikeniemant/ehr_tutorial.

First, have a look at the two .csv files in the repo:
- ehr.csv: this is the dataset
- data_dic.csv: this is dictionary with the description of each variable

We can directly download the EHR data from the Github repo with the `read_csv` function from the `readr` package.

In [None]:
%%R
dat <- read_csv("https://github.com/mikeniemant/ehr_tutorial/raw/main/ehr.csv")

The `read_csv` automatically assigns a class to each column. Let's have a look at the dataset.

In [None]:
%%R
dat

As we have imported the data into a `tibble`, R only returns the first ten rows (instead of all 384) and specifies the class of each column under the column name (`chr`, `dbl`, etc.). 

To get a readable overview, and depending on your screen width, some columns may be removed from the  output.

This dataset consists out of four data types:
- demographic
  - age
  - sex, 0 = female, 1 = male
- hospital visit
  - date of visit
  - time of visit
  - department
- vital
  - heart rate (hr)
  - systolic blood pressure (sbp)
  - diastolic blood pressure (dbp)
  - respiratory rate (rr)
- laboratory
  - serum creatinine (scr)
  - white blood cell count (wbc)

Take a look at the dataset for a few second, do you notice some errors? Does the `dat` object have a 'tidy' data structure?

It looks like there is a typo. The fourth column should be 'sex' instead of 'sx'. We can use the `rename` function from the `dplyr` package

In [None]:
%%R
dat <- dat %>%
  rename(sex = sx)

Did it work? Run the next chunk.

In [None]:
%%R
dat

Bravo!

I think long column names are quite annoying.. can you change 'white_blood_cell_count' to 'wbc'? Double click on the cell below and write your code. Make sure to not delete the '%%R' statement, otherwise your code will not run! You can check your answer by looking at the second cell and click on the `Show code` button. Attenzione! Make sure to not run this second cell after you have executed your own code.

In [None]:
%%R
YOUR R CODE HERE

In [None]:
#@title
%%R
dat <- dat %>%
  rename(wbc = white_blood_cell_count)

Grazie mille! 

The study_ids are not in order.. Patients can visit the hospital multiple times within one year. We can sort on multiple columns with the `arrange` command.

In [None]:
%%R
dat <- dat %>%
  arrange(study_id, visit_date, visit_time)

In [None]:
%%R
dat

That looks a lot better! Sei fantastico!

### 2.2 Tidy

Let's have a look at the department columns. We can use the `select` command to make a selection without assigning it to the `dat` object. There are multiple options to get the same result, run the following chunks

In [None]:
%%R
dat %>% 
  select(study_id, oncology, obstetrics, neurology, nephrology, 
         internal_medicine, cardiology, hematology)

In [None]:
%%R
dat %>% 
  select(study_id, 13:19)

Phoe.. This does not look tidy. We can fix this by pivoting the data from a wide to a long format. Before we make any errors, create a new object called `departments` by selecting the `study_id`, `visit_date`, `visit_time` and all `department` columns.

In [None]:
%%R
YOUR R CODE HERE

In [None]:
#@title
%%R
departments <- dat %>%
  select(study_id, visit_date, visit_time, 13:19)

Run the following code to pivot the data. As you can see, we have to select the columns that we want to pivot OR we can select the columns we do not want to pivot

In [None]:
%%R
departments %>% 
  pivot_longer(cols = -c(study_id, visit_date, visit_time), names_to = "department", values_to = "value")

In [None]:
%%R
departments %>% 
  pivot_longer(cols = 4:10, names_to = "department", values_to = "value")

As expected, also all `NA` values are pivoted.. We can remove them with the `na.omit` function.

In [None]:
%%R
departments %>% 
  pivot_longer(cols = 4:10, names_to = "department", values_to = "value") %>%
  na.omit()

Now that we have our cleaned `department` column, we can `left_join` this tibble to our main object (`dat`), by first removing the department columns.



In [None]:
%%R
departments %>% 
              pivot_longer(cols = 4:10, names_to = "department", values_to = "value") %>%
              na.omit()

In [None]:
%%R
departments %>% 
            pivot_longer(cols = 4:10, names_to = "department", values_to = "value") %>%
            na.omit() %>% select(-value)

In [None]:
%%R
dat

In [None]:
%%R
dat <- dat %>%
  select(-c(13:19)) %>%
  left_join(departments %>% 
            pivot_longer(cols = 4:10, names_to = "department", values_to = "value") %>%
            na.omit() %>%
            select(-value),
            by = c("study_id", "visit_date", "visit_time"))

In [None]:
%%R
dat

I confess, this is not the most elegant approach for this problem, but it is a good example of how you can combine multiple steps with TIdyverse.

### 2.3 Transform
In this section we will transform some variables and create some new features.

Let's round the `age` colunmn.

In [None]:
%%R
dat <- dat %>%
  mutate(age = round(age))

In [None]:
%%R
dat

We can also do this for multiple columns in one go using functional programming with tidyverse.

In [None]:
%%R
dat <- dat %>% mutate(across(c("hr", "sbp", "rr", "scr", "wbc"), round))

First, we tell R to transform the object with `mutate`. With `across` we specify that we would like to run a function over a selection of columns. 

In base R we would use the `sapply` function.

`dat[, c("hr", "sbp", "rr", "scr", "wbc")] <- sapply(dat[, c("hr", "sbp", "rr", "scr", "wbc"), round)`

What do you prefer? Base R or Tidyverse?

Create two additional columns with the `mutate` function:
- ratio sbp - dbp (`sbp_dbp_ratio`): sbp / dbp
- mean arterial blood pressure (`map`): (2 x diastolic blood pressure + systolic blood pressure) / 3


In [None]:
%%R
YOUR CODE HERE

In [None]:
#@title
%%R
dat <- dat %>%
  mutate(sbp_dbp_ratio = sbp / dbp,
         map = (2 * dbp + sbp) / 3)

For our study, we do not want to focus on children, and also do not want to include pregnant ladies.

In [None]:
%%R
dat %>% count(department)

In [None]:
%%R
dat %>% pull(age) %>% summary()

An example that is equivalent to the base R command `summary(dat$age)`. With the `pull` function, you can extract data as a `vector`, as compared to the `select` function, that returns a `tibble` object.

In [None]:
%%R
dat %>% pull(wbc) %>% head()

With the `filter` function, we can remove all patients younger than 20 years old.

In [None]:
%%R
dat <- dat %>%
  filter(age >= 20)

We can disregard all pregnant patients by removing the `obstetrics` department.

In [None]:
%%R
YOUR R CODE HERE

In [None]:
#@title
%%R
dat <- dat %>%
  filter(department != "obstetrics")

## 2.4 Visualise


What features are predictive for the `outcome`? 
First, we will compute some summary statistics, followed by some plots with the `ggplot2` Tidyverse package.



Is `age` related with `outcome`? We can compute some summary statistics for each `outcome` level by using both the `group_by` and `summarize` functinos. Look at the example


In [None]:
%%R
dat %>% 
  group_by(outcome) %>%
  summarize(mean_age = mean(age))

In [None]:
%%R
ggplot(data = dat, mapping = aes(x = hr, y = sbp, colour = factor(outcome))) +
  geom_point() +
  labs(colour = "Outcome")

Instead of a scatter plot, we can also plot other graph types.

In [None]:
%%R
ggplot(data = dat, mapping = aes(x = age, fill = factor(outcome))) +
  geom_density(alpha = 0.5) +
  labs(colour = "Outcome")

As you can see, plotting graphs with `ggplot2` always requires the standard format.
We can go through each step line by line

In [None]:
%%R
ggplot(data = dat, mapping = aes(x = age, y = map))

Above, we only specified the `data` and `mapping` arguments. `ggplot2` already reads the data and plots the canvas.

By adding the geometric function `geom_point()` we will draw points on this canvas.

Attenzione! With `ggplot2` we will have to use the `+` to combine multiple commands, instead of the pipe operator `%>%`.

In [None]:
%%R
ggplot(data = dat, mapping = aes(x = age, y = map)) +
  geom_point()

We can also draw some box plots if we bin the age variable. We can put the data into the ggplot function with the pipe operator, fancy!

In [None]:
%%R
dat %>% 
  mutate(bin_age = cut(age, 
                       breaks = seq(min(age), max(age), by = 10),
                       include.lowest = T)) %>%
  ggplot(aes(x = bin_age, y = map)) +
  geom_boxplot()

In [None]:
%%R
dat %>% 
  mutate(bin_age = cut(age, seq(min(age), max(age), by = 10))) %>% 
  select(1:2, age, bin_age)

In [None]:
%%R
min(dat$age)