# [📖 Main Menu](index.html)

In [1]:
#| output: false
rm(list = ls())
library(tidyverse)
library(rstatix)
library(easystats)
library(ggfortify)
library(ggpubr)
library(jtools)
library(pubh)
library(sjlabelled)

import::from(latex2exp, TeX)
theme_set(see::theme_modern(base_size = 10))
options('huxtable.knit_print_df' = FALSE)
options('huxtable.autoformat_number_format' = list(numeric = "%5.2f"))

── [1mAttaching core tidyverse packages[22m ─────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ───────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘rstatix’


The following object is masked from ‘package:stats’:

    filter


[34m# Attaching packages: easystats 0.7.1[39m


# Overview

The first step in data analysis that involves `R` will be acquiring your data in an appropriate format, either by entering it directly into `R`, importing it from a spreadsheet or other format, or by opening a pre-existing `R` data file.

In this laboratory, we will look at performing simple data handling in `R`. The most important part of this lab is to get you familiar with the software.

Once you have completed this lab, you should feel comfortable:

-   Creating variables in `R`.
-   Understanding the difference between continuous variables and categorical ones (factors).
-   Transforming data (e.g., converting from pounds to kilograms, calculating BMIs from weights and heights, log transformations, etc.).
-   Extracting subsets of your data.
-   Assigning labels to variables and categories within variables.
-   Saving your data.
-   Using `R`'s help system.
-   Importing data in `R` from an Excel spreadsheet.
-   Creating script files.

## Summary of New Commands

| **Command**          | **Library**  | **Function**                                               |
|----------------------|--------------|------------------------------------------------------------|
| **%\$%**             | *magrittr*   | Exposition pipe operator                                   |
| **\|>**             | *base*   | Forward pipe operator                                      |
| **%in%**             | *base*       | Value matching                                             |
| **as_tibble**        | *tibble*     | Coerces objects into tibbles                               |
| **c**                | *base*       | Concatenates values                                        |
| **copy_labels**      | *sjlabelled* | Copies labels from a data frame                            |
| **count**            | *dplyr*      | Counts observations by group                               |
| **data**             | *base*       | Loads data from `R` packages                               |
|**data_codebook** | *datawizard*  | Generates codebooks from data frames
| **factor**           | *base*       | Defines *factors*                                          |
| **filter**           | *dplyr*      | Filters data frames, given conditions                      |
| **freq_table**       | *rstatix*    | Frequency tables for categorical variables                 |
| **glimpse**          | *tibble*     | Displays information about a dataset                       |
| **head**             | *base*       | First rows of a data frame                                 |
| **help (?)**         | *base*       | Help function                                              |
| **install.packages** | *utils*      | Installs packages in the system                            |
| **is.factor**        | *base*       | Evaluates if a variable is a factor or not                 |
| **library**          | *base*       | Loads (attaches) functions from a package                  |
| **length**           | *base*       | Number of observations in a variable                       |
| **levels**           | *base*       | Levels of categorical variables                            |
| **mutate**           | *dplyr*      | Transforms/generates variables                             |
| **mean**             | base         | Calculates the arithmetic mean                             |
| **max**              | base         | Calculates maximum value                                   |
| **names**            | *base*       | Column names of variables in a data frame                  |
| **nrow**             | *base*       | Number of rows (observations) in data frames               |
| **read_csv**         | *readr*      | Loads files with `csv` format/extension                    |
| **read_rds**         | *readr*      | Loads RDS files                                            |
| **relevel**          | *base*       | Changes the reference category                             |
| **rep**              | *base*       | Replicates numbers or characters                           |
| **rm**               | *base*       | Deletes (*removes*) objects from the workspace             |
| **round**            | *base*       | Rounds variables                                           |
| **RSiteSearch**      | *utils*      | Searches `R` functions in the web                          |
| **select**           | *dplyr*      | Selects variables from a data frame or tibble              |
| **setwd**            | *base*       | Sets the working directory (path)                          |
| **tibble**           | *tibble*     | Constructs *tibbles*                                       |
| **var_labels**       | *sjlabelled* | Assigns labels to variables                                |
| **View**             | *utils*      | Displays data frames                                       |
| **which**            | *base*       | Finds the positions where the stated conditionals are true |
| **which.max**        | *base*       | Finds the position of the maximum value                    |
| **with**             | *base*       | Evaluates commands in a defined data frame or tibble       |
| **write_csv**        | *readr*      | Exports files with `csv` format/extension                  |
| **write_rds**        | *readr*      | Writes RDS files                                           |

## RStudio

A typical session on `RStudio` would look something like:

![RStudio displaying a typical .Rmd Markdown script file on the top left corner](figures/RStudio.png)

`RStudio` can have up to four panels open (displaying information):

1.  **Source** panel (top left). This panel shows the *script* files.
2.  **Console** panel (bottom left). This panel is where you interact with the program to perform an analysis. It shows the path of the current working directory at the top.
3.  **Workspace** panel (top right). You can select what to show in here. In the current example, it shows information about the workspace, like currently loaded objects (*Environment*) and the *History* of our commands.
4.  **Display** panel (bottom right). You can select what to show in here. In the current example, it shows *Files*, the *Help* files, generated *Plots*, *Packages* available on your `R` installation and the *Viewer*.

## Scripts

Files that document our analysis are known as *scripts*, and they open on the *Source* panel. The standard script file has a `.R` extension and can be opened with any text editor.

![](figures/Script.png)

The previous figure displays an example of a *script* file. These kind of files are intended to be read by `R`, so any text that is not part of a command, has to be *commented*. `R` will interpret anything written after a `#` (on the same line) as a comment.

Let's clarify the first set of lines from the script file

1.  The first line is a comment and gives the name of the file, so, it is not needed, but it is good to have.
2.  The second line is a comment about the content of the script. It can be a short or long description. The important thing to remember is that once that you start a new line, for example for writing a different paragraph, you would need to include the `#` symbol at the beginning of each new line.
3.  The third line is a comment about the author.
4.  The fourth line is a comment about the date.
5.  The sixth line sets the working directory. As explained in the preface, this can be done with the menus and it's not needed when working in `R` projects.
6.  Line 7 loads the `pubh` package.
7.  Line 9 estimates measures of association of the *exposure* `treat` on the *outcome* `fate` from the `Bernard` data set. \#\# Notebooks

The disadvantage of `.R` scripts is that you cannot produce a single document with both the analysis and the results of such analysis. Notebooks, permit us to record text, commands and results (including plots). We will be using Notebooks to document our analysis on both *PUBH 725* and *PUBH 726*.

We will start by creating a mock notebook, using the template that comes with the `pubh` package [@pubh]. Open `RStudio` and select `File > New File > R Markdown...`.

A window will pop-up like the one shown in the following figure, select: `From Template > PUBH Template`.

![](figures/Notebook1.png)

Click `OK` or type the `Return` key. Have a look at the template; we will edit the script later. For the moment, Let's run the template as it is. To execute the Notebook, you only need to click on the `Knit` button which can be found in the *Source* panel. When you click *knit* (see **1** in the next figure) for the first time, it will ask for a name to save your file. Give a name like *Lab1*. The result will appear on the *Display* panel, and two files will be created: *Lab1.Rmd* is the script, and *Lab1.html* is the output that you can open with any web browser.

![](figures/SourcePanel.png)

Let's create the Notebook for the first laboratory. Change the title to *Data management* and edit the author field.

Codes are inserted in sections called *chunks* or *blocks*. Go to the end of the script and click on `Insert > R` (see **2** ). Your cursor will be, by default, where you can insert your commands. The easiest thing to do is just to copy from the lab book and paste the code in the chunk.

The option `message = FALSE` was added to the first chunk, to hide messages from the output. You can change the options from a chunk by clicking on the *Cog button* (see **6**). Compile the *RMardown* script by clicking on the *Knit* button (see **1**).

When you *Knit* the document, it compiles the full document and displays the results on the *Display* panel. Sometimes, we want to check a particular command. For doing that, you can *transfer* your command from the *script* file by clicking on the small green arrow at the right of the code (see **3** and **6** ). Depending on your preferences, the results will show on the *Console* panel or directly in your script file.

## Accessing help

When you do not know about the specific options or syntax of a particular command, you can access the help files. For example, Let's say you want to learn more about the `mean` function. One way is to use the `help` command is:

In [None]:
#| eval: false
help(mean)

As an alternative, we can use the question mark, without any parenthesis as in:

In [None]:
#| eval: false
?sd

It's important that you get familiar with help files. At the end of each help file, you can see some examples. You can select a particular example and then type Ctrl + Return (Windows) or Command + Return (Macintosh) to transfer the selection to the console.

When we do not know the name of the function, or it may be part of a package that is installed in the computer, but not loaded yet, you can search by using double question marks. For example, Let's say that you would like to know how to perform diagnostic tests (e.g., sensitivity, specificity, etc.), in that case, you would type (please note the use of quotes):

In [None]:
#| eval: false
?? "diagnostic tests"

When we use `??` the system searches for functions associated with current installed packages. When we need to make a further search on the web, we can use `RSiteSearch` in the *Console* panel. For example:

In [None]:
#| eval: false
RSiteSearch("meta-analysis")

## Browsing help

Other times, you just want to browse the functions of a particular package of interest, to find new commands. For example, go to the **Display** panel and select the **Packages** tab. Look for the package `pubh` and click on it. A help file will open with the description of all functions and data that are part of that package. You can click on any of the listed functions to gather more information. Some packages have also **Vignettes** which are more helpful as they guide you to the use of the functions contained in the package.

In the help file of `pubh` click on `User guides, package vignettes and other documentation`. Next, select the first one: `Introduction to the pubh package`. You do not have to read through that vignette today, but now you know how to access vignettes.

## Packages

When you open `R`, it loads a standard number of *packages*, each one of them includes a particular set of functions and data. We can extend the number of available functions by loading more packages into the system.

When you start with the template provided by the `pubh` package, the first chunk loads recommended packages for *PUBH 725* and *PUBH 726* into the session. When loading a package *required* packages are also automatically attached.

In particular, when loading `pubh` the following packages are loaded too:

-   `emmeans`
-   `ggformula`
-   `magrittr`
-   `huxtable`
-   `gtsummary`

The `tidyverse` loads a collection of packages:

-   `dplyr`
-   `forcats`
-   `ggplot2`
-   `purrr`
-   `readr`
-   `stringr`
-   `tibble`
-   `tidyr`

When we load package `easystats` the following packages are loaded:

-   `insight`
-   `bayestestR`
-   `parameters`
-   `modelbased`
-   `see`
-   `datawizard`
-   `effectsize`
-   `correlation`
-   `report`

Sometimes, we would like to load more packages either because of a particular function or because we would like to access data from that library. We use the function `library` to load a package; for example, to load the `ISwR` package (ISwR stands for *Introductory Statistics with R*), we type:

In [None]:
#| eval: false
library(ISwR)

A list of most of the available `R` packages can be found at the [CRAN](https://cran.r-project.org/) web page, under *Packages*. Libraries associated with Bioinformatics can be found at the [Bioconductor](http://www.bioconductor.org/) web page. Finally, packages organised by topic, can be found [here](https://cran.r-project.org/web/views/).

To install a new package, you can go to the *Display* panel, and click on *Install* under the *Packages* tab.

Another option is to type the command in the *Console* panel. For example, to install the `epibasix` package we type:

In [None]:
#| eval: false
install.packages("epibasix")

# Objects in `R`

## Short-cuts

The following are important short-cuts that we can use in `RStudio`:

| Function   | Output | MacOS Short-cut   | Windows Short-cut |
|------------|--------|-------------------|-------------------|
| Assignment | `<-`   | `Option -`        | `Alt -`           |
| Pipe       | `%>%`  | `Shift Command M` | `Shift Ctl M`     |

| Action                           | MacOS Short-cut    | Windows Short-cut |
|----------------------------------|--------------------|-------------------|
| Insert Chunk                     | `Option Command I` | `Alt Ctl I`       |
| Run line/selection               | `Command Return`   | `Ctl Return`      |
| Formats text selection as `code` | `Command D`        | `Ctl D`           |

## Assignments

An excellent introduction to `R` and its objects, can be found in the first chapter of [@iswr].

A variable that holds a number or character is known as a *scalar*. Assignments are done with `<-` with no space between the two symbols. Assignments can also be done using the equal `=` symbol. For example:

In [7]:
x = 5
x + 3

When a variable, holds 2 or more numbers or characters, is called a *vector*. For example, a vector of 3 weights (in pounds) is generated using the command `c` (concatenate):

In [8]:
weight = c(151.45, 194, 121.25)

The advantage of vectors is that operations become faster than performing single operations. For example, let's say that we want to convert the weight in pounds to kg, and we do not want to have another variable, so just replace the old one. One kg equals 2.2046 pounds:

In [9]:
weight = weight/2.2046
weight

There is no way we could measure human weight with that accuracy using standard devices, so let's round the values. First, look at the help file of `round`. The default is `digits = 0` (i.e., no decimals). We are going to keep one decimal. As `round` has no other option, we can directly type the value of `x` (our *object*) and the number of digits. In other occasions, for clarity, we will type things like `digits = 1`.

In [10]:
round(weight, 1)

## Pipe-work flow

`R` is a computer language, thus, functions are performed from the most nested ones, to the less ones. For example, if we want to estimate the mean value of the vector of weights that we created, and report only one digit, we write:

In [None]:
round(mean(weight, na.rm = TRUE), 1)

When we have several parentheses, it's easy to get lost on the code. One option is to use a pipe-work flow. The native pipe command `|>` to passes information in what is known as *pipe-work flow*. In this case, commands are simply read from left to right and top to bottom:

In [11]:
weight |>
  mean(na.rm = TRUE) |>
  round(1)

To *remove* (clear) objects from the workspace, we use `rm`:

In [12]:
rm(x, weight)

# Generating data

For small datasets, the easiest thing to do is to generate the data directly in `R`. We will start by entering some data of our own. This particular dataset describes the levels of uric acid in the bloodstream of twenty subjects aged from 21 to 25. There were five individuals with each combination of Down's syndrome being present/absent and sex being male/female.

The variable `uric` contains the values for the uric acid, the variable `downs` contains numerical values for Down's syndrome (0 = "No", 1 = "Yes") and the variable `sex` contains numerical values representing sex (0 = "Male", 1 = "Female").

In [13]:
uric_down = tibble(
  uric = c(5.84, 6.3, 6.95, 5.92, 7.94, 5.5, 6.08, 5.12, 7.58, 6.78,
           4.9, 6.95, 6.73, 5.32, 4.81, 4.94, 7.2, 5.22, 4.6, 3.88),
  downs = c(rep(0, 5), rep(1, 5), rep(0, 5), rep(1, 5)),
  sex = c(rep(0, 10), rep(1, 10))
)

We used the `rep` (replicate) command for both `downs` and `sex`. In the case of `downs`, it alternates 5 zeros (meaning "No") and 5 ones (meaning "Yes"). For sex, we are entering the males first (10 of them) and the females later (10 of them).

The most common way to work with data sets in `R` are `data.frames`. A data frame is a rectangular object in which all components (variables) have the same length. A data frame can have variables of different nature (character, logical, double, etc), but each one of them of a single nature and all of them of the same length. A modern version of data frames are `tibbles`.

We can look at all columns but only the first rows of the data:

In [14]:
uric_down |> head()

uric,downs,sex
<dbl>,<dbl>,<dbl>
5.84,0,0
6.3,0,0
6.95,0,0
5.92,0,0
7.94,0,0
5.5,1,0


## Categorical variables: Factors

When we defined the data set `uric_down`, we used only numbers. Both `downs` and `sex` are categorical variables. A categorical variable in `R` is known as `factor`. Each `factor` contains two or more `levels` or *categories*.

To convert the variable `sex` from the data set `uric_down` to a factor we use the command `factor` indicating the names of the `levels` for each category in the same order as the corresponding sequence of numbers.

In this kind of operation, we need to give information of the variable, the name of the dataset and the actual function with options. One way to accomplish this, would be (please do **NOT** run):

In [None]:
#| eval: false
uric_down$sex = factor(uric_down$sex, labels = c("Male", "Female"))

If you understood the instructions, you did NOT run the previous code! Alas, we need to explain.

The name of the variable is `sex` and the name of the dataset is `uric_down`. We use the `$` symbol to give an address, like saying:

> *Variable `sex` lives at `uric_down`.*

We would need to do something similar to convert `downs` to a `factor`.

We will use a different approach, one more elegant, modern and *posh* (just saying!). We will use a pipe-workflow to transform variables.

Package `magrittr` introduced the concept of *pipes* in `R`; it's like passing information between objects and functions. The symbol to *pass* or *pipe* the information in  `magrittr` is  `%>%`, while the native pipe command in `R` is `|>`.


For the conversion to factors, we use the function `mutate` from the `dplyr` package.

Let's transform `downs` and `sex` from numerical (`double`) to categorical (`factor`) variables:

In [15]:
uric_down2 = uric_down |>
  mutate(
    downs = factor(downs, labels = c("No", "Yes")),
    sex = factor(sex, labels = c("Male", "Female"))
  ) 

uric_down2 |> head()

uric,downs,sex
<dbl>,<fct>,<fct>
5.84,No,Male
6.3,No,Male
6.95,No,Male
5.92,No,Male
7.94,No,Male
5.5,Yes,Male


## Labels

We would also like to display more information than the current variable name, in tables and figures. To accomplish this, we associate `labels` with variables. We will use `var_labels` from the `sjlabelled` package. Notice, that the way we assign labels is similar to the one we used for transformations:

In [16]:
uric_down2 = uric_down2 |>
  var_labels(
    uric = "Uric acid (mg/dl)",
    downs = "Down's syndrome",
    sex = "Sex"
  )

For small datasets (like our current one) it is easier to make the transformation and the labelling as part of the same pipe-workflow.

We had not modified the original data set `uric_down`, instead, we created a new one `uric_down2`, so we could go one step at a time. You can remove `uric_down2` with:

In [17]:
rm(uric_down2)

To do both, transformation and labelling in the same pipe-workflow, we type:

In [18]:
uric_down = uric_down |>
  mutate(
    downs = factor(downs, labels = c("No", "Yes")),
    sex = factor(sex, labels = c("Male", "Female"))
  ) |>
  var_labels(
    uric = "Uric acid (mg/dl)",
    downs = "Down's syndrome",
    sex = "Sex"
  )

## Saving `R` data frames

The advantage of saving data frames as `R` data (`.rds`) is that `R` will have access to factors, levels and labels. I will save the data on the subdirectory *data*.

In [19]:
write_rds(uric_down, "data/uric_down.rds")

If you are following instructions, you have everything recorded in your script in case you did something wrong. To show how to import, we will remove all objects associated with the data frame first:

In [20]:
rm(uric_down)

To load the data, we use `read_rds` and assign the file to a new object (in this case, a tibble). For simplicity, I use the same name of the file as the name of the new data frame, but you can change that.

In [21]:
uric_down = read_rds("data/uric_down.rds")
uric_down |> head()

uric,downs,sex
<dbl>,<fct>,<fct>
5.84,No,Male
6.3,No,Male
6.95,No,Male
5.92,No,Male
7.94,No,Male
5.5,Yes,Male


# Variables in data frames

The `uric_down` data frame has three variables. `R` can have more than one data frame loaded on the same session, that feature has the disadvantage that we need to tell `R` where to find individual variables.

For example, `uric_down` has a vector (variable) named `sex`. We could have another data frame which also has the variable `sex`, how do we know which one we are analysing? We have to give information of both the data frame and the vector. One way is by using the `$` symbol. The syntax is:

**data\$vector**

For example:

In [22]:
uric_down$sex

Another option is to **select** the variables (columns) we are interested in:

In [23]:
uric_down |>
  select(uric) |>
  describe_distribution()

Variable,Mean,SD,IQR,Min,Max,Skewness,Kurtosis,n,n_Missing
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
uric,5.928,1.099529,1.9225,3.88,7.94,0.1086282,-0.8318228,20,0


In [24]:
uric_down |>
  freq_table(sex, downs)

sex,downs,n,prop
<fct>,<fct>,<int>,<dbl>
Male,No,5,50
Male,Yes,5,50
Female,No,5,50
Female,Yes,5,50


An alternative to `$` is the function `with`. The syntax is: `with(data, function(x))`. For example, if we want to know if `sex` is a factor we can type:

In [25]:
with(uric_down, is.factor(sex))

Or using pipes:

In [26]:
uric_down %$% is.factor(sex)

The same command using the `$` symbol:

In [27]:
is.factor(uric_down$sex)

In the current example, the last command was simpler but in many cases is better to use `%>%`.

# Export and import objects

To export a data frame (or other objects) to *Excel*, we will write `.csv` files (comma-separated values). In the following code, the file will be saved in the subdirectory `data`:

In [28]:
write_csv(uric_down, "data/uric_down.csv")

For importing data from *Excel*, remember to:

1.  Don't use complicated names for the variables, in particular:

-   Don't start a variable name with a number.
-   Don't leave spaces as part of the name, e.g. don't use `blood pressure`. Some alternatives are:
    -   `blood_pressure`
    -   `bp`
    -   `blood.pressure`
    -   `BloodPressure`

2.  Don't use a long, complicated name for the name of your file; it is better to avoid spaces.
3.  Don't leave cells blank (without any information). For missing data, we will type `NA`.
4.  It is easier to record only numbers and to add labels on `R` later, to avoid mistakes on the names (it's not the same `female`, `Female`, or `female` with a blank space before the `f`).
5.  Export your data as *comma-separated values* (.csv).

Let's load our data frame:

In [29]:
uric_down = read_csv("data/uric_down.csv",  col_types = "dff")
uric_down |> data_codebook()

ID,Name,Type,Missings,Values,N,Prop,.row_id
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1.0,uric,numeric,0 (0.0%),"[3.88, 7.94]",20.0,,1
,,,,,,,1
2.0,downs,categorical,0 (0.0%),No,10.0,50.0%,2
,,,,Yes,10.0,50.0%,2
,,,,,,,2
3.0,sex,categorical,0 (0.0%),Male,10.0,50.0%,3
,,,,Female,10.0,50.0%,3
,,,,,,,3


# Data manipulation

In most cases you start by inspecting your data, cleaning, defining factors and making transformations. As mentioned in the section before, your data will come from a spreadsheet. In this paper, we will use data contained in `R` *packages* most of the time.

First read the help file for the dataset `wcgs` by typing: `?epitools::wcgs` in the **Console** panel.

We use the function `data` to load data from `R` *packages*.

In [30]:
data(wcgs, package = "epitools")
wcgs |> names()

The first thing I would like to do is to change the name of the variables:

In [31]:
wcgs = as_tibble(wcgs)
names(wcgs) = c(
  "id", "age", "height", "weight", "sbp", "dbp",
  "chol", "beh_pat", "ncigs", "dib_pat", "chd",
  "type_chd", "time", "arcus"
  )

wcgs |> names()

## Defining factors

Now, we define categorical variables as factors. By default, the value zero is our reference.

In [32]:
wcgs = wcgs |>
  mutate(
    chd = factor(chd, labels = c("No CHD", "CHD")),
    arcus = factor(arcus, labels = c("Absent", "Present")),
    beh_pat = factor(beh_pat, labels = c("A1", "A2", "B3", "B4")),
    dib_pat = factor(dib_pat, labels = c("B", "A")),
    type_chd = factor(
      type_chd, labels = c("No CHD", "MI or SD", "Angina", "Silent MI")
    )
  )

## Transforming to a binary variable

One of our variables is a count and stores the number of smoked cigarettes/day. We can define a new variable `Smoker` in which, everyone who smokes one or more cigarette/day will be a smoker. One of the easiest ways to create binary variables is to use a *conditional* statement. For example, the result of `wcgs$ncigs > 0` is a vector with `TRUE` and `FALSE` results.

In [33]:
wcgs = wcgs |>
  mutate(
    smoker = factor(ncigs > 0, labels = c("Non-Smoker", "Smoker"))
  )

## Simple numeric transformations

We also, prefer units in the metric system. We will convert from inches to centimetres and from pounds to kg.

In [34]:
wcgs = wcgs |>
  mutate(
    height = height * 2.54,
    weight = weight * 0.4536
  )

## Changing the reference for factors

First, check the reference level for the variable `dib_pat`:

In [35]:
levels(wcgs$dib_pat)

It would make more sense to have `A` as our reference category.

In [36]:
wcgs = wcgs |>
  mutate(dib_pat = relevel(dib_pat, ref = "A"))

## Labels

It is also helpful to add labels to variables.

In [37]:
wcgs = wcgs |>
  var_labels(
    age =  "Age (years)",
    height = "Height (cm)",
    weight = "Weight (kg)",
    sbp = "SBP (mm Hg)",
    dbp = "DBP (mm Hg)",
    chol = "Cholesterol (mg/dl)",
    beh_pat = "Behaviour pattern",
    ncigs = "Cigarettes (n/day)",
    dib_pat = "Dichotomous behaviour",
    chd = "Coronary Heart Disease",
    type_chd = "Type of CHD",
    time = "Follow up time (days)",
    arcus = "Corneal arcus",
    smoker = "Smoking status"
  )

In [38]:
#| code-fold: true
wcgs |> glimpse()

Rows: 3,154
Columns: 15
$ id       [3m[90m<int>[39m[23m 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2…
$ age      [3m[90m<int>[39m[23m 49, 42, 42, 41, 59, 44, 44, 40, 43, 42, 53, 41, 50, 43, 44, 5…
$ height   [3m[90m<dbl>[39m[23m 185.42, 177.80, 175.26, 172.72, 177.80, 182.88, 182.88, 180.3…
$ weight   [3m[90m<dbl>[39m[23m 68.0400, 72.5760, 72.5760, 68.9472, 68.0400, 92.5344, 74.3904…
$ sbp      [3m[90m<int>[39m[23m 110, 154, 110, 124, 144, 150, 130, 138, 146, 132, 146, 138, 1…
$ dbp      [3m[90m<int>[39m[23m 76, 84, 78, 78, 86, 90, 84, 60, 76, 90, 94, 96, 90, 80, 80, 8…
$ chol     [3m[90m<int>[39m[23m 225, 177, 181, 132, 255, 182, 155, 140, 149, 325, 223, 271, 2…
$ beh_pat  [3m[90m<fct>[39m[23m A2, A2, B3, B4, B3, B4, B4, A2, B3, A2, A2, A2, A1, B3, B3, B…
$ ncigs    [3m[90m<int>[39m[23m 25, 20, 0, 20, 20, 0, 0, 0, 25, 0, 25, 20, 50, 30, 0, 3, 9, 0…
$ dib_pat  [3m[90m<fct>[39m[23m A, A, B, B, B, B, B, A, B, A, A, A, A, B, B, B, B

In [39]:
wcgs |>
  data_codebook()

ID,Name,Label,Type,Missings,Values,N,Prop,.row_id
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1.0,id,,integer,0 (0.0%),"[2001, 22101]",3154.0,,1
,,,,,,,,1
2.0,age,Age (years),integer,0 (0.0%),"[39, 59]",3154.0,,2
,,,,,,,,2
3.0,height,Height (cm),numeric,0 (0.0%),"[152.4, 198.12]",3154.0,,3
,,,,,,,,3
4.0,weight,Weight (kg),numeric,0 (0.0%),"[35.38, 145.15]",3154.0,,4
,,,,,,,,4
5.0,sbp,SBP (mm Hg),integer,0 (0.0%),"[98, 230]",3154.0,,5
,,,,,,,,5


In [40]:
write_rds(wcgs, "data/wcgs.rds")

## Indexing and subsets

Let's said that we are only interested in subjects who are smokers. If that is the case, we can create a new data frame:

In [41]:
smokers = wcgs |>
  filter(smoker == "Smoker") |>
  copy_labels(wcgs)

One way to check that we did not make a terrible mistake, is to check for the number of observations. The number of observations on a data frame is, most of the time, equal to the number of rows (`nrow`):

In [42]:
wcgs |> nrow()
smokers |> nrow()

The function `nrow` works on *arrays*, i.e., data frames and matrices. For vectors, we use the function `length` instead. For example:

In [43]:
length(wcgs$smoker)

Another important concept is that of *indexing*. For indexing, we write the conditional inside square brackets. For example, another way to look at the number of smokers:

In [44]:
length(wcgs$smoker[wcgs$smoker == "Smoker"])

Using `with`:

In [45]:
with(wcgs, length(smoker[smoker == "Smoker"]))

Using a pipe-workflow:

In [46]:
wcgs |> count(smoker)

smoker,n
<fct>,<int>
Non-Smoker,1652
Smoker,1502


In [47]:
wcgs |> freq_table(smoker)

smoker,n,prop
<fct>,<int>,<dbl>
Non-Smoker,1652,52.4
Smoker,1502,47.6


Suppose we want to know the number of smokers who weight 100 kg or more:

In [48]:
wcgs |>
  filter(weight >= 100) |>
  count(smoker)

smoker,n
<fct>,<int>
Non-Smoker,28
Smoker,20


In [49]:
smokers |>
  filter(weight >= 100) |>
  count()

n
<int>
20


For obtaining the same result, but working on the original dataset, we would need to use two conditionals. We use the symbol `&` for **AND** and the symbol `|` for **OR**.

In [50]:
wcgs |>
  filter(weight >= 100 & smoker == "Smoker") |>
  count()

n
<int>
20


In [51]:
#| code-fold: true
smokers |>
  filter(beh_pat == "A2" & (type_chd == "Angina" | type_chd == "Silent MI")) |>
  count()

n
<int>
38


It's possible to answer the exercise using `%in%` which is a variant of the `match` command:

In [52]:
smokers |>
  filter(beh_pat == "A2" & type_chd %in% c("Angina", "Silent MI")) |>
  count()

n
<int>
38


Let's say we want to know all the variable values for the subject who has the maximum weight. The function `which` gives us the position for where the given condition is true. I will assign the result to a variable named `pos`.

In [53]:
pos = wcgs %$% which(weight == max(weight))
pos

Alternatively, we can use `which.max`:

In [54]:
wcgs %$% which.max(weight)

For indexing arrays, we use square brackets. The first number refers to the row and the second to the column. If one of them is missing, that means we are asking for all the values.

In [55]:
wcgs[pos, ]

id,age,height,weight,sbp,dbp,chol,beh_pat,ncigs,dib_pat,chd,type_chd,time,arcus,smoker
<int>,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>,<fct>
10078,43,193.04,145.152,166,102,188,B3,0,B,CHD,Angina,1795,Absent,Non-Smoker


In [57]:
wcgs |>
  filter(id == id[pos])

id,age,height,weight,sbp,dbp,chol,beh_pat,ncigs,dib_pat,chd,type_chd,time,arcus,smoker
<int>,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>,<fct>
10078,43,193.04,145.152,166,102,188,B3,0,B,CHD,Angina,1795,Absent,Non-Smoker
