# Chapter 1 - R Foundations

Joshua French

To open this information in an interactive Colab notebook, click the
Open in Colab graphic below.

<a href="https://colab.research.google.com/github/jfrench/lqr/blob/master/notebooks/01-r-foundations-notebook.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg">
</a>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>
This work is licensed under a
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative
Commons Attribution-NonCommercial-ShareAlike 4.0 International
License</a>.

## Setting up R and RStudio Desktop

### What is R?

R is a programming language and environment designed for statistical
computing.
[(https://www.r-project.org/about.html)](https://www.r-project.org/about.html)

Some important facts about R are that:

-   R is free, open source, and runs on many different types of
    computers (Windows, Mac, Linux, and others).
-   R is an interactive programming language.
    -   You type and run a command in the Console for immediate
        feedback, in contrast to a compiled programming language, which
        compiles a program that is then executed.
-   R is highly extendable.
    -   Many user-created packages are available to extend the
        functionality beyond what is installed by default.
    -   Users can write their own functions and easily add software
        libraries to R.

### Installing R and RStudio

Install R by downloading an installer program from the R Project’s
website [(https://www.r-project.org/)](https://www.r-project.org/).

RStudio Desktop is a free “front end” for R that makes doing data
analysis with R much easier by adding an Integrated Development
Environment (IDE) and providing many other features. Currently, you may
download RStudio at <https://posit.co/download/rstudio-desktop/>.

### RStudio Layout

RStudio Desktop has four panes:

1.  Console: the pane where commands are run.
2.  Source: the pane where you prepare commands to be run.
3.  Environment/History: the pane where you can see all the objects in
    your workspace, your command history, and other information.
4.  The Files/Plot/Packages/Help: the pane where you navigate between
    directories, where plots can be viewed, where you can see the
    packages available to be loaded, and where you can get help.

Tip: Change your R Studio environment so that it doesn’t save the
current workspace.

## Running code, scripts, and comments

You can run code in R by typing it in the Console next to the `>` symbol
and pressing the Enter key.

It’s better to write your commands in a “Script” file, save the Script,
and then run your commands from the Script. The commands in a Script
file are generically referred to as “code”.

Script files make it easy to:

-   Reproduce your data analysis without retyping all your commands.
-   Share your code with others.

A new Script file can be obtained by:

-   Clicking File → New File → R Script in the RStudio menu bar.
-   Pressing `Ctrl + Shift + n` on a PC or `Cmd + Shift + n` on a Mac.

To run code from a Script file:

-   Highlight the code you want to run and click the Run button at the
    top of the Script pane.
-   Highlight the code you want to run and press “Ctrl + Enter” on your
    keyboard. If you don’t highlight anything, by default, RStudio runs
    the command the cursor currently lies on.

To save a Script file:

-   Click File → Save in the RStudio menu bar.
-   Press `Ctrl + s` on a PC or `Cmd + s` on a Mac.

A comment is a set of text ignored by R when submitted to the Console.

A comment is indicated by the `#` symbol. Nothing to the right of the
`#` is executed by the Console.

To comment (or uncomment) multiple lines of code in the Source pane of
RStudio, highlight the code you want to comment and press
`Ctrl + Shift + c` on a PC or `Cmd + Shift + c` on a Mac.

------------------------------------------------------------------------

### Hands-on Practice

Perform the following tasks:

1.  Type `1+1` in the Console and press Enter.
2.  Open a new Script in RStudio.
3.  Type `mean(1:3)` in your Script file.
4.  Type `# mean(1:3)` in your Script file.
5.  Run the commands from the Script using an approach mentioned above.
6.  Save your Script file.
7.  Use the keyboard shortcut to “comment out” some of the lines of your
    Script file.

In [1]:
# type your commands here

------------------------------------------------------------------------

## Assignment in R

R works on various types of objects that we’ll learn more about later.

To store an object in the computer’s memory we must assign it a name
using the assignment operator `<-` or the equal sign `=`.

Some comments:

-   In general, both `<-` and `=` can be used for assignment.
-   Pressing `Alt + -` on a PC or `Option + -` on a Mac will insert `<-`
    into the R Console and Script files.
    -   If you are creating an R Markdown file, then this shortcut will
        only insert `<-` if you are in an R code block.
-   `<-` and `=` are NOT synonyms, but can be used identically most of
    the time.

## Functions

To use a function, you type the function’s name in the Console (or
Script) and then supply the function’s “arguments” between parentheses,
`()`.

The arguments of a function are pieces of data or information the
function needs to perform the requested task (i.e., the function
“inputs”). Each argument you supply is separated by a comma, `,`.

------------------------------------------------------------------------

### Hands-on Practice

Run the following commands in the Console.

What is each command doing?

In [2]:
m <- mean(1:10)

In [3]:
m

In [4]:
print(m)

In [5]:
mean(c(1, 5, 3, 4, 10))

In [6]:
mean(c(1, 5, 3, 4, 10), trim = 0.2)

------------------------------------------------------------------------

## Packages

Packages are collections of functions, data, and other objects that
extend the functionality available in R by default.

R packages can be installed using the `install.packages` function and
loaded using the `library` function.

The **tidyverse** package (actually, a collection of packages) contains
data and some useful functions we will be using later in the course.

------------------------------------------------------------------------

### Hands-on Practice

In [7]:
# install tidyverse if it's not already installed
if (!require("tidyverse")){
    install.packages("tidyverse")
}

After you install **tidyverse**, load the package by running the command
below.

In [8]:
library(tidyverse)

------------------------------------------------------------------------

## Getting Help

There are many ways to get help in R.

-   To get help for a function named `command`, run `?command` to access
    the documentation

    -   The Documentation will provide information on the function use,
        arguments, usage examples, and more.

-   Running `??topic` will search the documentation for any occurrence
    of the word “topic” and provide a list of relevant documentation to
    consider.

-   Stack Overflow (<https://www.stackoverflow.com>) is a great resource
    to find solutions.

------------------------------------------------------------------------

### Hands-on Practice

Let’s get help on the `lm` function.

In [9]:
?lm

In [10]:
??lm

------------------------------------------------------------------------

## Data Types and Structures

### Basic Data Types

R has 6 basic vector types:

1.  character: collections of characters. E.g., `"a"`, `"hello world!"`.
2.  double: decimal numbers. e.g., `1.2`, `1.0`.
3.  integer: whole numbers. In R, you must add `L` to the end of a
    number to specify it as an integer. E.g., `1L` is an integer but `1`
    is a double.
4.  logical: boolean values, `TRUE` and `FALSE`.
5.  complex: complex numbers. E.g., `1+3i`.
6.  raw: a type to hold raw bytes.

### Other important object types

-   **Numeric**: An object is `numeric` if it is of type `integer` or
    `double`. In that case, it’s `mode` is said to be `numeric`.
-   **NULL**: `NULL` is a special object to indicate an object is
    absent.
    -   An object having a length of zero is not the same thing as an
        object being absent.
-   **NA**: A “missing value” occurs when the value of something isn’t
    known. R uses the special object `NA` to represent a missing value.
    -   If you have a missing value, you should represent that value as
        `NA`. Note: `"NA"` is not the same thing as `NA`.

## Data structures

R operates on data structures. A data structure is a “container” that
holds certain kinds of information.

R has 5 basic data structures:

1.  vector.
2.  matrix.
3.  array.
4.  data frame.
5.  list.

## Vectors

### Creating vectors

A *vector* is a one-dimensional set of data of the same type.

The most basic way to create a vector is the `c` (combine) function.

The following commands create vectors of type `numeric`, `character`,
and `logical`, respectively.

-   `c(1, 2, 5.3, 6, -2, 4)`
-   `c("one", "two", "three")`
-   `c(TRUE, TRUE, FALSE, TRUE)`

### The `seq` function

The `seq` (sequence) function is used to create an equidistant series of
numeric values.

-   `seq(1, 10)` or `1:10` creates a sequence of numbers from 1 to 10 in
    increments of 1.
-   `seq(1, 20, by = 2)` creates a sequence of numbers from 1 to 20 in
    increments of 2.
-   `seq(10, 20, len = 100)` creates a sequence of numbers from 10 to 20
    of length 100.

------------------------------------------------------------------------

### Hands-on Practice

Run the commands below in the Console and try to answer the questions
below.

What does the `by` argument of the `seq` function control?

What does the `len` argument of the `seq` function control?

In [11]:
seq(1, 10)

In [12]:
1:10

In [13]:
seq(1, 20, by = 2)

In [14]:
seq(10, 20, len = 100)

------------------------------------------------------------------------

### The `rep` function

The `rep` (replicate) function can be used to create a vector by
replicating values.

-   `rep(1:3, times = 3)` replicates the sequence `1, 2, 3` three times.
-   `rep(c("trt1", "trt2", "trt3"), times = 1:3)` replicates `"trt1"`
    once, `"trt2"` twice, and `"trt3"` three times.
-   `rep(1:3, each = 3)` replicates each element three times.

------------------------------------------------------------------------

### Hands-on Practice

Run the commands below in the Console and try to answer the questions
below.

What does the `times` argument of the `rep` function control?

What does the `each` argument of the `rep` function control?

In [15]:
rep(1:3, times = 3)

In [16]:
rep(c("trt1", "trt2", "trt3"), times = 1:3)

In [17]:
rep(1:3, each = 3)

------------------------------------------------------------------------

### Combining vectors

Multiple vectors can be combined into a new vector object using the `c`
function.

-   E.g., `c(v1, v2, v3)` would combine vectors `v1`, `v2`, and `v3`.

------------------------------------------------------------------------

### Hands-on Practice

Run the commands below in the Console. Determine what action each
command performs.

In [18]:
v1 <- 1:5

In [19]:
v2 <- c(1, 10, 11)

In [20]:
v3 <- rep(1:2, each = 3)

In [21]:
new <- c(v1, v2, v3)

In [22]:
new

------------------------------------------------------------------------

### Categorical vectors

Categorical data should be stored as a `factor` in R.

#### Creating a `factor` object

We create same `factor` variables below.

In [23]:
f1 <- factor(rep(1:6, times = 3))
f1

In [24]:
f2 <- factor(c("a", 7, "blue", "blue", FALSE))
f2

A printed `factor` object lists the `Levels` (i.e., unique categories)
of the object.

Some additional comments:

-   `factor` objects aren’t technically vectors but behave like vectors,
    which is why they are included here.
-   The `is.factor` function can be used to determine whether an object
    is a `factor`.

------------------------------------------------------------------------

#### Hands-on Practice

Attempt to complete the following tasks:

1.  Create a vector named `grp` that has two levels: `a` and `b`, where
    the first 7 values are `a` and the second 4 values are `b`.
2.  Run `is.factor(grp)` in the Console.
3.  Run `is.vector(grp)` in the Console.
4.  Run `typeof(grp)` in the Console.

In [25]:
# type your code here

------------------------------------------------------------------------

#### Creating an ordered `factor` object

We can create `factor` objects with specific orderings of categories
using the `level` and `ordered` arguments of the `factor` function.

-   See `?factor` for more details).

In [26]:
size <- c("small", "medium", "small", "large", "medium", "medium", "large")
factor(size)

In [27]:
# create ordered factor
factor(size, levels = c("small", "medium", "large"), ordered = TRUE)

------------------------------------------------------------------------

### Extracting parts of a vector

Parts of a vector can be extracted by appending an index vector in
square brackets `[]`.

In [28]:
# define a sequence 2, 4, ..., 16
a <- seq(2, 16, by = 2)
# extract subset of vector
a[c(2, 4, 6)]

Supplying a negative index vector indicates the values you want to
exclude from your selection.

In [29]:
a[-c(2, 4, 6)] # select all but element 2, 4, 6

In [30]:
a[-(3:6)] # select all but elements 3-6

### Logical Expressions

A logical expression uses one or more logical operators to determine
which elements of an object satisfy the specified statement.

The basic logical operators are:

-   `<`, `<=`: less than, less than or equal to.
-   `>`, `>=`: greater than, greater than or equal to.
-   `==`: equal to.
-   `!=`: not equal to.

Creating a logical expression with a vector will result in a logical
vector indicating whether each element satisfies the logical expression.

------------------------------------------------------------------------

#### Hands-on Practice

Run the following commands in R and see what is printed. What task is
each statement performing?

In [31]:
a > 10

In [32]:
a <= 4

In [33]:
a == 10

In [34]:
a != 10

#### The “and”, “or”, and “not” operators

We can create more complicated logical expressions using the “and”,
“or”, and “not” operators.

-   `&`: and.
-   `|`: or.
-   `!`: not, i.e., not true

------------------------------------------------------------------------

#### Hands-on Practice

Run the following commands below in the Console.

What action is each command performing?

What role does `&` serve in a sequence of logical values?

Similarly, what roles do `|` and `!` serve in a sequence of logical
values?

In [35]:
TRUE & TRUE & TRUE

In [36]:
TRUE & TRUE & FALSE

In [37]:
FALSE | TRUE | FALSE

In [38]:
FALSE | FALSE | FALSE

In [39]:
!TRUE

In [40]:
!FALSE

------------------------------------------------------------------------

#### Connecting logical expressions

Logical expressions can be connected via `&` and `|` (and impacted via
`!`).

-   The operators are applied elementwise to values of the vectors.

------------------------------------------------------------------------

#### Hands-on Practice

Run the following commands in R and see what is printed. What task is
each statement performing?

Note that the parentheses `()` are used to group logical expressions to
more easily understand what is being done. This is a good coding style
to follow.

In [41]:
(a > 6) & (a <= 10)

In [42]:
(a <= 4) | (a >= 12)

In [43]:
!((a <= 4) | (a >= 12))

We can pass logical expressions within the square brackets to access
part of a data structure. This syntax will return each element of the
object for which the expression is `TRUE`.

#### Selection using logical expressions

Logical expressions can be used to return parts of an object satisfying
the appropriate criteria.

-   We pass logical expressions within the square brackets to access
    part of a data structure.
-   This syntax will return each element of the object for which the
    expression is `TRUE`.

------------------------------------------------------------------------

### Hands-on Practice

Run the following commands in R and see what is printed. What task is
each statement performing?

In [44]:
a[a < 6]

In [45]:
a[a == 10]

In [46]:
a[(a < 6)|(a == 10)]

------------------------------------------------------------------------

## Helpful Functions

Here is a list of helpful functions in R. We will use a hands-on example
to convey the action a function performs.

-   `length`
-   `sum`
-   `mean`
-   `var`
-   `sd`
-   `range`
-   `log`
-   `summary`
-   `str`

------------------------------------------------------------------------

### Hands-on Practice

Run the following commands in the Console. Determine what task each
command is performing.

In [47]:
x <- rexp(100) # sample 100 iid values from an Exponential(1) distribution

In [48]:
length(x)

In [49]:
sum(x)

In [50]:
mean(x)

In [51]:
var(x)

In [52]:
sd(x)

In [53]:
range(x)

In [54]:
log(x)

In [55]:
summary(x)

In [56]:
str(x) # structure of x

------------------------------------------------------------------------

### Functions for Statistical Distributions

Suppose that a random variable $X$ has the `dist` distribution. The
function templates in the list below describe how to obtain certain
properties of $X$.

-   `p[dist](q, ...)`: returns the cdf of $X$ evaluated at `q`, i.e.,
    $p=P(X\leq q)$.
-   `q[dist](p, ...)`: returns the inverse cdf (or quantile function) of
    $X$ evaluated at $p$, i.e., $q = \inf\{x: P(X\leq x) \geq p\}$.
-   `d[dist](x, ...)`: returns the mass or density of $X$ evaluated at
    $x$ (depending on whether it’s discrete or continuous).
-   `r[dist](n, ...)`: returns an independent and identically
    distributed random sample of size `n` having the same distribution
    as $X$.
-   The `...` indicates that additional arguments describing the
    parameters of the distribution may be required.

------------------------------------------------------------------------

### Hands-on Practice

Run the following commands in R to see the output. Before each command
is a description of the action performed by the command.

`pnorm(1.96, mean = 0, sd = 1)` returns the probability that a standard
normal random variable is less than or equal to 1.96, i.e.,
$P(X \leq 1.96)$.

In [57]:
pnorm(1.96, mean = 0, sd = 1)

`qunif(0.6, min = 0, max = 1)` returns the value $x$ such that
$P(X\leq x) = 0.6$ for a uniform random variable on the interval
$[0, 1]$.

In [58]:
qunif(0.6, min = 0, max = 1)

`dbinom(2, size = 20, prob = .2)` returns the probability that $X$
equals 2 when $X$ has a Binomial distribution with $n=20$ trials and the
probability of a successful trial is $0.2$.

In [59]:
dbinom(2, size = 20, prob = .2)

`dexp(1, rate = 2)` evaluates the density of an exponential random
variable with mean = 1/2 (i.e., the reciprocal of the `rate`) at $x=1$.

In [60]:
dexp(1, rate = 2)

`rchisq(100, df = 5)` draws a sample of 100 observations from a
chi-squared random variable with 5 degrees of freedom.

In [61]:
rchisq(100, df = 5)

------------------------------------------------------------------------

## Data Frames

Data frames are a *fundamental* data structure used by most of R’s
modeling software.

Data frames are:

-   two-dimensional data objects.
-   each column of a data frame is a vector.

### Direct creation

Data frames are directly created by passing vectors into the
`data.frame` function.

In [62]:
# create basic data frame
d <- c(1, 2, 3, 4)
e <- c("red", "white", "blue", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(d,e,f)
df

The columns of a data frame can be renamed using the `names` function on
the data frame.

In [63]:
names(df) <- c("ID", "Color", "Passed")
df

The columns of a data frame can be named when you are first creating the
data frame by using `name =` for each vector of data.

In [64]:
df2 <- data.frame(ID = d, Color = e, Passed = f)
df2

### Importing Data

In practice, we likely want to import data from a file into R.

The `read.table` function imports data in table format from file into R
as a data frame.

-   `file` is the file path and name of the file you want to import into
    R.
    -   If you don’t know the file path, setting `file = file.choose()`
        will bring up a dialog box asking you to locate the file you
        want to import.
-   `header` specifies whether the data file has a header (variable
    labels for each column of data in the first row of the data file).
    -   If you don’t specify this option in R or use `header = FALSE`,
        then R will assume the file doesn’t have any headings.
    -   `header = TRUE` tells R to read in the data as a data frame with
        column names taken from the first row of the data file.
-   `sep` specifies the delimiter separating elements in the file.
    -   If each column of data in the file is separated by a space, then
        use `sep = " "`.
    -   If each column of data in the file is separated by a comma, then
        use `sep = ","`.
    -   If each column of data in the file is separated by a tab, then
        use `sep = "\t"`.

------------------------------------------------------------------------

### Hands-on Practice

Consider reading in a csv (comma separated file) with a header. The file
in question contains information related to U.S. crime data in 2009.

In [65]:
path <- "https://raw.githubusercontent.com/jfrench/api2lm/main/inst/extdata/crime2009.csv"
# import data as data frame
crime2009 <- read.table(file = path, header = TRUE, sep = ",")
# view data structure
str(crime2009)

------------------------------------------------------------------------

### Extracting parts of a data frame

R provides many ways to extract parts of a data frame.

The `mtcars` data frame has 32 observations for 11 variables.

In [66]:
data(mtcars) # load data set
str(mtcars)  # examine data structure

#### Direct extraction

We can extract the `mpg` variable from the `mtcars` data frame using the
`$` operator.

In [67]:
mtcars$mpg

Another way to extract a variable from a data frame uses a
`df[rows, columns]` style syntax.

-   `rows` and `columns` indicate the desired rows or columns.
-   If either the `rows` or `columns` are left blank, then all `rows` or
    `columns`, respectively, are extracted.

In [68]:
mtcars[,"mpg"]

To select multiple variables in a data frame, we can provide a character
vector with multiple variable names between `[]`.

In [69]:
mtcars[c("mpg", "cyl")]

You can also use numeric indices to directly indicate the rows or
columns of the data frame that you would like to extract.

In [70]:
mtcars[c(1, 2)]

------------------------------------------------------------------------

#### Hands-on Practice

Run the following commands in the Console. Determine what task each
command is performing.

In [71]:
df3 <- data.frame(numbers = 1:5,
                  characters = letters[1:5],
                  logicals = c(TRUE, TRUE, FALSE, TRUE, FALSE))

In [72]:
df3

In [73]:
df3$logicals

In [74]:
df3[1, ]

In [75]:
df3[, 3]

In [76]:
df3[, 2:3]

In [77]:
df3[, c("numbers", "logicals")]

In [78]:
df3[c("numbers", "logicals")]

------------------------------------------------------------------------

#### Extraction using logical expressions

Logical expressions can be used to subset a data frame.

The following command creates of vector of logical values.

In [79]:
mtcars$hp > 250

This vector can be used to extract rows for all of the `TRUE` values.

In [80]:
# extract rows with hp > 250
mtcars[mtcars$hp > 250,]

We can make the logical expression more complicated.

In [81]:
# return rows with `cyl == 8` and `mpg > 17`
# return columns mpg, cyl, disp, hp
mtcars[mtcars$cyl == 8 & mtcars$mpg > 17,
       c("mpg", "cyl", "disp", "hp")]

### Extraction using the `subset` function

The `subset` function returns the part of a data frame that meets the
specified conditions.

The basic usage of this function is:
`subset(x, subset, select, drop = FALSE)`

-   `x` is the object you want to subset.
    -   `x` can be a vector, matrix, or data frame.
-   `subset` is a logical expression that indicates the elements or rows
    of `x` to keep (`TRUE` means keep).
-   `select` is a vector that indicates the columns to keep.
-   `drop` is a logical value indicating whether the data frame should
    “drop” into a vector if only a single row or column is kept. The
    default is `FALSE.`.

### Hands-on Practice

Run the following commands in the Console to use the `subset` function
to extract parts of the `mtcars` data frame.

What is each command performing?

In [82]:
subset(mtcars, subset = gear > 4)

In [83]:
subset(mtcars, select = c(disp, hp, gear))

In [84]:
subset(mtcars, subset = gear > 4, select = c(disp, hp, gear))

### Modifying a Data Frame

Columns can be added to a data frame using `$` and the assignment
operator. In the example below, we add a new column, `kpg`, to the
`mtcars` data set based on a transformatino of the `mpg` column.

In [85]:
mtcars$kpg <- mtcars$mpg*1.6
head(mtcars)

## Using the pipe operator

R’s native pipe operator (`|>`) allows you to “pipe” the object on the
left side of the operator into the first argument of the function on the
right side of the operator. The pipe operator is a convenient way to
string together numerous steps in a string of commands.

When reading code with pipes, the pipe can be thought of as the word
“then”.

In the code below, we take `mtcars` *then* subset it based on `disp` and
*then* select some columns.

In [86]:
mtcars |>
  subset(subset = disp > 400) |>
  subset(select = c(mpg, disp, hp))

Here is a complicated sequence of piped commands. What do you think is
happening?

In [87]:
# create new variable, select columns, extract first 5 rows
mtcars |>
  transform(lp100km = 237.5/mpg) |>
  subset(select = c(mpg, lp100km)) |>
  head(n = 5)