# Chapter 8: A Rapid Introduction to the R Language

## R Language Basics

### Significant Digits, print(), and Options in R

- By default, R will print seven significant digits.
- You can use `print()` function to define number of digits.

```r
print(sqrt(3.5), digits = 10) 
```

- `getOption('digits')`: Get default global number of significant digits in R
- `options(digit=9)`: Set default global number of significant digits in R

### Getting help in R

- `help()` or `?`: Access R’s built-in documentation with:

```R
help(log)
?log
```
- `help.search()` or `??`: Get help using a definition of a task (e.g., what was the function in R that calculates cross tabulate vectors??)

```R
help.search("cross tabulate")
??"cross tabulate"
```

- `example()`: Executes all examples in a R help file

```R
example(log)
```

- `library(help="package_name")`: Lists all functions in a package
- `apropos("part_of_func_name")`: Finds functions by name

```R
library(help="base") # List all functions in base
apropos("norm"): # Finds norm function
```

### Variables and Assignment

When we assign a value in our R session, we’re assigning it to an *environment* known as the *global environment*. We can see objects we’ve created in the global environment
with the function `ls()`:

- `ls()`: Lists objects created in the global environment
- `search()`: Returns where R looks when searching for the value of a variable.

### Vectors, Vectorization, and Indexing

- A vector is a container of contiguous data.
- `length()`: Returns length of a vector.
- `c()`: Creates a vector by combining values (of same type): `x <- c(1, 2, 3)`
- R’s vectors are the basis of one of R’s most important features: vectorization. Vectorization allows us to loop over vectors elementwise, without the need to write an explicit loop.
- `seq()` or `:`: Creates integer sequences: `seq(3,6)` or `3:6`
- If one vector is longer than the other, R will recycle the values in the shorter vector:

```R
c(1, 2) + c(0, 0, 0, 0)
[1] 1 2 1 2
c(1, 2) + c(0, 0, 0)
[1] 1 2 1
Warning message:
In c(1, 2) + c(0, 0, 0) :
longer object length is not a multiple of shorter object length
```

- In addition to operators like `+` and `*`, many of R’s mathematical functions (e.g.,`sqrt()`, `round()`, `log()`, etc.) are all vectorized.
- We can access specific elements of a vector through *indexing*. R’s vectors are 1-indexed, meaning that the index 1 corresponds to the first element in a list (in contrast to 0-indexed languages like Python).
- Trying to access an element that doesn’t exist in the vector leads R to return `NA`, the “not available” missing value.
- Vectors can also have names, which you can set while combining values with `c()`. Or we can use `names()` function to add names to a vector's elements:

```R
x <- c(a=1, b=3)
y <- c(1,2,3)
names(y) <- c("a", "b", "c")
```

- And just as we can access elements by their positional index, we can also access them by their name: `x['a']`.
- It is also possible to extract more than one element simultaneously from a vector using indexing: `x[1:2]` or `x[c(1,2)]`.
- It’s also possible to exclude certain elements from lists using negative indexes:

```R
z[c(-4, -5)] # exclude fourth and fifth elements
[1] 3.4 2.2 0.4 
```

- We could reverse the elements of a vector by creating the sequence of integers from 5 down to 1 using `5:1`:

```R
z <- c(3.4, 2.2, 0.4, -0.4, 1.2)
z[5:1]
[1] 1.2 -0.4 0.4 2.2 3.4
```

- The function `order()` returns a vector of indexes that indicate the (ascending) order of the elements. We can use `order()` to sort a vector:

```R
x <- c(2.3, 3.4, -1)
x[order(x, decreasing = TRUE)]
# [1]  3.4  2.3 -1.0
x[order(x, decreasing = FALSE)]
# [1] -1.0  2.3  3.4
```

- Often we use functions to generate indexing vectors for us. For example, one way to resample a vector (with replacement) is to randomly sample its indexes using the `sample()` function:

```R
set.seed(0)
    z <- c(3.4, 2.2, 0.4, -0.4, 1.2)
# we set the random number seed so this example is reproducible
i <- sample(length(z), replace=TRUE)
i
# [1] 5 2 2 3 5
z[i]
# [1] 1.2 2.2 2.2 0.4 1.2
```

Here is R’s comparison and logical operators:

| Operator | Description |
|----------| ------------|
| >        | Greater than |
| <        |Less than    |
| >=       | Greater than or equal to |
|<=        | Less than or equal to |
| ==       | Equal to |
|!         | Not equal to |
| &        | Elementwise logical AND |
|\|        | Elementwise logical OR |
|!         | Elementwise logical NOT |
|&&        | Logical AND (first element only, for if statements) |
|\|\|      | Logical OR (first element only, for if statements) |


Table 8-3. R’s vector types:

|Type | Example | Creation function | Test function | Coercion function |
|-----| --------| ----------------- | ------------- | ----------------- |
| Numeric | c(23.1, 42, -1) | numeric() | is.numeric() | as.numeric() |
| Integer | c(1L, -3L, 4L) | integer() | is.integer() | as.integer() |
| Character |c("a", "c") | character() | is.character() | as.character() |
| Logical | c(TRUE, FALSE) | logical() | is.logical() | as.logical() |

R has four special values used to represent different types of data states:

* **NA** stands for "not available" and signifies **missing data**. Operations involving NA typically result in NA. You can check for NA values using the `is.na()` function.
* **NULL** represents the **absence of a value** altogether, which is distinct from a missing value. The `is.null()` function is used to test for NULL.
* **Inf** and **-Inf** denote **positive and negative infinity**, respectively. You can use the `is.infinite()` function to check for these values.
* **NaN** stands for "not a number" and results from **mathematical computations that produce an undefined numerical value**, such as dividing zero by zero. The `is.nan()` function checks for these values.

---



### Factors and classes in R

Factors store categorical variables, such as a treatment group (e.g., “high,” “medium,” “low,” “control”), strand (forward or reverse), or chromosome (“chr1,” “chr2,” etc.).

- We can create a factor from a vector using the function `factor()`:

```R
chr_hits <- c("chr2", "chr2", "chr3", "chrX", "chr2", "chr3", "chr3")
hits <- factor(chr_hits)
hits
# [1] chr2 chr2 chr3 chrX chr2 chr3 chr3
# Levels: chr2 chr3 chrX
```

- The levels are the possible values a factor can contain (these are fixed and must be unique). We can view a factor’s levels by using the function `levels()`:

```R
levels(hits)
# [1] "chr2" "chr3" "chrX"
```

- We can count up how many of each level there are in a factor using the function `table()`:

```R
table(hits)
# hits
# chrX chrY chr2 chr3 chr4
#    1    0    3    3    0
```

- To discern the difference between an object’s class and its type, notice that factors are just integer vectors under the hood:

```R
typeof(hits)
# [1] "integer"
as.integer(hits)
# [1] 3 3 4 1 3 4 4
```


## Working with and Visualizing Data in R

### Loading Data into R

You can use `getwd()` to get R’s current working directory and `setwd()` to set the working directory:

```R
getwd()
# [1] "/home/payman"
setwd("~/bds-files/chapter-08-r") # path to this chapter's
# directory in the Github repository.

### Loading Large Genomics Data in R

* **Reduce data size:** If a dataset is too large to fit in your computer's memory, you should reduce its size before loading it. You can do this by summarizing the data, omitting unnecessary columns, splitting the data into smaller chunks, or using a random subset.

* **Speed up loading:** If the data fits in memory but takes a long time to load, you can significantly speed up the process by using the `colClasses` argument in R's data-reading functions (`read.csv()`, `read.delim()`).

* **Specify data types:** Explicitly setting the data type for each column with the `colClasses` argument saves R time.

* **Skip columns:** Use `"NULL"` for the value in `colClasses` to tell R to skip columns you don't need, which saves both time and memory.

* **Specify the number of rows**: You can improve the performance of `read.delim()` by setting the `nrow` argument, which tells R how many rows to expect. An easy way to estimate this number is by using the `wc -l` command. It's fine to slightly overestimate this value.

* **Use `data.table::fread()`**: For a significant speed boost, consider using the `fread()` function from the `data.table` package. It is much faster than the standard `read.*` functions but returns a `data.table` object, which behaves differently from a standard `data.frame`.

* **Use SQLite for very large data**: If your data is too big for your computer's memory, you can use a database solution like SQLite. The **RSQLite** R package allows you to store the data on disk and query subsets for analysis, avoiding the need to load the entire dataset into memory.

* **Work with compressed files**: R's data-reading functions can directly read gzipped files (files ending in `.gz`), eliminating the need to uncompress them first. This saves disk space and can offer slight performance gains due to fewer bytes being read from the disk.

### Reading data tables in R

- `read.csv()`: Used for reading CSV files
- `read.delim()`: Used for reading tab-delimited files
- Both functions are thin wrappers around the function `read.table()` with the proper arguments set for CSV and tab-delimited files.

```R
d <- read.csv("file.csv")
bd <- read.delim("noheader.bed", header=FALSE, col.names=c("chrom", "start", "end"))
```

### Getting Data into Shape

In many cases, data is recorded by humans in wide format, but we need data in long format when working with and plotting statistical modeling functions. Hadley Wickham’s `reshape2` package provides functions to reshape data: **the function `melt()` turns wide data into long data, and `cast()` turns long data into wide data.**


### Exploring and Transforming Dataframes

- `read.csv()` output is stored as a *dataframe*.
- Each of the columns of a dataframe are vectors.

Some useful functions to analyse a dataframe:
- `head(df, n=XX)`: Returns XX frist rows of the dataframe
- `nrow(df)`: Number of rows
- `ncol(df)`: Number of columns
- `dim(df)`: Dimension of dataframe
- `rownames(df)`: names of rows
- `colnames(df)`: names of columns
- `df$col_name` or `df["col_name"]`: Access to a single column
- `df[,1:2]`: access entire columns 1 and 2
- `df[,c("col1", "col2")]`: access entire "col1" and "col2"
- `df[1, ]`: access entire row 1
- `df[1,2]`: access cell from row 1 and column 2
- `df[, "start", drop=FALSE]`: returns the column as dataframe not a vector

### Creating Dataframes from Scratch

You can do this with the function `data.frame()`, which creates a dataframe from vector arguments (recycling the shorter vectors when necessary)

```R
x <- sample(1:50, 300, replace=TRUE)
y <- 3.2*x + rnorm(300, 0, 40)
d_sim <- data.frame(y=y, x=x)
```

### Exploring Data Through Slicing and Dicing: Subsetting Dataframes

- In these examples, we’re extracting all columns by omitting the column argument in the bracket operator (e.g., col in `df[row, col]`). If we only care about a few particular columns, we could specify them by their position or their name:

```R
d[d$Pi > 16 & d$percent.GC > 80, c("start", "end", "depth", "Pi")]
#          start      end epth     Pi
# 58550 63097001 63098000 2.39 41.172
# 58641 63188001 63189000 3.21 16.436
# 58642 63189001 63190000 1.89 41.099
```

- It’s also possible to subset rows by referring to their integer positions. The function `which()` takes a vector of logical values and returns the positions of all TRUE values.
- `which()` also has two related functions that return the index of the first minimum or maximum element of a vector: `which.min()` and `which.max()`. For example:

```R
d[which.min(d$total.Bases),]
#         start      end total.SNPs total.Bases depth [...]
#25689 25785001 25786000          0         110  1.24 [...]
d[which.max(d$depth),]
#        start     end total.SNPs total.Bases depth [...]
# 8718 8773001 8774000         58      21914  21.91 [...]
```

- `subset()`: A useful convenience function (intended primarily for interactive use). It takes two arguments: the dataframe to operate on, and then conditions to include a row.

```R
subset(d, Pi > 16 & percent.GC > 80)
#          start      end total.SNPs total.Bases depth [...]
# 58550 63097001 63098000          5         947  2.39 [...]
```

### Exploring Data Visually with ggplot2 I: Scatterplots and Densities

- The best up-to-date reference for ggplot2 is [the ggplot2 online documentation](http://docs.ggplot2.org/).
- Recommend books
    - **ggplot2: Elegant Graphics for Data Analysis** by Hadley Wickham (Springer, 2010)
    - **R Graphics Cookbook** by Winston Chang (O’Reilly, 2012)

- To install and load `ggplot2` package in R:

```R
install.packages("ggplot2")
library(ggplot2)
```

- Each `ggplot2` plot is built by adding layers to a plot that map the *aesthetic properties* of *geometric objects* to data. Layers can also apply statistical transformations to data and change the scales of axes and colors.
- We specify the mapping of aesthetic attributes to columns in our dataframe using the function `aes()`.
- `ggplot2` works exclusively with **dataframes**, so you’ll need to get your data tidy and into a dataframe before visualizing it with ggplot2.

```R
d$position <- (d$end + d$start) / 2
p <- ggplot(d) + geom_point(aes(x=position, y=diversity))
```

- Aesthetic mappings can also be specified in the call to `ggplot()` — geoms will then use
this mapping.

```R
p <- ggplot(d, aes(x=position, y=diversity)) + geom_point()
```

- `ggplot2` has many geoms (e.g., `geom_line()`, `geom_bar()`, `geom_density()`, `geom_boxplot()`, etc.)
- **Labels and Title**: You can change the default axis labels and add a title using specific functions:
    - `xlab()` for the x-axis label.
    - `ylab()` for the y-axis label.
    - `ggtitle()` for the main plot title.

    ```R
    p <- p + xlab("xlabel") + ylab("ylabel") + ggtitle("Title")
    ```
- **Axis Scales**: You can control the range and transformation of continuous axes:
    - `scale_x_continuous(limits = c(start, end))` and `scale_y_continuous()` let you manually set the start and end points of the axes.
    - `scale_x_log10()` and `scale_y_log10()` transform the axes to a log base 10 scale.
- We can map color to a column using `aes()` function:

```R
p <- ggplot(d, aes(x=position, y=diversity, color=cent)) + geom_point()
```

- We can use plot dessity using `geom_density()` layer:

```R
ggplot(d) + geom_density(aes(x=diversity, color=cent), fill="black", alpha=0.5)
```

### Exploring Data Visually with ggplot2 II: Smoothing

There are numerous potential confounders in genomic data (e.g., sequencing read depth; GC content; mapability, or whether a region is capable of having reads correctly align to it; batch effects; etc.)

```R
ggplot(d, aes(x=depth, y=total.SNPs)) + geom_point() + geom_smooth()
```

- By default, `ggplot2` uses generalized additive models (GAM) to fit this smoothed curve for datasets with more than 1,000 rows.
- `ggplot2` adds confidence intervals around the smoothing curve; this can be disabled by using `geom_smooth(se=FALSE)`.

### Binning Data with cut() and Bar Plots with ggplot2

- Binning takes continuous numeric values and places them into a discrete number of ranged bins.
- The benefit is that discrete bins facilitate conditioning on a variable. Conditioning is an incredibly powerful way to reveal patterns in data.
- `cut()`: Bins data

```R
d$GC.binned <- cut(d$percent.GC, 5) # Number of breaks=5
d$GC.binned <- cut(d$percent.GC, c(0, 25, 50, 75, 100)) # we can directly specify breaks
```

- When you manually specify breaks that don’t fully enclose all values, values outside the range of breaks will be given the value NA. You can check if your manually specified breaks have created NA values
using `any(is.na(cut(x, breaks)))`
- We can plot binned columns usig `geom_bar()` layer:

```R
ggplot(d) + geom_bar(aes(x=d$GC.binned))
```

### Merging and Combining Data: Matching Vectors and Merging Dataframes

* `x %in% y`: Returns a logical vector indicating which of the values of x are in y.

```R
> c(3, 4, -1) %in% c(1, 3, 4, 8)
# [1] TRUE TRUE FALSE
```

Let's show how to select desired rows using `%in%`:

```R
reps <- read.delim("chrX_rmsk.txt.gz", header=TRUE) # Repeat data found by Repeat Masker
common_reps <- c("SINE", "LINE", "LTR", "DNA", "Simple_repeat")
reps[reps$repClass %in% common_reps, ]
```

* `match(x, y)`: Returns the first occurrence of each of x’s values in y. If `match()` can’t find one of x’s elements in y, it returns its nomatch argument (which by default has the value NA).

* **Merging DataFrames with `match()`**: The `match()` function in R can be used to combine two data frames, acting similarly to a left outer join. It finds the positions of elements from one data frame (`mtfs$pos`) within a second data frame (`rpts$pos`).

```bash
i <- match(mtfs$pos, rpts$pos)
```

* **Handling Missing Data**: Positions from the first data frame (`mtfs`) that don't have a corresponding match in the second (`rpts`) will result in `NA` values in the merged column. The `table(is.na(i))` command can be used to count how many matches were found versus how many were not.

```bash
table(is.na(i))
```

* **Performing an Outer Join**: The new column, in this case `mtfs$repeat_name`, is created by using the index vector generated by `match()` to select the appropriate values from the second data frame (`rpts$name`) and assign them to the first (`mtfs`).

```bash
mtfs$repeat_name <- rpts$name[match(mtfs$pos, rpts$pos)]
```

* **Performing an Inner Join**: To perform an inner join, where only rows with matching data are kept, you can filter out the `NA` values from the merged data frame. The example uses `mtfs_inner <- mtfs[!is.na(mtfs$repeat_name), ]` to achieve this.

```bash
mtfs_inner <- mtfs[!is.na(mtfs $repeat_name), ]
```


The `merge()` function in R is a user-friendly way to combine two data frames. By default, it performs a join similar to an **inner join**, keeping only the rows where the specified columns have matching values in both data frames.

```bash
recm <- merge(mtfs, rpts, by.x="pos", by.y="pos") 
```

#### Key Features of `merge()`

* **Specify Columns**: It's best practice to explicitly name the columns to merge by using the `by.x` and `by.y` arguments. For instance, `by.x="pos"` and `by.y="pos"` tells R to find matching values in the `pos` column of both data frames.
* **Different Types of Joins**:
    * **Inner Join (Default)**: Combines rows that have matching values in both data frames. This is the default behavior.
    * **Left Outer Join**: Use `all.x=TRUE` to keep all rows from the first data frame (`x`) and include matching rows from the second (`y`). Unmatched rows from `y` will show `NA`.
    * **Right Outer Join**: Use `all.y=TRUE` to keep all rows from the second data frame (`y`) and include matching rows from the first (`x`).
    * **Full Outer Join**: Use `all=TRUE` to keep all rows from both data frames, filling in `NA`s where there are no matches.
* **Avoiding Duplicates**: The `merge()` function can sometimes create duplicated columns (e.g., `chr.x` and `chr.y`) if a column name exists in both data frames but isn't used for merging. For this reason, sometimes a custom approach with `match()` can be more efficient if you want to avoid these extra columns.

### Using ggplot2 Facets

* `ggplot2` has two facet methods: `facet_wrap()` and `facet_grid()`
* `facet_wrap()` takes a factor column and creates a panel for each level and wraps around horizontally.

```bash
> p <- ggplot(mtfs, aes(x=dist, y=recom)) + geom_point(size=1, color="grey")
> p <- p + geom_smooth(method='loess', se=FALSE, span=1/10)
> p <- p + facet_wrap(~ motif)
> print(p)
```

* `facet_grid()` allows finer control of facets by allowing you to specify the columns to use for vertical and horizontal facets. For example:

```bash
> p <- ggplot(mtfs, aes(x=dist, y=recom)) + geom_point(size=1, color="grey")
> p <- p + geom_smooth(method='loess', se=FALSE, span=1/16)
> p <- p + facet_grid(repeat_name ~ motif)
> print(p)
```

`ggplot2`'s `facet_wrap()` and `facet_grid()` functions use the tilde (`~`) symbol to specify the variables for creating separate plots. This syntax comes from R's formula notation, commonly used in statistical models.

#### Key Concepts

* **Tilde (`~`) Syntax**: The tilde is used to define the faceting variable. For example, `facet_wrap(~variable_name)` tells `ggplot2` to create a new panel for each unique value in `variable_name`.
* **Fixed Scales (Default)**: By default, both `facet_wrap()` and `facet_grid()` use **fixed scales** for the x and y axes. This means all the individual plots share the same axis limits, which is useful for direct visual comparisons between panels.
* **Free Scales**: To allow each panel to have its own unique axis limits, you can change the `scales` argument:
    * `scales = "free_x"`: Frees the x-axis, allowing each plot to have its own x-axis range.
    * `scales = "free_y"`: Frees the y-axis, allowing each plot to have its own y-axis range.
    * `scales = "free"`: Frees both the x and y axes, giving each plot a unique range for both.

Using free scales can be helpful when a fixed scale would hide important patterns in the data due to large differences in magnitude between panels.

### More R Data Structures: Lists

* Lists can contain elements of different types (they are heterogeneous).
* Elements can be any object in R (vectors with different types, other lists, environments, dataframes, matrices, functions, etc.).
* Because lists can store other lists, they allow for storing data in a recursive way (in contrast, vectors cannot contain other vectors).
* We create lists with the `list()` function:

```bash
adh <- list(chr="2L", start=14615555L, end=14618902L, name="Adh") 
```

Accessing elements from an R list is done using two distinct indexing operators:

* **Single Brackets (`[]`)**: Use single brackets to **extract a subset of a list**, which will always return another list. For example, `my_list[1:2]` will return a new list containing the first and second elements of `my_list`.
* **Double Brackets (`[[]]`)**: Use double brackets to **access a single element from a list**. This returns the element itself, not a list. For example, `my_list[[3]]` will return the third element from `my_list` in its original data type (e.g., a numeric vector, a data frame, etc.).
* We can create new elements or change existing elements in a list using the familiar
`<-`.
* Assigning a list element the value `NULL` removes it from the list.

The `str()` function in R is a useful tool for getting a compact, human-readable summary of a data structure. It's especially helpful for complex objects like nested lists or data frames.

* `str()` stands for "structure" and it provides a concise description of an R object.
* The output shows the **type** of each element (e.g., `List`, `num` for numeric), its **length** or dimensions, and the **first few values** it contains.
* For nested structures, `str()` indents the output to show the hierarchy, giving you a clear view of the object's organization.
* The `max.level` argument allows you to control how deeply `str()` explores nested objects.
* By default, `max.level` is set to `NA`, which means it shows the full structure.
* You can set a specific number (e.g., `str(z, max.level = 1)`) to limit the output to a certain depth, which is great for simplifying the view of very complex objects.



### Writing and Applying Functions to Lists with lapply() and sapply()

#### Using lapply()

```bash
lapply(list, func_name) 
lapply(list, func_name, func_args) 
```

`lapply()` has several advantages:
* it creates the output list for us.
* Uses fewer lines of code.
* Leads to clearer code.
* In some cases is faster than using a for loop.


Using the `lapply()` function in R can be made more efficient by parallelizing it with the `mclapply()` function from the `parallel` package.

* **`mclapply()` is a parallel version of `lapply()`**. It allows you to apply a function to each element of a list or vector, but it does this across multiple processor cores. This is particularly useful for tasks that are computationally intensive or "slow."
* **Basic Syntax**: You use `mclapply()` just like `lapply()`, providing the list or vector and the function you want to apply. For example, `mclapply(my_samples, slowFunction)` will run `slowFunction` on each element of `my_samples` simultaneously.
* **Controlling Cores**: By default, `mclapply()` uses two cores or the number specified by the global `options(cores)` setting. You can explicitly set the number of cores with `options(cores = #)`.

While parallelization can significantly speed up some tasks, it's not a replacement for writing clean, efficient R code. Optimizing your code first can often provide greater performance benefits than simply parallelizing a slow process.

#### Writing functions

General syntax for R functions:

```bash
fun_name <- function(args) {
# body, containing R expressions
return(value)
}
```

* Function definitions consist of arguments, a body, and a return value.
* Functions that contain only one line in their body can omit the braces.
* Using `return()` to specify the return value is optional; R’s functions will automatically return the last evaluated expression in the body.
* We could forgo creating a function with a specific name in our global environment altogether and instead use an *anonymous* function (named so because anonymous functions are functions without a name). Anonymous functions are useful when we only need a function once for a specific task. For example: `lapply(ll, function(x) mean(x, na.rm=TRUE))`

### Digression: Debugging R Code

To debug an R function, you can use the `browser()` function to pause its execution at a specific point. This allows you to:

* **Inspect variables**: Check the current values of all variables.
* **Step through code**: Execute the code line by line to see how it behaves.
* **Examine the call stack**: View the sequence of functions that led to the current point in the code.

```R
foo <- function(x) {
a <- 2
browser()
y <- x + a
return(y)
}
```

We use one-letter commands to control stepping through code with `browser()`. The mostly frequently used are:
*  `n`: Execute the next line
*  `c`: Continue running the code
*  `Q`: Exit without continuing to run code

Within `browser()`, we can view variables’ values:

```R
Browse[1]> ls() # list all variables in local scope
[1] "a" "x"
Browse[1]> a
[1] 2
```

Using `options(error = recover)` is a powerful way to debug an R function that's throwing an error. This setting tells R to automatically enter an interactive debugging session whenever an error occurs.

When you set `options(error = recover)` and then run a function that contains an error, the execution will pause at the exact point of the error. A menu will appear, asking you to choose which function from the call stack you want to inspect. By selecting the relevant function (e.g., `bar(2)`), you are placed at a `Browse` prompt, which is the same as if you had manually inserted a `browser()` call.

From this prompt, you can:

* **Inspect variables** to see their values.
* **Step through the code** line by line.
* **Examine the call stack** to understand the sequence of function calls.

To turn this debugging mode off, simply run `options(error = NULL)`.

### More list apply functions: sapply() and mapply()

#### sapply()
* The `sapply()` function is similar to `lapply()`, except that it simplifies the results into a vector, array, or matrix.
* `sapply()` can simplify more complex data structures than this simple list, but occasionally `sapply()` simplifies something in a strange way, leading to more headaches than it’s worth.


#### mapply()
* `mapply()` is a multivariate version of `sapply()`: the function you pass to `mapply()` can take in and use multiple arguments.
* Unlike `lapply()` and `sapply()`, `mapply()`’s first argument is the function you want to apply.

```R
mapply(func_name, var1, var2) 
```

* To prevent oversimplifying output, specify `SIMPLIFY=FALSE`.
* `mapply(fun, x, y, SIMPLIFY=FALSE)` is equivalent to using the function `Map()` like `Map(fun, x, y)`, which saves some typing.


### Working with the Split-Apply-Combine Pattern

The "split-apply-combine" strategy is a common data analysis pattern in R. First, we will use R base `split()` and `lapply()` functions to do it. Next, we will use `dpylr` package to do the same operations quicker.

#### Split-Apply-Combine with `split()` and `lapply()`

* **Split**: The `split(x, f)` function divides a data frame or vector (`x`) into a list of subsets based on the levels of a grouping factor (`f`). Each element of the resulting list corresponds to a unique level of the factor and contains all the data belonging to that group.

```R
d_split <- split(d$depth, d$GC.binned) 
```

* **Apply**: After splitting the data, you can use the `lapply()` function to apply a specific function (e.g., `mean()`, `median()`) to each element of the list created in the split step. This performs a calculation on each of the groups independently.

```R
grp_mean <- lappaly(d_split, mean) 
```

* **Combine**: The final step involves combining the results of the `lapply()` operation into a single, more usable data structure, often a vector or a data frame, using `rbind()` (row binding) or `cbind()` (column binding) and `do.call()`:

```R
rbind(grp_mean[[1]], grp_mean[[2]])
```

```R
do.call(rbind, grp_mean) # Calls rbind on all elements of grp_mean list
```

More on `split()` function:

* **Grouping by Multiple Factors**: You can group data by the unique combinations of several factors by providing `split()` with a **list of factors** as its second argument.
* **Reversing the Split**: The `unsplit()` function can reconstruct the original vector or data frame from the list created by `split()`. It requires the split list and the **original grouping factor(s)** used in the `split()` function.
* **Splitting Entire DataFrames**: While the examples typically split single columns (vectors), `split()` can also be used to split an **entire data frame**. This is necessary when your application step (the function used in `lapply()`) requires access to **multiple columns** from the group, such as fitting a linear model (`lm()`).

### Exploring Dataframes with dplyr


While base R functions can perform data manipulation, the **`dplyr` package** offers a faster, simpler, and more consistent alternative for the common **split-apply-combine** data pattern.

---

* **Base R Limitations**: Base R functions like `split()` and `lapply()`, while versatile, are often not the fastest or simplest approach for these tasks. Convenience functions like `tapply()` and `aggregate()` often require extra steps for output cleanup.
* **Introducing `dplyr`**: The `dplyr` package (developed by Hadley Wickham) is designed to consolidate and simplify routine data frame operations.
* **Performance Advantage**: `dplyr` is highly performant because much of its core functionality is written in **C++** for speed.
* **Core `dplyr` Functions**: The package centers around five basic functions for manipulating data frames:
    * `arrange()`
    * `filter()`
    * `mutate()`
    * `select()`
    * `summarize()`
* **Simplified Interface**: `dplyr`'s main advantage is its added **consistency, speed, and versatility**, which drastically simplifies data manipulation and allows for more effective data exploration.


#### select()

```R
d_df <- tibble(d) # Convert d dataframe to dplyr's tibble
select(d_df, start, end) # Select start and end columns
select(d_df, start:depth) # Select range of columns from start to depth
select(d_df, -start, -end) # Drop start and end columns from tibble
select(d_df, -(start:depth)) # Drop a range of columns from tibble
```

#### filter()

`filter()` is similar to subsetting dataframes using expressions like `d[d$Pi > 16 & d$percent.GC > 80, ]`, though you can use multiple statements (separated by commas) instead of chaining them with `&`:

```R
filter(d_df, Pi > 16, percent.GC > 80) 
```

#### arrange()

`arrange()` sorts columns, which behaves like `d[order(depth), ]`:

```R
arrange(d_df, depth) 
arrange(d_df, desc(depth)) # Sort a column in descending order
arrange(d_df, desc(total.SNPs), desc(depth)) # Additional columns can be specified to break ties
```

#### mutate()

Using `mutate()` function, we can add new columns to our dataframe:

#### Chaining dplyr functions

* **Problem with Sequential Operations**: Manipulating a data frame through multiple sequential steps (like selecting, filtering, and mutating) can lead to either:
    1.  Assigning output to **intermediate variables** (which is inefficient).
    2.  **Nesting functions** (e.g., `filter(select(...))`), which makes code difficult to read and understand (reading from the inside out).
* **Solution: The Pipe Operator (`%>%`)**: `dplyr` uses the **pipe operator (`%>%`)** from the `magrittr` package to chain operations.
* **How the Pipe Works**: The pipe takes the output of the expression on its left-hand side and passes it as the **first argument** of the function on its right-hand side.
    * Example: `d_df %>% filter(percent.GC > 40)` is equivalent to `filter(d_df, percent.GC > 40)`.
* **Benefit**: Using the pipe allows for the creation of clear, natural-reading **data-processing pipelines** that express complex data manipulation operations in a straightforward, sequential manner.

#### group_by() and summarize()

* We can group by one or more columns by calling `group_by()` with their names as arguments.
* `summarize()` handles passing the relevant column to each function and automatically creates columns with the supplied argument names.

The `dplyr` package offers several convenience functions designed specifically for summarizing data within groups:

* **`n()`**: Returns the **total number of observations** (rows) in the current group.
* **`n_distinct()`**: Returns the **number of unique observations** in the current group.
* **`first()`, `last()`, and `nth()`**: These functions return specific observations from the group:
    * `first()`: Returns the **very first observation**.
    * `last()`: Returns the **very last observation**.
    * `nth()`: Returns the **observation at a specified position** (e.g., the 5th row).


One of the best features of dplyr is that all of these same methods also work with *database connections*. For example, you can manipulate a SQLite database with all of the same verbs we’ve used here. See dplyr’s databases vignette for more information on this.

### Working with Strings

#### Key Points on R String Processing in Bioinformatics

* **Necessity**: String manipulation is often required when working with bioinformatics data within R.
* **R's Drawbacks for Text Processing**:
    * **Memory Use**: R loads all data into memory, which is inefficient for large bioinformatics files that are better handled with **stream-based approaches**.
    * **Clunky Functions**: R's string processing functions are considered less user-friendly and more difficult to remember compared to those in languages like Python.
* **When to Use R's Functions**:
    * **Practicality**: If data has **already been loaded into R** for exploration or analysis, it is simpler and more convenient to use R's built-in string functions than to switch to a separate language like Python.
    * **Performance**: Since the initial cost of reading the data into memory has already been incurred, there's **no significant performance gain** in switching to another language just for string processing at that point.

#### R functions to work with strings

* `nchar()`: Retrieve the number of characters of each element of a character vector and is vectorized.
* `grep(pattern, x)`: Returns the positions of all elements in `x` that match `pattern`:

```R
> re_sites <- c("CTGCAG", "CGATCG", "CAGCTG", "CCCACA") 
> grep("CAG", re_sites) 
# [1] 1 3
```

By default, `grep()` uses POSIX extended regular expressions, so we could use more sophisticated patterns:

```R
> grep("CT[CG]", re_sites) 
#[1] 1 3
```

* **Controlling Regular Expression Dialect**:
    * **Perl Compatible Regular Expressions (PCRE)**: Enable this modern, powerful dialect using the argument `perl=TRUE`. This is often required for special symbols like `\d` (which matches any digit).
    * **Fixed String Matching**: Use `fixed=TRUE` to treat the pattern as a literal string, disabling the interpretation of special regular expression characters.
* **Writing Precise Patterns**: To avoid matching unwanted strings (e.g., matching "6" but excluding "16"), you must use a **restrictive regular expression**.
    * **Example**: The pattern `[^\\d]6` matches any non-digit character (`[^\\d]`) immediately followed by the number `6`, successfully filtering out entries like "chr16".
* **Backslash Escaping**: When using regular expressions in R strings, you often need to **double-escape** backslashes (e.g., `\\d`) to ensure the correct pattern is passed to the engine.

```R
>chrs <- c("chrom6", "chr2", "chr6", "chr4", "chr1", "chr16", " chrom8")
> grep("[^\\d]6", chrs, perl=TRUE)
#[1] 1 3
> chrs[grep("[^\\d]6", chrs, perl=TRUE)]
#[1] "chrom6" "chr6"
```

* `regexpr(pattern, x)`: Unlike `grep()`, `regexpr(pattern, x)` returns where in each element of `x` it matched `pattern`. If an element doesn’t match the pattern, `regexpr()` returns –1. For example:

```R
> pos <- regexpr("[^\\d]6", chrs, perl=TRUE)
# [1] 5 -1 3 -1 -1 -1 -1
# attr(,"match.length")
# [1] 2 -1 2 -1 -1 -1 -1
# attr(,"useBytes")
# [1] TRUE
```

* You can access attributes with the function `attributes()`:

```R
atrributes(pos)$match.length
```

* `substr()`: `substr(x, start, stop)` takes a string `x` and returns the characters between `start` and `stop`.

* `sub()`: `sub(pattern, replacement, x)` replaces the first occurrence of `pattern` with `replacement` for each element in character vector `x`. Like `regexpr()` and `grep()`, `sub()` supports `perl=TRUE` and `fixed=TRUE`:

```R
> sub(pattern="Watson", replacement="Watson, Franklin,",
x="Watson and Crick discovered DNA's structure.")
# [1] "Watson, Franklin, and Crick discovered DNA's structure." 
```

* `paste()`: `paste()` takes any number of arguments and concatenates them together using the separating string specified by the `sep` argument (which is a space by default). Like many of R’s functions, `paste()` is vectorized:

```R
> paste("chr", c(1:22, "X", "Y"), sep="")
# [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
```

* `strsplit()`: `strsplit(x, split)` splits string `x` by `split`. Like R’s other string processing functions, `strsplit()` supports optional `perl` and `fixed` arguments. For example:

```R
> leafy <- "gene=LEAFY;locus=2159208;gene_model=AT5G61850.1"
> strsplit(leafy, ";")
# [[1]]
# [1] "gene=LEAFY"
# "locus=2159208"
# "gene_model=AT5G61850.1"
```

### Developing Workflows with R Scripts

#### Control Flow: if, for, and while

The basic syntax of `if`, `for`, and `while` are:

```R
if (x == some_value) {
# do some stuff in here
} else {
# else is optional
}
for (element in some_vector) {
# iteration happens here
}
while (something_is_true) {
# do some stuff
}
```

* `ifelse()`: Vectorized version of `if`: Rather than control program flow, `ifelse(test, yes, no)` returns the yes value for all TRUE cases of test, and no for all FALSE cases. For example:

```R
> x <- c(-3, 1, -5, 2)
> ifelse(x < 0, -1, 1)
# [1] -1 1 -1 1
```

### Working with R Scripts

You can run R scripts from R using the function `source()`. For example, to execute an R script named "my_analysis.R" use:

```R
> source("my_analysis.R")
```

Alternatively, we can execute a script in batch mode from the command line with:

```bash
$ Rscript --vanilla my_analysis.R
```

* **Recommended Flag**: Use `Rscript --vanilla` when running scripts from the command line.
* **Default Behavior (Avoided)**: By default, `Rscript` restores a previously saved R environment and then saves the current environment upon completion.
* **Irreproducible Results**: Restoring past saved states (`.RData` files) makes results **irreproducible** because the script's outcome depends on local files and context, not just the code itself.
* **Debugging Nightmare**: Saved environments can severely complicate **debugging** efforts by introducing unexpected variables and states.
* **Benefit of `--vanilla`**: It ensures R starts in a clean state without loading any past saved environments, guaranteeing that the analysis runs independently and is more easily debugged.

* `commandArgs()`: Retrieves command-line arguments passed to your script. For example, this simple R script just prints all arguments:

```R
## args.R -- a simple script to show command line args
args <- commandArgs(TRUE)
print(args)
```

We run this with:

```bash
$ Rscript --vanilla args.R arg1 arg2 arg3
[1] "arg1" "arg2" "arg3"
```

#### Reproducibility and sessionInfo()

* **The Problem**: Results from R analyses can change over time due to **updates in R and package versions**, causing **reproducibility headaches**.
* **The Solution**: At a minimum, you must **always record the versions of R and all packages** used for an analysis.
* **How to Record Versions**: R makes this process easy with the built-in function **`sessionInfo()`**, which provides a detailed summary of the current R environment.
* **Advanced Solutions**: Addressing these versioning issues is an active area of development, with tools like the **`packrat`** package available for more comprehensive environment management.


### Workflows for Loading and Combining Multiple Files

* `list.files()`: Is used to list all files within a specified directory.

* **Filtering with Regex**: The function optionally accepts a **regular-expression pattern** via the `pattern` argument, which is used to select only files that match the expression.
* **Best Practice**: It's highly recommended to make the regex pattern **as restrictive as possible** (e.g., `pattern="hotspots.*\\.bed"`) to prevent accidentally loading incorrect or extraneous files that might end up in the data directory.

* **Example**: To load all `.bed` files for a specific dataset from a "hotspots" directory, you would use a command like:

```R
list.files("hotspots", pattern="hotspots.*\\.bed") 
```

Some details in above command:

`pattern = "hotspots.*\\.bed"`: This argument provides a regular expression (regex) used to filter the files found in the directory. Only files whose names match this pattern will be returned:
* `hotspots`: Matches the literal string "hotspots".
* `.*`: Matches any character (.) zero or more times (*). This is a flexible wildcard that allows for any characters (like _chr1) to be between "hotspots" and the file extension.
* `\\.`: Matches the literal period (.) before the file extension. The double backslash is necessary because the single backslash escapes the period, telling R to treat it as a literal character rather than the regex wildcard.
* `bed`: Matches the literal file extension "bed".

* `list.files()` also has an argument that returns full relative paths to each file:

```R
> hs_files <- list.files("hotspots", pattern="hotspots.*\\.bed", full.names=TRUE)
> hs_files
# [1] "hotspots/hotspots_chr1.bed" "hotspots/hotspots_chr10.bed"
```

Now, let's read multiple bed files and combined them into a single dataframe:

```R
# Create header row for combined bed dataframe.
bedcol <- c("chr", "start", "end")
# Define a function to read tab-delimited bed file.
loadFile <- function(x) read.delim(x, header=FALSE, col.names=bedcol)
# Apply loadFile for each elements of hs_files.
hs <- lapply(hs_files, loadFile)
# Concatenate dataframes into one big dataframe.
hsd <- do.call(rbind, hs)
# Remove row names
row.names(hsd) <- NULL
```

Often, we need to include a column in our dataframe containing meta-information about files that’s stored in each file’s filename. As a simple example of this, let’s pretend we did not have the column chr in our hotspot files and needed to extract this information from each filename using sub(). We’ll modify our `loadFile()` function accordingly:

```R
loadFile <- function(x) {
# read in a BED file, extract the chromosome name from the file,
# and add it as a column
df <- read.delim(x, header=FALSE, col.names=bedcols)
df$chr_name <- sub("hotspots_([^\\.]+)\\.bed", "\\1", basename(x))
df$file <- x
df
}
```

**Bonus note**: You can use `fix()` function to fix a function, which was defined on rstudio's console. It would open up a new window with called function so you can edit the function and save it.

Processing files this way is very convenient for large files, and is even more powerful because it’s possible to parallelize data processing by simply replacing `lapply()` with `mclapply()`.


### Exporting Data

To export data frames from R to plain-text files, you use the `write.table()` function. However, its default settings often need adjustment for standard data export.

#### Key Points for Using `write.table()`

  * **Required Arguments**: The first two arguments are the **data frame/matrix** to export and the **file path** (`file`) where it should be saved.
  * **Adjusting Defaults (Standard Practice)**: To create a clean, tab-delimited file without extra quotes or row numbers, you typically set the following arguments:
      * `quote = FALSE`: **Disables quotation marks** around character and factor columns.
      * `row.names = FALSE`: **Excludes R's internal row numbers** from the output file.
      * `sep = "\t"`: **Sets the column separator to a tab** (`\t`) for creating a tab-delimited file.
      * `col.names = TRUE`: **Includes column headers** (though this is often the default, it's good practice to ensure it).
  * **Writing Compressed Files (Gzip)**: You can directly write compressed (Gzipped) files by passing an **open file connection** to the `file` argument instead of a simple string path. This is achieved using the `gzfile()` function, which is useful for integration with Unix tools.

**Example of Best Practice Export:**

```r
write.table(mtfs, file="hotspot_motifs.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE) 
```

```r
hs_gzf <- gzfile("hotspots.txt.gz")
write.table(mtfs, file=hs_gzf, quote=FALSE, row.names=FALSE, col.names=TRUE, sep='\t')
```

**Serialization**: Encoding and saving objects to disk in a way that allows them to be restored as the original object is known as serialization. R’s functions for saving and loading R objects are `save()` and `load()`.

```r
tmp <- list(vec=rnorm(4), df=data.frame(a=1:3, b=3:5))
save(tmp, file="example.Rdata")
rm(tmp) # remove the original 'tmp' list
load("example.Rdata") # this fully restores the 'tmp' list from file
str(tmp)
# List of 2
# $ vec: num [1:4] -0.655 0.274 -1.724 -0.49
# $ df :'data.frame':
# 3 obs. of 2 variables:
# ..$ a: int [1:3] 1 2 3
# ..$ b: int [1:3] 3 4 5
```

The `save.image()` function is a convenient way to quickly save your entire R workspace, including all objects currently defined. When combined with `savehistory()`, which saves the commands you've run, these two functions can be used in a rush to store all of your recent work.
