# Short Introduction to R Commands

> **This Jupyter Notebook (JN) was built as a SJSU/GeneLab collaboration to introduce students to running R commands within a Jupyter environment. Many essential bioinformatics tools are written and run in the R language, including the tools we will use for differential gene expression analysis of RNA sequence data.**  

---

# Table of Contents

* [0. What is R?](#zeroR)
* [1. Introduction to programming in R](#oneR)
    * [1a. Install and load packages/libraries in R](#lib)
    * [1b. Define locations and assign variables in R](#variable)
    * [1c. Loading and viewing a data file in R](#file)
        * [help()](#help)
        * [read.csv()](#read)
        * [head()](#headR)
        * [dim()](#dim)
        * [summary()](#summary)
    * [1d. Data frame manipulations](#df)
        * [Add a value to all cells](#dfadd)
        * [Take the log of all cells](#dflog)
        * [Convert data frame to contain only integers](#dfint)
        * [Slice a data frame column](#dfcol)
        * [Slice a data frame row](#dfrow)
        * [Filter data in a data frame](#dffilter)
        * [Add columns to a data frame](#dfmore)
        * [Combine data frames](#dfcombine)
    * [1e. Export data from R](#export)
    * [1f. Visualizations](#viz)

<a class="anchor" id="zeroR"></a>

# 0. What is R?

R is a programming language designed specifically for statistical analysis and computing. R has many bioinformatic libraries for statistical analysis, as well as for data visualization.

This Jupyter Notebook (JN) is running an R kernel, which means that each cell is designed to interpret and run commands written in the R language. A JN cannot run both an R kernel and a Python kernel at the same time.

Which language did we use when running Unix commands in the previous JN? 

<a class="anchor" id="oneR"></a>

# 1. Introduction to programming in R

<a class="anchor" id="lib"></a>

## 1a. Install and load packages/libraries in R

A R library, or package, is a collection of functions, compiled code, and sample data. Some R packages are installed automatically with the base R installation. But there are hundreds of packages that do not come with the base installation. If you want to use one of these packages, you can install it using the `install.packages()` function and then load it using the `library()` function. 

For example, the following is the syntax to install the BiocManager package, which is then itself used to install any package from the Bioconductor suite.

`install.packages("BiocManager")`

Bioconductor packages including DESeq2, which we will use for differential gene expression analysis, are installed using the BiocManager `install()` function (instead of the R basic `install.packages()` function) and loaded using the R `library()` function as shown in the example below: 

`BiocManager::install("DESeq2")`

`library(DESeq2)`

<a class="anchor" id="variable"></a>

## 1b. Define locations and assign variables in R

#### getwd() and setwd()

The `getwd()` function in R stands for "get working directory". This function lists your current working directory, and is similar to the `pwd` command in Unix. 

The `setwd()` function in R stands for "set working directory". This function allows you to change your current working directory, and is similar to the `cd` command in Unix.

Let's run `getwd()` in the next cell to print your current working directory.

In [None]:
getwd()

Where are you located within the cluster?

Notice that the output of `getwd()` has quotation marks around it. That is because R is treating the path as a **string**, which is a one-dimensional array of characters that only represents the textual information contained between the quotation marks.

For example, 'Sally Ride was the first female American astronaut to go into space on June 18, 1983' is a string. 

In the next cell, try running `setwd()` to change your current working directory to the `SolarSystem` folder you made in the Unix Intro JN (hint: the `SolarSystem` folder is located in your home directory). Then, run `getwd()` to check if you were successful. 
- Remember that R expects paths as **strings**, so be sure to pass your path to `setwd()` as a string.

Next, use `setwd()` to change your working directory back to your home directory, then use `getwd()` to check if you navigated back successfully.

#### Assign variables in R

In most programming languages, including R, we use **variables** to store information. Variables are named objects which refer to data stored in memory. In R, we use the `<-` operator to assign information to a variable. Remember that it is very important to choose informative, memorable and short variable names.

For example, in the next cell, we are going to use the variable `x` to refer to the result of the equation `2+4`.

In [None]:
x <- 2+4

We can access the data stored in the variable `x` using the function `print()`:

In [None]:
print(x)

One of the cool things about the Jupyter environment is that the `print()` function is built-in, and we can access data stored in a variable simply by typing the variable name into a cell and running it:

In [None]:
x

We can use variables to hold paths to locations so that we don't have to type out the path every time. Run the cell below to assign your home directory to the `homeDir` variable. We chose the `homeDir` variable as a short representation of the phrase "home directory".

In [None]:
homeDir <- '/home/jovyan'

Use the next cell to print the content of the `homeDir` variable. Did you assign your variable successfully?

We can also use variables to store large amounts of data, such as assigning data within a matrix to a variable as you'll see in the next section.

**WARNING:** Never name your variable after a common function or a built-in variable in R. For a list of built-in R functions and variables, see "Appendix D Function and variable index" of the [R manual](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf).  

<a class="anchor" id="file"></a>

## 1c. Loading and viewing a data file in R

To begin working with a data file in R, we first have to load (aka read in) the file. How you load the data file in R will depend on the file type. R has several built-in functions for reading in common file types, including .TXT files, .CSV (comma-separated values) files and .TSV (tab-separated-values) files.

In this tutorial, we are going to read in a .CSV file.

To do this, we will use a R function called `read.csv()`, which is specifically designed to handle data in the .CSV format. This function will create a data frame, which is a very common data structure in R. **Note:** A data frame is similar to a matrix. 

<a class="anchor" id="help"></a>
#### help()

Before reading in the .CSV file, we'll use the `help()` function in R to view all possible arguments for the `read.csv()` function. (Note: this is similar to the `--help` option that's used in most Unix commands). 
>**After you run the `help()` command:** The output from this command is long, so when you are done reading it, you can comment out this command by adding `#` at the beginning of the line, and rerun the cell to hide the output.

In [None]:
help(read.csv)

<a class="anchor" id="read"></a>
#### read.csv()

Notice that to run `read.csv()`, we'll have to provide the function with a set of arguments. These arguments allow you to give R some information about the data in your .CSV file. Each possible argument is separated by a `,` (note: all arguments listed for the `read.table()` function can also be used in the `read.csv()` function). The arguments specified for `read.csv()` are the default argument settings.

In this tutorial, we are going to provide `read.csv()` with the following arguments: 

* `example.csv` – the name of the file we want to read in
* `header=TRUE` – this argument specifies that the data starts on the second line of the file and the first line is a header, or column names
* `row.names=1` – this argument specifies that the data starts on the second column of the file and the first column contains row names

In the next cell, we will read in the `example.csv` file, and store the data frame in the variable `myDF`. We chose the variable `myDF` as a short representation of the phrase "my data frame".

In [None]:
myDF <- read.csv('example.csv', header=TRUE, row.names=1)

Since the `example.csv` file is in your home directory, which is also your current working directory, you did not need to specify a path to the file you read in. However, if the file you want to read in is located in a directory other than your working directory, you would have to define the path to the directory that holds your file. 

If the directory location of the file is held in a variable (as with your home directory in [section 1b](#variable)), you can use the function `file.path()` to construct the path to the file from the directory variable and the filename. 

For example, since you stored the path to your home directory in the `homeDir` variable, you could have also read in the `example.csv` file using the following syntax: 

`myDF <- read.csv(file.path(homeDir, 'example.csv'), header=TRUE, row.names=1)`

Use the next cell to view the data stored in the `myDF` variable.

How many columns are in the `myDF` data frame?

How many rows?

<a class="anchor" id="headR"></a>
#### head()

Although in this example there is a manageable amount of data in `myDF`, when working with a data frame containing RNA sequence data, viewing all the data may be unfeasible. Thus, similar to the `head` command in Unix, R also has a built-in function to view only a certain number of rows of a data frame or matrix, called `head()`. 

In the next cell, run `head()` and provide only one argument, the data frame variable.

How many lines are given with the `head()` command in R? Is this different from the number of lines given with the `head` command in Unix? 

We can also specify the number of lines we want to view, by providing `head()` with another argument, `n=3`:

In [None]:
head(myDF, n=3)

Use the next cell to print the first 8 rows of the `myDF` data frame.

<a class="anchor" id="dim"></a>
#### dim()

R also has a funtion called `dim()` that will allow you to print the dimensions of the data frame without having to print any lines. 

In the next cell, you will use the `dim()` function to report the *dimensions*, or number of rows and columns, of your `myDF` data frame. 

In [None]:
dim(myDF)

Is the row or column dimension reported first?

How many rows does `myDF` have? How many columns? 

<a class="anchor" id="summary"></a>
#### summary()

To get more information about a loaded data frame without having to print the entire data frame, we can use the `summary()` function in R, which will print a mathematical summary of the data frame. 
Note: `summary()` can also be used on non-data frame objects to report all the data components contained in the object.

Run the `summary()` function in the next cell to view a mathematical summary of your `myDF` data frame.

Which column contains the highest median value in the data frame, `myDF`? Which column contains the highest overall value?

<a class="anchor" id="df"></a>

## 1d. Data frame manipulations

Much of the data we work with in bioinformatics is in the data frame or matrix format. For example, gene expression data is usually held in matrix format, with samples as columns and genes as rows. Each entry or cell in the matrix contains the expression of a particular gene in a particular sample. 

When analyzing gene expression data, it can be useful to be able to perform mathematical functions on all cells in a data frame, such as adding a value to all cells or taking the log of all cells. Fortunately, R makes that easy for us to do. 

Below are some examples of common mathematical manipulations we often perform on data frames in bioinformatics.  

<a class="anchor" id="dfadd"></a>
#### Add a value to all cells

In R, you can add, subtract, multiply, or divide the number in every cell of a data frame by a specific value very easily. Run the command in the next cell to add `1` to every value in your `myDF` data frame.

In [None]:
myDF + 1

Use the next cell to subtract 2 from all values in your `myDF` data frame.

<a class="anchor" id="dflog"></a>
#### Take the log of all cells

R also has a `log()` function that will allow you to take the log of all values in a data frame. By default, the `log()` function will calculate the natural log. However, you can specify which base you want to use as follows: 

`log10()` will compute the common logarithm (base 10)

`log2()` will compute the binary logarithm (base 2)

Run the next cell to compute the natural logarithm of all values in your `myDF` data frame.

In [None]:
log(myDF)

Use the next cell to compute the binary logarithm of all values in your `myDF` data frame.

<a class="anchor" id="dfint"></a>
#### Convert data frame to contain only integers

Some bioinformatics applications, including the DESeq2 method we will use to perform differential gene expression, require that the input data contain only integers. There is a function in R called `ceiling()` that will round decimal values up to the nearest integer. 

Run the next cell to test the `ceiling()` function.

In [None]:
ceiling(1.2)

Before we test this function on a data frame, we first have to create a data frame that contains decimal values. Note that although the `log()` function was used to print the natural logarithm of all values in your `myDF` data frame, those calculations were not saved in the `myDF` variable.

Use the next cell to print the data held in the `myDF` variable. Are the values indeed the same as what we started with?

To make changes to your `myDF` variable, the calculations must be assigned to `myDF`. In the following cell, you will subtract 0.3 from all values in your `myDF` data frame and assign the new values back to the `myDF` variable.

In [None]:
myDF <- myDF - 0.3

In the next cell, print the data held in the `myDF` variable. Have the values in `myDF` changed?

Note: If you want to keep a variable containing the original data in the data frame and also preserve the calculations performed, you can assign the calculations performed on `myDF` to a different variable as follows:
`myDFsub <- myDF - 0.3`

Now that your `myDF` data frame contains decimal values, use the `ceiling()` function to round all the values in `myDF` up to the nearest integer, and assign the new values back to the `myDF` variable.

In the next cell, print the data held in the `myDF` variable. How have the values in `myDF` changed?

<a class="anchor" id="dfcol"></a>
##### Slice a data frame column

When analyzing bioinformatics data, you may need to extract only one column from a data frame. To subset a data frame based on column names, we use the bracket `[` operator. This type of operation is also referred to as "slicing" the data frame. 

Run the cell below to slice `column1` of your `myDF` data frame.

In [None]:
myDF['column1']

**Challenge:** In the next cell, try subsetting `myDF` to "column2", but only view the first 3 rows of the output using `head()`.

<a class="anchor" id="dfrow"></a>
#### Slice a data frame row

To subset a data frame based on row names, we again use the bracket `[` operator, but we add a comma after indicating the row name, which lets R know that we are slicing along rows instead of columns.

Run the cell below to slice `row1` of your `myDF` data frame.

In [None]:
myDF['row1',]

<a class="anchor" id="dffilter"></a>
#### Filter data in a data frame 

When analyzing bioinformatics data, we often need to filter the data to reduce noise. A common filtering method is to remove rows that have all zero values. To do this, we will remove all rows whose values sum to zero using a function called `rowSums()`.

First let's calculate the sum of each row in your `myDF` data frame using the `rowSums()` function:

In [None]:
rowSums(myDF)

Next, we'll use the greater than mathematical operator, `>`, to identify which rows have sums greater than zero:

In [None]:
rowSums(myDF)>0

Finally, we can apply this output to subset the `myDF` data frame by removing all rows whose values sum to zero, using the same row slicing method we used above:

In [None]:
myDF[ rowSums(myDF)>0 , ]

Use what you've just learned to remove all rows in `myDF` whose sum is less than 20, in the next cell. 

<a class="anchor" id="dfmore"></a>
#### Add columns to a data frame

When generating a table containing results from a bioinformatic analysis, it may be useful to add a column. To add a column to a data frame, we use the `[` bracket operator to name the new column. We then turn the new column into a variable using the `<-` operator, and we assign a list of values to that variable. 

To create a list of values in R, we use the `c()` function, which is a method that combines all its arguments to form a vector, or list.

Run the next cell to add a 4th column to your `myDF` data frame, then print the revised data frame, `myDF`.

In [None]:
myDF['column4'] <- c(1,2,3,4,5,6,7,8,9,10)
myDF

Use the next cell to add a 5th column to your `myDF` data frame, then print the revised data frame.

<a class="anchor" id="dfcombine"></a>
#### Combine data frames

Sometimes in bioinformatics, we have two (or more) data frames that we want to combine into one data frame. To do this, we can use the R function `cbind()`. `cbind()` requires at least two arguments: the names of the two data frames that need to be combined.

Let's duplicate `myDF` and then use `cbind()` to merge the original and duplicated data frames.

Run the following cell to create a copy of `myDF` in a variable called `myDF2`, then view the contents of the `myDF2` data frame.

In [None]:
myDF2 <- myDF
myDF2

Now that we have two data frames, `myDF` and `myDF2`, let's use `cbind()` to combine them into one data frame:

In [None]:
cbind(myDF, myDF2)

Now, instead of merely printing the combined data frame, use the next cell to create a variable called `combinedDF` that holds the combined data frame. We suggest the `combinedDF` as a short representation of the phrase "combined data frame".

What are the dimensions of the `combinedDF` data frame? Hint: use the `dim()` function in the cell below.

Did `cbind()` merge these data frames on the row dimension or the column dimension? Why do you think that is? Hint: you can take a look at the `cbind()` documentation using the `help()` function.

<a class="anchor" id="export"></a>

## 1e. Export data from R

<a class="anchor" id="write"></a>

#### write.csv()

Thus far, we have manipulated data frame variables in R, but the altered data is only stored in memory until you export it. How you export data in R will depend on the file type to which you want to export the data. Similar to loading data in R, R has several built-in functions for exporting common file types, including .TXT files, .CSV (comma-separated values) files and .TSV (tab-separated-values) files.

In this tutorial, we will export the data as a .CSV file. To do this, we will use the `write.csv()` function to write a data frame out to a file. The following arguments are needed to execute the `write.csv()` function: 

* the data frame we want to write out 
* the file name we want to write to

In the next cell, you will use `write.csv()` to export your `combinedDF` data frame to a file called `combinedDF.csv`.

In [None]:
write.csv(combinedDF, 'combinedDF.csv')

List all files in your current directory using the `list.files()` function:

In [None]:
list.files()

Do you see your `combinedDF.csv` file in your current directory?

Challenge: Use the next few cells to export your `myDF2` data frame as a .CSV file called `myDF2.csv` but this time, specify your home directory as the path you want to write your file to (hint: use the `homeDir` variable you created previously). Then list all files in your home directory. Did you successfully export the `myDF2.csv` file?

<a class="anchor" id="viz"></a>

## 1f. Visualizations

<a class="anchor" id="plot"></a>

#### plot()

R has a basic built-in function for many common plot types called `plot()`. 

At its most basic, the function call is: `plot(x,y)`, where x and y are numeric vectors containing the (x,y) points for the plot.

Let's call `plot()` with the following parameters: 

* x = the values from myDF2, column 1
* y = the values from myDF2, column 4


We saw before that we can use the following syntax to extract just one column from a dataframe: 
`myDF2['column1']`.  But recall that this produces a subset dataframe, not a vector of numbers. 

If we need just the values from a column as a vector of numbers, we can use the following syntax: 
`myDF2$column1`
> Note: The `$` specifies the column title.

Use this syntax to fill in the x and y vectors in the function call below: 


In [None]:
plot()

Let's see if we can make this plot more interesting. Let's look at the parameters for `plot()`:

In [None]:
help(plot)

In the next few cells, recreate this plot but pass different values to the following 2 parameters: `type` and `col`. 

**Challenge:** Can you create a plot with both points and lines, colored purple? 