# Installation

* Created Rlang-env environment in Anaconda with r-base and r-essential packages
  * Followed [these](https://docs.anaconda.com/anaconda/navigator/tutorials/create-r-environment/) and [these](https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/) instructions

* Installed Rstudio from Anaconda homepage with the Rlang-env environment active

# RStudio Main Interface

<img src="https://datascienceplus.com/wp-content/uploads/2019/02/r-studio.png" width="500">

1. Menu Bar: runs along the top
  * Session tab: where you can restart, interrupt, or terminate R if running wierd (similar to kernel operations in jupyter notebook)
  * Tools tab: where you can install new packages, link to Github, set up aesthetics

2. Source Quadrant: aka "Script Editor Panel"
  * The main panel
  * Where you store R commands (code) as either a history of what you did or to rerun later
    * Similar to combination of all cells in jupyter notebook

3. Console Quadrant: located in the bottom left corner of main interface upon initial installation, but can be rearranged from the "workspace Panes" icon along the top
  * Can type and execute commands in this space and they will be saved in the "History" tab of the Environment Panel
  * This is where the code typed into the "Source" panel is evaluated by R

4. Environment Quadrant: located in the upper right corner upon intial installation, but again can be rearranged
  * Can view global variables for the environment with the "Environment" tab
  * Can view previously run commands with the "History" tab

5. Files/Plots/Packages/Help Quadrant: bottom right corner
  * "Files" tab: can see the current working directory and its files
  * "Plots" tab: any plots generated by the code will display here
  * "Packages" tab: view all installed packages and update if needed
  * "Help" tab: view documentation for packages and functions

[HotKey Shortcuts for RStudio](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)

# R Packages

3 Main Repositories:
* CRAN - main repository (~12000 packages)
  * Installation Method 1:
    * Type `install.packages("package_name")` in the Console panel
    * To install multiple packages at once, `install.packages(c("package_a", "package_b", "package_c"))`
  * Installation Method 2:
    * Use the "Tools" menu at the top of RStudio and select "Install Packages..."
    * Select the source repository to install from with the dropdown menu at the top of the screen that pops up
    * Type the package name(s) in the middle bar on that screen
* GitHub - popular, open source
  * Installation Method:
    * Make sure the "devtools" package is installed. If not, `install.packages("devtools")` in Console panel
    * Type `library(devtools)` in Console panel
    * Type `install_github("author_name/package_name")` in Console panel
* BioConductor - mainly focused on bioinformatic packages
  * Installation Method 1:
    * Type `source("https://bioconductor.org/biocLite.R")` in the Console panel
      * This enables the installation function for BioConductor, `biocLite("package_name")`

Once the package is installed, it has to by **loaded** into R/RStudio
* Type `library(package_name)` into Console Panel
  * No `" "` around the package name
* Or, the RStudio interface can be used to load packages from the "Packages" tab in the Files/Plots/Packages/Help Panel
  * Check the box to the left of the package name to load
* **Note**: There is an order to loading packages - some packages require other packages to be loaded first (dependencies). That package’s manual/help pages will help you out in finding that order, if they are picky.


To find a package that might be useful for a project:
* Use "Task View" on the CRAN site to group packages by 35 specific themes
* Go to RDocumentation, a search engine for packages across the three main respositories

To view available functions within a package, use `help(package="package_name)`. For extended help files (vignettes), use `browseVignettes("package_name")` and a new browser tab will open with links to vignettes

# R Projects

# Connect with GitHub

1. Link GitHub and RStudio
  * In RStudio, go to Tools > Global Options > Git/SVN
  * In the Git/SVN option window, click “Create RSA Key” and when this completes, click “Close” for the new window that pops up.
  * Following this, in that same Git/SVN window, click “View public key” and copy the string of numbers and letters that appear in the pop up, then Close this window.
  * Go to Github > Account > Settings > SSH and GPG Keys. Click "New SSH Key"
  * Paste in the public key you have copied from RStudio into the Key box and give it a Title related to RStudio. Confirm the addition of the key with your GitHub password.
  * GitHub and RStudio are now linked!

2. Link R Project with GitHub Repository
  * Create a new repository on GitHub and copy the URL
  * In RStudio, go to File > New Project. 
  * Select Version Control and Select Git as your version control software. 
  * Paste in the repository URL from before, select the location where you would like the project stored. When done, click on “Create Project”.
    * This will initialize a new project, linked to the GitHub repository, and open a new session of RStudio.

3. Committing and Pushing Changes in RStuido to GitHub
  * Save the file you are working on
  * Go to the Git tab of the environment quadrant (you should see your file you just saved/created) 
  * Click the checkbox under “Staged” to stage your file.
  * Click “Commit”
    * A new window should open that lists all of the changed files from earlier, and below that shows the differences in the staged files from previous versions. 
    * In the upper quadrant of the “Commit message” box, write yourself a commit message. 
    * Click Commit and Close the window.
  * Click the "Push" icon in the upper right corner of the screen to push the file changes to the GitHub repository specified in step 2
  * Go to your GitHub repository and see that the commit has been recorded.


# Entering Input

* Code entered is called an "expression"

* Assignment operator: `<-`
  * assigns a variable name to some data
  * `x <- 26`

* When a variable is called, the number inside brackets to the left of the output indicates the index position of what is being returned from the variable vector
  * Input: `x`
  * Output: `[1] 26`

* Can use the colon operator (`:`) to generate a range of values (inclusive)
  * Input: `x <- 1:10`
  * Ouput: `[1,2,3,4,5,6,7,8,9,10]`

# Data Types

1. Objects
  * everything in R is an object
  * 5 Basic "Atomic" Classes:
    * Character - string
    * Numeric - real numbers (float?)
    * Integer - whole numbers
      * Specify a number as an integer by adding the `L` suffix. EX: `1L`
    * Complex - complex numbers with real and imaginary parts
      * EX: `1+4i`
    * Logical - True/False (boolean?)
  * Most basic object is a vector
    * Vectors can only contain objects of the same class
      * Except for lists, which can contain mixed classes
    * Empty vectors can be created with the `vector()` function

2. Attributes
  * R Objects can have **attributes**
    * names, dimnames
    * dimensions - think matricies and arrays
    * class - one of the basic atomic classes listed above
    * length - number of elements in vector
    * user-defined - user can set for the object
  * Can be accessed with the `attributes()` function
    * Can also be used to set/modify object attributes

3. Vectors
  * The `c()` function can be used to concatonate multiple vectors together
    * `x <- c(1, 5, 7, 9)`
  * The `vector()` function can also be used to create vectors
    * EX: `vec <- vector("numeric", length=16)`
      * Output: `[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]`
  * When creating vectors with mixed data types, the elements of the vector will be _coerced_ to the lowest class data type in the vector
  * Can explicitly coerce a data type into a specified datatype
    * `as.numeric(variable_name)`
    * `as.logical(variable_name)`
    * If the coercion is nonsensical, will get NaN values 

4. Lists
  * Can contain different classes of data (different data types)
    * `x <- list(1, "a", FALSE, 1+6i)`
  * The output is indexed by double brackets
    * `[[2]]`
    * `[1] "a"`

5. Matricies
  * Special type of vector with dimension attribute for nrows and ncols
    * `m <- matrix(1:6, nrows=2, ncols=3)`
    * Constructed column-wise; so each column is filled before moving on to filling the next column
  * Can transform a vector into a maxtrix
    * Declare a vector: `m <- 1:10`
    * Assign dimensions: `dim(m) <- c(2,5)`
    * Now `m` is a 2x5 matrix of the numbers from 1 to 10
  * Can form matrix with "cbind'ing" - column binding
    * `x <- 1:3`
    * `y <- 10:12`
    * `cbind(x,y)`
    * Resulting matrix is 3x2 where the first column is the numbers in x and the second column is the numbers in y
  * Can also form matrix with "rbind'ing" - row binding
    * `rbind(x, y)`
    * Resulting matrix is 2x3 where the first row is the numbers in x and the second row is the numbers in y

6. Factors
  * Vectors used to represent categorical data
    * Ordered - thinks ranks; small/medium/large
    * Unordered - male/female
  * Need special consideration when used with modeling functions; ex. `lm()` and `glm()`
  * Can be created with the "factor" function
    * `f <- factor(c('yes', 'yes', 'no', 'yes', 'no'))`
  * Use the "table" function to get a frequency count of the unique levels (values) in the factor
    * `table(f)`
  * can use the "unclass" function to convert to integer elements
    * `unclass(f)`
  * Can set the hierarchy of the levels in the factor with the "levels" argument in the factor() function
    * `f <- factor(c('yes', 'yes', 'no', 'yes', 'no'), levels = c('yes', 'no'))`
      * Now "yes" is the baseline level
      * The default hierarchy is alphabetical order

7. Missing values
  * `is.nan()` is used for mathematical operations
  * `is.na()` is used for almost everything else
    * `NA` values also have classes; there is an integer `NA`, character `NA`, etc.
      * `NaN` is also an `NA`, but the converse is not true
        * `x <- c(1, 2, NA, 4)`
        * `is.na(x)` $\to$ `FALSE FALSE TRUE FALSE`
        * `is.nan(x)` $\to$ `FALSE FALSE FALSE FALSE`

8. Data Frames
  * Used to store tabular data
  * Can be thought of as a special type of list, where each element in the list has the same length
    * Each element in the list can be thought of as a column; and the length of each element is the number of rows
  * Can store different classes of data in each column
  * Has special attribute `row.names`
  * Usually created with `read.table()` or `read.csv()`
    * Can also use the `data.frame(col1, col2, col3, ...)` function
  * Can be converted to a matrix with `data.matrix()`
    * Be careful of coercion

9. Names Attribute
  * Can give a name to eah element in the vector
    * `x <- 1:3`
    * `names(x) <- c('A', 'B', 'C')`
    * Now element 1 is named 'A', element 2 is named 'B', and element 3 is named 'C'
  * Can do this too when creating a list
    * `x <- list(a=1, b=2, c=3)`
  * Can name matricies with a similar convention
    * `dimnames(m) <- list(c(row_names), c(col_names))`

  


# Reading Tabular Data

1. Function: `read.table()`
  * file - name of file or connection
  * header - logical indicating if the file has a header line
  * sep - string indicating how the columns are separated
    * Default is a space (" ")
  * colClasses - character vector indicating the class of each column
  * nrows - number of rows in the dataset
  * comment.char - character string indicating the comment character
    * Lines that beginning with the specified comment character will be skipped
    * default is `#`
  * skip - the nunmber of lines to skip from the beginning
  * stringsAsFactors - should character variables be coded as factors? Default is TRUE

2. Function: `read.csv()`
  * Same as `read.table()` except that the defualt `sep` argument value is the comma (",")
  * Also always specifies header as TRUE

3. Tips for Reading in Larger Datasets
  * If there are no commented lines, set `comment.char = ""`
  * Use the `colClasses` argument to speed up load time
    * Boilerplate for figuring out and setting `colClasses`
      * `initial <- read.table("datatable.txt", nrows=100)`
      * `classes <- sapply(initial, class)`
      * `tabAll <- read.table("datatable.txt", colClasses=classes)`
  * Setting nrows with a mild overestimate doesn't necessarily make R run faster, but it will help with memory usage
  * Can use functions `dump()` and `dput()` to output data from R
    * Contains the metadata and data in a textual format
    * `y <- data.frame(a=1, b="a")`
    * `dput(y, file="y.R")`
      * writes the dataframe to the file named y.R
      * Use `dget("y.R")` to reconstruct the dataframe from the file
      * `dput()` can only be used on a single R object
    * Use `dump()` function for multiple R objects
      * `x <- "moose"`
      * `y <- data.frame(a=1, b="a")`
      * `dump(c("x", "y"), file="data.R")`
        * Use `source("file_name")` to retrieve the objects in the file

3. Interface to Outside World
  * `file` function
    * `function(description="file_name", open="r", blocking=TRUE, encoding=getOption("encoding"))`
      * Read only = "r"
      * Writing (and initializing new file) = "w"
      * Appending = "a"
      * For Windows in binary: "rb", "wb", "ab"
  * `url` function
    * `con <- url("web_address", "r")`
    * `x <- readLines(con)`
      * read lines from a website
    * `head(x)`



# Subsetting

* Single Square Bracket (`[ ]`)
  * Always returns an object of the same class as the original
    * Except when subsetting a single element, row/col from a matrix. Then it returns a vector of length 1 (for single element) or of length row/col (if subsetting single row or column)
  * Can use to select more than one element
    * Use integers within the brackets to denote index positions (starts at 1)
      * Can also be a range (eg. 1:4)
    * Can use logicals within the brackets (eg. x > 3)
* Double Square Bracket (`[[ ]]`)
  * Used to extract elements of a list or dataframe
  * Can only extract a single element and it may not be the same class as the original
* Dollar Sign (`$`)
  * Used to extract elements from a list/datafram by name
  * Can only extract a single element

* Subsetting Lists
  * `x <- list(foo=1:4, bar=0.6, baz="hello")`
  * `x[1]` $\to$ list of values 1 to 4
  * `x[[1]]` $\to$ series of values 1 to 4
  * `x$bar` = `x[["bar"]] `$\to$ the number 0.6
  * `x["bar"]` $\to$ list with 0.6 as only element
  * `x[c(1,3)]` $\to$ returns a list of the values 1 to 4 and the word "hello"

  * For nested lists:
    * `x <- list(a=list(1:4), b=c(5,6))`
    * `x[[c(1,3)]]` = `x[[1]][[3]]` $\to$ returns the third element of the first element $\to$ the number 3

* Subsetting Matricies
  * Use single bracket notation [row, col]
    * Leave row or col value blank if you want everything instead of a particular element in the matrix
  * Returning a single element from the matrix results in an output of a vector of length 1, instead of a 1x1 matrix
    * Same is true if subsetting single row or column $\to$ get a vector with length equal to the length of the row/col
    * can turn off this default behavior by setting the `drop` argument in the single bracket matrix subsetting call equal to FALSE
      * `x[1,2, drop=FALSE]` $\to$ returns a 1x1 matrix
        * Same for single row/col

* Partial Matching
  * `x <- list(aardvark=1:5)`
  * `x$a` = `x[["a", exact=FALSE]]` $\to$ returns series of values 1 to 5

* Removing Missing Values
  * Use logical with single bracket subsetting
    * `x <- c(1,2,NA,4,NA)`
    * `bad <- is.na(x)`
    * `x[!bad]`
  * Subsetting multiple vectors with NA values
    * `y <- c("a", NA, "c", NA, NA, "f")`
    * `good <- complete.cases(x,y)`
    * `x[good]` and `y[good]`
  * Remove rows of dataframe with NA values
    * `good <- complete.cases(df)`
    * `df[good, ]`
    
     


# Vectorized Operations

* Adding/Subtracting/Multiplying/Dividing two vectors together will be performed elementwise
  * `x <- 1:3` and `y <- 4:6`
  * `x + y` $\to$ vector of values 5 to 9
  * This is true for matricies as well, but if you want dot-product of two matricies you have to use `%*%` as the mathematical operator

# Control Flow

1. If-Else

```r
if (<condition statement>) {
  # Do something
} else if (<condition statement 2>) {
  # Do something different
} else {
  # Do something more different
}

```

2. For Loop

```r
# All of these loops are equivalent 
x <- c("a", "b", "c", "d")

for(i in 1:4) {
  print(x[i])
}

for(i in seq_along(x) {
  print(x[i])
}

for(letter in x) {
  print(letter)
}

for(i in 1:4) print(x[i])
```

```r
# Nested for loops
m <- matrix(1:6, 2, 3)

for(i in seq_len(nrow(m))) {
  for(j in seq_len(ncol(m))) {
    print(m[i,j])
  }
}
```

3. While Loop

```r
# Standard while loop
count <- 0
while(count<10) {
  print(count)
  count <- count + 1
}

# While loop with conditionals
z <- 5
while(z>=3 && z<=10) { # checks conditionals left to right
  print(z)
  coin <- rbinom(1,1,0.5)

  if(coin == 1) {
    z <- z + 1
  } else {
    z <- z - 1
  }
}
```

4. Repeat, Next, Return, and Break
  * Repeat initiates an infinite loop - will have to use break
  * Next skips iterations in a loop
    * Set up as conditional in for loop
      * `if(i<=20) next # skips the first 20 iterations in a loop`
  * Return indicates that the loop should break and return a value
  * Break stops/exits the loop


# Writing Functions

* Write functions in R script file and then source into R
* Functions will always return the last expression
* Exhibit lazy evaluation in that arguments are only evaluated as needed
* The `...` argument
  * Used to indicate a variable number of other arguments that are usually passed on to other functions
  * Use when extending another function and don't want to copy the entire arguments list of the original function
  * Generic functions use `...` so that extra arguements can be passed to methods

```r
# General syntax for functions
func_name <- function(arg1, arg2, arg3) {
  # Do something
  # Do something
  # Do something
}
```

```r
# Example function: returns the mean of each column in a matrix

column_means <- function(m) {

  # determine number of columns
  nc <- ncol(m)

  # initialize empty numeric vector
  means <- numeric(nc)

  # loop through each column in matrix m
  for(i in 1:nc) {

    # calculate the mean and add to means vector at index i
    means[i] <- mean(m[, i], na.rm=TRUE)
  }

  # return means vector
  means
}

```



# Dates and Times

* Dates are stored as objects of the date class
* Times can be stored as one of two types
  * `POSIXct` - integer; use when storing times in a dataframe
  * `POSIXlt` - list; stores other info like day of week, month, year
* `strptime()` function is similar to python datetime function
  * `strptime("date_as_characters", "%formatting_strings")`

# Logical Expressions

1. Double equals (`==`)
  * Equality operator
  * Evaluates for truthiness, maybe also exact; returns boolean

2. Inequality Operators (`>, <, >=, <=`)
  * Evaluates for truthiness; returns boolean

3. Bang Operator (`!=`)
  * Evaluates if expressions are unequal; returns boolean
  * Can just use the not operator (`!`) to reverse the logical expression
    * Ex. `!(5==7)` $\to$ `TRUE`

4. "And" Operator (`&`/`&&`)
  * Will evaluate to true if both the left and the right operand are true
  * The single ampersand evaluates the left operand across all elements in the right operand
    * Ex. `TRUE & c(TRUE, FALSE, FALSE)` $\to$ `TRUE FALSE FALSE`
  * The double ampersand evaluates the left operand against only the first element in the right operand
    * Ex. `TRUE && c(TRUE, FALSE, FALSE)` $\to$ `TRUE`
  * **In a chain of logical operators, all "and" operators are evaluated before any "or" operators**

5. "Or" Operator (`|, ||`)
  * Will evaluate to true if either the left or the right operand are true
  * Similar to the "and" operator in that the single bar (`|`) evaluates the left operand against all of the elements in the right operand; and the double bar (`||`) evaluates the left operand against only the first element in the right operand

6. Logical Functions (`isTRUE()`, `identical()`, `xor()`, `which()`, `any()`, `all()`)
  * `isTRUE()` evaluates experssion in the parentheses and if it is TRUE, then it returns TRUE; otherwise returns FALSE
    * must be an expression, not a single value
  * `identical()` evaluates if the two objects in the parentheses are the same and then returns TRUE; otherwise FALSE
    * Ex. `identical('twins', 'twins')` $\to$ `TRUE`
    * Can pass objects or expressions
  * `xor()` evaluates if either of two passed arguments are true, then returns TRUE; otherwise returns FALSE
    * Will return FALSE if both arguments are true
  * `which()` takes in a logical vector as an argument and retures indices of the TRUE elements


# Looping Functions

1. `lapply()` - loop over a list and apply a function at each element
  * Arguments: list object, function, and ... (additional args to be passed into the specified function)
  * Always returns a list
2. `sapply()` - same as `lapply()`, but it tries to simplify the output if possible
  * If the result is a list where every element is of length 1, then it converts result to a vector
  * If the result is a list where every element is of the same length (>1), then result is converted to a matrix
  * Default return class is a list
3. `apply()` - apply a function over the margins of an array
  * Most often used to apply functions over the rows or columns of a matrix
    * It can be used with general arrays too
  * Not necessarily faster than writing a loop, but it is completed in one line
  * Arguments: array, margin, function, ... (again, additional args to pass into specified function)
    * array - array/matrix
    * margin - integer vector indicating which margins (dimension of the array/matrix; rows=1, cols=2) should be retained
  * Could also use `rowSums(), rowMeans(), colSums(), colMeans()` if they apply to the task

4. `tapply()` - apply a function over subsets of a vector
  * Arguments: vector of data, vector of group indicator, function
5. `mapplay()` - multivariate version of `lapply()`
  * Applies a function in parallel over elements in multiple lists
    * Arguments: function, lists for function to operate on, args for function, boolean for whether to simplify result 

6. `split()` - not a looping function, but can be useful for splitting data into groups to prepare it for a looping function
  * Arguments: vector/list/dataframe, vector of group indicator, boolean for if empty vector levels should be dropped
  * Always returns a list

# Plotting

[Tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)

[Cheat Sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)