# R: A Quick Overview

R is a widely used programming language and has many resources available. If you're looking to do some data wrangling in R,  chances are someone before you has already had to do the same thing, and maybe even written a package for those specific operations - so it's usualy worth looking around a bit before trying to implement your own function. 

 - **Caveat**: Beware external package installations in R!!! Updates/version changes for that are needed for a particular package can end up breaking your entire conda environment and lead you to bang your head against a wall endlessly while you attempt to rectify the issue, before ultimately starting over and re-installing your anaconda. A backup plan is to have a saved conda environment .yaml file that you can backup to if you end up in this unenviable state.

<div>
<img src="attachment:image.png" width="550"/>
</div>


While here we're working within a notebook on TSCC, I would also recommend that you get [RStudio](https://www.rstudio.com/) if R becomes a large part of your data analysis, instead of say Python. This tends to be a more user-friendly interface and has some great capabilities when plotting and debugging. RStudio is also providing several [cheatsheets](https://www.rstudio.com/resources/cheatsheets) for some of the most commonly used functionalities. Additionally, when you know the name of a function you can directly access the documentation for said function by entering: ?functionOfInterest.

The allure of R lies primarily in both its statistical packages and its plotting options to visualize the data. Higher level visualizations are usually performed using [ggplot2](http://ggplot2.org/) which has an enormous range of possibilities. You can find some examples in the [R Graph Gallery](http://www.r-graph-gallery.com/)

There are six types of data structures in R, out of which we'll look at the main 4 (the other two are rarely used). Knowing about these structures will help you get a better understanding of the language which in turn improves your programming skills.

**Notes:** R has some annoying tendencies relative to other programming languages (depending on where your background is primarily). The two that to me are the most frustrating are initialization and indexing.

 1) *Initialization*:In R the primary notation of initialization with a reverse arrow <- as opposed to an = sign. You can usually get away with the = sign initialization though.<br>
 2) *Indexing*: Indexing in R starts at 1 and not 0! This can be extremely annoying if you are used to the classical 0-based indexing of many other programming languages.

# Getting Started

Okay similar to python let's start off with numbers and some basic operations:

In [1]:
# R can be used to perform simple operations:

# addition
5 + 3
# subtraction
10 - 3
# multiplication
5 * 4
# division
14 / 7
# raise a number to a power
2^8
# take a root
sqrt(4)


What about our irrational numbers?

In [2]:
euler = exp(1)
euler

In [3]:
pie = pi
pie

# Data Structure: Vector, Atomic

Atomic vectors can only contain one type of data: logical, integer, double, character

In [8]:
my.vector <- c(1, 2, 3)
my.vector2 <- c(1:3)
# atmoic vectors are one dimensional
print(my.vector)

# they take only one type of data
typeof(my.vector2)
my.char.vector <- c('a', 'b', 'c')
typeof(my.char.vector)

length(my.vector)

[1] 1 2 3


**Question:** Get the first element from my.vector

In [15]:
#YOUR CODE HERE!

**Wait why are there dots in these names?**

These "." can have a wide range of applications and meanings, but usually they are used for variable naming in a way similar to the underscores in python. This tends to be a style thing again, but at other times the "." can make calls to internal methods. For now you can just interpret it as part of the name.

If your curious though the following 2 links should be helpful: 1) https://stats.stackexchange.com/questions/10712/what-is-the-meaning-of-the-dot-in-r 2) https://stackoverflow.com/questions/7526467/what-does-the-dot-mean-in-r-personal-preference-naming-convention-or-more



Nicely though we can perform across the board applications on these atomic vectors. For example: 

In [18]:
# vectorized operations
my.vector + 3
my.vector + my.vector2

# Data Structure: Vector, Lists

Okay, what if we want a flexible structure like our python list. Well let's look at the R list equivalent:

In [19]:
my.list <- list(1:3, 'a', c(TRUE, FALSE))

my.list

We can also give our individual list elements named keys, similar to the python dictionary:

In [20]:
my.named.list <- list(one = 1:3, two = 'a', three = c(TRUE, FALSE))
my.named.list

and we can use these names to grab elements.

In [24]:
my.named.list$one
my.named.list[['two']]
my.named.list[[3]]

R Lists are made up of atomic vectors or other lists

In [25]:
typeof(my.named.list)
typeof(my.named.list$one)


We can also put lists inside lists:

In [27]:
long.list <- list(first = 1, second = list(1,2))
long.list

typeof(long.list$second)

In [38]:
long.list$second

# Data Structure: Attributes

Attributes are not as important in the beginning but good to know about. They are used to store metadata.

In [39]:
# Names
x <- c(a = 1, b = 2, c = 3)
x

Factors are used to store categorical data (such as sex information). R encodes factors as integers so that under the hood any strings in your categorical vector are actually represented as integers. A helpful explanation of this can be found here: https://monashbioinformaticsplatform.github.io/2015-09-28-rbioinformatics-intro-r/01-supp-factors.html

In [41]:
# Factors
sex.char <- c('m', 'm', 'f')
sex.factor <- factor(sex.char, levels = c('m', 'f'))
sex.factor
table(sex.factor)

sex.factor
m f 
2 1 

# Data Structure: Matrices / Arrays

These multi-dimensional data structures can only hold one type of data (usually numeric). A Matrix is a sub-category of Arrays and only has two dimensions while Arrays can have more.

In [12]:
my.matrix <- matrix(1:6, ncol = 3, nrow = 2)
my.matrix
typeof(my.matrix)
# dimensions are shown as: rows columns
dim(my.matrix)

0,1,2
1,3,5
2,4,6


In [13]:
# accessing matrix row
my.matrix[1,]

#acessing matrix column
my.matrix[,1]

In [52]:
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.
array1 <- array(c(vector1,vector2),dim = c(3,3,2))
array1[1,,]
array1[,1,]
array1[,,1]

0,1
5,5
10,10
13,13


0,1
5,5
9,9
3,3


0,1,2
5,10,13
9,11,14
3,12,15


# Data Structue: Data Frame

One of the most commonly used data structures. This is your pandas equivalent structure in R.

In [53]:
my.df <- data.frame(x = 1:3, y = c('a', 'b', 'c'))
my.df

x,y
<int>,<fct>
1,a
2,b
3,c


Under the hood data frames are lists and can be accessed as such

In [54]:
typeof(my.df)
my.df$x
my.df[['y']]

But they also have matrix-like properties: they posess rows and columns

In [62]:
dim(my.df)

my.df[,1]
my.df[1,]

x,y
<int>,<fct>
1,a


This mixed property allows it to be flexible. Additionally two data frames can be combined if the dimensions match up

In [63]:
my.other.df <- data.frame(x = 5:7, y = c('x', 'y', 'z'))

#combine by columns --> stack together left to right
my.col.combined.df <- cbind(my.df, my.other.df)
my.col.combined.df

#combine by rows --> stack together top to bottom
my.row.comb.df <- rbind(my.df, my.other.df)
my.row.comb.df

x,y,x,y
<int>,<fct>,<int>.1,<fct>.1
1,a,5,x
2,b,6,y
3,c,7,z


x,y
<int>,<fct>
1,a
2,b
3,c
5,x
6,y
7,z


Subsetting (these should look familiar based on above operations): 

In [67]:
# subset operators: [, [[, $
my.df[1,]
my.df$x
my.df[[1]]

x,y
<int>,<fct>
1,a


In [85]:
# subset types
# 1. using positive integers
my.col.combined.df[c(1,3),]

# 2. using negative integers (omitting these parts)
my.col.combined.df[c(-1, -3),]

# 3. Logical Vectors (careful with usage here)
my.col.combined.df[c(TRUE, FALSE, TRUE, TRUE),] #Note the extra TRUE gives us another row filled with NAs
my.col.combined.df[,c(TRUE, FALSE, FALSE, TRUE)]

dim(my.col.combined.df)


Unnamed: 0_level_0,x,y,x,y
Unnamed: 0_level_1,<int>,<fct>,<int>.1,<fct>.1
1,1,a,5,x
3,3,c,7,z


Unnamed: 0_level_0,x,y,x,y
Unnamed: 0_level_1,<int>,<fct>,<int>.1,<fct>.1
2,2,b,6,y


Unnamed: 0_level_0,x,y,x,y
Unnamed: 0_level_1,<int>,<fct>,<int>.1,<fct>.1
1.0,1.0,a,5.0,x
3.0,3.0,c,7.0,z
,,,,


x,y
<int>,<fct>
1,x
2,y
3,z


In [94]:
# subset types continued
# 4. Nothing, used especially with matrices, arrays and data frame
my.col.combined.df

# 5. Zero, returns 0-length vector. Mainly used in generating test data (output is: numeric(0))
dim(my.col.combined.df[0])

# 6. Character vectors, if names are present
my.df['x']

x,y,x,y
<int>,<fct>,<int>.1,<fct>.1
1,a,5,x
2,b,6,y
3,c,7,z


x
<int>
1
2
3


One of the most widely used subsetting types used is logical subsetting. Given a provided condition elements can be extracted

In [95]:
# want to extract rows where column x is greater or equal to two (2)
my.df
my.df[my.df$x >= 2, ]
# What actually happens is your condition creates a logical vector which is used for extracion
my.df$x >= 2

x,y
<int>,<fct>
1,a
2,b
3,c


Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,<int>,<fct>
2,2,b
3,3,c


In [113]:
# it is also possible to provide multiple conditions
my.df[my.df$x >= 2 & my.df$y == 'c',]

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,<int>,<fct>
3,3,c


In [114]:
# is we have a regular vector we can use the command 'which' to do the exact same
some.vector <- c(11:20)
some.vector[which(some.vector >= 17)]

# Loops and 'if - else'

Loops are used when an operation has to be repeated several times. There are two types of loops; 'for-' and 'while-loops'. We briefly touched on for-loops in our intro to python walkthrough during list-comprehensions.

In [124]:
# for loops are used if a definitive end is in sight and you know the number of times an operations has to be performed
# here I want to know what the value is for each element in my.vector
for(i in 1:length(my.vector)){
    print(paste('Iteration:', i))
    print(paste('Value at position', i, 'is:', my.vector[i]))
}
# while loops are used if the total number of iterations is not clear from the beginning.
my.condition <- TRUE
my.counter <- 0
while(my.condition){
    my.counter <- my.counter + 1
    my.sum <- sum(runif(5)) #runif(5) generates 5 uniform random numbers on [0, 1] 
    print(paste(my.sum))
    if(sum(runif(5)) > 3){
        my.condition <- FALSE
    }
    if(my.counter > 20){
        print('max iter reached')
        break()
    }
}

[1] "Iteration: 1"
[1] "Value at position 1 is: 1"
[1] "Iteration: 2"
[1] "Value at position 2 is: 2"
[1] "Iteration: 3"
[1] "Value at position 3 is: 3"
[1] "2.2967026217375"
[1] "2.05857916292734"
[1] "3.31857038335875"
[1] "1.16796144726686"
[1] "2.15282483887859"
[1] "1.63364121248014"
[1] "2.22034367569722"
[1] "1.1301711355336"


In [116]:
# if-else statements check whether or not a certain condition holds true and reacts accordingly
if(length(my.vector) == 3){
    print('expected')
} else{
    print('this is odd')
}

[1] "expected"
