This first part in the Introduction to R series will introduce you to using the R interpreter (which works very similarly to this Jupyter notebook), data types, and parts of the R syntax.

## Vectors, arithmetic operations and functions
Arithmetic operators work just as you learnt to use them on a calculator, including precedence of operators and brackets. Some examples are below.

In [15]:
100 / 2 - 2 ^ 3 + 5 * 2

In [16]:
100 / (2 - 2 ^ 3) + 5 * 2

If you want to perform the same operations on more than one number, vectors are there to help. In R, a vector will always contain more than one element of the same type (typically numeric values, but there are character, complex, logical and integer vectors too). Vectors can be built "manually" using the concatenation function c() or they can be read from files.
To store the elements of a vector in a variable (called x below), just assign the vector's elements to x with the <- assignment operator.

In [2]:
x <- c(1, 1, 2, 3, 5, 8, 13)

In [3]:
print(x)

[1]  1  1  2  3  5  8 13


In [21]:
x * 2 - 1

Operations can be carried out between vectors too:

In [12]:
y <- c(3, 1, 4, 1, 5, 9, 2)

In [25]:
x + y

There are other ways of defining a vector. Numerical sequences can be generated using the seq() function. These can come in handy when timepoints have to be defined to examine a physical process over some period.  
seq() takes at least two numbers as arguments: the lower and upper bounds of an interval. The R interpreter will then generate a sequence of numbers between the two limits using a step size of 1. If a third number is provided, this will be considered as a step size.

In [8]:
t <- seq(0, 01)
print(t)

[1] 0 1


In [10]:
t <- seq(0, 1, 0.1)
print(t)

 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0


As mentioned before, vectors can contain logical values too (i.e. TRUE or FALSE). These values can represent states (e.g. a patient having type 2 diabetes) or can be a result of evaluating a logical expression, as shown below. We want to find the negative values in the difference between vectors x and y (defined earlier).

In [14]:
print(x - y)
negatives <- (x - y) < 0
print(negatives)

[1] -2  0 -2  2  0 -1 11
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE


There is a special data type in R, called NaN (not a number) and another one called NA (not available). These two don't refer to a specific value, but rather represent an undefined value (e.g. 0/0 would evaluate to NaN) or a missing value (in the case of NA).

In [15]:
x <- c(1, 2, 3, 0/0)
print(x)

[1]   1   2   3 NaN


The R interpreter always does what it is being told to do, therefore if we want to perform operations on a vector that contains NaN or NA values, those operations will likely return NaN or NA as a result. R needs to be told to skip NaN or NA values from operations. The example below shows that the average of the numbers defined in vector x cannot be determined if we also consider the NaN element. Once we remove that, we get the average of 1, 2, and 3.

In [16]:
mean(x)

In [17]:
mean(x, na.rm=TRUE)

It is sometimes necessary to refer to a __specific element__ or a __portion__ of a vector as part of a subset calculation. This can be done by indexing. R allows indexing of vectors using square brackets ([]) that take one number of a logical expression as an input. For instance, the following line will return the second element of vector x:

In [18]:
x[2]

While this line will return all elements between element 2 and element 3 (so just elements 2 and 3 in this case).

In [19]:
x[2:3]

Indexing can also be used to __exclude__ elements from a vector by prefixing the index term with a minus sign.

In [20]:
x[-2]

A more complicated way to refer to vector elements is by constructing logical expression. Let's say we want to have only the numeric elements that are greater than 1. It may be tempting to try this:

In [21]:
x[x > 1]

... but it will return the NA element too, which we didn't want. So we'd have to "filter" that one out, by combining the greater than condition with a function that only returns indices of non-NA elements. Since we want to have the two conditions to be true simultaneously, we use the & operator which corresponds to logical __and__. Logical __or__ can be achieved with | and logical __negation__ is represented by !

In [22]:
x[!is.na(x) & x > 1]

In [43]:
x[!is.na(x) & (x == 1 | x == 3)]

In addition to arithmetic and logical operations, several mathematical functions are available in R: min, max, log, exp, sin, cos, tan all work in the usual way.

In [23]:
print(y)
max(y)

[1] 3 1 4 1 5 9 2


There are two special functions that can tell more about a vector. length() provides the number of elements in a vector, while range() returns the minimum and maximum values of a vector - corresponding to ```c(min(y), max(y))```.

In [24]:
length(y)
range(y)

## Other ways to structure data
In the first part we only used vectors - a collection of values of the same type (e.g. numeric, logical). They are useful for storing simple, homogenous datasets (e.g. Fibonacci numbers, heights of a group of people). The moment a more complicated set of measurements or an entire data set arises, vectors become disadvantageous.  
wLet's look at an experiment involving growing bacteria. Let's assume that we have 5 petri dishes each containing the same type of bacteria, however, four of the five bacteria have been genetically modified to alter their growth rate. We will follow up the area of the petri dishes taken up by bacteria over 7 days, taking a note of the area occupied by various bacteria in their petri dish every day. By the end of the week we will end up with 5 x 7 = 35 measurements. If we wanted to analyse these measurements we could put them all in a vector, but then teasing out the different days and petri dishes would be difficult. However, we can organise the measurements in a "table" or a __matrix__, as it is called in R. Matrices are two-dimensional data structures that can contain the same data types as a vector - in our case numeric values. A matrix can be defined as a vector whose elements we reogranise to fit the particulars of our data set.

In [41]:
# Values in this matrix refer to the percentage of area occupied by bacteria in each petri dish
areas <- matrix(c(0, 0, 0, 0, 0, 10, 10, 20, 10, 40, 20, 20, 40, 30, 80, 30, 40, 60, 70, 100, 40, 80, 80, 100, 100, 60, 90, 100, 100, 100, 70, 100, 100, 100, 100), 
                nrow=5, ncol=7)
print(areas)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0   10   20   30   40   60   70
[2,]    0   10   20   40   80   90  100
[3,]    0   20   40   60   80  100  100
[4,]    0   10   30   70  100  100  100
[5,]    0   40   80  100  100  100  100


Indexing of matrices is very similar to indexing vectors, the difference being that for matrices both a row and a column index has to be specified. To select the weekly evolution of areas in dish number 3, we can write:

In [42]:
print(areas[3,])

[1]   0  20  40  60  80 100 100


Notice how there now is a comma inside the square brackets containing the indices. The term before the comma refers to rows and the one after the comma refers to columns. In the example above we only provided a row index, meaning that only values on row 3 would be returned, but by skipping a column index values from all columns will be included in the result. To return all observations made on day 5 we would do opposite:

In [43]:
print(areas[,5])

[1]  40  80  80 100 100


__Arrays__ are extending the limit of two-dimensionality of matrices by allowing to add additional dimensions. Using the experiment from above, we may choose to examine multiple bacteria at the same time and evaluate their growth rates. Thus, we may end up with two or more sets of 5 x 7 matrices. Instead of using a single matrix for each bacterium, we could store all the measurements in an array. Note that the elements of an array are passed on as a vector and so are the dimensions. ```dim=c(5, 7, 2)``` refers to an array of size 5 x 7 x 2.

In [44]:
areasArray <- array(c(0, 0, 0, 0, 0, 10, 10, 20, 10, 40, 20, 20, 40, 30, 80, 30, 40, 60, 70, 100, 40, 80, 80, 100, 100, 60, 90, 100, 100, 100, 70, 100, 100, 100, 100, 
                      0, 0, 0, 0, 0, 10, 10, 10, 10, 20, 20, 20, 30, 30, 55, 30, 35, 50, 59, 77, 40, 70, 60, 78, 90, 75, 90, 80, 88, 100, 90, 100, 100, 100, 100), 
                    dim=c(5, 7, 2))
print(areasArray)

, , 1

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0   10   20   30   40   60   70
[2,]    0   10   20   40   80   90  100
[3,]    0   20   40   60   80  100  100
[4,]    0   10   30   70  100  100  100
[5,]    0   40   80  100  100  100  100

, , 2

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0   10   20   30   40   75   90
[2,]    0   10   20   35   70   90  100
[3,]    0   10   30   50   60   80  100
[4,]    0   10   30   59   78   88  100
[5,]    0   20   55   77   90  100  100



R also allows mixing of different data types via the use of __lists__. This data structure allows the "packaging" of more than one data type into a single data unit. In addition, elements of a list can be named, making it easier to reference them. Let's now consider a series of concentration measurements taken at equal time points. Let's also assume that once the concentration reaches a fixed value, a a light bulb is turned on (for instance to signal a dangerous concentration of lead in running water). If we want to store the time points, concentrations and the state of the light bulb, we could construct a list:

In [56]:
# We will read the concentration value every minute; concentration is defined in ug/l;
# we encode the bulb's state with 1 if it's on and 0 if it's off. A cut-off of 10 ug/l
# will be considered as the upper limit of acceptable lead concentration in water.
waterQualityList <- list(seq(0, 10, 1), 
                         c(1.0, 2.0, 1.5, 3.0, 4.0, 6.0, 11.0, 15.0, 8.0), 
                         c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE))
names(waterQualityList) <- c("TimePoints", "LeadConcentration", "BulbState")
print(waterQualityList)

$TimePoints
 [1]  0  1  2  3  4  5  6  7  8  9 10

$LeadConcentration
[1]  1.0  2.0  1.5  3.0  4.0  6.0 11.0 15.0  8.0

$BulbState
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE



In [58]:
steatosisGrade <- c("S0", "S0", "S2", "S1", "S2", "S0", "S1", "S3", "S1", "S2", "S3")

In [59]:
length(steatosisGrade)

In [60]:
steatosisGradeF <- factor(steatosisGrade)

In [61]:
levels(steatosisGradeF)

In [62]:
fatFraction <- c(2.3, 3.0, 45.3, 12.5, 35.6, 0.6, 7.8, 73.7, 13.3, 40.5, 89.0)

In [64]:
tapply(fatFraction, steatosisGradeF, mean)