### R Programming 1 ###

R is a dialect of the S programming language which was creted by John Chambers et al at Bell Labs. Originally S was implemented as fortran libraries for statistical analysis (hence the name of S).

In 1988 S was re written in C to include statistical modeling.

R is an implementation of the S language.

R is free software, but only members of the R Association can make changes to the source code.

It has great visualization graphics packages (better than most other options).

#### Console Input and Evaluation ####

The things we input into R are called expressions, for example:

In [2]:
x <- 1

The above '<-' is the assignment operator, that assigns the value `1` to the variable `x`. It works as an arrow, so the value and the variable can be in any order as long as the arrow points in the right directtion `x <- 1` is the same as `1 -> x`. Nonetheless convention says we place the variable first.

`print()` simply prints the contents of a variable to the console. 

In [3]:
print(x)

[1] 1


Nonetheless in R you can also just call the object without the `print()` function and it gets printed to the console.

In [4]:
x

We can also create string objects, like this:

In [5]:
msg <- 'hello'

In [6]:
msg

Notice the quotes on then word, this is what tells R that this is a string object and not a numerical one.

`#` is used for commenting, so anything to the right of `#` will be ignore by the interpreter.

In [7]:
# this is a comment

There is a difference betwee auto printing and explicit printing. Just calling an object will auto print it, calling the print function will make it explicit printing (we need this last one if we are building applications that need to print a value to the application).

In [11]:
x <- 5 # Nothing is printed since it is only assigning the value 5 to the variable x

In [9]:
x # Auto printing

In [10]:
print(x) # Explicit printing

[1] 5


The `[1]` printed with the `5` is R saying this is the first element of a vector (of size 1 in this case). Remeber, R is very much focused on statistics, and variables are built for this purpose.

R has a nifty shorthand for creating vectors, and it works like this:

In [14]:
x <- 1:10

In [15]:
x

Here, we created a vector from 1 to 10 by just using a `:` to separate the first from the last value. This is used to create integer vectors (does not work well with floats).

In [16]:
x <- 2.1:6.7

In [17]:
x

#### Objects ####

All things we create in R are called objects. There are 5 basic 'atomic' classess of objects:

- Character
- Numeric (real numbers)
- Integer
- Complex
- Logical (True/False)

The most basic object in R is a vector, which follows the following rules:

- A vector can only contain objects of the same class.
- BUT: There is one exception, called a `list` which is represented as a vector but can contain objects of different classes.

Empty vectors can be created with the `vector()` function.

In [18]:
x <- vector()

In [21]:
x

In [20]:
print(x)

logical(0)


##### Numbers #####

- Numbers in R are generally treated as numeric objects. 
- If you explicitly want an integer, you need to specify the L sufix (entering `x <- 1` gives you a numeric object, entering `x <- 1L` gives you an integer).
- There is a special number `Inf` which represents infinity. 
- The value `NaN` represents an undefined, Not a Number value. `NaN` can also be thought of as a missing value.

##### Attributes #####

R objects can have attributes:

- names, dimnames (dimension names)
- dimensions (e.g. matrices, arrays)
- class
- length
- other user defined attributes or metadata

The attributes of an object can be accesed using the `attributes()` function as well as setting and modifying them.

In [28]:
x <- 1

In [39]:
attributes(x)

NULL

##### Data Types Vectors and Lists #####

The `c()` function can be used to create vectors of objects. This function concatenates the objects given into a vector (hence the `c()`).

In [44]:
x <- c(0.5,0.6) # Numeric Vector
x

In [45]:
x <- c(TRUE, FALSE) # Logical Vector
x

In [46]:
x <- c(T, F) # Logical Vector
x

In [47]:
x <- c('a', 'b', 'c') # Characterl Vector
x

In [48]:
x <- 1:5 # Integer Vector
x

In [50]:
x <- c(1+0i, 2+4i) # Complex Vector
x

Also, using the `vector()` function we can create vectors with specific qualities.

In [51]:
x <- vector('numeric', length = 10)
x

##### Mixing Objects #####

R will try and accomodate our mistakes when creating vectors by assuming a common denominator between the objects to prevent errors. This is called cohersion.

In [53]:
x <- c(1.7, 'a') # creates a vector of characters
x

In [54]:
x <- c(TRUE, 2) # Creates a numeric vector
x

In [55]:
x <- c('a', TRUE) # creates a vector of characters
x

##### Explicit Cohercion #####

We can convert objects from one type to another with the `as.type()` function.

In [70]:
x <- 0:6
x

In [71]:
class(x)

In [72]:
as.numeric(x)

In [73]:
as.logical(x)

In [74]:
as.character(x)

It is important to mention, this is NOT transforming the value of the original vector `x` but casting it as different types. But if need to you can save it as another variable or into the same one replacing the original.

In [75]:
x

In [76]:
x <- as.logical(x)
x

In [77]:
class(x)

Cohersion does not always work, since some objects cannot be interpreted as a different class. For example:

In [78]:
x <- c('a', 'b', 'c')
x

In [79]:
class(x)

In [80]:
as.numeric(x)

"NAs introduced by coercion"

In [81]:
as.logical(x)

In [82]:
as.complex(x)

"NAs introduced by coercion"

#### Lists ####

Lists can have objects of different classes, making them very useful.

In [84]:
x <- list(1, 'a', TRUE, 1+4i)
print(x)

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i



Notice that printing a list gives you indices in double brackets.

#### Matrices ####

Matrices are vectors with a dimension attribute. This attribute is an integer vector of length 2 (nrow, ncol).

In [86]:
m <- matrix(nrow = 2, ncol = 3)
print(m)

     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA


In [87]:
dim(m)

In [88]:
attributes(m)

We can also create matrices by assigning the values first and then adding the dimension attribute.

In [89]:
m <- 1:10
print(m)

 [1]  1  2  3  4  5  6  7  8  9 10


In [90]:
dim(m) <- c(2,5)
print(m)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10


##### cbind-ing and rbind-ing #####

Matrices can be created by column binding or rowbinding using `cbind()` and `rbind()`.

In [94]:
x <- 1:3
y <- 10:12
print(x)
print(y)

[1] 1 2 3
[1] 10 11 12


In [95]:
cbind(x,y)

x,y
1,10
2,11
3,12


In [96]:
rbind(x,y)

0,1,2,3
x,1,2,3
y,10,11,12


#### Factors ####

Factors are a special kind of vector used to represent categorical data. They can be ordered or unordered. We can think of factors as an integer vector with labels.

- Factors are treated specially by modelling functions like `lm()' and `glm()'.
- Using factors with labels is better than using integers because they are self describing. Having a variable with labels `Male` and `Female` is better than having variables with values `0` and `1`.

In [97]:
x <- factor(c('yes', 'yes', 'no', 'yes', 'no'))
x

In [98]:
table(x)

x
 no yes 
  2   3 

In [99]:
unclass(x)

In [101]:
attr(x,'levels')

The order of the levels can be set by using the `levels` argument to `factor()`. This can be important in linear modeling because the first level is used as the baseline level.

In [102]:
x <- factor(c('yes', 'yes', 'no', 'yes', 'no'),
           levels = c('yes', 'no'))
x

##### Missing Values #####

Missing values are denoted by `NA` (Not Available) or `NaN` (Not a Number) for undefined mathematical operations.

- `is.na()` is used to test if objects are `NA`.
- `is.nan()` is used to test if objects are `NaN`.
- `NA` values have class, so they can be `NA` integer, `NA` character, etc.
- A `NaN` value is also `NA` but an `NA` value is not necessarily a `NaN` (all `NaN` are `NA` but not all `NA` are `NaN`).

In [103]:
x <- c(1,2,NA,10,3)
is.na(x)

In [104]:
is.nan(x)

In [105]:
x <- c(1, 2, NaN, NA, 5)
is.na(x)

In [106]:
is.nan(x)

#### Data Frames ####

Data frames are used to store tabular data.

- They are represented by a special type of list where every element of the list has the same length.
- Each element of the list can be taught of as column and the length of each element of the list is the number of rows.
- Unlike matrices, data frames can store different classess of objects in each column (just like lists); matrices must have every element of the same class.
- Data frames also have a special attribute called `row.names`.
- Data frames are usually created by calling `read.table()` or `read_csv()`, but can be directly created with `data.frame()`.
- Can be converted into a matrix by calling `data.matrix()`.

In [107]:
x <- data.frame(foo = 1:4, bar = c(T,T,F,F))
x

foo,bar
1,True
2,True
3,False
4,False


In [108]:
nrow(x)

In [109]:
ncol(x)

#### Names ####

R objects can also have names, which is very useful for writing readable code and self-describing objects.

In [110]:
x <- 1:3
names(x)

NULL

In [111]:
names(x) <- c('one', 'two', 'three')
x

In [112]:
names(x)

Lists can also have names.

In [113]:
x <- list(a = 1, b = 2, c = 3)
x

In [114]:
names(x)

In [115]:
m <- matrix(1:4, nrow = 2, ncol = 2)
m

0,1
1,3
2,4


In [116]:
dimnames(m)

NULL

In [117]:
dimnames(m) <- list(c('a', 'b'), c('c', 'd'))
m

Unnamed: 0,c,d
a,1,3
b,2,4


In [118]:
dimnames(m)

#### Reading Tabular Data ####

These are the most common ways of reading data into data frames:

- `read.table()`, `read.csv()` for tabular data.
- `readLines()`, for reading lines of a text file.
- `source()`, for reading in R code files (inverse of dump)
- `dget()`, for reading in R code files (inverse of dput)
- `load()`, for reading in saved workspaces.
- `unserialize()`, for reading single R objects into binary form.

#### Writing Data ####

These are the writing versions of the above.

- `write.table()`
- `writeLines()`
- `dump()`
- `dput()`
- `save()`
- `serialize()`

#### Reading Data Files with read.table ####

The `read.table()` function is one of the most common ones used for reading data, and here are it's most important arguments:

- `file`, the name of a file or a connection.
- `header`, logical indicating if the file has a header.
- `sep`, a string indicating how the columns are separated.
- `colClasses`, a character vector indicating the class of each colums in the dataset.
- `nrows`, the number of rows in the dataset.
- `comment.char`, a character string indicating the comment character (for using other characters other than `#`).
- `skip`, the number of lines to skip from begining.
- `stringsAsFactors`, should character variables be coded as factors?

#### read.table ####

For a small to moderately sized dataset, you can usually use `read.table()` without specifying any other arguments.

```
data <- read.table('example.txt')
```

R will automatically:

- Skip lines that begin with `#`.
- Figure out how many rows there are (and how much memory needs to be allocated).
- Figure out what type of variable is in each column of the table (but telling R all these things directly makes it run much faster and efficiently).
- `read.csv()` is identical to `read.table()` except that the default separator is a comma.

#### Reading in Larger Datasets with read.table ####

With much larger datasets, doing the following will make your life easier and prevent R from choking:

- Read the help page for `read.table()`, which contains many hints.

In [None]:
?read.table

- Make a rough calculation if the memory required to store your dataset. If the size is larger than the RAM of your computer, simply stop.
- Set `comment.char = ''` if there are no commented lines in your file.
- Use `colClasses` argument. This will help R run much faster. Although you have to know the class of each column.
- Set `nrows`. This does not help R run faster, but it does help with memory usage.

#### Know Thy System ####

It is important to know the capabilities of the system you are using, in particular:

- How much memory is available?
- What other applications are in use?
- Are there other users logged into the same system?
- What is the operating system?
- Is the OS 32 bit or 64 bit.

### Subsetting ###

In [123]:
x <- c('a', 'b', 'c', 'c', 'd', 'a')
x[1] # get the first element

In [125]:
x[2] # get the second element

In [127]:
x[2:4] # get elements 2 to 4 inclusive

In [128]:
x[x > 'a'] # get all elements of x that are greater than 'a' (lexical ordering)

In [129]:
u <- x > 'c' # creting in u a logical vector that tells me which elements from x are greater than 'c'
u

In [130]:
x[u] # getting all elements from x where u is TRUE

### Subsetting Lists ###

In [131]:
x <- list(f1 = 1:4, f2 = 0.6)
x[1] # extracts the first element of the list or f1

In [132]:
x[[1]] # just get the sequence without the name of the object

In [133]:
x$f2 # get the item from the list by name, in this case f2

In [134]:
x[['f2']] # calling by name, same as x[[2]]

In [135]:
x['f2'] # also calling by name, but with the header.

In [137]:
x[[1]][[1]] # extract the first element of the first element

### Subsetting Matrices ###

In [138]:
x <- matrix(1:6, 2, 3) # A matrix with 2 rows and 3 columns with the numbers 1 to 6
x

0,1,2
1,3,5
2,4,6


In [139]:
x[1,2] # extracts the second element of the first row x[row,column]

In [140]:
x[2,1] # extracts the first element of the second column

In [142]:
x[1,] # extracts all of the first row as a vector

In [144]:
x[,2] # extracts all of the second column as a vector

In [145]:
x[1, , drop = FALSE] # extracts the first column as a matrix

0,1,2
1,3,5


#### Removing NA Values ####

In [146]:
x <- c(1,2,NA,4,NA,5)
x

In [147]:
bad <- is.na(x) # creates a logical vector for NA values
bad

In [148]:
x[!bad] # ! before the name of the logical vector, will return all non NA values from x without modifying x

#### Vectorized Operations ####

R works in paralel when working with vectors.

In [150]:
x <- 1:4
y <- 6:9
x
y

In [151]:
x + y

In [152]:
x > 2

In [153]:
x >= 2

In [154]:
y == 8

In [155]:
x * y

In [156]:
x / y