# R Basics
### by [Jason DeBacker](http://jasondebacker.com), October 2019

This notebook will introduce you to R, explaining the syntax, data structures that are available in R.

R is not a full programming language like Python, but it excels for statistical analysis and offers much more versatility than Stata.

## Assignment operators

The first obvious difference you'll notice with R is the syntax to assign a value to a variable.  Most R code you'll come across will use `<-` as the assignment operator.  There are actually several ways to assign variables in R:

* `x <- value`
* `x <<- value`
* `value -> x`
* `value ->> x`

In these the "arrow" points towards the object that is being assigned the value.  The double arrow operators are typically only used in functions (and will look outside the function to see if the variable was assigned a value previously).  

In addition, one can often (but not always) interchange `x <- value` with `x = value`.  However, one needs to be careful because:
* `<-` is given precedence over `=` (which can matter if both are used.  E.g., `x <- y <- 5` is not the same as `x <- y = 5`)
* `=` only works to assign a value to a variable in the top-level environment.  E.g., if you use `x = value` in a function `x` won't be available outside the function.

*In short*, although more difficult to type (use your keyboard shortcuts!), use `<-` rather than `=`.  But be careful about using appropriate white space.  E.g., don't write `x< -3` ("x is less than -3") when you want `x <- 3`.


## Data Structures

The most often used data structures in R are:
* Vectors
* Matrices
* Arrays
* Lists
* Factors
* Dataframes

### Vectors

Vectors can be numeric, strings or logicals.  Scalar values are simply vectors of length 1.

In [1]:
# a pound sign comments out the rest of the line
# no analogue to the docstring block comments in Python

# Assign 
x <- 4
y <- 5
xx <- TRUE  # use all caps for TRUE/FALSE for logical variables
yy <- "True"  # put strings in quotes

print(class(x)) # to print what type of object x is
print(class(xx))
print(class(yy))
print(x * y)

[1] "numeric"
[1] "logical"
[1] "character"
[1] 20


#### Useful vector operations:

* `c()` concatenates elements in a vector. Note that elements of the vector have to be of the same mode (numeric, logical, character, etc.). If you try to mix modes, `c()` will recast the variables to be the same modes.  See:


In [4]:
mixed_type <- c(TRUE, 2)
print(class(mixed_type))
print(mixed_type)

[1] "numeric"
[1] 1 2


* Mathematical operations:
    * Are element by element as a standard, but can use matrix operations with need for additonal packages (just different syntax
    * `^`  or `**` both used for powers
    * Division of integers does not use integer division
    * Logical variables are treated as 0/1 when used in mathematical operations

In [5]:
a = c(1, 2, 3)
b = c(2, 2, 2)
# element by element multiplication of vectors
a * b

In [6]:
# matrix multiplication of vectors
a %*% b

0
12


In [8]:
# element by element powers (** also works - try it)
a ** b

In [9]:
# division
a / b

In [10]:
# operation between numeric and logical vectors
a * c(TRUE, FALSE, TRUE)

* Slicing vectors is done in a similar manner as in Python BUT:
     * Indexing start with 1 (not 0)
     * You always need values on both sides of a `:` when slicing
         * to take from a given element to the end of a vector (or to take from the start to a given place in the vector, you need to use the `head()` or `tail()` functions

In [13]:
print(a[1])
print(a[1:3])
print(tail(a, 2)) # take elements from the second to the end
print(head(a, 2)) # take elements up to and including the second

[1] 1
[1] 1 2 3
[1] 2 3
[1] 1 2


### Matrices

Matrices are 2-dimensional objects.  The have rows and columns.  Each column much have the same data type (just as with vectors).

To create a matrix:

```
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, 
  	dimnames=list(char_vector_rownames, char_vector_colnames))
```  

`byrow=TRUE` indicates that the matrix should be filled by rows. `byrow=FALSE` indicates that the matrix should be filled by columns (the default). `dimnames` provides optional labels for the columns and rows.

In [14]:
# generates 5 x 4 numeric matrix 
y <- matrix(1:20, nrow=5,ncol=4)
print(y)

# another example - where label rows and columns
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2") 
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
  dimnames=list(rnames, cnames))
# mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE)
mymatrix

     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20


Unnamed: 0,C1,C2
R1,1,26
R2,24,68


In [15]:
# to identify rows, columns, separate by commas
y[,4] # 4th column of matrix
y[3,] # 3rd row of matrix 
y[2:4,1:3] # rows 2,3,4 of columns 1,2,3

# can also reference by labels if have them
mymatrix['R1',]

0,1,2
2,7,12
3,8,13
4,9,14


 ### Arrays

Arrays are like matrices, but can be more than 2-dimensions.  Assign and array by:

```
A = array(data = NA, dim = length(data), dimnames = NULL)
```

### Lists

Lists can contain different type of elements.  E.g., a list can contain vectors, functions, or other lists inside it.  Lists are ordered, so you reference an element with it's index.  Though you can assign labels to elements in a list and reference an element by a label.

In [16]:
# create a list
a_list <- list(a, 99, y)
a_list

0,1,2,3
1,6,11,16
2,7,12,17
3,8,13,18
4,9,14,19
5,10,15,20


In [19]:
# Reference an element of a list by it's position
a_list[[1]]

In [20]:
a_list[[3]]

0,1,2,3
1,6,11,16
2,7,12,17
3,8,13,18
4,9,14,19
5,10,15,20


In [21]:
# assign labels to elements
b_list <- list(first = a,second = 99, third = y)
print(b_list)
# reference the element labeled "first"
# b_list[['first']]
b_list$first # an alternative way to get the element labeled 'first'

$first
[1] 1 2 3

$second
[1] 99

$third
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20



### Factors

Factors are the R-objects created using vectors.  They are somewhat like a set in Python in that they store  unique values of the vector, but they also store the number of times each value is repeated.  So factor objects have two parts - a vector of the unique values (which are always stored as label that are strings, regardless their mode in the vector the factor came from) and a vector of counts of the number of occurances for each unique value.

Factors are created using the `factor()` function.   The `nlevels` functions gives the count of levels.

In [23]:
# create a factor and inspect it
degree <- c(rep("PhD",10), rep("MA", 5)) # vector with PhD repeated 10 times and MA repeated five time
degree <- factor(degree) # turn vector into factor with factor()
summary(degree) # summary function on factors
# degree

In [24]:
print(degree)
print(nlevels(degree))

 [1] PhD PhD PhD PhD PhD PhD PhD PhD PhD PhD MA  MA  MA  MA  MA 
Levels: MA PhD
[1] 2


An ordered factor is used to represent an ordinal variable.

R will treat factors as nominal variables and ordered factors as ordinal variables in statistical procedures and graphical analyses. You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overiding the alphabetical ordering). You can also use factors to create value labels.

In [25]:
# creating an ordered factor
ordered_degree <- ordered(degree)
print(ordered_degree)

 [1] PhD PhD PhD PhD PhD PhD PhD PhD PhD PhD MA  MA  MA  MA  MA 
Levels: MA < PhD


### Data Frames

R data frames are similar to Pandas DataFrames (and to tables of data in Stata, SAS, etc.).  They are 2-dimensional objects, but unlike a matrix in R, they can have columns that contain differnt modes of data.  Eeach column has a label and the rows are indexed.  Like Pandas DataFrames, R data frames can have non-integer indexes for the rows.

Data Frames are created using the `data.frame()` function.

In [26]:
# Create the data frame.
BMI <- 	data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
print(BMI)

  gender height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26


There are several ways to slice a data frame:

In [27]:
BMI[2:3] # columns 2 and 3 of data frame
BMI[c("height","weight")] # columns height and weight from data frame
BMI$height # variable height in the data frame
BMI[1,] # first row of data frame
BMI[1, 'gender'] # element in the first row and the gender column

height,weight
<dbl>,<dbl>
152.0,81
171.5,93
165.0,78


height,weight
<dbl>,<dbl>
152.0,81
171.5,93
165.0,78


gender,height,weight,Age
<fct>,<dbl>,<dbl>,<dbl>
Male,152,81,42


## Useful Functions

* `length(object)` # number of elements or components
* `str(object)`    # structure of an object 
* `class(object)`  # class or type of an object
* `names(object)`  # names

* `c(object,object,...)`       # combine objects into a vector
* `cbind(object, object, ...)` # combine objects as columns
* `rbind(object, object, ...)` # combine objects as rows 

* `object`     # prints the object

* `ls()`       # list current objects
* `rm(object)` # delete an object

* `newobject <- edit(object)` # edit copy and save as `newobject`
* `fix(object)`               # edit in place

In [None]:
# List all the objects we created in this notebook
str(BMI)

In [None]:
ls()

In [None]:
rm(BMI)

In [None]:
ls()