# R and RStudio


<b> First lets learn the difference between R and R studio</b>:

R is a programming language used for statistical computing while RStudio uses the R language to develop statistical programs. ... R and RStudio are not separate versions of the same program, and cannot be substituted for one another. R may be used without RStudio, but RStudio may not be used without R.

   
## Introduction to R

We will use R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in `R` in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the shell lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.

### Using R as a calculator

This might be one of the simplest things in R.

In [1]:
1 + 100

<mark> Note: </mark>

If you’re using R from the command line instead of from within RStudio, you need to use `Ctrl`+`C` instead of `Esc` to cancel the command. This applies to Mac users as well!

"PEMDAS" is the order of precedence when we are using R.

In [2]:
3 + 5 * 2

In [3]:
(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2       # clear, if you remember the rules
3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

In [4]:
# Really small or large numbers get a scientific notation:
2/10000

In [5]:
# You can write numbers in scientific notation too:
5e3  # Note the lack of minus here

### Mathematical functions

R has many built in mathematical functions. To call a function, we can type its name, followed by open and closing parentheses. Anything we type inside the parentheses is called the function’s arguments:

In [6]:
sin(1)  # trigonometry functions

In [7]:
log(1)  # natural logarithm

In [8]:
log10(10) # base-10 logarithm

In [9]:
exp(0.5) # e^(1/2)

Don’t worry about trying to remember every function in R. You can look them up on Google, or if you can remember the start of the function’s name, use the tab completion in RStudio.

This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take. Typing a `?` before the name of a command will open the help page for that command.

### Comparing things

We can also do comparisons in R:

In [10]:
1 == 1  # equality (note two equals signs, read as "is equal to")
1 != 2  # inequality (read as "is not equal to")
1 < 2  # less than
1 <= 1  # less than or equal to
1 > 0  # greater than
1 >= -9 # greater than or equal to

### Variables and assignment

We can store values in variables using the assignment operator `<-`, like this:

In [11]:
x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a <b>variable</b>. `x` now contains the value `0.025`:

In [12]:
x
log(x)

In [13]:
# Notice also that variables can be reassigned:
x <- 100

`x` used to contain the value `0.025` and now it has the value `100`.

Assignment values can contain the variable being assigned to:

In [14]:
x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods but no spaces. They must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). Variables begining with a period are hidden variables. Different people use different conventions for long variable names, these include

+ periods.between.words
+ underscores_between_words
+ camelCaseToSeparateWords
What you use is up to you, but be <b>consistent</b>.

It is also possible to use the `=` operator for assignment:

In [15]:
x = 1/40

But this is much less common among R users. The most important thing is to be consistent with the operator you use. There are occasionally places where it is less confusing to use `<-` than `=`, and it is the most common symbol used in the community. So the recommendation is to use `<-`.

# Seeking Help

### Reading Help files

R, and every package, provide help files for functions. The general syntax to search for help on any function, “function_name”, from a specific function that is in a package loaded into your namespace (your interactive R session):

```python
?function_name
help(function_name)
```

This will load up a help page in RStudio (or as plain text in R by itself).

Each help page is broken down into sections:

+ Description: An extended description of what the function does.
+ Usage: The arguments of the function and their default values.
+ Arguments: An explanation of the data each argument is expecting.
+ Details: Any important details to be aware of.
+ Value: The data the function returns.
+ See Also: Any related functions you might find useful.
+ Examples: Some examples for how to use the function.

Different functions might have different sections, but these are the main ones you should be aware of.

### Special Operators

To seek help on special operators, use quotes:

```python
?"<-"
```

### Getting help on packages

Many packages come with “vignettes”: tutorials and extended example documentation. Without any arguments, `vignette()` will list all vignettes for all installed packages; `vignette(package="package-name")` will list all available vignettes for `package-name`, and `vignette("vignette-name")` will open the specified vignette.

If a package doesn’t have any vignettes, you can usually find help by typing `help("package-name")`.

### When you kind of remember the function

If you’re not sure what package a function is in, or how it’s specifically spelled you can do a fuzzy search:

```python
??function_name
```

### When your code doesn’t work: seeking help from your peers

If you’re having trouble using a function, 9 times out of 10, the answers you are seeking have already been answered on Stack Overflow. You can search using the `[r]` tag.

# Data Structures

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file.
Let’s start by making a toy dataset in your `r_data/ directory`, called `feline-data.csv`:

In [16]:
cats <- data.frame(coat = c("calico", "black", "tabby"),
                    weight = c(2.1, 5.0, 3.2),
                    likes_string = c(1, 0, 1))
write.csv(x = cats, file = "r_data/feline-data.csv", row.names = FALSE)

The contents of the new file, `feline-data.csv`:

In [17]:
read.csv("r_data\\feline-data.csv")

coat,weight,likes_string
<chr>,<dbl>,<int>
calico,2.1,1
black,5.0,0
tabby,3.2,1


### What name can I assign to a variable?

ou are free to assign most any letter or name to a variable as long as it follows these rules:

+ does not contain spaces,
+ starts with a letter (and not a number),
+ contains only alphanumeric characters, underscores _, and dots .,
+ is not a reserved word.
You can see a list of reserved words by typing the following at the command line:

In [18]:
?reserved

## Core data types

There are three core data types in R: numeric (both integer and double), character and logical. You can get an object’s type (also referred to as mode in R) using the `typeof()` function. Note that R also has a built-in `mode()` function that will serve the same purpose with the one exception in that it will not distinguish integers from doubles.

###  Numeric
The numeric data type is probably the simplest. It consists of numbers such as integers (e.g. whole numbers such as 1 ,-3 ,33 ,0) or doubles (e.g. numbers with a decimal point such as `0.3, 12.4, -0.04, 1.0`).

In [19]:
x <- c(1.0, -3.4, 2, 140.1)
typeof(x)

In [20]:
x <- 4L
typeof(x)

### Logical
Logical values can take on one of two values: `TRUE` or `FALSE`. These can also be represented as `1` or `0`. For example, to create a logical vector of 4 elements, you can type:

In [21]:
x <- c(TRUE, FALSE, FALSE, TRUE)

In [22]:
x <- as.logical(c(1,0,0,1))

Note that in both cases, `typeof(x)` returns logical. Also note that the 1’s and 0’s in the last example are converted to TRUE’s and FALSE’s internally.

### Coercing from one data type to another

Data can be coerced from one type to another. For example, to coerce the following vector object from character to numeric, use the `as.double()` function.

In [23]:
y   <- c("23.8", "6", "100.01", "6")
as.double(y)

The `as.double` function forces the vector y to a double (numeric).

In [24]:
as.integer(y)

There are many other coercion functions in R, a summary of some the most common ones follows:

![image4](r_images\image4.png)

## Data structures

Most datasets we work with consist of batches of values such as a table of temperature values or a list of survey results. These batches are stored in R in one of several data structures. These include <b>(atomic) vectors</b> and <b>data frames</b>. Other data structures not explicitly covered in this workshop include matrices and lists.

![image5](r_images\image5.png)

### (Atomic) vector
The atomic vector (or vector for short) is the simplest data structure in R which consists of an ordered set of values of the same type and or class (e.g. numeric, character, etc…). This is the data structure we have worked with thus far. You can think of a vector as a single column of values in a spreadsheet. As such, one important property of a vector is that it cannot mix data types. For example, let’s mix double, integer and character in the vector variable `x`.

In [25]:
x <- c( 1.2, 5L, "Rt", "2000")

R does not stop us from doing this (if it did, it would have returned an error message). However, if we pass `x` to the `typeof` function, we get:

In [26]:
typeof(x)

When data types are mixed in a vector, R will convert the element types to the highest common type following the order <b>logical < integer < double < character </b>. 
In our last example, character is the highest data type in this hierarchy thus forcing all elements in that vector to character.
if you are interested in a range of indexed values such as index 2 through 4, use the sequence, `:`, operator.

In [27]:
x[2:4]

### Dataframe
A dataframe is what comes closest to our perception of a data table. You can think of a dataframe as a collection of vector elements where each vector represents a column. As such, it’s important that the vectors have the same number of elements.

In [28]:
name <- c("a1", "a2", "b3")
col1 <- c(23, 4, 12)
col2 <- c(1, 45, 5)
dat  <- data.frame(name, col1, col2)
dat

name,col1,col2
<chr>,<dbl>,<dbl>
a1,23,1
a2,4,45
b3,12,5


To view each column’s data type we’ll make use of a new function: the structure, `str`, function.

In [29]:
str(dat)

'data.frame':	3 obs. of  3 variables:
 $ name: chr  "a1" "a2" "b3"
 $ col1: num  23 4 12
 $ col2: num  1 45 5


You’ll notice that the `col1` and `col2` columns are stored as numeric (i.e. as doubles) and not as integer. There is some inconsistency in R’s characterization of data type. Here, numeric represents double whereas an integer datatype would display integer. For example:

In [30]:
col2 <- c(1L, 45L, 5L)
dat  <- data.frame(name, col1, col2)
str(dat)

'data.frame':	3 obs. of  3 variables:
 $ name: chr  "a1" "a2" "b3"
 $ col1: num  23 4 12
 $ col2: int  1 45 5


Data frames can also be constructed without needing to create separate vector objects.

In [31]:
dat  <- data.frame(name = c("a1", "a2", "b3"),
                   col1 = c(23, 4, 12),
                   col2 = c(1, 45, 5))
dat

name,col1,col2
<chr>,<dbl>,<dbl>
a1,23,1
a2,4,45
b3,12,5


Like a vector, elements of a data frame can be accessed by their index (aka subscripts). The first index represents the row number and the second index represents the column number. For example, to list the second row of the third column, type:

In [32]:
dat[2, 3]

If you wish to list all rows for columns one through two, leave the first index blank:

In [33]:
dat[ , 1:2]

name,col1
<chr>,<dbl>
a1,23
a2,4
b3,12


Or, if you wish to list the third row for all columns, leave the second index blank:

In [34]:
dat[ 3 , ]

Unnamed: 0_level_0,name,col1,col2
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
3,b3,12,5


To get the column names of a table, use the `names()` function.

In [35]:
names(dat)

## R coding style guide

### Spacing

Spaces help improve readability. Add spaces around operators (this includes the assignment operator) and after commas.

Place a space before an open parenthesis/curly brace <b>except</b> when an open parenthesis is preceded with a function name. Place a space after a closed parenthesis/curly brace.

![image6](r_images\image6.png)

### Parentheses
Use parentheses to isolate conditional statements. Do not wrap overall statements with parentheses.

![image7](r_images\image7.png)

### Comments
Comments allow the user to document parts of the code without the comments being interpreted by R as code. All comments are preceded by the `#` character.

In [36]:
# Assign three values to p
p <- c(23, 1.2, 5)

Comments should be used to isolate key steps in a workflow. But they should not be used to document each and every line of code (except when used in an instructional setting).

An empty line should be placed before the comment but not after. A space should separate the first letter of a comment and the `#` symbol.

![image8](r_images\image8.png)

## Reading a data file

A popular (and universal) data file format is the comma separated file format known as a CSV file. To open a csv data table in R, use the `read.csv()` function. In the next example, we will load registrar’s course schedule for the Spring of 2019. But first, we will need to let our R session know where to find the data file. We’ll make use of RStudio’s interface to specify our working directory. In the menu bar, navigate to <b>Session >> Set Working Directory >> Choose Directory</b> and select the folder where you have the `SP1818.csv` file saved. Or, if you are familiar with directory structures, you can type the full path in R using the `setwd()` function as in:

```python
# On Windows ...
setwd("C:/Users/ck/Workshop/Data")

# On Macs ...
setwd("/Users/ck/Workshop/Data")
```

Next, we’ll open the data file and store its contents in an object we’ll name `dat`.

In [38]:
dat <- read.csv("r_data/SP1819.csv", stringsAsFactors = FALSE)

Now, identify the data types associated with each variable (aka column).

In [39]:
str(dat)

'data.frame':	717 obs. of  13 variables:
 $ Course   : chr  "AA118" "AA223" "AA223" "AA231" ...
 $ Section  : chr  "A" "A" "A" "A" ...
 $ Cr       : chr  "2" "4" "4" "4" ...
 $ Days     : chr  "TR" "TR" "TR" "MW" ...
 $ Times    : chr  " 2:30pm- 4:00pm" " 8:00am- 9:30am" " 9:45am-10:45am" "11:00am-12:15pm" ...
 $ Title    : chr  "Dance Technique Lab: Dance Forms of the African Diaspora: Hip-hop (See TD118)" "Critical Race Feminisms and Tap Dance (See WG223)" "Critical Race Feminisms and Tap Dance (See WG223)" "Caribbean Cultures (See AY231)" ...
 $ DistReq  : chr  "" "A" "A" "" ...
 $ Diversity: chr  " " "U " "U " "I " ...
 $ Room     : chr  " " " " " " " " ...
 $ Reg      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Max      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Exam     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Faculty  : chr  "Akuchu     " "Thomas, S  " "Thomas, S  " "Bhimull    " ...


## Control Flow

There are several ways you can control flow in R. For conditional statements, the most commonly used approaches are the constructs:

```python
# if
if (condition is true) {
  perform action
}

# if ... else
if (condition is true) {
  perform action
} else {  # that is, if the condition is false,
  perform alternative action
}
```

Say, for example, that we want R to print a message if a variable `x` has a particular value:

In [41]:
x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
}

x

The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement.

In [42]:
x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
} else {
  print("x is less than 10")
}

[1] "x is less than 10"


You can also test multiple conditions by using `else if`.

In [43]:
x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
} else if (x > 5) {
  print("x is greater than 5, but less than 10")
} else {
  print("x is less than 5")
}

[1] "x is greater than 5, but less than 10"


<b>Important</b>: when R evaluates the condition inside `if()` statements, it is looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some headaches for beginners. For example:

In [44]:
x  <-  4 == 3
if (x) {
  "4 equals 3"
} else {
  "4 does not equal 3"          
}

As we can see, the not equal message was printed because the vector x is `FALSE`.

In [45]:
x <- 4 == 3
x

## Repeating operations

If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, a `for()` loop will do the job. We saw `for()` loops in the shell lessons earlier.
This is the most flexible of looping operations, but therefore also the hardest to use correctly. In general, the advice of many `R` users would be to learn about `for()` loops, but to avoid using `for()` loops unless the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations. If the order of iteration is not important, then you should learn about vectorized alternatives, such as the `purr` package, as they pay off in computational efficiency.

The basic structure of a `for()` loop is:

```python
for (iterator in set of values) {
  do a thing
}
```

For example:

In [46]:
for (i in 1:10) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10


The `1:10` bit creates a vector on the fly; you can iterate over any other vector as well.