# Lab 0 Tutorial

## Prerequisite: load required packages

In [1]:
library(tidyverse) #tidyverse is a collection of packages for working with data

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.0     [32m✔[39m [34mdplyr  [39m 1.0.4
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.0.3”
“package ‘tibble’ was built under R version 4.0.3”
“package ‘tidyr’ was built under R version 4.0.3”
“package ‘readr’ was built under R version 4.0.3”
“package ‘dplyr’ was built under R version 4.0.3”
“package ‘forcats’ was built under R version 4.0.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Some R basics

### Objects
You'll often want to store values as objects (with descriptive names) for later use. 

The syntax for this is:
object_name <- object_value

It's good practice to do this rather than "hard-code" in numbers and names all over your scripts. Hard-coding involves repeatedly using a value rather than assigning it to an object and using that object instead. This is hard to read because "10" means a less to a reader than "sample_size." It's hard to modify because if you wanted to change the value from 10 to 15, you'd only have to modify one line of code (the line where you initialize the object AKA first assign it a value), whereas you'd have to go through and replace every 10 with a 15 (except in cases where you are using 10 for a different reason and not to refer to the sample size... again because raw values are harder to read and interpret than names) if you had hard-coded it.

I'll discuss some basic classes of objects: numerical, character, and logical.
Numeric refers to numbers.
Character refers to text and the values go in quotes to differentiate it from object names.
Logical is either TRUE or FALSE and will be useful for performing operations only in certain cases (it's the basis of subsetting datasets)

In [2]:
#Examples of each of these three classes

age <- 20 #age in days
class(age)

species <- "Melanoplus boulderensis" #species name (Genus species)
class(species)

is_ctrl <- FALSE #whether or not the individual was in the control treatment (this individual was in the test treatment)
class(is_ctrl)

Imagine you have a data on pillbugs. You store your observed value for the mass of an individual pillbug in an object, and you pick a concise, descriptive name for this object. When you call this object, the value is displayed. The class of this object is 'numeric'.

In [3]:
PB_mass <- 107.2 #the mass of an individual pillbug in mg
PB_mass #print PB_mass
class(PB_mass) #print class of PB_mass

You can also modify this object (or use it to create a new object if you want to preserve the original, as well). Let's say you now want to change the units of mass. 

In [4]:
PB_mass <- PB_mass/1000 #the mass of an individual pillbug, now in g; overwrites the old value of PB_mass
PB_mass #print the updated PB_mass

#note: if you wanted separate objects for both units, you could have instead run: PB_mass_g <- PB_mass/1000

But rarely is one observation all you have. Let's say you're working with the masses of 10 pillbugs now. Instead of storing these masses in 10 objects, you can store them in a single object: a vector. A vector is a one-dimensional collection of values all of the same class (in this case, numeric).

In [5]:
PB_mass_vec <- c(107.2, 101.0, 89.4, 96.8, 104.3, 102.8, 112.5, 108.1, 101.6, 105.9) #the mass of 10 pillbugs in mg
PB_mass_vec #print PB_mass_vec
class(PB_mass_vec) #print class of PB_mass_vec

An array is similar to a vector, but it can be multidimensional (for a 2-dimensional array, called a matrix, imagine the mass of 10 pillbugs each measured over 15 days). A data frame is sort of like a matrix, but it is more useful for working with data. One big difference is that columns have names that they can be accessed by and different columns can contain different classes of data. You'll encounter one when we work with data later in this tutorial.

### Functions

A function typically takes in inputs (AKA arguments) and gives you an output.
The syntax of calling a function is as follows (this function has two arguments but functions can have any number):

func_name(arg1, arg2)

If you want to store the output of the function as an object (you often do), you can:

obj_name <- func_name(arg1, arg2)

It is important to understand the functions you're using. What inputs does it want (a single numerical value? a vector with values of any class?)? Does it care about the order of those inputs? How example does it manipulate those inputs and what sort of output does it give? You can read documentation online (just google the function's name and R). Often at the end there will be examples of the function in action that may also help you understand. Writing your own simple examples is another great way to figure things out. 

At some point it is inevitable that you'll call a function and it won't run -- instead you'll get an error. The wording of the error may not make sense to you or be very helpful. If you copy and paste that wording into google, you'll find that someone has encountered this error before and asked about it on https://stackoverflow.com/. Definitely don't be afraid to google things! Becoming a skilled R programmer isn't so much not needing to google things anymore as it is being better at figuring out what to search and being better at interpreting the answers you find online.

Here's an example with the function rep(), which makes a vector by repeating a single value. rep() takes in two values and the order of these values matters. The first is the value to be repeated and the second is the number of times to repeat it. 
Here's a link to the documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rep and you can also type ?rep into a code cell.

In [6]:
rep(2,5) #make a vector by repeating 2 five times 
rep(5,2) #make a vector by repeating 5 two times

### Operators

These are like functions but have different syntax.
-Arithmetic operators: +, -, *, /, ^
-Relational operators: <, >, <=, >=, == ("equals"), != ("does not equal")
-Logical operators: & ("and"), | ("or"), ! ("not")
-Assignment operators: <-, = (for assigning arguments within a function)
-The pipe operator: %>% (“and then”)

### Exercise 1

-Initialize an object with a meaningful name (i.e. not num <- 20 but age <-20).

-Use an operator on this object either to modify that object or to create a new object.

-Remember to comment your code!


In [7]:
#YOUR CODE HERE!

### The pipe operator: %>%

The pipe operator is part of tidyverse syntax and is quite useful for working with data frames (a class for datasets in R).
It allows you to write code that is easier to read and modify by allowing you to nest functions without the clutter of nested parentheses.
You always start with the object you're working on (usually a data frame, in this example a vector) and then apply functions to it. Read %>% as “and then”: reading top to bottom and left to right each step is an input to the next (it would go in the parentheses of the next function).

One example of how it's easier to modify: if I decided I didn't actually want to use the exponent function, I could just comment out or deleter that line.

In [8]:
# Initialize x
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907) #create arbitrary vector


# Snippet 1
round(exp(diff(log(x))), 1) #round the exponent of the lagged difference of the log of x to 1 decimal place

#Snippet 2
x %>% #ok so you're working with x and then
  log() %>% #you take the log of it and then
  diff() %>% #then you take the lagged difference of that and then
  exp() %>% #you take the exponent of that
  round(1) #you round to one decimal point

### On packages
Packages are groups of functions that do a specific task. If you need to do a specific type of analysis, somebody else might have done it already and created a lot of useful functions. It is worth looking for a relevant R package before writing a lot of new code yourself. Packages and the community of people (including many biologists) who make them are a great feature of R! 

The packages you need should already be installed on Jupyterhub, but you’ll need to load them into your Jupyter notebook. We did this with tidyverse at the beginning of this tutorial. The general form is: library(name_of_package)

## Working with data

### Reading in the data

You'll often deal with data in comma separated values (csv) files. Within the folder you have for a project (where your R scripts live), it's good practice to have a data folder where these csvs can go. Inside the read.csv function, you put the path from this jupyter notebook to the csv you want to access. In this case that is "data/crabs.csv" since we're in the lab0 folder and crabs.csv is just nested one folder deeper in data. This approach should work fine for you in this class, but sometime in the future you may encounter more complex file structures. For (totally optional) further reading on more complex paths, see this: https://ssc.wisc.edu/sscc/pubs/R_intro/book/1-10-paths-and-working-directories.html.

Read in the data

In [9]:
#1)
#YOUR CODE HERE


### Exploring the data frame
How many rows and columns are there?

In [10]:
#2.1-2.3)
#YOUR CODE HERE


What are the names of the columns (the variables)?

In [11]:
#2.4)
#YOUR CODE HERE


Looking at the whole data frame can be a lot. Let's just look at the first few rows.

In [12]:
#2.5)
#YOUR CODE HERE


### Accessing variables
Use the dollar sign ($) to access the columns by name.

In [13]:
#3)
#YOUR CODE HERE


### Summarizing data frames
There are many useful functions for summarizing data. Summarize the frontal lobe size (FL) data by calculating the following statistics: mean, median, maximum, minimum, standard deviation

In [14]:
#4)
#YOUR CODE HERE


We can also create a table summarizing, for example, the mean by sex. This uses the function summarize() and the function group_by(). Let's start with a very simple example to introduce you to summarize().

In [15]:
#what is the mean BD?

#fairly equivalent

mean(crabs$BD) #class: numeric
#OR
crabs %>% summarize(meanBD=mean(BD)) #class: data.frame

ERROR: Error in mean(crabs$BD): object 'crabs' not found


Now we incorporate group_by in order to get the mean by sex.

In [None]:
#5.1)
#YOUR CODE HERE


Now find the mean by species.

In [None]:
#5.2)
#YOUR CODE HERE (BRAINSTORM)


Now find the mean for each combination of species and sex.

In [None]:
#5.3)
#YOUR CODE HERE (BRAINSTORM)


What do you notice about the means? What hypotheses does this table generate that you could test statistically?

5.4 
YOUR THOUGHTS HERE (BRAINSTORM)


Note that you can have multiple statisitcs in one place, here I add standard deviation and, a sample size for each group.

In [None]:
#5.5)
#YOUR CODE HERE


### Subsetting
We use the filter function to extract a subset of all the observations (rows) that meet our criteria.
Note the use of relative operators (and a logical operator in 6.3).

Retrieve data for males only

In [None]:
#6.1)
#YOUR CODE HERE


Retrieve data for individuals with an FL value of 20 or greater

In [None]:
#6.2)
#YOUR CODE HERE


Retrieve data for males with an FL value of 20 or greater

In [None]:
#6.3)
#YOUR CODE HERE


### Creating new columns
Use the function mutate() to create a new column from existing ones
Add a new column that is sum of FL and RW.

In [None]:
#7)
#YOUR CODE HERE


### Dropping columns
Sometimes you don't plan to use all the columns. You can use select() to drop some either by saying which you want to keep or which you want to omit.

In [None]:
#8.1)
#YOUR CODE HERE


In [None]:
#8.2)
#YOUR CODE HERE


OK so we've gotten to the end of manipulating a single data frame. Here's a good moment to note that if you modify a data frame and don't assign it to an operator then the modification is lost after that line. Like if  I ran just crabs %>% select(-index) and then called crabs, crabs would be unchanged. If I want to use a data frame with without index in the future, I need to assign it either to crabs (modifying the original data frame directly) or to some other object name like crabs_new (leaving the crabs data frame unchanged).

It's only out of convenience that I haven't assigned it to objects in the scenarios above. I wanted crabs to stay unmodified so I could go through different examples, and coming up with new object names for these modified data frames was unappealing (plus if you don't assign it, it displays automatically when you run the code). But you will typically be using assigning to objects!

### Exercise 2
-Read in the crabs data file and assign it the name “crabs_og”
-Drop BD and name the new data frame “crabs_mod”
-In crabs_mod, create a new column containing the product of CL and CW for each individual and call it “CLxCW”
-Use subsetting to create a new data frame of the females that have an FL value of 15 or higher and call it “fem.lg.crabs”
-Calculate the mean of CLxCW for this subset by species

In [None]:
#YOUR CODE HERE

### Plotting data
ggplot is a versatile tool for making beautiful plots. 

Here, we'll make a simple scatter plot of RW over BD

In [None]:
#9.1)
#YOUR CODE HERE


Now let's add an aesthetic mapping so that color corresponds to species.

In [None]:
#9.2)
#YOUR CODE HERE



Aw we can do better than that -- let's make blue be blue and orange be orange. Let's also make the x and y axes more descriptive.

In [None]:
#9.3)
#YOUR CODE HERE


### Linear modeling
-Linear models are used to reveal associations between variables

-They contain a single dependent variable (y) and one or more independent variables (x)

-The linear model predicts the value of the dependent variable given the values of each of the independent variables

To create a linear model we need to estimate the y intercept (B0, where the depedent variable = 0) and the slope (B1) in the following equation
y=B0 + B1*x

Now create a linear model where RW is the dependent variable and BD is the independent variable. Use summary to view the output.

In [None]:
#10)
#YOUR CODE HERE


In [None]:
#11)
#YOUR CODE HERE


### Exercise 3
-Create a linear model with body depth (BD) as the dependent variable and carapace width (CW) as the independent variable
-Write out the linear model equation using the Y intercept and slope from the model summary
-Write out the R2 and p value
-Plot BD as a function of CW and add the model line to the plot
-If you finish early, try making your plot more interesting (axis labels, color, etc.)

In [None]:
#YOUR CODE HERE