Skip to content

Latest commit

 

History

History
700 lines (509 loc) · 18.2 KB

ecology_r.md

File metadata and controls

700 lines (509 loc) · 18.2 KB

Introduction

R

  • R and RStudion are two different pieces of software.
    • Describe the differences
  • In R there is no pointing and clicking, and that a good thing ...
  • R allows you to reproduce your work faster.
  • R has over 10,000 packages.
  • R was made for data, and it scales.
  • R produces high resolution graphics that are ready for publication.
  • R is free, open-source, cross platform, and community supported.

R Studio

  • R studio is an Integrated Development Environment (IDE).
  • RStudio can: write code, navigate files, inspect variables, visualize plots, integrate with Git for version control, develop packages, and write shiny apps.
  • It is separate from R, but gives us a nicer way to interact with R.
  • Show the "Help," "Plots," and "Files" tabs.
  • Show the "Console" window.
    • Console commands are forgotten after your session.
  • Show the "Script" window.
    • You can save script commands for later.
    • Ctrl+Enter shortcut for testing a line of code in the console.

Project directory

  • Good practice to contain the whole project in a single "working directory."
  • Relative paths are portable.
  • Absolute paths are not.
  • It's a good idea to have separate folders for raw and clean data, and for scripts.
  • It's a good idea to determine a system you like and keep it consistent across projects.

Creating Objects in R

  • Let's start a new project: File -> New Project -> New Directory -> New Project -> [ Give your project a descriptive name ] -> Create Project
  • Under the "Files" tab, create a new folder called data.
  • To begin with, R will do most math operations by tying it into the window.
  • > is the prompt. It means R is ready to accept input.
  • + means that R is expecting more input and that your command is not complete yet (this allows for multi-line commands).
  • Two ways of coding in RStudio
    • scripts (saves work)
      • keyboard shortcut Ctrl + Enter
    • console window
3 + 5
12/7
  • To do useful things, we need to assign values to objects
  • The assignment operator in R is <-
  • You can also use =, but historically R has used <-
    • Careful: = is used to pass arguments within a function which is different than assignment!

In many programming languages, data is stored in variables. R uses the term object due to differences in the way memory is managed in R.

!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!

weight_kg <- 55
  • Shortcut for <- is Alt + -

  • Object names are case-sensitive

  • It's best to avoid . in object names

  • When assigning a value to an object, R doesn't print anything

  • You can force it to by using parenthesis

weight_kg <- 55
(weight_kg <- 55)
  • You can also check the value stored in an object by retyping its name
weight_kg
  • We can use objects to do arithmetic
2.2 * weight_kg
weight_kg
  • We can change the value of a variable by assigning it a new value
weight_kg <- 57.5
2.2 * weight_kg
  • We can store the results in a new variable

!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!

weight_lb <- 2.2 * weight_kg
weight_lb
  • If I now change the value of weight_kg, it will not change the value saved in weight_lb
weight_kg <- 100
weight_lb
  • Anything after # is treated as a comment in R

  • The computer will ignore that line and it is for your reference

  • Making comments is a good practice as it provides reference for others (or yourself in six months) as to what is happening in your code

  • RStudio has the shortcut Ctrl+Shift+C for commenting multiple lines at once when scripting.

Demonstrate Ctrl + Shift + C on the following lines of code:

mass <- 47.5
age <- 122
mass <- mass * 2.0
age <- age - 20
mass_index <- mass/age

Functions and their Arguments

  • Functions are independent modules of code that can be called from your script/console
  • They save us time so that we don't have to rewrite a piece of code if we need to use it multiple times
  • sqrt() is an example of a function in R
sqrt(2)
  • The inputs of a function are called arguments

  • If the function returns something, it returns a value

  • We call a function when we want to execute the code in it

  • Functions are not limited to working with numbers

  • Functions can take as arguments and return almost any object we can create withing the R environment

  • Functions can also take multiple arguments, like round()

round(3.14159)
  • round() defaults to rounding to the nearest whole number

  • We can change this by changing the number of arguments we give round()

  • Let's find out more about round()'s arguments

args(round)
  • We can look up the manual entry for a specific function.
?round
  • We can make general help searches if we don't know the function name.
??ceiling
  • RStudion has great built-in help, but sometimes a quick Google or Stack Exchange search is the best place to get help with examples.

  • We have a list of arguments and we see that digits=0 defaults to 0

  • We can look up more details about the function by using a ?

?round
  • Now we can round to two digits
round(3.14159, 2)
  • I can also label the arguments that I pass to a function.
round(x = 3.14159, digits = 2)
  • If I explicitly name all the arguments, then can list them in any order that I want
round(digits = 2, x = 3.14159)
  • It's good practice to name at least the optional arguments so that somebody else can read your code without referencing the help manual

Vectors and Data Types

  • In R (or any programming language), we organize our data into abstract structures called data structures
  • The most common data structure in R is the vector
  • A vector is a series of values that can be either characters or numbers
  • We stitch a series of values together by using the c() function

!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!

weight_g <- c(50, 60, 65, 82)
weight_g
  • Vectors can also contain characters
animals <- c("mouse", "rat", "dog")
animals
  • The quotes tell R that this is a string and not a variable

  • The are many functions that help us work with vectors

  • length() tells us how long a vector is

length(weight_g)
length(animals)
  • class() tells us what type of objects are in the vector
class(weight_g)
class(animals)
  • We can use c() to add another element to a vector
weight_g <- c(weight_g, 90)   # add to the end of the vector
weight_g <- c(30, weight_g)   # add to the beginning of the vector
weight_g

!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!

  • We reference values within a vector using [ ] to indicate index

Notice the 1 - indexing.

animals <- c(animals, "cat")
animals
animals[2]
animals[c(3, 2)]
  • We can pull the same element out multiple times
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
  • We can also index elements by using logical values

In R, a logical value is a value that is either true or false

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
  • Usually you don't type logical values out by hand, they are usually the output from another statement
  • For example, suppose you want all the weights in weight_g that are over 50

!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!

weight_g > 50
weight_g[weight_g > 50]
  • We can use the AND (&) or the OR (|) tests to combine multiple statements

    • AND: both statements must be true
    • OR: one of the statements must be true
  • Suppose we want all the weights that are less than 30 grams or greater than 50 grams

weight_g[weight_g < 30 | weight_g > 50]
  • Suppose we want all the weights greater than or equal to 30, or equal to 21
weight_g[weight_g >= 30 | weight_g == 21]
  • == test for equality, it does not mean "equal to"

Challenge: Use & to return all the values in weight_g that are greater than 30 and less than 50.

weight_g[weight > 30 & weight_g < 50]
  • We can use these kinds of statements to search for whether of not a vector has certain elements
animals <- c("mouse", "rat", "dog", "cat")
animals[animals == "cat" | animals == "rat"]
  • To make this less tedious for multiple values, we can use %in%
animals %in% c("rat", "cat", "dog", "duck", "goat")]
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
  • To return the index of where an item is found, we can use the match() function.
match("dog", animals)
  • To insert an object into the middle of a vector, use the append() function.
append(animals, c("giraffe", "zebra"), after = 3)

Missing Data

  • R represents missing data with NA
  • Many functions will return NA if your data is missing values
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
  • To fix this, we tell mean() and max() to ignore the missing data
mean(heights, na.rm = TRUE)

Challenge: Ignore the missing values and find the maximum value of heights using max().

max(heights, na.rm = TRUE)

Starting with Data

  • Begin by downloading the data into your /data file
download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data_joined.csv")
  • Now load the data into your workspace
surveys <- read.csv('data/portal_dat_joined.csv')
surveys
  • To just output the first 6 lines, use head()
head(surveys)

Note: we can examine the original .csv file from within RStudio. Have a look around and get a sense of the data.

Data Frames

  • When data is brought in from a spreadsheet, it is stored in R as a data frame.

  • Data frames are used for statistics and plotting

  • Each column is a vector

    • This forces each column to only have one data type
  • To inspect the structure of a data frame, use str()

str(surveys)
  • There are several functions for examining the size of a data frame
dim(surveys)
nrow(surveys)
ncol(surveys)
  • Functions for examining data frame content
head(surveys)
tail(surveys)
  • Functions for examining row and column names
names(surveys)   # Column names
rownames(surveys)
  • summary() prints out summary statistics for each column
summary(surveys)

Indexing and Subsetting Data Frames

  • Since data frames have two dimensions, we must account for rows and columns when we reference elements
head(surveys)
surveys[1,1]      # 1st element in the first column
surveys[1,6]      # 1st element in the 6th column
surveys[,1]       # 1st column in the data frame
surveys[1:3, 7]   # 1st 3 elements of the 7th column
surveys[3,]       # 3rd element for all columns
  • : is a special function that creates numeric vectors of integers in increasing or decreasing order.
1:10
10:1
  • You can use the - sign to exclude certain parts of the data frame
surveys[,-1]  # Include whole data frame *except* last column
  • You can also use the title of a column
surveys$species_id

!!!END OF FIRST WORKSHOP. REMAINING MATERIAL OPTIONAL BASED ON REMAINING TIME!!!

Factors

  • Notice how some of the vectors in the surveys data frame are "factors"
str(surveys)
  • Factors are like vectors of string, but they are actually vectors of integers
  • Correspond to the "categorical variable" concept in statistics.
  • R list all the unique strings and assigns each one an integer code (after sorting in alphabetical order)
  • Once created, the factor can only contain this predefined list. We can change these entries, but we can't add to them.
  • This list of possible values is called the factor's "levels"
sex <- factor(c("male","female","female","male"))
levels(sex)
nlevels(sex)
  • In this case, R will assign 1 to the level "female" since alphabetically it comes before "male"

    • Even thought the first string in the list was "male"
  • Sometimes we need to reorder the factors for our specific analysis

sex
sex <- factor(sex, levels = c("male", "female"))
sex
  • One nice feature about factors is that you can quickly plot the number of observations represented by each factor level
  • Let's look at the number of male and females captured during the experiment
plot(surveys$sex)
  • We can change the entries in a factor
  • For example, consider if we wanted to label all the observations that were not assigned a gender with the word "missing,"
sex <- surveys$sex
head(sex)
levels(sex)
levels(sex)[1] <- "missing"
levels(sex)
head(sex)
  • While factors are useful for plotting, they can also be cumbersome at times
  • You can import new data and specify that you don't want any factors
  • Let's re-import our data and we'll set only the plot_type column to be a factor
surveys <- read.csv("data/portal_dat_joined.csv", stringsAsFactors = FALSE)
str(surveys)
surveys$plot_type <- factor(surveys$plot_type)

Formatting Dates

  • One big challenge when wrangling data is to format dates correctly
  • To do this we will use an R library called lubridate, which is found inside of the tidyverse package
  • A library is a collection of functions and code (or other libraries) that extends the base functionalities of R
  • tidyverse is an "umbrella package" that contains several sublibraries. These libraries all abide by the tidy data principles layed out in the paper "Tidy Data" by Hadley Wickham.
install.packages("tidyverse")

Many errors can be addressed by using `install.packages("tidyverse", dependencies = TRUE)

  • the ymd() function changes the data type to dates
library(lubridate)
paste(surveys$year, surveys$month, surveys$day, sep='-')
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep='-'))
str(surveys)
  • Notice the error: Warning: 129 failed to parse.
    • Some of the dates didn't convert
    • Let's check it out
surveys_dates <- paste(surveys$year, surveys$month, surveys$day, sep = "-")
head(surveys_dates)
surveys_dates[is.na(surveys$date)]

Manipulating and Analyzing Data with dplyr

  • dplyr is a set of tools that makes many tasks in data manipulation easier
  • To use dplyr we need to load the right library
library("tidyverse")
  • dplyr uses read_csv() instead of read.csv()
surveys <- read_csv("data/portal_dat_joined.csv")
str(surveys)
  • surveys is now a "tibble"
  • tibbles are the most recent version of data frames
    • The only difference we need to worry about today is the fact that tibbles never have factors, but always use vectors of characters

Selecting Columns and Filtering Rows

  • One of the big advantages of dplyr is that referencing pieces of the tibble is easier than the traditional index notation of data frames
select(surveys, plot_id, species_id, weight)
  • Notice the object is the first argument

  • The columns that we want then follow as the rest of the arguments

  • We can filter rows based on specific criteria

filter(surveys, year == 1995)

Pipes

  • Suppose that I want to select and filter at the same time, now what?
  • Pipes allow us to take the output of one function and input it into another function
surveys %>% 
  filter(weight < 5) %>% 
  select(species_id, sex, weight)
  • Each function call is missing the first argument since it was supplied through the pipe

  • You can use the shortcut Ctrl + Shift + M to type a pipe in RStudio

  • To save this as an object, do the following

surveys_sml <- surveys %>% 
  filter(weight < 5) %>% 
  select(species_id, sex, weight)

surveys_sml

Mutate

  • Often we want to create new columns from existing columns
  • We could create a new column from the weight column (which is currently in grams)
surveys %>% 
  mutate(weight_kg = weight / 1000) %>% 
  head()
  • dplyr allows us to group our data into categories and create statistical summaries of the data as well
  group_by(sex) %>% 
  summarize(mean_weight = mean(weight, na.rm = TRUE))
  • We are not limited to just one category
    • We can summarize with subgroups as well
surveys %>% 
  group_by(sex, species_id) %>% 
  summarize(mean_weight = mean(weight, na.rm = TRUE))

Exporting Data

  • R has function for exporting .csv files just as it has functions for reading them

  • Begin by creating a new folder in your project directory called /data_output

    • Raw data should be kept separate from processed data
    • The original data should never be changed
  • Let's clean up our data and write out the results

surveys_complete <- surveys %>%
  filter(species_id != "",
         !is.na(weight),
         !is.na(hindfoot_length),
         sex != "")
  • In the next section, we will look at plotting how species abundances have changed through time, so we will remove observations for rare species

  • We will begin by counting how many of each species was observed and filtering out those observations which were less than 50

species_counts <- surveys_complete %>% 
  group_by(species_id) %>% 
  tally() %>% 
  filter(n >= 50)
  • Now we will only keep the most common species
surveys_complete <- surveys_complete %>% 
  filter(species_id %in% species_counts$species_id)
  • Check that we all have the same clean data
dim(surveys_complete)
  • We should all have a tibble that is 30,463 rows and 13 columns

  • We're now ready to write out to a .csv

  • If you check your folder with RStudio, it should be there

Plotting Data

If not still in the workspace, then the students will need tidyverse and the data that we wrote out from the last workshop

library(tidyverse)
surveys_complete <- read_csv("data_output/surveys_complete.csv")
  • The package ggplot2 can help us create publication quality graphs with minimal amounts of adjustments and tweaking

  • ggplot graphics are built step by step by adding new elements

  • Begin by binding the plot to a specific data frame

ggplot(data = surveys_complete)
  • We can change the presentation of things like size, shape, color, variables to be plotted, etc. by using the aesthetics argument
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
  • We can specify the graphical representation of our data points with geom_point()
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point()
  • This is just the surface of what ggplot2 has to offer
  • Happy data wrangling!