- R and RStudion are two different pieces of software.
- Describe the differences
- In R there is no pointing and clicking, and that a good thing ...
- R allows you to reproduce your work faster.
- R has over 10,000 packages.
- R was made for data, and it scales.
- R produces high resolution graphics that are ready for publication.
- R is free, open-source, cross platform, and community supported.
- R studio is an Integrated Development Environment (IDE).
- RStudio can: write code, navigate files, inspect variables, visualize plots, integrate with Git for version control, develop packages, and write shiny apps.
- It is separate from R, but gives us a nicer way to interact with R.
- Show the "Help," "Plots," and "Files" tabs.
- Show the "Console" window.
- Console commands are forgotten after your session.
- Show the "Script" window.
- You can save script commands for later.
Ctrl
+Enter
shortcut for testing a line of code in the console.
- Good practice to contain the whole project in a single "working directory."
- Relative paths are portable.
- Absolute paths are not.
- It's a good idea to have separate folders for raw and clean data, and for scripts.
- It's a good idea to determine a system you like and keep it consistent across projects.
- Let's start a new project: File -> New Project -> New Directory -> New Project -> [ Give your project a descriptive name ] -> Create Project
- Under the "Files" tab, create a new folder called
data
. - To begin with, R will do most math operations by tying it into the window.
>
is the prompt. It means R is ready to accept input.+
means that R is expecting more input and that your command is not complete yet (this allows for multi-line commands).- Two ways of coding in RStudio
- scripts (saves work)
- keyboard shortcut
Ctrl
+Enter
- keyboard shortcut
- console window
- scripts (saves work)
3 + 5
12/7
- To do useful things, we need to assign values to objects
- The assignment operator in R is
<-
- You can also use
=
, but historically R has used<-
- Careful:
=
is used to pass arguments within a function which is different than assignment!
- Careful:
In many programming languages, data is stored in variables. R uses the term object due to differences in the way memory is managed in R.
!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!
weight_kg <- 55
-
Shortcut for
<-
isAlt
+-
-
Object names are case-sensitive
-
It's best to avoid
.
in object names -
When assigning a value to an object, R doesn't print anything
-
You can force it to by using parenthesis
weight_kg <- 55
(weight_kg <- 55)
- You can also check the value stored in an object by retyping its name
weight_kg
- We can use objects to do arithmetic
2.2 * weight_kg
weight_kg
- We can change the value of a variable by assigning it a new value
weight_kg <- 57.5
2.2 * weight_kg
- We can store the results in a new variable
!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!
weight_lb <- 2.2 * weight_kg
weight_lb
- If I now change the value of
weight_kg
, it will not change the value saved inweight_lb
weight_kg <- 100
weight_lb
-
Anything after
#
is treated as a comment in R -
The computer will ignore that line and it is for your reference
-
Making comments is a good practice as it provides reference for others (or yourself in six months) as to what is happening in your code
-
RStudio has the shortcut
Ctrl
+Shift
+C
for commenting multiple lines at once when scripting.
Demonstrate
Ctrl
+Shift
+C
on the following lines of code:
mass <- 47.5
age <- 122
mass <- mass * 2.0
age <- age - 20
mass_index <- mass/age
- Functions are independent modules of code that can be called from your script/console
- They save us time so that we don't have to rewrite a piece of code if we need to use it multiple times
sqrt()
is an example of a function in R
sqrt(2)
-
The inputs of a function are called arguments
-
If the function returns something, it returns a value
-
We call a function when we want to execute the code in it
-
Functions are not limited to working with numbers
-
Functions can take as arguments and return almost any object we can create withing the R environment
-
Functions can also take multiple arguments, like
round()
round(3.14159)
-
round()
defaults to rounding to the nearest whole number -
We can change this by changing the number of arguments we give
round()
-
Let's find out more about
round()
's arguments
args(round)
- We can look up the manual entry for a specific function.
?round
- We can make general help searches if we don't know the function name.
??ceiling
-
RStudion has great built-in help, but sometimes a quick Google or Stack Exchange search is the best place to get help with examples.
-
We have a list of arguments and we see that
digits=0
defaults to 0 -
We can look up more details about the function by using a
?
?round
- Now we can round to two digits
round(3.14159, 2)
- I can also label the arguments that I pass to a function.
round(x = 3.14159, digits = 2)
- If I explicitly name all the arguments, then can list them in any order that I want
round(digits = 2, x = 3.14159)
- It's good practice to name at least the optional arguments so that somebody else can read your code without referencing the help manual
- In R (or any programming language), we organize our data into abstract structures called data structures
- The most common data structure in R is the vector
- A vector is a series of values that can be either characters or numbers
- We stitch a series of values together by using the
c()
function
!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!
weight_g <- c(50, 60, 65, 82)
weight_g
- Vectors can also contain characters
animals <- c("mouse", "rat", "dog")
animals
-
The quotes tell R that this is a string and not a variable
-
The are many functions that help us work with vectors
-
length()
tells us how long a vector is
length(weight_g)
length(animals)
class()
tells us what type of objects are in the vector
class(weight_g)
class(animals)
- We can use
c()
to add another element to a vector
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!
- We reference values within a vector using
[ ]
to indicate index
Notice the 1 - indexing.
animals <- c(animals, "cat")
animals
animals[2]
animals[c(3, 2)]
- We can pull the same element out multiple times
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
- We can also index elements by using logical values
In R, a logical value is a value that is either true or false
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
- Usually you don't type logical values out by hand, they are usually the output from another statement
- For example, suppose you want all the weights in
weight_g
that are over 50
!!!REVIEW AT BEGINNING OF SECOND WORKSHOP!!!
weight_g > 50
weight_g[weight_g > 50]
-
We can use the AND (
&
) or the OR (|
) tests to combine multiple statements- AND: both statements must be true
- OR: one of the statements must be true
-
Suppose we want all the weights that are less than 30 grams or greater than 50 grams
weight_g[weight_g < 30 | weight_g > 50]
- Suppose we want all the weights greater than or equal to 30, or equal to 21
weight_g[weight_g >= 30 | weight_g == 21]
==
test for equality, it does not mean "equal to"
Challenge: Use
&
to return all the values inweight_g
that are greater than 30 and less than 50.
weight_g[weight > 30 & weight_g < 50]
- We can use these kinds of statements to search for whether of not a vector has certain elements
animals <- c("mouse", "rat", "dog", "cat")
animals[animals == "cat" | animals == "rat"]
- To make this less tedious for multiple values, we can use
%in%
animals %in% c("rat", "cat", "dog", "duck", "goat")]
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
- To return the index of where an item is found, we can use the
match()
function.
match("dog", animals)
- To insert an object into the middle of a vector, use the
append()
function.
append(animals, c("giraffe", "zebra"), after = 3)
- R represents missing data with
NA
- Many functions will return
NA
if your data is missing values
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
- To fix this, we tell
mean()
andmax()
to ignore the missing data
mean(heights, na.rm = TRUE)
Challenge: Ignore the missing values and find the maximum value of
heights
usingmax()
.
max(heights, na.rm = TRUE)
- Begin by downloading the data into your
/data
file
download.file("https://ndownloader.figshare.com/files/2292169", "data/portal_data_joined.csv")
- Now load the data into your workspace
surveys <- read.csv('data/portal_dat_joined.csv')
surveys
- To just output the first 6 lines, use
head()
head(surveys)
Note: we can examine the original
.csv
file from within RStudio. Have a look around and get a sense of the data.
-
When data is brought in from a spreadsheet, it is stored in R as a data frame.
-
Data frames are used for statistics and plotting
-
Each column is a vector
- This forces each column to only have one data type
-
To inspect the structure of a data frame, use
str()
str(surveys)
- There are several functions for examining the size of a data frame
dim(surveys)
nrow(surveys)
ncol(surveys)
- Functions for examining data frame content
head(surveys)
tail(surveys)
- Functions for examining row and column names
names(surveys) # Column names
rownames(surveys)
summary()
prints out summary statistics for each column
summary(surveys)
- Since data frames have two dimensions, we must account for rows and columns when we reference elements
head(surveys)
surveys[1,1] # 1st element in the first column
surveys[1,6] # 1st element in the 6th column
surveys[,1] # 1st column in the data frame
surveys[1:3, 7] # 1st 3 elements of the 7th column
surveys[3,] # 3rd element for all columns
:
is a special function that creates numeric vectors of integers in increasing or decreasing order.
1:10
10:1
- You can use the
-
sign to exclude certain parts of the data frame
surveys[,-1] # Include whole data frame *except* last column
- You can also use the title of a column
surveys$species_id
!!!END OF FIRST WORKSHOP. REMAINING MATERIAL OPTIONAL BASED ON REMAINING TIME!!!
- Notice how some of the vectors in the
surveys
data frame are "factors"
str(surveys)
- Factors are like vectors of string, but they are actually vectors of integers
- Correspond to the "categorical variable" concept in statistics.
- R list all the unique strings and assigns each one an integer code (after sorting in alphabetical order)
- Once created, the factor can only contain this predefined list. We can change these entries, but we can't add to them.
- This list of possible values is called the factor's "levels"
sex <- factor(c("male","female","female","male"))
levels(sex)
nlevels(sex)
-
In this case, R will assign
1
to the level "female" since alphabetically it comes before "male"- Even thought the first string in the list was "male"
-
Sometimes we need to reorder the factors for our specific analysis
sex
sex <- factor(sex, levels = c("male", "female"))
sex
- One nice feature about factors is that you can quickly plot the number of observations represented by each factor level
- Let's look at the number of male and females captured during the experiment
plot(surveys$sex)
- We can change the entries in a factor
- For example, consider if we wanted to label all the observations that were not assigned a gender with the word "missing,"
sex <- surveys$sex
head(sex)
levels(sex)
levels(sex)[1] <- "missing"
levels(sex)
head(sex)
- While factors are useful for plotting, they can also be cumbersome at times
- You can import new data and specify that you don't want any factors
- Let's re-import our data and we'll set only the
plot_type
column to be a factor
surveys <- read.csv("data/portal_dat_joined.csv", stringsAsFactors = FALSE)
str(surveys)
surveys$plot_type <- factor(surveys$plot_type)
- One big challenge when wrangling data is to format dates correctly
- To do this we will use an R library called
lubridate
, which is found inside of thetidyverse
package - A library is a collection of functions and code (or other libraries) that extends the base functionalities of R
tidyverse
is an "umbrella package" that contains several sublibraries. These libraries all abide by the tidy data principles layed out in the paper "Tidy Data" by Hadley Wickham.
install.packages("tidyverse")
Many errors can be addressed by using `install.packages("tidyverse", dependencies = TRUE)
- the
ymd()
function changes the data type to dates
library(lubridate)
paste(surveys$year, surveys$month, surveys$day, sep='-')
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep='-'))
str(surveys)
- Notice the error:
Warning: 129 failed to parse.
- Some of the dates didn't convert
- Let's check it out
surveys_dates <- paste(surveys$year, surveys$month, surveys$day, sep = "-")
head(surveys_dates)
surveys_dates[is.na(surveys$date)]
dplyr
is a set of tools that makes many tasks in data manipulation easier- To use
dplyr
we need to load the right library
library("tidyverse")
dplyr
usesread_csv()
instead ofread.csv()
surveys <- read_csv("data/portal_dat_joined.csv")
str(surveys)
surveys
is now a "tibble"- tibbles are the most recent version of data frames
- The only difference we need to worry about today is the fact that tibbles never have factors, but always use vectors of characters
- One of the big advantages of
dplyr
is that referencing pieces of the tibble is easier than the traditional index notation of data frames
select(surveys, plot_id, species_id, weight)
-
Notice the object is the first argument
-
The columns that we want then follow as the rest of the arguments
-
We can filter rows based on specific criteria
filter(surveys, year == 1995)
- Suppose that I want to select and filter at the same time, now what?
- Pipes allow us to take the output of one function and input it into another function
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
-
Each function call is missing the first argument since it was supplied through the pipe
-
You can use the shortcut
Ctrl
+Shift
+M
to type a pipe in RStudio -
To save this as an object, do the following
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
- Often we want to create new columns from existing columns
- We could create a new column from the weight column (which is currently in grams)
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
dplyr
allows us to group our data into categories and create statistical summaries of the data as well
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
- We are not limited to just one category
- We can summarize with subgroups as well
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
-
R has function for exporting
.csv
files just as it has functions for reading them -
Begin by creating a new folder in your project directory called
/data_output
- Raw data should be kept separate from processed data
- The original data should never be changed
-
Let's clean up our data and write out the results
surveys_complete <- surveys %>%
filter(species_id != "",
!is.na(weight),
!is.na(hindfoot_length),
sex != "")
-
In the next section, we will look at plotting how species abundances have changed through time, so we will remove observations for rare species
-
We will begin by counting how many of each species was observed and filtering out those observations which were less than 50
species_counts <- surveys_complete %>%
group_by(species_id) %>%
tally() %>%
filter(n >= 50)
- Now we will only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
- Check that we all have the same clean data
dim(surveys_complete)
-
We should all have a tibble that is 30,463 rows and 13 columns
-
We're now ready to write out to a
.csv
-
If you check your folder with RStudio, it should be there
If not still in the workspace, then the students will need
tidyverse
and the data that we wrote out from the last workshop
library(tidyverse)
surveys_complete <- read_csv("data_output/surveys_complete.csv")
-
The package
ggplot2
can help us create publication quality graphs with minimal amounts of adjustments and tweaking -
ggplot
graphics are built step by step by adding new elements -
Begin by binding the plot to a specific data frame
ggplot(data = surveys_complete)
- We can change the presentation of things like size, shape, color, variables to be plotted, etc. by using the aesthetics argument
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
- We can specify the graphical representation of our data points with
geom_point()
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) + geom_point()
- This is just the surface of what
ggplot2
has to offer - Happy data wrangling!