-
Notifications
You must be signed in to change notification settings - Fork 1
acquainted R
Starting R is simple. If you're using Windows simply navigate to the R
subfolder from the Start Menu. On a Unix/Linux system invoke the program
by typing R. On OS X start R by clicking the R icon from the Dock or
in the Applications folder.
The OS X and Windows version of R provide a simple GUI interface that simplifies certain tasks. When you start up the R GUI you’ll be presented with a single window, the R console. The rest of this document will assume you’re using R under Windows or OS X.
RStudio is a new IDE for R being developed under an open source model (see the RStudio GitHub Site). It provides a nicely designed graphical interface that is consistent across platforms. It can even run as a server, allowing you to access R via a web interface!
Check out the RStudio Docs for detailed info on configuring Rstudio. We'll go over some of RStudio's nice features in class.
R comes with fairly extensive documentation and a simple help system. You can access HTML versions of R documentation under the Help menu. The HTML documentation also includes information on any packages you’ve installed. Take a few minutes to browse through the R HTML documentation.
The help system can be invoked from the console itself using the help
function or the ? operator.
> help(length)
> ?length
> ?log
What if you don’t know the name of the function you want? You can use
the help.search() function.
> help.search("log")
In this case help.search(log) returns all the functions with the string ‘log’ in them.
For more on help.search type ?help.search. Other useful help related
functions include apropos() and example().
When you start the R environment your ‘working directory’ (i.e. the directory on your computer’s file system that R currently ‘sees’) defaults to a specific directory. On Windows this is usually the same directory that R is installed in, on OS X it is typically your home directory. Here are examples showing how you can get information about your working directory and change your working directory.
> getwd()
[1] "/Users/pmagwene"
> setwd("/Users")
> getwd()
[1] "/Users"
Note that on Windows you can change your working directory by using the
Change dir... item under the File menu.
To get a list of the file in your current working directory use the
list.files() function.
> list.files()
[1] "Shared" "pmagwene"
The simplest way to use R is as a fancy calculator.
> 3.14 * 2.5^2
[1] 19.625
> pi * 2.5^2 # R knows about some mathematical constants such as Pi
[1] 19.63495
> cos(pi/3)
[1] 0.5
> sin(pi/3)
[1] 0.8660254
> log(10)
[1] 2.302585
> log10(10) # log base 10
[1] 1
> log2(10) # log base 2
[1] 3.321928
> (10 + 2)/(4-5)
[1] -12
> (10 + 2)/4-5 # compare the answer to the above
[1] -2
Be aware that certain operators have precedence over others. For example multiplication and division have higher precedence than addition and subtraction. Use parentheses to disambiguate potentially confusing statements.
> sqrt(pi)
[1] 1.772454
> sqrt(-1)
[1] NaN
Warning message:
NaNs produced in: sqrt(-1)
> sqrt(-1+0i)
[1] 0+1i
What happened when you tried to calculate sqrt(-1)? -1 is treated as a
real number and since square roots are undefined for the negative reals,
R produced a warning message and returned a special value called NaN
(Not a Number). Note that square roots of negative complex numbers are
well defined so sqrt(-1+0i) works fine.
> 1/0
[1] Inf
Division by zero produces an object that represents infinite numbers.
You’ve already been introduced to the most commonly used arithmetic operators. Also useful are the comparison operators:
> 10 < 9 # less than
[1] FALSE
> 10 > 9 # greater than
[1] TRUE
> 10 <= (5 * 2) # less than or equal to
[1] TRUE
> 10 >= pi # greater than or equal to
[1] TRUE
> 10 == 10 # equals (note that '=' is an alternative assignment operator)
[1] TRUE
> 10 != 10 # does not equal
[1] FALSE
> 10 == (sqrt(10)^2) # Are you surprised by the result? See below..
[1] FALSE
> 4 == (sqrt(4)^2) # Even more confused?
[1] TRUE
Comparisons return boolean values. Be careful to distinguish between
== (tests equality) and = (the alternative assignment operator
equivalent to <-).
How about the last two statement comparing two values to the square of their square roots? Mathematically we know that both ((\sqrt{10})^2 = 10) and ((\sqrt{4})^2 = 4) are true statements. Why does R tell us the first statement is false? What we’re running into here are the limits of computer precision. A computer can’t represent (\sqrt 10) exactly, whereas (\sqrt 4) can be exactly represented. Precision in numerical computing is a complex subject and beyond the scope of this course. Later in the course we’ll discuss some ways of implementing sanity checks to avoid situations like that illustrated above.
Remember that in R arithmetic operations work on vectors as well as on single numbers (in fact single numbers are vectors).
> x <- c(2, 4, 6, 8, 10)
> x * 2
[1] 4 8 12 16 20
> x * pi
[1] 6.283185 12.566371 18.849556 25.132741 31.415927
> y <- c(0, 1, 3, 5, 9)
> x + y
[1] 2 5 9 13 19
> x * y
[1] 0 4 18 40 90
> x/y
[1] Inf 4.000000 2.000000 1.600000 1.111111
> z <- c(1, 4, 7, 11)
> x + z
[1] 3 8 13 19 11
Warning message:
longer object length
is not a multiple of shorter object length in: x + z
When vectors are not of the same length R ’recycles’ the elements of the
shorter vector to make the lengths conform. In the example above z was
treated as if it was the vector (1, 4, 7, 11, 1).
The comparison operators also work on vectors as shown below. Comparisons involving vectors return vectors of booleans.
> x > 5
[1] FALSE FALSE TRUE TRUE TRUE
> x != 4
[1] TRUE FALSE TRUE TRUE TRUE
> length(x)
[1] 5
> x[1]
[1] 2
> x[4]
[1] 8
> x[6]
[1] NA
> x[-1]
[1] 4 6 8 10
> x[c(3,5)]
[1] 6 10
For a vector of length (n), we can access the elements by the indices
(1 \ldots n). Trying to access an element beyond these limits returns a
special constant called NA (Not Available) that indicates missing or
non-existent values.
Negative indices are used to exclude particular elements. x[-1]
returns all elements of x except the first.
You can get multiple elements of a vector by indexing by another vector.
In the example above x[c(3,5)] returns the third and fifth element of
x.
A very powerful feature of R is the ability to combine the comparison operators with indexing. This facilitates data filtering and subsetting. Some examples:
> x <- c(2, 4, 6, 8, 10)
> x[x > 5]
[1] 6 8 10
> x[x < 4 | x > 6]
[1] 2 8 10
In the first example we retrieved all the elements of x that are
larger than 5 (read ‘x where x is greater than 5’).
In the second example we retrieved those elements of x that were
smaller than four or greater than six. The symbol | is the
'logical or' operator. Other logical operators include & ('logical
and' or 'intersection') and ! (negation).
Combining indexing and comparison is a powerful concept and one you’ll probably find useful for analyzing your own data.
Creating sequences of numbers that are separated by a specified value or that follow a particular patterns turns out to be a common task in programming. R has some built-in operators and functions to simplify this task.
> s <- 1:10
> s
[1] 1 2 3 4 5 6 7 8 9 10
> s <- 10:1
> s
[1] 10 9 8 7 6 5 4 3 2 1
> s <- seq(0.5,1.5,by=0.1)
> s
[1] 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
> s <- seq(0.5, 1.5, 0.33) # 'by' is the 3rd argument
# so you don't have to specify it
> s
[1] 0.50 0.83 1.16 1.49
rep() is another way to generate patterned data.
> rep(c("Male","Female"),3)
[1] "Male" "Female" "Male" "Female" "Male" "Female"
> rep(c(T,T, F),2)
[1] TRUE TRUE FALSE TRUE TRUE FALSE
You’ve already seem a number of functions (e.g. sin(), log,
length(), etc). Functions are called by invoking the function name
followed by parentheses containing zero or more arguments to the
function. Arguments can include the data the function operates on as
well as settings for function parameter values. We’ll discuss function
arguments in greater detail below.
An important function that you’ve used extensively but we’ve glossed
over is the c() function. This is short for ‘concatenate’ or ‘combine’
and as you’ve seen it combines it’s arguments to form a vector.
For vectors of more than 10 or so elements it gets tiresome and error
prone to create vectors using c(). For medium length vectors the
scan() function is very useful.
> test.scores <- scan()
1: 98 92 78 65 52 59 75 77 84 31 83 72 59 69 71 66
17:
Read 16 items
> test.scores
[1] 98 92 78 65 52 59 75 77 84 31 83 72 59 69 71 66
When you invoke scan() without any arguments the function will read in
a list of values separated by white space (usually spaces or tabs).
Values are read until scan() encounters a blank line or the end of
file (EOF) signal (platform dependent).
Note that we created a variable with the name ‘test.scores’. If you
have previous programming experience you might be surpised that this
works. Unlike most languages, R (and S) allow you to use periods in
variable names. Descriptive variable names generally improve readability
but they can also become cumbersome (e.g.
my.long.and.obnoxious.variable.name). As a general rule of thumb use
short variable names when working at the interpreter and more
descriptive variable names in functions.
Let’s introduce some additional numerical functions that are useful for operating on vectors.
> sum(test.scores)
[1] 1131
> min(test.scores)
[1] 31
> max(test.scores)
[1] 98
> range(test.scores) # min and max returned as a vector of length 2
[1] 31 98
> sorted.scores <- sort(test.scores)
> sorted.scores
[1] 31 52 59 59 65 66 69 71 72 75 77 78 83 84 92 98
> w <- c(-1, 2, -3, 3)
> abs(w) # absolute value function
Function arguments can specify the data that a function operates on or parameters that the function uses. Some arguments are required, while others are optional and are assigned default values if not specified.
Take for example the log() function. If you examine the help file for
the log() function you’ll see that it takes two arguments, refered to
as ‘x’ and ‘base’. The argument x represents the numeric vector
you pass to the function and is a required argument (see what happens
when you type log() without giving an argument). The argument base
is optional. By default the value of base is (e = 2.71828\ldots).
Therefore by default the log() function returns natural logarithms. If
you want logarithms to a different base you can change the base
argument as in the following examples:
> log(2) # log of 2, base e
[1] 0.6931472
> log(2,2) # log of 2, base 2
[1] 1
> log(2, 4) # log of 2, base 4
[1] 0.5
The c() and scan() functions are fine for creating small to medium
vectors at the interpreter, but eventually you’ll want to start
manipulating larger collections of data. There are a variety of
functions in R for retrieving data from files.
The most convenient file format to work with are tab delimited text files. Text files have the advantage that they are human readable and are easily shared across different platforms. If you get in the habit of archiving data as text files you’ll never find yourself in a situation where you’re unable to retrieve important data because the binary data format has changed between versions of a program.
scan() itself can be used to read data out of a file. Download the
file algae.txt from the class website and try the following (after
changing your working directory):
> algae <- scan('algae.txt')
Read 12 items
> algae
[1] 0.530 0.183 0.603 0.994 0.708 0.006 0.867 0.059 0.349 0.699 0.983 0.100
One of the things to be aware of when using scan() is that if the data
type contained in the file can not be coerced to doubles than you must
specify the data type using the what argument. The what argument
is also used to enable the use of scan() with columnar data. Download
algae2.txt and try the following:
> algae.table <- scan('algae2.txt', what=list('',double(0)))
# note use of list argument to what
> algae.table
[[1]]
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
[[2]]
[1] 0.530 0.183 0.603 0.994 0.708 0.006 0.867 0.059 0.349 0.699 0.983 0.100
> algae.table[[1]]
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> algae.table[[2]]
[1] 0.530 0.183 0.603 0.994 0.708 0.006 0.867 0.059 0.349 0.699 0.983 0.100
Use help to learn more about scan().
read.table() (and it’s derivates - see the help file) provides a more
convenient interface for reading tabular data. Using the file
turtles.txt:
> turtles <- read.table('turtles.txt', header=T)
> turtles
sex length width height
1 f 98 81 38
2 f 103 84 38
3 f 103 86 42
# output truncated
> names(turtles)
[1] "sex" "length" "width" "height"
> length(turtles)
[1] 4
> length(turtles$sex)
[1] 48
What kind of data structure is turtles? What happens when you call the read.table() function without specifying the argument header=T?
You'll be using the read.table()}function frequently. Spend some time reading the documentation and playing around with different argument values (for example, try and figure out how to specify different column names on input).
Note: read.table() is more convenient but scan() is more efficient for large files. See the R documentation for more info.
There are a wealth of statistical functions built into R. Let's start to put these to use.
If you wanted to know the mean carapace width of turtles in your sample you could calculate this simply as follows:
> sum(turtles$width)/length(turtles$width)
[1] 95.4375
Of course R has a built in mean() function.
mean(turtles$width) [1] 95.4375
One of the advantages of the built in mean() function is that it knows
how to operate on lists as well as vectors:
> mean(turtles)
sex length width height
NA 124.68750 95.43750 46.33333
Warning message:
argument is not numeric or logical: returning NA in: mean.default(X[[1]], ...)
Can you figure out why the above produced a warning message? Let’s take a look at some more standard statistical functions:
> min(turtles$width)
[1] 74
> max(turtles$width)
[1] 132
> range(turtles$width)
[1] 74 132
> median(turtles$width)
[1] 93
> summary(turtles$width)
Min. 1st Qu. Median Mean 3rd Qu. Max.
74.00 86.00 93.00 95.44 102.00 132.00
> var(turtles$width) # variance
[1] 160.6769
> sd(turtles$width) # standard deviation
[1] 12.67584
One of the advantages of R is it's ability to produce a variety of plots and statistical graphics. Try out the following:
> hist(turtles$width) # histogram plot
> hist(turtles$width,10) # produces a histogram with 10 bins
> hist(turtles$width,breaks=10, xlab="Carapace Width", probability=T)
>
> boxplot(turtles$width) # simple box plot
> boxplot(list(turtles$length, turtles$width, turtles$height),
+ names=c("Carapace\nLength","Carapace\nWidth","Carapace\nHeight"),
+ ylab="millimeters") # a fancy box plot showing multiple variables
> title("Turtle Shell Variables")
>
> plot(turtles$length, turtles$width)
> plot(turtles$length ~ turtles$width) # how does this differ from the plot above?
> plot(turtles$length, turtles$width, xlab="Carapace Length(mm)",
+ ylab="Carapace Width(mm)")
> title("Relationship Between\nLength and Width")
To get a sense of some of the graphical power of R try the demo()
function:
> demo(graphics)