# Introduction to R

In this introductory tutorial, we will explore some R basics. Topics include vectors, matrices, factors, data frames, lists, and stats examples. Getting help is simple, use the help command. For example to get help on the objects command:

In [17]:
help(objects)

Examples for a command can be run by using the example command:

In [18]:
example(assign)


assign> for(i in 1:6) { #-- Create objects  'r.1', 'r.2', ... 'r.6' --
assign+     nam <- paste("r", i, sep = ".")
assign+     assign(nam, 1:i)
assign+ }

assign> ls(pattern = "^r..$")
[1] "r.1" "r.2" "r.3" "r.4" "r.5" "r.6"

assign> ##-- Global assignment within a function:
assign> myf <- function(x) {
assign+     innerf <- function(x) assign("Global.res", x^2, envir = .GlobalEnv)
assign+     innerf(x+1)
assign+ }

assign> myf(3)

assign> Global.res # 16
[1] 16

assign> a <- 1:4

assign> assign("a[1]", 2)

assign> a[1] == 2          # FALSE
[1] FALSE

assign> get("a[1]") == 2   # TRUE
[1] TRUE


## Vectors and Vector arithmetic

Vectors are the simplest R data structures. Examples of vector assignment:

In [19]:
x1 <- c(1,5,4)

In [20]:
x1

In [21]:
assign("x2",c(2,1,-1,4,5))

In [22]:
x2

In [23]:
c(4,4,1,2) -> x3

In [24]:
x3

In [25]:
seq(-1, 5, by=.5) -> seq1

In [26]:
seq1

Arithmetic operators include +, -, /, *, and ^ for power. The operations are done element by element. For example:

In [27]:
y1=4*x1+x1^2

In [28]:
y1

What happens if we do operations with vectors of varying length?

In [29]:
y2=x1+2*x3

“longitud de objeto mayor no es múltiplo de la longitud de uno menor”

In [30]:
y2

As you can see the shorter vector is recycled (as often as needed). Other common operations include log, exp, sin, cos, tan, sqrt. Basic statistical functions include mean(x) and var(x)

In [31]:
max(y2)

In [32]:
min(y2)

In [33]:
mean(y2)

In [34]:
var(y2)

In [35]:
sum((y2-mean(y2))^2)/(length(y2)-1)

Character vectors are also an option, elements are enclosed in double quotes and comma delimited. For example:

In [36]:
cvector <- c("name1","name2","name3")

In [37]:
cvector

Index vectors can be used to select and modify subsets of a dataset. There are four types:
(a) Logical Vectors

In [38]:
seq(-1, 5, by=.5) -> seq2

In [39]:
seq2

In [40]:
y <- seq2[seq2 > 0]

In [41]:
y

(b) Select a subset of positive integral labels

In [42]:
seq2[2:5]

(c) Exclude a subset, negative integral labels

In [43]:
seq2[-(2:5)]

(d) If an object has names attribute, subvectors of the names vector can be used

In [44]:
week <- c(1,2,3,4,5,6,7)

In [45]:
names(week) <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

In [46]:
weekend <- week[c("Saturday","Sunday")]

In [47]:
weekend

In [48]:
objects()

## Arrays and Matrices

Arrays can have multiple subscripts specified by the dimension vector (of positive integers). Matrices are a special case - i.e. a 2 dimensional array. The ordering is column major (like in FORTRAN).

In [49]:
h <- seq(1,12)

In [50]:
Matrix1 <- array(h, dim=c(4,3))

In [51]:
Matrix1

0,1,2
1,5,9
2,6,10
3,7,11
4,8,12


Just like index vectors, we have array indexing for extracting subsets of arrays and index matrices for extraction and assignment operations on a subset of the data. 

In [52]:
indexm <- array(c(2,4,1,2,1,3),dim=c(3,2))

In [53]:
indexm

0,1
2,2
4,1
1,3


In [54]:
Matrix1[indexm]

In [55]:
Matrix1[indexm] <- -1

In [56]:
Matrix1

0,1,2
1,5,-1
2,-1,10
3,7,11
-1,8,12


Several standard matrix operations are part of R. These include transpose, multiplication, inversion (linear equation solution), eigen values and vectors, determinants, singular value decomposition, least squares fit, QR decomposition.

In [57]:
svd(Matrix1)

0,1,2
-0.06959598,-0.630956,-0.3921813
-0.40494016,0.723435,-0.2176596
-0.62196691,-0.1405692,-0.560512
-0.6665861,-0.242439,0.6961641

0,1,2
-0.09785836,0.08338371,-0.99170101
-0.45330523,-0.89084453,-0.03017262
-0.88596733,0.44659061,0.12497486


In [58]:
AMAT <- array(1:12,dim=c(4,3))

In [59]:
AMAT

0,1,2
1,5,9
2,6,10
3,7,11
4,8,12


In [60]:
BMAT <- array(7:-4,dim=c(3,4))

In [61]:
BMAT

0,1,2,3
7,4,1,-2
6,3,0,-3
5,2,-1,-4


In [62]:
CMAT=AMAT %*% BMAT

In [63]:
CMAT

0,1,2,3
82,37,-8,-53
100,46,-8,-62
118,55,-8,-71
136,64,-8,-80


In [64]:
determinant(CMAT)

$modulus
[1] -60.0973
attr(,"logarithm")
[1] TRUE

$sign
[1] 1

attr(,"class")
[1] "det"

## Factors

Factors are the data objects that can be used to categorize the data and store it as levels. Factors in R are stored as a vector of integer values with a corresponding set of character values (which are used when the factor is displayed). R has both ordered and unordered factors.

In [65]:
months = c(4,11,2,3,3,4,5,1,2,6,9,8,8,6,7,10,12)

In [66]:
fmons <- factor(months)

In [67]:
fmons

In [68]:
levels(fmons)

In [69]:
levels(fmons)=c('January','February','March','April','May','June','July','August','September','October','November','December')

In [70]:
fmons

In [71]:
tickets=c(4,5,1,0,7,4,27,2,55,2,11,3,8,10,9,22,3)

In [72]:
ticketav=tapply(tickets,fmons,mean)

In [73]:
ticketav

## Lists and Data Frames

An R list is an object that comprises of an ordered collection of objects (components). The components don't have to be of the same kind and are numbered. Components can also be named in which case the component can be referred either by giving the component name or the number. 

In [74]:
FamilyList <- list(name="John", wife="Jane", numberofkids=4, kids.ages=c(5,6,11,18), numcars=3, carmodels=c('Volvo 230','Ford Ranger','Ford Fiesta'))

In [76]:
FamilyList["carmodels"]

In [83]:
length(FamilyList)

In [85]:
AveKidAge=(FamilyList[[4]][1]+FamilyList[[4]][2]+FamilyList[[4]][3]+FamilyList[[4]][4])/4

In [86]:
AveKidAge

A Data Frame is a list of vectors of equal length. The restrictions are : column names must be non-empty, row names should be unique, each column must have the same number of data items, components can be vector (numeric, character, or logical), factors, numeric matrices, lists, or other data frames. For practical purposes a data frame can be considered as a matrix with columns possibly of differing modes and attributes. Here is an example of a built-in data frame:

In [87]:
mtcars

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


There are a lot of options for slicing-and-dicing the information in a dataframe. Some examples:

In [88]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [89]:
mtcars[['disp']]

In [90]:
mtcars[c(2,15),]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Cadillac Fleetwood,10.4,8,472,205,2.93,5.25,17.98,0,0,3,4


In [91]:
mtcars['Hornet Sportabout',]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2


Lot more information and hands on material on data frames upcoming in tomorrow's talk on R. Btw, we used a built in dataset above. To get the full list you can run:

In [97]:
data()

## Statistical Models

R has a wide array of options to make fitting statistical models easy. This includes functions for extracting model information, analysis of variance and model comparison, generalized linear models, and nonlinear least squares, maximum likelihood models. The class of generalized linear models includes gaussian, binomial, poisson, inverse gaussian, and gamma response distributions. Quasi-likelihood models are an option where the response distribution is not specified. A simple glm example is provided below but more details upcoming in other talks.

In [100]:
mlfit <- glm( mpg ~ cyl + disp + hp, data=mtcars, family=quasi)

In [101]:
summary (mlfit)


Call:
glm(formula = mpg ~ cyl + disp + hp, family = quasi, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.0889  -2.0845  -0.7745   1.3972   6.9183  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.18492    2.59078  13.195 1.54e-13 ***
cyl         -1.22742    0.79728  -1.540   0.1349    
disp        -0.01884    0.01040  -1.811   0.0809 .  
hp          -0.01468    0.01465  -1.002   0.3250    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasi family taken to be 9.33462)

    Null deviance: 1126.05  on 31  degrees of freedom
Residual deviance:  261.37  on 28  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 2
