# Creating Vectors

In [1]:
x <- c(0.5, 0.6) # numeric
y <- c(TRUE, FALSE) ## Logical
z <- c("a", "b", "c") #Character


## Explicit Coercion


In [5]:
x <- 0:6
class(x)

In [6]:
as.numeric(x)

In [7]:
as.logical(x)

In [8]:
as.character(x)

## Matrices

In [9]:
m <- matrix(nrow=2, ncol=3)
m

0,1,2
,,
,,


In [10]:
dim(m)

In [11]:
attributes(m)

Matrices are constructed columnwise

m <- matrix(1:6, nrow=2, ncol=3)
m

matrix can also be constructed from Vectors using the dim() command

In [13]:
m <- 1:10
m

In [15]:
dim(m) <- c(2,5)
m

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10


matrices can be created using column binding and row binding

In [16]:
x <- 1:3
y <- 10:12
cbind(x,y)

x,y
1,10
2,11
3,12


In [17]:
rbind(x,y)

0,1,2,3
x,1,2,3
y,10,11,12


## Factors

Factors are used to represent categorical data and can be unordered or ordered. One can think of
a factor as an integer vector where each integer has a label. Factors are important in statistical
modeling and are treated specially by modelling functions like lm() and glm().
Using factors with labels is better than using integers because factors are self-describing. Having a
variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
Factor objects can be created with the factor() function.

In [18]:
x <- factor(c("Yes","Yes","No","Yes","No"))
x

In [19]:
levels(x)

In [21]:
table(x)

x
 No Yes 
  2   3 

Often factors will be automatically created for you when you read a dataset in using a function like
read.table(). Those functions often default to creating factors when they encounter data that look
like characters or strings.
The order of the levels of a factor can be set using the levels argument to factor(). This can be
important in linear modelling because the first level is used as the baseline level.

In [23]:
x <- factor(c("Yes","Yes","No","Yes","No"), levels=c("Yes","No"))
levels(x)

## Missing Values

Missing values are denoted by NA or NaN for q undefined mathematical operations.
 is.na() is used to test objects if they are NA
• is.nan() is used to test for NaN
• NA values have a class also, so there are integer NA, character NA, etc.
• A NaN value is also NA but the converse is not true

In [24]:
# create a vector with NAs in it
x <- c(1,2,NA,10,3)
# return logical vector indicating which elements are NA
is.na(x)


## Data Frames

In [25]:
x <- data.frame(f00=1:4, bar=c(T,T,F,F))
x

f00,bar
1,True
2,True
3,False
4,False


In [26]:
nrow(x)
ncol(x)

## Names

In [27]:
x <- 1:3
names(x)

NULL

In [28]:
names(x) <- c("New York", "Seattle", "Los Angeles")
x

## Reading Data Files with read.table()

## Subsetting a Vector

In [30]:
x <- c("a","b","c","c","d","a")
x[1] # first element
x[2]  # second element

In [31]:
x[1:4]  # first to 4th element

In [33]:
u <- x > "a"
u

## Removing NA values

In [34]:
x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
print(bad)

[1] FALSE FALSE  TRUE FALSE  TRUE FALSE


In [35]:
x[!bad]

In [36]:
head(airquality)

Ozone,Solar.R,Wind,Temp,Month,Day
41.0,190.0,7.4,67,5,1
36.0,118.0,8.0,72,5,2
12.0,149.0,12.6,74,5,3
18.0,313.0,11.5,62,5,4
,,14.3,56,5,5
28.0,,14.9,66,5,6


In [38]:
# we can use complete.cases on dataframes to eliminate missing values which returns logicals
good <- complete.cases(airquality)
head(airquality[good,])

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
1,41,190,7.4,67,5,1
2,36,118,8.0,72,5,2
3,12,149,12.6,74,5,3
4,18,313,11.5,62,5,4
7,23,299,8.6,65,5,7
8,19,99,13.8,59,5,8


## Vectorized operations

Many operations in R are vectorized, meaning that operations occur in parallel in certain R objects.
This allows you to write code that is efficient, concise, and easier to read than in non-vectorized
languages.

In [40]:
x <- 1:4
y <- 6:9
z <- x+y
z

In [41]:
x > 2

## Managing DataFrames with the DPLYR package

Some of the key “verbs” provided by the dplyr package are
1. select: return a subset of the columns of a data frame, using a f lexible notation
2. filter: extract a subset of rows from a data frame based on logical conditions
3. arrange: reorder rows of a data frame
4. rename: rename variables in a data frame
5. mutate: add new variables/columns or transform existing variables
6. summarise / summarize: generate summary statistics of different variables in the data frame,
possibly within strata
7• %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline

## Common dplyr Function Properties

All of the functions that we will discuss in this Chapter will have a few common characteristics. In
particular,
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame specified in the first
argument, and you can refer to columns in the data frame directly without using the $ operator
( just use the column names).
3. The return result of a function is a new data frame
4. Data frames must be properly formatted and annotated for this to all be useful. In particular,
the data must be tidy⁵². In short, there should be one observation per row, and each column
should represent a feature or characteristic of that observation.

In [42]:
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [43]:
chicago <- readRDS("chicago.rds")
dim(chicago)

In [45]:
# display structure of dataFrame
str(chicago)

'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


The select() function can be used to select columns of a data frame that you want to focus on.
Often you’ll have a large data frame containing “all” of the data, but any given analysis might only
use a subset of variables or observations. The select() function allows you to get the few columns
you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for
example use numerical indices. But we can also use the names directly.

In [46]:
names(chicago)[1:3]

In [48]:
subset <- select(chicago, city:dptp)
head(subset)

city,tmpd,dptp
chic,31.5,31.5
chic,33.0,29.875
chic,33.0,27.375
chic,29.0,28.625
chic,32.0,28.875
chic,40.0,35.125


Note that the : normally cannot be used with names or strings, but inside the select() function
you can use it to specify a range of variable names.
You can also omit variables using the select() function by using the negative sign. With select()
you can do

In [49]:
select(chicago, -(city:dptp))

date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
1987-01-01,,34.00000,4.250000,19.98810
1987-01-02,,,3.304348,23.19099
1987-01-03,,34.16667,3.333333,23.81548
1987-01-04,,47.00000,4.375000,30.43452
1987-01-05,,,4.750000,30.33333
1987-01-06,,48.00000,5.833333,25.77233
1987-01-07,,41.00000,9.291667,20.58171
1987-01-08,,36.00000,11.291667,17.03723
1987-01-09,,33.28571,4.500000,23.38889
1987-01-10,,,4.958333,19.54167


The select() function also allows a special syntax that allows you to specify variable names based
on patterns. So, for example, if you wanted to keep every variable that ends with a “2”, we could do

In [50]:
# selecting variables which end with 2
subset <- select(chicago, ends_with("2"))
str(subset)

'data.frame':	6940 obs. of  4 variables:
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


### Filter()

The filter() function is used to extract subsets of rows from a data frame. This function is similar
to the existing subset() function in R but is quite a bit faster in my experience.
Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are
greater than 30 (which is a reasonably high level), we could do

In [52]:
# select those where mean is > 30
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)

'data.frame':	194 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  23 28 55 59 57 57 75 61 73 78 ...
 $ dptp      : num  21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
 $ date      : Date, format: "1998-01-17" "1998-01-23" ...
 $ pm25tmean2: num  38.1 34 39.4 35.4 33.3 ...
 $ pm10tmean2: num  32.5 38.7 34 28.5 35 ...
 $ o3tmean2  : num  3.18 1.75 10.79 14.3 20.66 ...
 $ no2tmean2 : num  25.3 29.4 25.3 31.4 26.8 ...


In [57]:
summary(chic.f$pm25tmean2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.05   32.12   35.04   36.63   39.53   61.50 

We can place an arbitrarily complex logical sequence inside of filter(), so we could for example
extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees
Fahrenheit.

In [61]:
chic.f <- filter(chicago, pm25tmean2>30 & tmpd>80)
select(chic.f, date, tmpd, pm25tmean2)

date,tmpd,pm25tmean2
1998-08-23,81,39.6
1998-09-06,81,31.5
2001-07-20,82,32.3
2001-08-01,84,43.7
2001-08-08,85,38.8375
2001-08-09,84,38.2
2002-06-20,82,33.0
2002-06-23,82,42.5
2002-07-08,81,33.1
2002-07-18,82,38.85


### arrange()

The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.
Reordering rows of a data frame (while preserving corresponding order of other columns)
is normally a pain to do in R. The arrange() function simplifies the process quite a bit.
Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest)
observation and the last row is the latest (most recent) observation.

In [62]:
# arranging by date and selecting first 3 rows
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)

date,pm25tmean2
1987-01-01,
1987-01-02,
1987-01-03,


### rename()
Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function designed to make this process easier.
Here you can see the names of the first five variables in the chicago data frame.

In [63]:
head(chicago[ ,1:5],3)

city,tmpd,dptp,date,pm25tmean2
chic,31.5,31.5,1987-01-01,
chic,33.0,29.875,1987-01-02,
chic,33.0,27.375,1987-01-03,


In [64]:
chicago <- rename(chicago, dewpoint=dptp, pm25 = pm25tmean2)
head(chicago[ ,1:5],3)

city,tmpd,dewpoint,date,pm25
chic,31.5,31.5,1987-01-01,
chic,33.0,29.875,1987-01-02,
chic,33.0,27.375,1987-01-03,


The syntax inside the rename() function is to have the new name on the left-hand side of the = sign
and the old name on the right-hand side.

### mutate()
The mutate() function exists to compute transformations of variables in a data frame. Often, you
want to create new variables that are derived from existing variables and mutate() provides a clean
interface for doing that.
For example, with air pollution data, we often want to detrend the data by subtracting the mean
from the data. That way we can look at whether a given day’s air pollution level is higher than or
less than average (as opposed to looking at its absolute level).
Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

In [72]:
chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm=TRUE))
head(chicago)

city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,pm25detrend
chic,31.5,31.5,1987-01-01,,34.0,4.25,19.9881,
chic,33.0,29.875,1987-01-02,,,3.304348,23.19099,
chic,33.0,27.375,1987-01-03,,34.16667,3.333333,23.81548,
chic,29.0,28.625,1987-01-04,,47.0,4.375,30.43452,
chic,32.0,28.875,1987-01-05,,,4.75,30.33333,
chic,40.0,35.125,1987-01-06,,48.0,5.833333,25.77233,


### group_by()
The group_by() function is used to generate summary statistics from the data frame within strata
defined by a variable. For example, in this air pollution dataset, you might want to know what the
average annual level of PM2.5 is. So the stratum is the year, and that is something we can derive
from the date variable. In conjunction with the group_by() function we often use the summarize()
function (or summarise() for some parts of the world).
The general operation here is a combination of splitting a data frame into separate pieces defined by
a variable or group of variables (group_by()), and then applying a summary function across those
subsets (summarize()).
First, we can create a year varible using as.POSIXlt().

In [74]:
chicago <- mutate(chicago, year=as.POSIXlt(date)$year + 1900)

years <- group_by(chicago, year)

In [79]:

summarize(years, pm25 = mean(pm25, na.rm=TRUE), o3 = max(o3tmean2, na.rm=TRUE), n02 = median(no2tmean2, na.rm=TRUE))

year,pm25,o3,n02
1987,,62.96966,23.49369
1988,,61.67708,24.52296
1989,,59.72727,26.14062
1990,,52.22917,22.59583
1991,,63.10417,21.38194
1992,,50.8287,24.78921
1993,,44.30093,25.76993
1994,,52.17844,28.475
1995,,66.5875,27.26042
1996,,58.39583,26.38715


### %>% Pipeline operator
The pipeline operater %>% is very handy for stringing together multiple dplyr functions in a sequence
of operations. Notice above that every time we wanted to apply more than one function, the sequence
gets buried in a sequence of nested function calls that is difficult to read, i.e.

#### first(x) %>% second(x) %>% third(x)
Another example might be computing the average pollutant level by month. This could be useful to
see if there are any seasonal trends in the data.

In [81]:
mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% summarize(
pm25 = mean(pm25, na.rm=TRUE), o3=max(o3tmean2, na.rm=TRUE), no2=median(no2tmean2, na.rm=TRUE))

month,pm25,o3,no2
1,17.76996,28.22222,25.35417
2,20.37513,37.375,26.78034
3,17.40818,39.05,26.76984
4,13.85879,47.94907,25.03125
5,14.0742,52.75,24.22222
6,15.86461,66.5875,25.0114
7,16.57087,59.54167,22.38442
8,16.9338,53.96701,22.98333
9,15.91279,57.48864,24.47917
10,14.23557,47.09275,24.15217


### Summary 
The dplyr package provides a concise set of operations for managing data frames. With these
functions we can do a number of complex operations in just a few lines of code. In particular,
we can often conduct the beginnings of an exploratory analysis with the powerful combination of
group_by() and summarize().
Once you learn the dplyr grammar there are a few additional benefits
1. dplyr can work with other data frame “backends” such as SQL databases. There is an SQL
interface for relational databases via the DBI package
2. dplyr can be integrated with the data.table package for large fast tables
The dplyr package is handy way to both simplify and speed up your data frame management code.
It’s rare that you get such a combination at the same time!