In [1]:
options(jupyter.rich_display = FALSE)

Data vectors in R
========== 

* The fundamental data type in R is the _vector_.
* The single-number objects we have seen so far are one-element vectors.

Variables with single values are inconvenient if we want to process data in bulk, e.g.:

In [2]:
height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2

height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2

In [6]:
bmi2

[1] 21.55102

Better way: Hold related values in a **vector**.

Creating vectors
=========
The most general way to create data vectors is to use the `c()` function (_concatenate_).

In [8]:
heights <- c(1.70, 1.75, 1.62)
weights <- c(65, 66, 61)

In [9]:
heights

[1] 1.70 1.75 1.62

In [10]:
weights

[1] 65 66 61

Vectors can also be created with the _colon operator_ (:)

In [12]:
x <- 2:10 # assign integers from 2 to 10, inclusive.
x

[1]  2  3  4  5  6  7  8  9 10

Extending vectors
==========
The function `c()` can also be used to add new elements to vectors.

Suppose initially we have only two pieces of data:

In [13]:
heights <- c(1.70, 1.75)
heights

[1] 1.70 1.75

Then we get another data point, and we extend the vector.

In [14]:
c(heights, 1.62)

[1] 1.70 1.75 1.62

In [15]:
heights

[1] 1.70 1.75

In [16]:
heights <- c(heights, 1.62)

In [17]:
heights

[1] 1.70 1.75 1.62

In [18]:
heights <- c(1.62, heights)

In [19]:
heights

[1] 1.62 1.70 1.75 1.62

In [20]:
c(heights, heights)

[1] 1.62 1.70 1.75 1.62 1.62 1.70 1.75 1.62

Modes
=====

* R variable types are called _modes_.
* Modes include: "numeric", "character", "logical", "complex", and so on.
* All elements in a vector must be of the same mode.

In [25]:
mode(c(1,2))
mode(c("abc","xyz"))
mode(c(TRUE,FALSE))
mode(2+4i)

[1] "numeric"

[1] "character"

[1] "logical"

[1] "complex"

In [23]:
typeof("abc")

[1] "character"

# Vector arithmetic

If you add two vectors with the same number of elements, they are added elementwise.

In [26]:
c(1,4,9) + c(2,16,5)

[1]  3 20 14

Same applies to all basic operations:

In [27]:
c(1,4,9) * c(2,16,5)

[1]  2 64 45

In [28]:
c(1,4,9) / c(2,16,5)

[1] 0.50 0.25 1.80

In [29]:
3 > 2

[1] TRUE

In [30]:
c(1,4,9) > c(2,16,5)

[1] FALSE FALSE  TRUE

If an arithmetic or logic operation involves a vector and a single number, the same number is _recycled_ with every element.

In [31]:
c(1,4,9) + 5  # converted to: c(1,4,9) + c(5,5,5)

[1]  6  9 14

In [32]:
c(1,4,9) < 5  # converted to: c(1,4,9) < c(5,5,5)

[1]  TRUE  TRUE FALSE

In [33]:
c(1,4,9)^2   # converted to: c(1,4,9) ^ c(2,2,2)

[1]  1 16 81

## Pause to think

What is the output of the operation `2 * c(1,2,3) + 3`?

* `5 7 9`
* `2 4 6 3`
* `4 5 6 4 5 6`
* `1 2 3 1 2 3 3`

# Vectorized functions 

## sum(), cumsum()
Adds up all elements in vector

In [34]:
sum(c(1,4,9))

[1] 14

In [35]:
sum(1:1000)

[1] 500500

In [36]:
cumsum(1:10)

 [1]  1  3  6 10 15 21 28 36 45 55

## prod(), cumprod()
Multiplies all elements in a vector

In [37]:
prod(c(1,4,9))

[1] 36

In [39]:
prod(1:5)  # 5!

[1] 120

In [40]:
cumprod(1:5)

[1]   1   2   6  24 120

## Pause to think

Which of the following commands \
can be used to calculate \
$\sum_{i=1}^{10} i^2$?

* `sum(1:10^2)`
* `sum(1:10)^2`
* `sum((1:10)^2)`
* `sum(1^2:10^2)`

# Mathematical functions
Familiar mathematical functions are designed to apply on vectors elementwise.

In [41]:
sqrt(c(4,9,16))

pi

sin(c(0, pi/4, pi/2, 3*pi/4, pi))  # or: sin( 0:4*pi/4 )

exp(1:5)

log(exp(1:5))

[1] 2 3 4

[1] 3.141593

[1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16

[1]   2.718282   7.389056  20.085537  54.598150 148.413159

[1] 1 2 3 4 5

Missing data
========
* In many data sets, we often have some missing data, i.e., observations for which the values are missing.
* In R, missing values are denoted with `NA`.
* Any vector can contain missing values.

In [42]:
weights <- c(65, NA, 61)
names <- c("Can","Cem",NA)

Vector element names
===========
For readability, we can assign name labels to the elements of a data vector.

In [43]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

In [44]:
weights <- c(Can=65, Cem=66, Hande=61)
weights

  Can   Cem Hande 
   65    66    61 

We can retrieve these names with the `names()` function.

In [45]:
names(heights)

[1] "Can"   "Cem"   "Hande"

We can assign names to the elements of a vector that already exists.

In [46]:
heights <- c(1.70, 1.75, 1.62)
names(heights) <- c("Can","Cem","Hande")
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

If for some reason we want to remove the names, we use the `unname()` function.

In [47]:
unname(heights)

[1] 1.70 1.75 1.62

The original vector is not changed with this function call, because we did not assign the result to `heights`.

In [48]:
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

Vector indexing
=========
We can access a single element of a vector by providing the index of the element in square brackets.

In [49]:
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

In [50]:
heights[1]  # first element

Can 
1.7 

In [51]:
heights[3] # third element

Hande 
 1.62 

We can select a slice of the vector by providing a range inside brackets.

In [52]:
heights[1:2]  # select from element 1 to element 2, inclusive.

 Can  Cem 
1.70 1.75 

We can also give a vector consisting of element indices.

In [53]:
heights[c(1,3)]  # select elements 1 and 3.

  Can Hande 
 1.70  1.62 

The indices do not have to be in order:

In [54]:
heights[c(2,1,3)]

  Cem   Can Hande 
 1.75  1.70  1.62 

We can select the same element more than once.

In [55]:
heights[c(1,1,3,2,3)]

  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 

We can provide a Boolean (true/false) vector for indexing. This will select only elements with corresponding `TRUE` values.

In [56]:
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

In [57]:
heights[c(T,F,F)]  # T is a shorthand for TRUE, F is for FALSE.

Can 
1.7 

We can **exclude** elements using negative indices.

In [58]:
heights[-1]  # exclude first element.

  Cem Hande 
 1.75  1.62 

In [59]:
heights[c(-1,-3)]  # exclude 1st and 3rd elements

 Cem 
1.75 

## Pause to think

Suppose we define a four-element vector

`v <- c(3,6,2,-1)`.

Which of the following CANNOT be used to select the second and third elements of this vector?

* `v[2:3]`
* `v[c(2,3)]`
* `v[c(6,2)]`
* `v[c(F,T,T,F)]`
* `v[c(-1,-4)]`

Using names to select elements
=======================
If the elements are given names consisting of strings, we can use these names in brackets instead of indices.

In [60]:
heights["Can"]

Can 
1.7 

In [61]:
heights[c("Can","Can","Hande","Cem","Hande")]

  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 

Modify element values in a vector
=================

In [62]:
heights

  Can   Cem Hande 
 1.70  1.75  1.62 

In [63]:
heights[1] <- 1.72

In [64]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [None]:
heights[1] <- 1.70

Insert values to an existing vector
============
A vector's size is determined at its creation, and its elements are stored contiguously (side-by-side) in memory. Therefore it is really not possible to add or remove an element in a vector. However, we can reassign the identifier to a new one.

In [65]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [66]:
heights <- c(heights[1:2], Lale=1.76, heights[3])

In [67]:
heights

  Can   Cem  Lale Hande 
 1.72  1.75  1.76  1.62 

Delete elements from vector
==========
Again, we cannot directly remove an element from an existing vector, but we can create a new vector without the element we want to delete, and reassign to the name.

In [68]:
heights

  Can   Cem  Lale Hande 
 1.72  1.75  1.76  1.62 

In [69]:
heights <- heights[-3]  # exclude element 3

In [70]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

# Pause to think

Suppose we define a vector with

`v <- c(3,4,5)`

What is the output of the following commands?

    v <- c(5, v, 1:2)
    v <- v[-2]
    v[2:4]
    
* `2 3 4`
* `5 3 4 5 3 4`
* `4 5 3`
* `4 5 1`

In [None]:
v <- c(3,4,5)
v <- c(5,v,1:2)
v

In [None]:
v <- v[-2]
v

In [None]:
v[2:4]

Getting the length of a vector
==========
We can get the number of elements in a vector using the `length()` function.

In [71]:
length(heights)

[1] 3

In [72]:
length(10:17)

[1] 8

Vector filtering
===========

* Apply a Boolean function (e.g., greater than, less than, ...) to each element of the vector.
* Returns a Boolean vector according to the result on each element.

In [73]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [74]:
heights > 1.65

  Can   Cem Hande 
 TRUE  TRUE FALSE 

Using this Boolean vector, we can select data points satisfying the condition.

In [75]:
tall_people <- heights>1.65
tall_people

  Can   Cem Hande 
 TRUE  TRUE FALSE 

In [76]:
heights[tall_people]

 Can  Cem 
1.72 1.75 

Obviously, this can be done in a single line, too.

In [77]:
heights[heights>1.65]

 Can  Cem 
1.72 1.75 

One can also filter a vector according to another vector's values.

In [78]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [79]:
weights

  Can   Cem Hande 
   65    66    61 

In [80]:
weights[ heights > 1.65 ]  # weights of people who are taller than 1.65

Can Cem 
 65  66 

## Pause to think

Given the vectors with named values:

    ages <- c(Ali=18, Hasan=21, Fatma=18, Hande=22, Cem=21)
    weights <- c(Ali=75, Hasan=72, Fatma=60, Hande=56, Cem=67)

which of the following commands prints the weights of people who are 18 years old?

* `weights[ages==18]`
* `ages[weights]==18`
* `weights[names(ages==18)]`
* `names(weights[ages==18])`

Modify a vector by filtering
=========
* We can use filtering to selectively change only the elements that satisfy a condition.
* **Example**: For people who weigh more than 65 kg, decrease the weight by 1 kg.

In [87]:
weights

  Ali Hasan Fatma Hande   Cem 
   75    72    60    56    67 

In [88]:
weights[weights > 65] - 1

  Ali Hasan   Cem 
   74    71    66 

In [89]:
weights[weights > 65] <- weights[weights > 65] - 1
weights

  Ali Hasan Fatma Hande   Cem 
   74    71    60    56    66 

In [91]:
weights["Cem"] <- 66
weights

  Ali Hasan Fatma Hande   Cem 
   74    71    60    56    66 

Get indices of elements that satisfy a condition
===========
The `which()` function returns the indices (and labels, if available) of elements in a vector for which a Boolean function returns `TRUE`.

In [92]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [93]:
heights > 1.65

  Can   Cem Hande 
 TRUE  TRUE FALSE 

In [94]:
which(heights > 1.65)

Can Cem 
  1   2 

Using all() and any()
==========
* We use the `all()` function to check if **all** elements in a vector are `TRUE`.
* We use the `any()` function to check if **any one** of the elements in a vector are `TRUE`.

In [95]:
heights

  Can   Cem Hande 
 1.72  1.75  1.62 

In [96]:
all(heights > 1.60) # TRUE

all(heights > 1.70) # FALSE

any(heights > 1.70) # TRUE

[1] TRUE

[1] FALSE

[1] TRUE

## Pause to think

Suppose a vector named `ages` holds the ages of a group who want to enter a museum. You want to make sure that there is at least one grownup among them. Which command do you use?

* any(ages > 18)
* all(ages > 18)
* any(ages < 18)
* all(ages < 18)

## Pause to think

Suppose a vector named `ages` holds the ages of a group who want to enter a bar. You want to make sure that everybody is of proper age to drink. Which command do you use?

* any(ages > 18)
* all(ages > 18)
* any(ages < 18)
* all(ages < 18)

# Generating vectors with repeated elements
The `rep()` function can be used to replicate values or vectors a specified number of times.

In [None]:
rep(3,10)

In [None]:
rep("abc",5)

In [None]:
rep(c(1,2,3),5)

In [None]:
rep(c(1,2,3),length.out=10)

Generating sequences with seq()
==========
The `seq()` function generates a vector of numbers in arithmetic progression. It is a generalization of the colon(`:`) operator.

In [None]:
seq(4,9)  # same as 4:9

In [None]:
seq(from=12, to=29, by=3)

In [None]:
seq(from=1.1, to=6, length.out=10)

Sorting a vector
=========

In [None]:
sort(heights)

In [None]:
sort(heights, decreasing = TRUE)

* Often we need to sort a vector according to the values of another vector.
* First we compute an _ordering_.

In [None]:
heights

In [None]:
order(heights)

Then we use this ordering with the other vector:

In [None]:
weights[order(heights)]  # return the weights of people ordered by their heights.

In [None]:
heights[sort(names(heights))]

# Exercises

1. Create and store a sequence of values from 5 to −11 that progresses in steps of 0.3.


2. Create and store a 20-element vector that contains, in any configuration, the following:

    (a) A sequence of integers from 6 to 12 (inclusive)
    
    (b) A threefold repetition of the value 5.3
    
    (c) The number −3
    
    (d) A sequence of nine values starting at 102 and ending at the number that is the total length of the vector created in (c).
    

3. A set of temperature measurements are given in Fahrenheit scale as follows:

    `temperatures <- c(87, 89, 101, 91, 86, 71, 76)`
    
    Write an R expression that returns a vector of corresponding Celsius values.

4. Consider the following data

|Country|Area|Population|
|------|------|------|
|Russia|17,098,242|142,257,519|
|United States|9,833,517|326,625,791|
|China|9,596,960|1,379,302,771|                                    
|Brazil|8,515,770|207,353,391|                
|Australia|7,741,220|23,232,413|
|India|3,287,263|1,281,935,911|
|Turkey|783,562|80,845,215|
|France|643,801|67,106,161|
|Japan|377,915|126,451,398|
|United Kingdom|243,610|65,648,100|

   A. Create two vectors `area` and `population` that hold the data in the respective columns. Label the elements in each vector with the country name.
   A. Create a new vector called `density` that holds the population density of the countries.
   A. Print the names of countries sorted by population density, in descending order (from highest to lowest).