# Introduction to R Programming: Loops, Conditionals, & Functions

Matthew D. Turner, PhD  
Georgia State University

Some rights reserved: [cc by-nc-sa](https://creativecommons.org/licenses/by-nc-sa/4.0/) See bottom of document for details.
***
# Exercises
This notebook is a collection of basic exercises for making loops, using conditionals, and writing functions. You should start here if you are completely new to these topics. The exercises below are of varying complexity, so if you find some to be too hard, you may want to proceed to the next section. There are also some answers at the end if you get stuck.

In the notebook below, empty cells with comments contain instructions for you to build simple programs in R code. Cells are executed by either using the key combination shift-enter (or shift-return on some computers) or by using the mouse to press the "Run" button at the top of the notebook.

Start by running the cells below. **You need to run all of the cells with code, even ones where you do not make any changes.** If something does not work, it could be due to skipping a cell. Also, you may press shift-enter to move through the notebook one cell at a time; this includes the text cells!

The following cell sets the figure size in inches (approximate inches that depend on other display settings on your browser). 

In [None]:
# Setting up the R environment; just run this cell

options(repr.plot.width = 4, repr.plot.height = 4)  # Set the figure size 

## 1. Sorting and Filtering Examples with `for`
Here we will use loops to display some data and do basic data filtering. Please note that using loops to filter data is not always optimal, but it does lend itself to easy examples for practice.

In [None]:
# Here is some data drawn from a mathematically defined (known) distribution
#  >> It has a true mean of 25 and a true standard deviation of 9 <<

d <- c(39.89503,26.84281,24.25861,28.87009,21.98997,21.62833,3.836511,33.59383,
       29.82358,10.65092,33.86132,37.29709,1.178438,9.14365,24.33669,18.59312,
       3.755524,35.13025,12.3738,22.3297,17.46969,26.15035,24.44927,33.22489,
       30.54006,17.60494,11.47602,50.5574,22.49891,17.17721,34.62822,21.65989,
       25.29415,22.08202,31.12628,28.16447,26.65634,26.81129,38.41278,21.12782,
       10.29205,25.77778,1.574856,14.65169,30.01658,32.61243,22.40103,41.40533,
       6.766324,34.11612,23.20139,25.23146,22.6141,16.48503,23.39496,27.68777,
       20.16894,19.71992,22.51938,18.84068,47.7751,33.34769,40.17129,25.83041,
       25.5106,29.01418,4.318843,21.87696,31.89288,17.99988,26.79175,9.075963,
       17.81769,29.30444,40.21988,39.85456,16.40074,24.58396,26.97367,20.57295,
       29.30642,20.47403,33.41683,9.886789,32.07099,27.86358,23.91271,8.362674,
       20.9204,27.89309,-0.5909693,30.50101,25.05787,25.68484,31.23876,11.40253,
       18.55688,5.459615,16.87926,27.25582)

Determine the number of data points in `d`. We will use this in some `for` loops below.

In [None]:
# How many elements in d?
#
# Hint: If you've never used it, look up the help on the "length" function 
#       first (?length). You should get used to using it.



Note that the `length` function in R gives different sorts of values for different sorts of objects. For a data frame, it gives the number of columns. For a matrix, vector, or list it gives the total number of elements in the object. For a _vector_ like `d` above it will give the total number of items.

### 1.1 Filter and Print `d`

In [None]:
# Write a for loop that prints out the elements of d
#
# Hint: If you use tabs to arrange things, your columns might not line up
#       perfectly, but this is ok. If you want to use commas, see the demo 
#       notebook, section 1.1.1. But that is a little harder.
# Hint: If i = 3, then d[i] is the 3rd element of d, for instance. 



Note that when arranging things on the screen tabs will look better or worse depending on the size of the browser window. You might want to adjust the size of your browser window and re-run the cell above to get a nicer display.

In [None]:
# Modify the loop above to print out only elements that are greater
# than the mean of d
#
# Hint: You do not need an else for this exercise
# Hint: You should get 53 numbers printed out



If you get stuck on the previous exercise, see the appendix at the bottom of the notebook.

In R the `quantile` functions reports the values in data (like our `d`) that correspond to particular percentage points of the distribution. The cell below computes the 25th and 75th quantiles (equivalent to the 1st and 3rd qua**r**tiles) of `d`.

In [None]:
quantile(d, c(0.25, 0.75))

The code below puts these values into variables `Q1` and `Q3`. You will use these in the next exercises.

In [None]:
out <- quantile(d, c(0.25, 0.75))
Q1 <- out[1]
Q3 <- out[2]
cat(Q1, Q3, sep=", ")   # This is just to show you the values

We can use the comparisons to see if elements of `d` are between these `Q1` and `Q2` limits. In the examples below, there is some "fluff" from R printed along with the results (the bold labels from the `quantile` function output) which can safely be ignored. 

To work out how to make these comparisons we will select out the first element of `d` and build the test for that one element, then we will replace `d[1]` with `d[i]` and put that in a loop.

In [None]:
cat(d[1])   # Just look at this value, it is between 17.95 and 29.87?

In [None]:
# Now do the tests automatically

d[1] > Q1   # TRUE
d[1] < Q3   # FALSE

We can combine the comparisons with the [logical and operator](https://en.wikipedia.org/wiki/Logical_conjunction), which is written `&` in R. This operator combines two logical conditions and returns `TRUE` only when **both** conditions are true. 

For this problem, we want to find points  `d[i]` where both `d[i] > Q1` **and** `d[i] < Q3` are `TRUE` at the same time. For the specific point we are considering this should return `FALSE`.

In [None]:
# Combining conditions with &

d[1] > Q1 & d[1] < Q3  # Is d[1] **INSIDE** the (Q1, Q3) interval?
                       # That is, are both statements simultaneously TRUE?

This combined condition can be placed inside of an `if` statement just like a simple comparison.

In [None]:
# Start with the first for loop above, and modify it to only print out
# elements of d that are INSIDE the interval (Q1, Q3).
#
# Hint: If you think this is hard, you **might** be overthinking it.
# Hint: If you copied and pasted, did you remember to change the d[1] 
#       from the example in the cell above?



In [None]:
# Copy the loop just above, but **add** an else clause to cat the string
# "........" (a sequence of 8 dots) for each element of d that is NOT in 
# the interval. The output should be a list of numbers inside the interval
# from Q1 to Q3 and dots in place of the numbers that were outside the
# interval
#
# Hint: In case you wonder, "........" made things line up nicely for me
#       when I did this using tabs ("\t") to separate values. You may need
#       to adjust the number of dots for your screen.



In case you had problems with the previous two exercises, here is my solution to a **related** problem: The following code prints out the values that are **outside** the (Q1, Q3) interval, rather than inside, and prints dots for values inside the interval. Comapre it to your result above, and where you have dots I should have numbers, and _vice-versa_. Hopefully this will get you unstuck if you have any problems.

In [None]:
# Print the points in d OUTSIDE of the (Q1, Q3) interval:

for(i in 1:length(d)){
    if(d[i] > Q1 & d[i] < Q3){
        cat("........", "\t")
    } else{
        cat(d[i], "\t")
    }
}

**NB:** The code that I wrote does not have to look exactly like your code. For instance, I wrote: `i in 1:length(d)` where you could have written `i in 1:100` or `i in 1:n` for some variable `n` that you defined above. Neither way is better or worse, just different.

**However, the version that I wrote will adapt itself if it is used with data sets of different lengths, while the instruction `i in 1:100` will not change itself automatically.** You should make sure you understand why this is and how to do equivalent things in your own code. Making loops and functions adapt themselves automatically to different sized data is a key concept and one you should always be thinking about as you work and develop your skills.

## 2. Simple Function Exercises
Here we will make some small functions. 

### 2.1 Z-Scores
In the first example we will make a simple z-score function. 

The z-score is defined as: $$z = \frac{x - \bar{x}}{s_x}\\[2ex]$$ where $x$ is any of the data points, $\bar{x}$ is the mean of the whole data set, and $s_x$ is the standard deviation of the set. In the following you will calculate this then wrap it up as a function.

In [None]:
# For the x provided, (1) use the z formula to compute z from x, and 
#                     (2) print (or cat) z
#
# Note that we just want to copy the z formula here, we will turn it
# into a function in the cells below.

x <- c(5, 8, 7, 9)



To make a function you need to wrap your calculations inside of the following:

```r 
function_name <- function(INPUT){
    calculations                                 
    return(OUTPUT)
}
```
where the `calculations` set the value of the `OUTPUT`. In the above cell you actually did all of the calculations for `z`. So to make it a function, you just have to add the formula you wrote above to the middle of the function definition.

In [None]:
# (1) Copy the function definition from the text cell above (or retype it)
# (2) Change the function_name to zf (for "z function")
# (3) Change the INPUT to x
# (4) Place your z formula where the calculations go
# (5) Change OUTPUT to z 
# (6) Run this cell



The next few cells will test your `z` function. If you are having problems, a solution is provided in the appendix below.

In [None]:
# The results here should be the same as in your formula cell above

zf(x)  

In [None]:
# The following is an example of what math people call a "fixed-point"
# these particular numbers are also their own z-scores. So if your 
# function works, the output should be the same as the input

y <- c(-1, 0, 1)

zf(y)     

The process of coding solutions to problems requires a willingness on your part to break problems into smaller pieces and limited cases, solve these, then generalize the answers (often with iteration/loops, abstraction/functions, or other processes). Another thing you should get in the habit of doing is trying to come up with _test cases_ where you know the answer and can see what your function does. The cell above is an example of this; a good test for any function is to give it some data with a known answer and see what you get. 

In [None]:
# Here we make 100 random normal numbers. They are drawn from a 
# normal distribution with mean of 100 and standard deviation of 15

random_data <- rnorm(100, mean = 100, sd = 15)
random_data

For any normally distributed data, $N(\mu, \sigma)$ if you apply the $z$ transformation to it, the data should become $N(0,1)$, that is, normally distributed with a mean or center of 0 and a standard deviation of 1. We can test the random data above:

In [None]:
# Apply your z function to the data
random_z <- zf(random_data)

# Check if the results of this are as expected
mean(random_z)   # Should be close to 0; very small ~ 1 x 10^-12 (1e-12) or less
sd(random_z)     # Should be close to 1

When writing functions:

+ It is often easiest to simply do the calculation (like for the `z`) then when you get it to work transfer it into the function form.
+ If you do a calculation on a vector (like `x` above) then the function will work on other vectors naturally. 
+ Functions written in R by the developers tend to have a lot of extra parts: to check that inputs are correct, to look for problems, and generally to protect users from errors. Your functions will usually not have these features. If you start developing stuff for other people to use, you may want to learn how to do this sort of thing.

[This blog post](https://nicercode.github.io/guides/functions/) decribes some reasons for using functions. You may find it helpful in understanding why functional abstraction is a good thing for program clarity and also maintenance.

### 2.2 Selector Functions

One really useful short funtion type in R are **selectors** &mdash; functions that do something, pick out just the part of the output you want, and return only that. They are useful for keeping your programs clean.

#### Selecting the t-Interval
As part of the output of the `t.test` function, the (default) 95% confidence interval is given. Write a function to do the t-test and return **only** the confidence interval.

In [None]:
# Fake data for a 2 group t-test

g1 <- c(1,2,3,4,5)
g2 <- c(5,5,4,4,6)

t.test(g1, g2)   # Generate the t.test output

When you use R functions their results can appear in two ways &mdash; (1) as a print out on the screen (like the `t.test` result above) or (2) as a list of items that can be stored in a variable. When you store the items, you often get much more information than is printed out.

You can look at the items in the list with the `str` command:

In [None]:
t <- t.test(g1, g2)   # Instead of printing, store the results

str(t)   # This shows what information was put in t

The part we want is the part called `conf.int` in the output above. Notice that the items listed above are all written with `$`'s before them. This is a hint about how to access each item:

In [None]:
t$conf.int    # This picks out the confidence interval and leaves
              # the rest alone

In [None]:
# Write a function that takes 2 data sets, does the t.test on them
# and then selects out the confidence interval part and returns just
# that. Call the function t_conf_int
#
# Hint: Although it is not necessary, if you recall the use of the 
#       names function from the demo, that trick will work here, too



If you get stuck there is an example solution in the appendix.

In [None]:
# Test of t_conf_int

t_conf_int(g1, g2)   # Same as above

If you followed the instructions above you have a `t_conf_int` function that works fine for the two sample t-test. But it does not work for other cases that `t.test` allows. There is a nice trick to fix this.

#### R's `...` Operator
One of R's features is the `...` operator. This let's you pass a whole set of **named** parameters to a function which will, in turn, be passed to other functions. The key here is to remember that the items must be named. The R documentation describes `t.test` as having the following format for its input:

```R
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95)
```
If you want to do a simple two sample t-test, like the one above, there are two ways to specify it:

1. `t.test(g1, g2)`
1. `t.test(x = g1, y = g2)`

All of the other values are optional and will use defaults if not given by you. Most people use the first format for simple things, but we can choose to use the second if we need to.

In R, when you place `...` inside of a function definition, that spot collects all of the variables with names that the current function does **not** use. These then can be sent to other functions as needed. Those other functions will look through the list of spare variables and select out ones with names they recognize.

This gives us a trick to build a very flexible `t_conf_int` selector:

In [None]:
# This functions passes **all** of its arguments to t.test

t_conf_int2 <- function(...){
    return(t.test(...)$conf.int)
}

In [None]:
# This is the same case as above

t_conf_int2(g1, g2)

In [None]:
# Here we pass the additional parameter conf.level as 0.99 to get 
# a 99% interval; our original function can't do that

t_conf_int2(x = g1, y = g2, conf.level = 0.99)

You can try passing the additional arguments with your `t_conf_int` function but it will fail.

### 2.3 A Robust Example
The [median absolute deviation (MAD)](https://en.m.wikipedia.org/wiki/Median_absolute_deviation) is a robust measure of variability for use in cases where the standard deviation might have problems. (Cases such as analyzing data with outliers, or data from _contaminated_ distributions.) Here we will compute the MAD and _also convert it into a robust estimate of the standard deviation_.

The MAD is defined as the median of the _absolute deviations_ from the data's median. As R has a `median` function this is easy to write.

In [None]:
# Data for testing in a:

a <- c(6, 1, 2, 4, 1, 9, 2)

d <- a - median(a)  # Subtract median from each point, store result in d
cat(d)

Note that the values in `d` are positive and negative. We need to use the absolute value to make all of these positive. We want to look at the size of the deviations, we do not care about their sign (direction). R provides an absolute value function, `abs`, which makes negative values positive.

In [None]:
abs(d)

Finally to get the MAD, we just take the median of this list of absolute `d`'s. In the following cell the entire computation is shown:

In [None]:
d <- abs(a - median(a))
median(d)

We can convert the MAD into an estimate of the standard deviation by multiplying it by 1.4826. You will just have to trust me on this, unless you have taken some probability theory. If you have, the number we want to convert MAD into an estimate of $s$ (`sd`) is:

$$\frac{1}{\Phi^{-1}(0.75)} \approx 1.482602$$

(This can be computed in R with the command: `1/qnorm(.75)`, if you want a better estimate.)

In [None]:
# So to make a standard deviation estimate, we just multiply:

s_estimate <- median(d) * 1.4826
cat(s_estimate)

In [None]:
# Take the steps outlined above and make a "s_est" function



In [None]:
# Test your function against the original case

s_est(d)

In [None]:
newdata <- c(10, 20, 40, 40, 50, 70, 80)

s_est(newdata)   # Should be about 29.65

As has been the case for a few of our simple examples today, this function is actually implemented in R as a standard feature where it is called `mad`. We can use this to check out results above against what the professionals did.

In [None]:
mad(d)
mad(newdata)

## Appendix

Here are some solutions in case you get stuck. Try not to use them without trying for a bit first.

In [None]:
# The second exercise from 1.1

for(i in 1:length(d)){
    if(d[i] > mean(d)){
        cat(d[i], "\t")
    }
}

In [None]:
# The zf function from 2.1

zf <- function(x){
    z = (x - mean(x))/sd(x)
    return(z)
}

In [None]:
# t_conf_int from 2.2.1

t_conf_int <- function(a, b){
    return(t.test(a, b)$conf.int)
}

In [None]:
# s_est from 2.3

s_est <- function(a){
    d <- abs(a - median(a))
    return(median(d)*1.4826)
}

***
Version 1.0  
2018.07.11

To contact the author, email [mturner46@gsu.edu](mailto:mturner46@gsu.edu). Please contact me with recommendations for improvement or if you find any errors. This work may be adapted for any non-commercial purpose within the bounds of the license.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.