# Loops

Until this point, the scripts we've been writing are quite straightforward: each line of code runs in sequence, one after another. Moreover, everything we want R to do, we have to write out by hand.

In this lesson, we'll learn about loops -- a programming structure ideal for getting R to do LOTS of work with very little code. 

The basic idea of a loop is that in situations where you want R to do lots of very similar things, we can decompose our code into (a) generic code we want to run over and over, and (b) an input to that code that changes each time we repeat the generic code.

Let's start with a simple example -- suppose I wanted to thank my TAs and everyone in class for a great bootcamp.I *could* type it out with:

```r
print("Happy Birthday Yi!")
print("Happy Birthday Zoe!")
print("Happy Birthday Ishani!")
print("Happy Birthday Kelly!")
...
```
etc. But that'd take FOREVER! Moreover, it's easy to see that most of what I'm typing out is exactly the same on every line. If only there were some way to leverage that information...

Enter the loop!

## The For-Loop

The most basic loop is called a for-loop, and it loops over a collection of items, doing the loop once per item in the collection. 

In this case, our collection would be a vector with the names of all the students in the class. I won't make you look at the full list, though, so here's a little toy vector:

In [40]:
names <- c("Yi", "Zoe", "Ishani", "Kelly")

Then we need to write out a for-loop, which looks like this:

```r
for(i in names) {
    [code to repeat here]
}
```

The way this loop works is that it iterates over the names in `names`, and each time it reaches a new name, it sets `i` to equal that name, then runs the code in the middle. For example, I could print all the names like this:

In [43]:
for (i in names) {
    print(i)
}

[1] "Yi"
[1] "Zoe"
[1] "Ishani"
[1] "Kelly"


But of course, we usually don't just want to print `i`, we want to use it. So let's try and thank everyone in the class with a loop. 

We will start by writing out the code we eventually want in the loop, setting `i <- "Yi"` by hand to practice:

In [44]:
i <- "Yi"
message <- paste0("Thank you ", i, " for a great class! ")
print(message)

[1] "Thank you Yi for a great class! "


If you haven't seen `paste0()` before, it's a function that takes characters and concatenates them (sticks them together). (There's also a function called `paste()` that does the same thing but also adds a space between entries). 

OK, so now that we've got working code to put in the middle, we can move it into our loop:

In [48]:
for (i in names) {
    message <- paste0("Thank you ", i,
                      " for a great class! ")
    print(message)
}

print("OK, bye now!")

[1] "Thank you Yi for a great class! "
[1] "Thank you Zoe for a great class! "
[1] "Thank you Ishani for a great class! "
[1] "Thank you Kelly for a great class! "
[1] "OK, bye now!"


A few things to note about this:

- When developing the code to go into our loop, we started by assigning one of our names to `i`. In the real loop, we don't do that because it's done automatically at the top of the loop. 
- Remember how `print()` always seemed pointless, since you could always just type the name of a variable to see its value? Well that trick doesn't work in loops, which is why we need a tool like `print()`. 
- I also added a print statement at the end of my code -- as you can see, when the loop has iterated over all the values in `names`, R just moves on and runs the next command it sees. 

OK, let's take a moment to diagram what's happening here just to make sure everything is clear. For length, though, we'll shorten our name vector to just have Yi and Zoe. 

![loop1](images/loop1.png)

![loop2](images/loop2.png)

![loop3](images/loop3.png)

![loop4](images/loop4.png)

![loop5](images/loop5.png)

![loop6](images/loop6.png)

![loop7](images/loop7.png)

## An appliction

For loops can be used to carry out Monte Carlo simulations. In the
example below, we'll draw repeated samples from a population,
calculate the mean for each sample, and test whether we on average do
a good job of estimating the population mean. 

Say the population consists of 10 individuals with the following heights: 

## Don't Loop Over Big Vectors / Data 


**CAUTION:** Do not loop over your dataset rows

Loops are powerful, but one thing you almost never want to do is loop over the rows of your dataset. The reason is that looping is **much** slower than doing an operation using vector math (a practice called "vectorization"). 

To illustrate, suppose I want to add up two vectors, each with 1,000,000 entries:

In [4]:
v1 <- rnorm(1000000)
v2 <- rnorm(1000000)

# add up with vector math:

vector_time <- system.time(v1 + v2)
vector_time

   user  system elapsed 
  0.002   0.000   0.001 

Note I have to put them in a function to do it because of how 

In [5]:
# Now add them up in a loop.

looped <- function(v1, v2) {
    # Add up in a loop
    for (i in 1:1000000) {
        v1[i] <- v1[i] + v2[i]
    }
    return(v1)
}

looped_time <- system.time(looped(v1, v2))
looped_time

   user  system elapsed 
  0.070   0.003   0.073 

In [7]:
# Looped took about this many time longer:

round(looped_time[["elapsed"]] / vector_time[["elapsed"]])


So... yeah. 73x slower. Don't do it. 

(If you want to know why, I have an explanation for the [same phenomenon in Python here](https://www.practicaldatascience.org/html/performance_understanding.html). The examples have Python code, but the principles are the same). 

Also, if that code looks weird, it's because the result of `system.time` is being stored in a `list`, which is an interesting data structure we sadly don't have time to cover, but which you can read about [here](lists.ipynb))

## Exercises


1. Use a for loop to take the square root of each value in the following 
vector: `vec1 <- c(4, 9, 81, 100, 1000, 10^6)`. Save the results to a new 
vector called `vec2`. 

2. Monte Carlo Simulation: Imagine that the values in the vector `pop`
below represent vote shares for a presidential candidate across the
3,144 counties in the United States. If we were to take a sample of 50
counties and estimate mean support for the presidential candidate,
would we, on average, estimate the vote share across all counties
accurately? (Don't worry about the fact that we really should be
weighing counties by their population size to estimate overall
support.) Draw 10,000 samples of 50 counties from `pop` and estimate
mean support for each sample, saving each mean estimate into a vector
called `smpl_means`. How does the mean of the sample means compare to 
the population mean? Do we, on average, do a good job of estimating the 
population mean? 

<div class="indent">