# Functions

Up to this point in our course, we've mostly *used* functions without really thinking about how they work. And to some degree, that's by design -- as discussed in our earlier reading, you don't actually need to know what's going on inside the function. You only need to know the arguments you need to pass to it to get back the thing that you want. In that sense, function is kind of like a toaster: you put bread in, you get toast back; how the toaster is turning the bread into toast isn't really something you need to worry about.

But in your career, you will often find it useful to write your own functions, and to do that we have to understand a little more about how functions work.

Why do we care about writing functions? Functions are useful when you want to execute a task over and over again, resulting in less verbose and more easily interpretable code. And as we learned in our [defensive programming reading](defensive_programming.ipynb), that will not only save you time, but it will also make it less likely that you will end up with errors in your code!

**But wait... isn't that what you told me loops were for?**

Yes! Both loops and functions are, broadly speaking, for the same purpose: helping write more succinct code when you're doing something similar over and over. The big difference is that with a loop you only really get one variable the changes with each pass, whereas with the function, you can generalize behavior a lot more. In addition, as you'll see, functions are a little more flexible and reusable than loops.

## Defining a Function

To illustrate how function works, let's begin with a very simple function that takes a number, adds one to that number, then doubles it. It is admittedly a bit of a contrived example, but it has just enough complexity to be interesting.

We would write this function as follows:

In [6]:
add_1_and_double <- function(input_number) {
    plus_1 <- input_number + 1
    doubled <- plus_1 * 2
    return(doubled)
}

In [7]:
x <- 5
y <- add_1_and_double(x)
y

Let's walk through this line by line to understand what's going on. 

The first thing we see -- the name to which the function is being assigned -- will become the function name. 

Text between the parentheses after `function` (here, `input_number`) are the arguments the function will accept. We are writing this function to only accept one argument, so we've only put one thing between the parentheses. This is called the function signature.

Then between the curly brackets is the actual function -- the argument passed to the function is referred to by whatever you called the argument in the function signature (here, `input_number`). So here within the function we had one to the input, double that value.

Finally, we passed that doubled value to the `return()` function, which means that we want the value assigned to `doubled` to be what the function returns.

## What Happens When a Function is Called

Now that we've seen how to write a function, let's pause for a moment to work through exactly what happens when that function is called. For example, above we ran the code `above_1_and_double(x)` and got back 12. How did that happen? Well...

We begin with a simple assignment of `5` to `x`:

![function1](images/function1.png)

After which, we pass `x` to our function `add_1_and_double()`. When that happens, a new "stack frame" is created by R to execute that function, and the value passed to the function -- `5` -- is assigned to a variable with the name it was given in the function signature.

![function2](images/function2.png)

The function then begins to execute. When `input_number` is added to `1`, a new variable -- `plus_1` -- is created within the function frame. 

![function3](images/function3.png)


Then `plus_1` is doubled, and that value is assigned to `doubled`:

![function4](images/function4.png)

But then things get interesting, as we arrive at the `return` statement: 

![function5](images/function5.png)

The return statement tells R that the function is done. When R encounters a return statement, it does two things: it (a) returns the value given to `return()`, and (b) it ends the function and deletes the function's frame:

![function6](images/function6.png)

Notice that all the variables that had been defined within the frame `add_1_and_double` (`input_number`, `plus_1`, and `doubled`) are gone -- when a function ends, none of the variables defined within the function live on. All that's left is that the function's memory lives on through its return value, which is now stored in `y`.

## Function Arguments

In the example above, our function accepted a single argument, but of course we've seen functions can accept more than one argument! To accept more than one argument, just recommended the function signature in the first line:

In [1]:
add_two_numbers <- function(number1, number2) {
    sum <- number1 + number2
    return(sum)
}
add_two_numbers(1, 2)

You can also set default values for arguments by writing `my_argument="default value"` in the function signature. If an argument has a default value, it becomes an optional keyword argument that users *may* specify, but don't have to:

In [2]:
add_two_numbers <- function(number1, number2, return_as_character=FALSE) {
    sum <- number1 + number2

    # If return_as_character is TRUE, then sum will be returned as a character
    if (return_as_character == TRUE) {
        sum <- as.character(sum)
    }

    return(sum)
}
add_two_numbers(1, 2)

In [4]:
add_two_numbers(1, 2, return_as_character = TRUE)

## Using Functions

Functions are nice tools both for creating generalizable code, and also just for organizing your work. Because a function is a bundle of code that does one thing, putting parts of your work in functions can be helpful in defining the goal of that chunk of code. 

To illustrate one use of a function, we'll write a function that reads and manipulates a .csv file. We can then put this in a for loop to iterate over several files with a similar structure and combine the resulting data frames into one data frame.

As you'll see, one *could* cram all the code we're gonna write directly into the for-loop at the end, but by breaking out part of it into a function, the problem is more easily broken into smaller parts. 

**Note** in what follows, I use a couple tricks for working with data on dates. If you don't follow it, don't worry about it -- it's not critical to understand those couple lines!

### Reading several files


Begin by downloading a [.zip file with service request data from NYC](https://github.com/nickeubank/computational_methods_boot_camp/blob/main/source/data/nyc-311-sample.zip). The zip file contains six files for years 2004-2009, each with 10,000 observations. The data are originally from [NYC's Open Data portal](https://nycopendata.socrata.com/data?cat=social%20services), which hosts datasets with millions of service requests filed by residents through the city's 311 program. For the purpose of this example, I have taken a random sample of 10,000 for each year.

Here's what the 2004 file looks like (the other years have the same structure).

In [35]:
url2004 <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-2004-sample.csv"
nyc04 <- read.csv(url2004)
head(nyc04)

Unnamed: 0_level_0,Unique.Key,Created.Date,Closed.Date,Complaint.Type,Location
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>
1,4735434,01/23/2004 12:00 AM,02/02/2004 12:00 AM,Boilers,"(40.71511134258793, -73.98998982667266)"
2,7547062,06/04/2004 12:00 AM,06/09/2004 12:00 AM,HEATING,"(40.871781348425515, -73.88238262118011)"
3,5050661,08/04/2004 12:00 AM,08/06/2004 12:00 AM,General Construction/Plumbing,"(40.59418801428136, -73.80082145383885)"
4,7281795,11/26/2004 12:00 AM,12/10/2004 12:00 AM,PLUMBING,"(40.85911979460089, -73.90605127158484)"
5,1443894,08/22/2004 12:00 AM,08/22/2004 12:00 AM,Noise - Street/Sidewalk,"(40.54800892371052, -74.17041676351323)"
6,3244577,12/02/2004 12:00 AM,12/15/2004 12:00 AM,Noise,


The variables in the data are as follows: 

* `Unique.Key`: An id number unique to each request.
* `Created.Date`: The date the request was filed in the 311 system.
* `Closed.Date`: The date the request was resolved by city workers (`NA`
implies that it was never resolved).
* `Complaint.Type`: The subject of the complaint.
* `Location`: Coordinates that give the location of the service issue.

Our goal with the function is to read the file and clean it. In particular,
we want to convert the `Created.Date` and `Closed.Date` variables so that
R recognizes them as dates. From these variables, we can then calculate
measures of *government responsiveness*: (1) how many days it took city
workers to resolve a request, and (2) whether or not a request was resolved
within a week. 

In [37]:
 library(lubridate) #to work with dates


In [16]:
# Load required packages

# Create a function that reads and cleans a service request file.
# The input is the name of a service request file and the
# output is a data frame with cleaned variables.
clean_dta <- function(file_name) {

    # Read the file and save it to an object called 'dta'
    dta <- read.csv(file_name)

    # Clean the dates in the dta file and generate responsiveness measures
    # mdy(substring(dta$Created.Date, 1, 10)) pulls just the month-day-year
    # from our columns with dates, then `mdy` tells R to read it as a date 
    # in month-day-year format.

    dta$opened <- mdy(substring(dta$Created.Date, 1, 10))
    dta$closed <- mdy(substring(dta$Closed.Date, 1, 10))

    # Number of days between an issue opens and is resolved. 
    dta$resptime <- as.numeric(difftime(dta$closed, dta$opened, units = "days"))

    # Create indicator of whether solved within 7 days. 
    # responses in less than 0 is bad data.
    dta[dta$resptime < 0 | is.na(dta$resptime), "resptime"] <- NA
    dta$solvedin7 <- as.numeric(dta$resptime <= 7)

    # Return the cleaned data
    return(dta)
}

Let's test the function on the 2004 data:

In [17]:
# Execute function on the 2004 data
nyc04 <- clean_dta(url2004)
head(nyc04)

Unnamed: 0_level_0,Unique.Key,Created.Date,Closed.Date,Complaint.Type,Location,opened,closed,resptime,solvedin7
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<date>,<date>,<dbl>,<dbl>
1,4735434,01/23/2004 12:00 AM,02/02/2004 12:00 AM,Boilers,"(40.71511134258793, -73.98998982667266)",2004-01-23,2004-02-02,10,0
2,7547062,06/04/2004 12:00 AM,06/09/2004 12:00 AM,HEATING,"(40.871781348425515, -73.88238262118011)",2004-06-04,2004-06-09,5,1
3,5050661,08/04/2004 12:00 AM,08/06/2004 12:00 AM,General Construction/Plumbing,"(40.59418801428136, -73.80082145383885)",2004-08-04,2004-08-06,2,1
4,7281795,11/26/2004 12:00 AM,12/10/2004 12:00 AM,PLUMBING,"(40.85911979460089, -73.90605127158484)",2004-11-26,2004-12-10,14,0
5,1443894,08/22/2004 12:00 AM,08/22/2004 12:00 AM,Noise - Street/Sidewalk,"(40.54800892371052, -74.17041676351323)",2004-08-22,2004-08-22,0,1
6,3244577,12/02/2004 12:00 AM,12/15/2004 12:00 AM,Noise,,2004-12-02,2004-12-15,13,0


The cleaned dataset has four new variables:

* `opened`: The date the request was filed in date format. 
* `closed`: The date the request was resolved in date format. 
* `resptime`: The number of days it took to resolve the request (`closed` - `opened`).
* `solvedin7`: A dummy variable equal to 1 if the request was solved within a week
  and 0 otherwise. 

We can now use this function on all the six files using a for loop, or something called lapply() (Read more about `lapply()`
[here](http://www.r-bloggers.com/using-apply-sapply-lapply-in-r/)).

In [18]:
# First create a vector with the names of the files we want to read

file_names <- paste0(url_stem, 2004:2009, url_suffix)
file_names

In [25]:
# loop over and collect in a list!

url_stem = "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-"
url_suffix = "-sample.csv"

# Get first one so we can append others to the bottom
nyc_all <- clean_dta(paste0(url_stem, 2004, url_suffix))

for (year in 2005:2009) {
    url <- paste0(url_stem, year, url_suffix)
    new_data <- clean_dta(url)
    nyc_all <- rbind(nyc_all, new_data)
}

In [26]:
# 10 random rows
nyc_all[sample(nrow(nyc), 10), ]


Unnamed: 0_level_0,Unique.Key,Created.Date,Closed.Date,Complaint.Type,Location,opened,closed,resptime,solvedin7
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<date>,<date>,<dbl>,<dbl>
53321,13901265,05/07/2009 12:00:00 AM,05/07/2009 12:00:00 AM,Derelict Vehicle,"(40.617799993451946, -73.93627682089715)",2009-05-07,2009-05-07,0,1
57449,15438339,12/06/2009 12:00:00 AM,01/04/2010 12:00:00 AM,ELECTRIC,"(40.80005453928576, -73.94058468907889)",2009-12-06,2010-01-04,29,0
21611,6773413,10/17/2006 12:00 AM,10/18/2006 12:00 AM,HEATING,"(40.61383002382055, -73.95104077615885)",2006-10-17,2006-10-18,1,1
26441,8528938,12/10/2006 12:00 AM,12/18/2006 12:00 AM,HEATING,"(40.82384059281061, -73.89186881921808)",2006-12-10,2006-12-18,8,0
24293,3637086,01/27/2006 12:00 AM,01/27/2006 12:00 AM,Street Condition,"(40.82105078782504, -73.8822663446518)",2006-01-27,2006-01-27,0,1
28327,366213,07/31/2006 12:00 AM,07/31/2006 12:00 AM,Derelict Vehicle,"(40.61916174589555, -73.93688427328587)",2006-07-31,2006-07-31,0,1
58533,15528269,12/17/2009 12:00:00 AM,12/23/2009 12:00:00 AM,HEATING,"(40.68596848970488, -73.93964069232293)",2009-12-17,2009-12-23,6,1
5238,1455730,09/14/2004 12:00 AM,09/14/2004 12:00 AM,Noise - Street/Sidewalk,"(40.822455126458685, -73.87869412517016)",2004-09-14,2004-09-14,0,1
32209,9006249,08/01/2007 12:00:00 AM,08/06/2007 12:00:00 AM,APPLIANCE,"(40.64359535901695, -74.07741192402966)",2007-08-01,2007-08-06,5,1
1481,511930,12/12/2004 12:00 AM,12/14/2004 12:00 AM,Street Light Condition,,2004-12-12,2004-12-14,2,1


## A Note About Scope

There's a concept in programming called "scope", which refers to what variables are visible at a given moment of execution. If you write a function to only need to work with (a) the arguments given to the function, and (b) the variables that you define within the function, you don't need to worry about scope. And indeed, there's a whole philosophy of programming -- called *functional programming* -- that says that's the only way you should write a function. 

In general, I would recommend sticking to this approach. However, I would not be doing my duty as an instructor if I did not mention that functions can see variables that exist outside of themselves. For example, in our `add_1_and_double()` example above, if we'd added the line `doubled <- doubled + x` right above our return statement, the function would have been able to "see" that there was a variable `x` in the world outside the function, and that it had been assigned the value `5` and increment `doubled` by 5. But... That's a very dangerous method of programming, because if you write a function that way, the behavior of the function now depends on the values assigned to variables outside the function. So `add_1_and_double(5)` would return one thing if you had earlier defined `x <- 2` and something different if you defined `x <- 7`. So... don't do it? I just want to warn you that code written like that will run, but it's something you won't want to use unless you really know what you're doing.

</div>


