# Lecture 16: Writing Functions in R
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will understand:**
* how to write functions in R
</div>

This correpsonds to Chapter 26 of your book





In [None]:
library(tidyverse)
remotes::install_github("bradleyboehmke/harrypotter")
install.packages("tidytext")
library(harrypotter)
library(tidytext)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.0     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Downloading GitHub repo bradleyboehmke/harrypotter@HEAD



[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/RtmpXRCotN/remotes46c5249d3e8/bradleyboehmke-harrypotter-51f7146/DESCRIPTION’ ... OK
* preparing ‘harrypotter’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘harrypotter_0.1.0.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’, ‘janeaustenr’, ‘tokenizers’




# Function
Function is not new to use, we have been using `functions` right from day-1. 

`print`, `filter`, `tibble` - are all functions. 

Try 

`class(print)`

Today we will learn to write our own functions

### Why write out own functions??

Often when programming we find ourselves repeating the same block of code with minor modifications. 

Did you encounter this situation in your HW7 when you applied the same logic again and again, while deriving the sentiment score for all the Harry Potter books??

That is precisely when a function is defined to save ourselves from repetition!

Let us start with simple examples. When building machine learning models (which you will learn next week) it is common practice to normalize all the columns values to the same scale; typically between 0 and 1. Let us take the `mpg` dataset and see its current min and max values:

In [None]:
mpg$hwy %>% range

Let us now normalize it

In [None]:
hwy_a <- (mpg$hwy - min(mpg$hwy, na.rm = TRUE)) / (max(mpg$hwy, na.rm = TRUE) - min(mpg$hwy, na.rm = TRUE))
hwy_a %>% range

We need to do the exact same for all columns; to begin with let us do for all numerical columns. And it is easy to copy and paste the same code from above and change the variable names

In [None]:
cty_a <- (mpg$cty - min(mpg$cty, na.rm = TRUE)) / (max(mpg$cty, na.rm = TRUE) - min(mpg$hwy, na.rm = TRUE))
cty_a %>% range

🤔 Quiz

Why is the range not showing a maximum of 1?

* A. For the `cty` values it cannot have a maximum value of 1
* B. The numerator should use `max` function
* C. Copy-paste technique not done correctly is the reason for a `flaw` in the formula

We still need to do the exact same procedure for all other columns as well and so intead of repeating ourselves we can introduce a function to do the repeats.

## Anatomy of a function
To write a function we should first think about the inputs and output. A function takes input(s), does something(s) to them, and then returns an output.



🤔 Quiz

What is the output of our rescale function?

* A. x
* B. nothing
* C. rescaled vector values


In R, it is not necessary to use a `return` keyword to specify the return value of a function. The return value of a function is simply the value of the last expression evaluated within the function body. 


### Arguments

Functions will often have multiple arguments. Some arguments have default values, others do not. All arguments without default values must be passed to a function. Arguments can be passed by name or position. For instance, 




In [None]:
# generate 5 numbers from a Normal(0, 1) distribution.
w = rnorm(5, mean = 0, sd = 1)
x = rnorm(n = 5, mean = 0, sd = 1)
y = rnorm(5, 0, 1)
z = rnorm(5)
round(cbind(w, x, y, z), 1)

w,x,y,z
0.0,1.5,-1.3,-0.6
-2.8,0.0,0.1,-0.9
-0.2,1.3,-0.2,-1.2
-0.8,0.7,-0.4,-0.7
-0.5,-1.2,-0.1,0.7


Arguments passed by name need not be in order:

In [None]:
w = rnorm(mean = 0, sd = 1, n = 5)
u = rnorm(mean = 0, sd = 1, 5) # This also works but is bad style. 
round(rbind(u = u, w = w), 1)

# unnamed arguments get passed to the first argument after the names arguments are assigned

0,1,2,3,4,5
u,0.2,-0.1,1.8,-1.2,-0.1
w,1.7,-1.2,0.5,-1.2,-0.1


# z-score function

Let us create a function to compute z-scores of a vector

In [None]:
# function to compute z-scores
z_score1 = function(x) {
  #inputs: x - a numeric vector
  #outputs: the z-scores for x
  xbar = mean(x)
  s = sd(x)
  z = (x - mean(x)) / s
  return(z)  
}

z_score1

The return statement is not strictly necessary, but can make complex functions more readable. It is good practice to avoid creating intermediate objects to store values only used once.



In [None]:
# function to compute z-scores
z_score2 = function(x){
  #inputs: x - a numeric vector
  #outputs: the z-scores for x
  (x - mean(x)) / sd(x)
}

In [None]:
x = rnorm(10, 3, 1) ## generate some normally distributed values
round( cbind(x, 'Z1' = z_score1(x), 'Z2' = z_score2(x) ), 1)

x,Z1,Z2
3.0,-0.3,-0.3
3.2,0.0,0.0
2.9,-0.5,-0.5
3.4,0.3,0.3
3.2,-0.1,-0.1
2.2,-1.5,-1.5
2.8,-0.6,-0.6
4.9,2.4,2.4
3.3,0.1,0.1
3.3,0.1,0.1


### Default Parameters

We can set default values for parameters using the construction `parameter = xx` in the function definition.




In [None]:
# function to compute z-scores
z_score3 = function(x, na.rm = T){
  {x - mean(x, na.rm = na.rm)} / sd(x, na.rm = na.rm)
}

In [None]:
x = c(NA, x, NA)
round( cbind(x, 'Z1' = z_score1(x), 'Z2' = z_score2(x), 'Z3' = z_score3(x) ), 1)

x,Z1,Z2,Z3
,,,
3.0,,,-0.3
3.2,,,0.0
2.9,,,-0.5
3.4,,,0.3
3.2,,,-0.1
2.2,,,-1.5
2.8,,,-0.6
4.9,,,2.4
3.3,,,0.1


## Scope

Scoping refers to how R looks up the value associated with an object referred to by name. There are two types of scoping – lexical and dynamic – but we will concern ourselves only with lexical scoping here. There are four keys to understanding scoping:

- environments
- name masking
- variables vs functions
- dynamic look up 


An environment can be thought of as a context in which names are associated with objects. Each time a function is called, it generates a new environment for the computation.

Consider the following examples:

In [None]:
ls()

In [None]:
f1 = function() {
  f1_message = "I'm defined inside of f!"  # `message` is a function in base
  ls()
}
f1()

In [None]:
exists('f1') # f1 %in% ls() 

In [None]:
# what about f1_message?


In [None]:
environment() # here we are in the global environment

<environment: R_GlobalEnv>

In [None]:
f2 = function(){
  environment() # here we are in the local environment -- each time we get a different local environment
    # created for the purpose of this function
}
f2()

<environment: 0x564d05c250a0>

In [None]:
rm(f1, f2)

Name masking refers to where and in what order `R` looks for object names.
When we call `f1` above, `R` first looks in the current environment which happens to be the global environment. The call to `ls()` however, happens within the environment created by the function call and hence returns only the objects defined in the local environment.

When an environment is created, it gets nested within the current environment referred to as the “parent environment”. When an object is referenced we first look in the current environment and move recursively up through parent environments until we find a value bound to that name.



Name masking refers to the notion that objects of the same name can exist in different environments. Consider these examples:



In [None]:
#  Example 3 -- lexical scoping
y = x = 'I came from outside of f!'
f3 = function(){
  x =  'I came from inside of f!'
  print(paste("x =", x, "and y =", y))
}
f3()
print(paste("outside-x =", x, "and outside-y =", y))

[1] "x = I came from inside of f! and y = I came from outside of f!"
[1] "outside-x = I came from outside of f! and outside-y = I came from outside of f!"


* x is redefined inside the function enviornment
* y is not, so R will search for y in the parent environment and keep moving up
* x that is associated with f3, is not going to change the x in the global environment, unless we explicitly write the code to do that

In [6]:
#  Example 4 -- masking
mean = function(x){ 
    sum(x)
}
mean(1:10)

In [5]:
base::mean(1:10)

In [None]:
rm(mean)

R also uses dynamic look up, meaning values are searched for when a function is called, not when it is created. In the example above, y was defined in the global environment rather than within the function body. This means the value returned by f3 depends on the value of y in the global environment. You should generally avoid this, but there are occasions where it can be useful.



In [None]:
# Example 5 - dynamic lookup
y = "I have been reinvented!"
f3()

## Dataframe functions

A case of a problem of `indirection`, as `dplyr` uses `tidy evaluation` to allow you to refer to the names of variables inside your data frame without any special treatment.

In [None]:
grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by(group_var) |> 
    summarize(mean(mean_var))
}

grouped_mean(mpg, model, hwy)

ERROR: ignored

### Fix with embracing `{{ }}`

In [None]:
grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by({{ group_var }}) |> 
    summarize(mean({{ mean_var }}))
}

grouped_mean(mpg, model, hwy) %>% head

model,mean(hwy)
<chr>,<dbl>
4runner 4wd,18.83333
a4,28.28571
a4 quattro,25.75
a6 quattro,24.0
altima,28.66667
c1500 suburban 2wd,17.8


In [None]:
grouped_mean <- function(df, year, group_var, mean_var) {
  df %>% filter(year == {{ year }}) %>%
    group_by({{ group_var }}) |> 
    summarize(mean({{ mean_var }}))
}

grouped_mean(mpg, 1999, model, hwy) %>% head

model,mean(hwy)
<chr>,<dbl>
4runner 4wd,19.0
a4,27.5
a4 quattro,25.25
a6 quattro,24.0
altima,28.0
c1500 suburban 2wd,17.0


### When to embrace?🤔

There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:

* Data-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.

* Tidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.

In [None]:
?select

In [None]:
?summarize

## Pick function

Applied for verbs that use data-masking when sub-selecting the columns

In [None]:
?pick

In [None]:
mean_hwy <- function(df, var1, var2, var_3) {
  df |> 
    group_by({{ var1 }}, {{ var2 }}) |> 
    summarize(
      mean = mean(is.na({{ var_3 }})),
      .groups = "drop"
    )
}

mpg |> 
  mean_hwy(manufacturer, model, hwy) %>% head

manufacturer,model,mean
<chr>,<chr>,<dbl>
audi,a4,0
audi,a4 quattro,0
audi,a6 quattro,0
chevrolet,c1500 suburban 2wd,0
chevrolet,corvette,0
chevrolet,k1500 tahoe 4wd,0


## Using pick

In [None]:
mean_hwy <- function(df, var1, var2) {
  df |> 
    group_by(pick({{ var1 }})) |> 
    summarize(
      mean = mean(is.na({{ var2 }})),
      .groups = "drop"
    )
}

mpg |> 
  mean_hwy(c(manufacturer, model), hwy) %>% head

manufacturer,model,mean
<chr>,<chr>,<dbl>
audi,a4,0
audi,a4 quattro,0
audi,a6 quattro,0
chevrolet,c1500 suburban 2wd,0
chevrolet,corvette,0
chevrolet,k1500 tahoe 4wd,0


### Other useful functions

### `bind_rows`

To combine multiple dataframe into one large dataframe, you cna use `bind_rows`.
Recollect you encountered the need for this in hw7 although you could have solved the problems without knowing about this, it would have been easier if you knew this!



In [None]:
phil_tbl <- tibble(chapter=seq_along(philosophers_stone), text=philosophers_stone)
chamber_tbl <- tibble(chapter=seq_along(chamber_of_secrets), text=chamber_of_secrets)
prisoner_tbl <- tibble(chapter=seq_along(prisoner_of_azkaban), text=prisoner_of_azkaban)
goblet_tbl <- tibble(chapter=seq_along(goblet_of_fire), text=goblet_of_fire)
order_of_tbl <- tibble(chapter=seq_along(order_of_the_phoenix), text=order_of_the_phoenix)
half_blood_tbl <- tibble(chapter=seq_along(half_blood_prince), text=half_blood_prince)
deathly_tbl <- tibble(chapter=seq_along(deathly_hallows), text=deathly_hallows)

combined <- bind_rows(phil_tbl, chamber_tbl, prisoner_tbl, goblet_tbl, order_of_tbl, half_blood_tbl, deathly_tbl)

In [None]:
combined %>% glimpse

If the data frames have different column names or types, bind_rows will try to match them based on their column names, and it will coerce the columns to a common type if necessary.

### `purr::map` function

When you want to apply a function to individual elements of a list, a good option to consider is `map`


In [None]:
my_list <- list(1:5, 6:10, 11:15)

print(my_list)

# Add 2 to each element in the list using map()
my_list_plus_two <- map(my_list, ~ .x + 2)

# Print the result
my_list_plus_two

### Using `map_dfr` 

While map is applied to a list or vector, `map_dfr` can be used to apply functions on dataframe.



In [None]:
books = c(
    "philosophers_stone",
    "chamber_of_secrets", 
    "prisoner_of_azkaban",
    "goblet_of_fire", 
    "half_blood_prince",
    "order_of_the_phoenix",
    "deathly_hallows"
)

combined <- map_dfr(books, ~ tibble(book=., chapter=seq_along(get(.)), text=get(.))) %>%
  glimpse

Now it is good idea to re-factor your HW7 to use all of these new functions!

## What is refactoring?

In software it is customary to first get 'a' solution for a problem.  Once you have a working software, you should always think through and see if there are opportunities to make things better. That step is called refactoring!
Invariably everytime you relook at your code, you will find opportunities to make things better. Some of the things that stand out are
* Code that is repeating - pull them over to a function
* Renaming your variables to make it more meaningful so that a third person who looks at your code can understand what you are trying to do (sometimes it helps you to recollect what you were doing, when you revisit your code after a few months)
* See if there are other builtin functions that you replace your code with 

These are only some tips, to get your started in refactoring path. However, attempting to write the best code possible to begin with itself could take you in an analysis-paralysis mode. So the first step always is to get it working in some form and then think of refactoring. However, as you gain experience, you will apply the correct methods itself in your first attempt.
But in anycase writing code itself is a iterating process