# Introduction to Writing Functions in R

Being able to write your own functions makes your analyses more readable, with fewer errors, and more reusable from project to project. Function writing will increase your productivity more than any other skill! In this course you'll learn the basics of function writing, focusing on the arguments going into the function and the return values. You'll be writing useful data science functions, and using real-world data on Wyoming tourism, stock price/earnings ratios, and grain yields.

## How to write a function

Learn why writing your own functions is useful, how to convert a script into a function, and what order you should include the arguments

### Calling functions
 
One way to make your code more readable is to be careful about the order you pass arguments when you call functions, 
and whether you pass the arguments by position or by name.

medals.csv, a file wiat a numeric vector of the number of medals won by each country in the a Summer Olympics, is provided.

For convenience, the arguments of median() and rank() are displayed using args(). 
Setting rank()'s na.last argument to "keep" means "keep the rank of NA values as NA".

Best practice for calling functions is to include them in the order shown by args(), and to only name rare arguments.

In [4]:
# Look at the medals data
medals_file = read.csv("medals.csv")
medals = medals_file$medals

# with median() function
# Note the arguments to median()
args(median)

# Rewrite this function call, following best practices
# bad writing -->  median(TRUE, medals)
median(medals, na.rm = TRUE)

# with rank() function
# Note the arguments to rank()
args(rank)

# Rewrite this function call, following best practices
# bad writing --> rank("keep", "min", -medals)
rank(-medals, na.last = "keep",ties.method = "min")

### Your first function: tossing a coin
Time to write your first function! It's a really good idea when writing functions to start simple. 
You can always make a function more complicated later if it's really necessary, so let's not worry about arguments for now.

In [5]:
# Simulate a single coin toss by using sample() to sample from coin_sides once.

coin_sides <- c("head", "tail")

# Sample from coin_sides once
sample(coin_sides, size = 1)

# Paste your script into the function body
toss_coin <- function() {
coin_sides <- c("head", "tail")
sample(coin_sides, 1)
}

# Call your function
toss_coin()

### Inputs to functions
Most functions require some sort of input to determine what to compute. The inputs to functions are called arguments. 
You specify them inside the parentheses after the word "function."
As mentioned in the video, the following exercises assume that you are using sample() to do random sampling.
Sample from coin_sides, with n_flips times with replacement.

In [9]:
# use args(sample) to check the inputs names of the function

args(sample)

coin_sides <- c("head", "tail")
n_flips <- 10

# Sample from coin_sides n_flips times with replacement
sample(coin_sides, size = n_flips, replace = TRUE)

# Update the definition of toss_coin() to accept a single argument, n_flips. 
# The function should sample coin_sides n_flips times with replacement. Remember to change the signature and the body.

# Update the function to return n coin tosses
toss_coin <- function(n_flips) {
  coin_sides <- c("head", "tail")
  sample(coin_sides, size = n_flips, replace = TRUE)
}

# Generate 10 coin tosses
toss_coin(10)

### Multiple inputs to functions
If a function should have more than one argument, list them in the function signature, separated by commas.
To solve this exercise, you need to know how to specify sampling weights to sample(). 
Set the prob argument to a numeric vector with the same length as x. Each value of prob is the probability of sampling 
the corresponding element of x, so their values add up to one. In the following example, each sample has a 20% chance 
of "bat", a 30% chance of "cat" and a 50% chance of "rat".
sample(c("bat", "cat", "rat"), 10, replace = TRUE, prob = c(0.2, 0.3, 0.5))

In [11]:
# Bias the coin by weighting the sampling. Specify the prob argument so that 
# heads are sampled with probability p_head (and tails are sampled with probability 1 - p_head).

coin_sides <- c("head", "tail")
n_flips <- 10
p_head <- 0.8

# Define a vector of weights
weights <- c(p_head, 1 - p_head)

# Update so that heads are sampled with prob p_head
sample(coin_sides, n_flips, replace = TRUE, prob = weights)


# Update the definition of toss_coin() so it accepts an argument, p_head, and weights the samples using 
# the code you wrote in the previous step.

# Update the function so heads have probability p_head
toss_coin <- function(n_flips, p_head) {
  coin_sides <- c("head", "tail")
  # Define a vector of weights
  weights <- c(p_head, 1-p_head)
  # Modify the sampling to be weighted
  sample(coin_sides, n_flips, replace = TRUE, prob = weights)
}

# Generate 10 coin tosses with an 80% chance of each head.
toss_coin(10, 0.8)



### Renaming GLM
R's generalized linear regression function, glm(), suffers the same usability problems as lm(): its name is an acronym, 
and its formula and data arguments are in the wrong order.
To solve this exercise, you need to know two things about generalized linear regression:
glm() formulas are specified like lm() formulas: response is on the left, and explanatory variables are added on the right.
To model count data, set glm()'s family argument to poisson, making it a Poisson regression.
Here you'll use data on the number of yearly visits to Snake River at Jackson Hole, Wyoming, snake_river_visits.

In [17]:
library(dplyr)

snake_river_visits <- readRDS(file = "snake_river_visits.rds")
print(head(snake_river_visits))

# Run a generalized linear regression 
glm(
  # Model no. of visits vs. gender, income, travel
  n_visits ~ gender + income + travel, 
  # Use the snake_river_visits dataset
  data = snake_river_visits, 
  # Make it a Poisson regression
  family = poisson
)

# Define a function, run_poisson_regression(), to run a Poisson regression. 
# This should take two arguments: data and formula, and call glm(), passing those arguments and setting family to poisson.

# Write a function to run a Poisson regression
run_poisson_regression  <- function(data, formula){
    glm(formula, data, family = poisson)
}

# Recreate the Poisson regression model from the first step, this time by calling your run_poisson_regression() function.

# Re-run the Poisson regression, using your function
model <- snake_river_visits %>%
  run_poisson_regression(n_visits ~ gender + income + travel)

snake_river_explanatory <- snake_river_visits %>%
    select(gender, income, travel)

# Run this to see the predictions
snake_river_explanatory %>%
  mutate(predicted_n_visits = predict(model, ., type = "response"))%>%
  arrange(desc(predicted_n_visits))



  n_visits gender      income travel
1        0   male ($95k,$Inf)   <NA>
2        0   male ($25k,$55k]   <NA>
3        0   male ($95k,$Inf)   <NA>
4        0 female ($25k,$55k]   <NA>
5        0   male ($95k,$Inf)   <NA>
6        0 female ($25k,$55k]   <NA>



Call:  glm(formula = n_visits ~ gender + income + travel, family = poisson, 
    data = snake_river_visits)

Coefficients:
      (Intercept)       genderfemale  income($25k,$55k]  income($55k,$95k]  
           4.0864             0.3740            -0.0199            -0.5807  
income($95k,$Inf)   travel(0.25h,4h]    travel(4h,Infh)  
          -0.5782            -0.6271            -2.4230  

Degrees of Freedom: 345 Total (i.e. Null);  339 Residual
  (64 observations deleted due to missingness)
Null Deviance:	    18850 
Residual Deviance: 11530 	AIC: 12860

gender,income,travel,predicted_n_visits
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
female,"[$0,$25k]","[0h,0.25h]",86.51860
