# Interlude 4: Writing Functions in R

The first couple of exercises are easy freebies; the exercises using apply() and by() really delve into R's functional/vectorial nature.

## Miles to km conversion

In [1]:
# 2. Miles to Kilometers Conversion
# 
# Write a simple function that accepts a numeric value, in miles, and 
# converts the value(s) to kilometers.
#
# Use the following information: 
#   kilometers = (8/5) * miles
# 
# Test the function with the following vector, and print the results.
miles = c(50, 100, 200, 275)

# What is the type (class) of the output, given the input vector miles?

# What is the type (class) of the output if you supply a single numeric value?
# What about a single character value?

## Average and standard deviation

In [2]:
# Write a simple function that prints out the mean and standard deviation
# of an input set of numbers. Test the result on the body and brain columns of
# the dataset mammals (mammals$body in units of kg, mammals$brain in units of g).
#
# Hints: 
#   - The native R function mean() can be used for the mean.
#   - The standard deviation is calculated as the square root (sqrt()) of the variance (var())
#       of a set of numbers. Or, use native R function sd().
#   - To output the values, build a vector or data frame from your
#       calculated values, then return that:
#         c(mean=avg, stdDev=sd)
#
library(MASS) # loads the dataset called "mammals"

## Using apply() functions

The apply() family of functions can be used to call some other function multiple times, with several 
different arguments. In this exercise we will explore the use of the sapply() function. We will use 
it to call several R functions on a predefined dataset, and then look at the output.

You can use the apply() family on a native R function, on a function you wrote yourself.

In [4]:
# a. First, create the sample dataset of US car data by running the following:
data(car.test.frame, package = "rpart")
US = car.test.frame[car.test.frame$Country=="USA", ]    # Only use American Cars
US = droplevels(US[ ,c(1,4,6:8)])                       # Only use specified columns

# b. Call head() or str() on the US dataset, to get a sense of the contents.
# How many columns does it have? What are their types?

# c. Now, call the sapply function on the US dataset, to apply the mean function to it.

# Hint: Use the syntax sapply(US, mean). Do you agree that this is equivalent to running
# mean(US$Price), followed by mean(US$Mileage), followed by the mean function of each other
# column in the US data frame? Can you see the use of the apply() family?

# d. Call the sapply function on the US dataset and the range function.

## Using the by() function

The apply() family of functions can be used to call some other function multiple times, after first grouping on the data that will be operated on.

In this exercise we will explore the use of the by() function. We will use it to call several R functions on a predefined dataset, and then look at the output.

As with the other apply() functions, you can use the by() family on a native R function, or on a function you wrote yourself.

### Iris dataset with by()

In [None]:
# First, show yourself what by() does. For this, we'll use the iris dataset. 
attach(iris)

# Use the mean() function to find the overall mean of the iris Petal.Width column. 
# Hint: use a call like: mean(iris[,"Petal.Width"])

# Then, use the by() function to find the mean of the Petal.Width column for each 
# iris Species.  Hint: use a call like: by(iris[,"Petal.Width"], Species, mean)

# Using the output of the by() call, find the mean of the means for all species.  
# It should match the overall mean you computed.

### Cars93 dataset with by()

In [None]:
# a. Now, create a sample dataset of car data by running the following:
d = droplevels(Cars93[,c(3,5,7,8,12)] )                # Only use specified columns

# b. Call head(), View(), or str() on the d dataset, to get a sense of the contents.
# How many columns does it have? What are their types?

# To find out more about the columns in the dataset, type:
?Cars93

# Remember that the first argument to by() specifies the columns to operate on; the second 
# specifies the column whose values form groups for the data; and the last specifies the
# function. Use R help for further information (?by)

# c. Now, call the by() function on the d dataset, to apply the summary function to the price
# column while grouping on auto Type. Which auto type has the highest median price?
# What about the lowest median price? (Note that price is reported in thousands of dollars). 

# Use by() to determine how many cars in our dataset are found for each Type. 

# d. Call the by() function on the d dataset, to apply the colMeans function to the 
# MPG and engine size columns, while grouping on auto Type. What does colMeans do?
# What happens if you try to use the mean() function with by()?

# e. Call the by() function on the d dataset, to determine the standard deviation (sd) of 
# the engine size column, while grouping on auto Type. 

# Which auto type has the largest variation (standard deviation) about the engine size mean? 
# Which type has the smallest and largest mean engine size? Any surprises there? 
# Does the by() function make it easy to answer a question like this one?
