### R Fundamentals: Lesson 2

In [None]:
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}

In [None]:
analyze_all <- function(folder = "data", pattern) {
  # Runs the function analyze for each file in the given folder
  # that contains the given pattern.
  filenames <- list.files(path = folder, pattern = pattern, full.names = TRUE)
  for (f in filenames) {
    analyze(f)
  }
}

In [None]:
pdf("inflammation-01.pdf")
analyze("data/inflammation-01.csv")
dev.off()

### Conditionals

In [None]:
num = 37
num > 100

In [None]:
num < 100

In [None]:
num = 37
if (num > 100) {
    print('greater')
} else {
    print('not greater')
}
print('done')

In [None]:
num > 100

In [None]:
num < 100

In [None]:
num <= 53
if (num > 100) {
    print('num is greater than 100')
}

---

In [None]:
sign = function(num) {
    if (num > 0) {
        return(1)
    } else if (num == 0) {
        return(0)
    } else {
        return(-1)
    }
}
sign(-3)

In [None]:
sign(0)

In [None]:
sign(2/3)

We can also combine tests. Two ampersands, `&&`, symbolize “and”. Two vertical bars, `||`, symbolize “or”. `&&` is only true if both parts are true:

In [None]:
if (1 > 0 && -1 > 0) {
    print('both parts are true')
} else {
    print('at least one part is not true')
}

while `||` is true if either part is true:

In [None]:
if (1 > 0 || -1 > 0) {
    print('at least one part is true')
} else {
    print('neither part is true')
}

## Choosing Plots Based on Data

Write a function `plot_dist` that plots a boxplot if the length of the vector is greater than a specified threshold and a stripchart otherwise. To do this you’ll use the R functions `boxplot` and `stripchart`.

In [None]:
plot_dist = function(x, threshold) {
    if (length(x) > threshold) {
        boxplot(x)
    } else {
        stripchart(x)
    }
}

In [None]:
df = read.csv('data/inflammation-01.csv', header= FALSE)
plot_dist(x = df[, 10], threshold = 10)

## Histograms Instead

One of your collaborators prefers to see the distributions of the larger vectors as a histogram instead of as a boxplot. In order to choose between a histogram and a boxplot we will edit the function `plot_dist` and add an additional argument `use_boxplot`. By default we will set `use_boxplot` to `TRUE` which will create a boxplot when the vector is longer than `threshold`. When `use_boxplot` is set to `FALSE`, `plot_dist` will instead plot a histogram for the larger vectors. As before, if the length of the vector is shorter than `threshold`, `plot_dist` will create a stripchart. A histogram is made with the `hist` command in R.

In [None]:
plot_dist = function(x, threshold, use_boxplot = TRUE) {
    if (length(x) > threshold && use_boxplot) {
        boxplot(x)
    } else if (length(x) > threshold && !use_boxplot) {
        hist(x)
    } else {
        stripchart(x)
    }
}

In [None]:
df <- read.csv("data/inflammation-01.csv", header = FALSE)
plot_dist(df[, 10], threshold = 10, use_boxplot = TRUE)

## Find the Maximum Score

Find the file containing the patient with the highest average inflammation score. Print the file name, the patient number (row number) and the value of the maximum average inflammation score.

Tips:

    Use variables to store the maximum average and update it as you go through files and patients.
    You can use nested loops (one loop is inside the other) to go through the files as well as through the patients in each file (every row).

Complete the code below:

In [None]:
filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)
filename_max <- "" # filename where the maximum average inflammation patient is found
patient_max <- 0 # index (row number) for this patient in this file
average_inf_max <- 0 # value of the average inflammation score for this patient
for (f in filenames) {
  dat <- read.csv(file = f, header = FALSE)
  dat.means <- apply(dat, 1, mean)
  for (patient_index in 1:length(dat.means)){
    patient_average_inf <- dat.means[patient_index]
    # Add your code here ...
    if (patient_average_inf > average_inf_max) {
        average_inf_max <- patient_average_inf
        filename_max <- f
        patient_max <- patient_index
    }
  }
}
print(filename_max)
print(patient_max)
print(average_inf_max)

---

## Saving Automatically Generated Figures 

Now that we know how to have R make decisions based on input values, let’s update `analyze`:

In [None]:
analyze <- function(filename, output = NULL) {
  # Plots the average, min, and max inflammation over time.
  # Input:
  #    filename: character string of a csv file
  #    output: character string of pdf file for saving
  if (!is.null(output)) {
    pdf(output)
  }
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
  if (!is.null(output)) {
    dev.off()
  }
}

In [None]:
output <- NULL
is.null(output)

In [None]:
!is.null(output)

In [None]:
analyze("data/inflammation-01.csv")

In [None]:
analyze("data/inflammation-01.csv", output = "inflammation-01.pdf")

In [None]:
dir.create("results")

In [None]:
analyze("data/inflammation-01.csv", output = "results/inflammation-01.pdf")

In [None]:
f <- "inflammation-01.csv"
sub("csv", "pdf", f)

In [None]:
file.path("results", sub("csv", "pdf", f))

## Let's update analyze_all

In [None]:
analyze_all <- function(pattern) {
  # Directory name containing the data
  data_dir <- "data"
  # Directory name for results
  results_dir <- "results"
  # Runs the function analyze for each file in the current working directory
  # that contains the given pattern.
  filenames <- list.files(path = data_dir, pattern = pattern)
  for (f in filenames) {
    pdf_name <- file.path(results_dir, sub("csv", "pdf", f))
    analyze(file.path(data_dir, f), output = pdf_name)
  }
}

## Saved with one line of code

In [None]:
analyze_all("inflammation.*csv")

In [None]:
# Pull up documentation
?plot

One of your collaborators asks if you can recreate the figures with lines instead of points. Find the relevant argument to plot by reading the documentation (?plot), update analyze, and then recreate all the figures with analyze_all.

In [None]:
analyze <- function(filename, output = NULL) {
  # Plots the average, min, and max inflammation over time.
  # Input:
  #    filename: character string of a csv file
  #    output: character string of pdf file for saving
  if (!is.null(output)) {
    pdf(output)
  }
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation, type = "l")
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation, type = "l")
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation, type = "l")
  if (!is.null(output)) {
    dev.off()
  }
}

In [None]:
analyze_all("inflammation.*csv")

---